License: CC BY 4.0
arXiv:2312.04584v2 [cs.CR] 11 Dec 2023

Towards Sample-specific Backdoor Attack with Clean Labels via Attribute Trigger

Yiming Li, Mingyan Zhu, Junfeng Guo, Tao Wei, Shu-Tao Xia, Zhan Qin The first two authors contributed equally to this work.Yiming Li is with ZJU-Hangzhou Global Scientific and Technological Innovation Center (HIC), Hangzhou, 311215, China and also with School of Cyber Science and Technology, Zhejiang University, Hangzhou, 311200, China (e-mail: [email protected]).Mingyan Zhu is with Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518055, China (e-mail: [email protected]).Junfeng Guo is with Department of Computer Science, University of Maryland, College Park, MD 20742, USA (e-mail: [email protected]).Tao Wei is with Ant Group, Hangzhou, 310023, China (email: [email protected])Shu-Tao Xia is with Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518055, China, and also with the Research Center of Artificial Intelligence, Peng Cheng Laboratory, Shenzhen, 518000, China (e-mail: [email protected]).Zhan Qin is with The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, 311200, China and also with School of Cyber Science and Technology, Zhejiang University, Hangzhou, 311200, China (e-mail: [email protected]).
Abstract

Currently, sample-specific backdoor attacks (SSBAs) are the most advanced and malicious methods since they can easily circumvent most of the current backdoor defenses. In this paper, we reveal that SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. In particular, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class. We reveal that it is primarily due to two reasons, including (1) the ‘antagonistic effects’ of ground-truth features and (2) the learning difficulty of sample-specific features. Accordingly, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild trigger intensity required for ensuring stealthiness. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs. Motivated by this understanding, we propose to exploit content-relevant features, a.k.a.formulae-sequence𝑎𝑘𝑎a.k.a.italic_a . italic_k . italic_a . (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). Extensive experiments are conducted on benchmark datasets, which verify the effectiveness of our BAAT and its resistance to existing defenses.

Index Terms:
Backdoor Attack, Sample-specific Attack, Clean-label Attack, Trustworthy ML, AI Security

1 Introduction

Deep neural networks (DNNs) have demonstrated their effectiveness and efficiency in many applications, such as face recognition [1, 2, 3] and speech recognition [4, 5, 6]. In practice, training well-performed DNNs usually requires a large number of training samples and computational facilities. Accordingly, third-party resources (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., samples or pre-trained models) are usually involved in the training process of DNNs to alleviate its costs.

Refer to caption
Figure 1: The limitations of existing sample-specific and clean-label backdoor attacks. The first two poisoned samples are generated by sample-specific attacks, where their anomalies can be noticed by users for their image-label inconsistency (marked in red). The last two ones are produced by clean-label attacks, where detection algorithms can reveal trigger patterns (marked in the red boxes) since they are sample-agnostic. This example indicates that the adversaries should design sample-specific attacks with clean labels to truly fulfill attack stealthiness for they can bypass both human inspection and machine detection.

However, recent studies revealed that using third-party training resources could bring a new security threat, which was called backdoor attack [7, 8, 9]. In general, backdoor attacks intend to implant the hidden backdoor, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., a latent connection between the adversary-specified trigger pattern and the target label, by maliciously manipulating the training process of DNNs. Currently, there are many different types of backdoor attacks, such as invisible attacks [10, 11, 12], physical attacks [13, 14, 15], and sample-specific backdoor attacks [16, 17, 18]. Among all different types of methods, sample-specific attacks are usually regarded as the most advanced and malicious backdoor paradigm [8]. The trigger patterns of these attacks are sample-specific instead of sample-agnostic and therefore they can easily circumvent most existing backdoor defenses by breaking their fundamental assumptions.

In this paper, we revisit the sample-specific backdoor attacks (SSBAs). We notice that existing SSBAs [16, 18, 17] are all under the poisoned-label setting, whose labels of poisoned samples are inconsistent with their ground-truth labels. For example, a cat-like image may be labeled as a ‘dog’. As such, existing SSBAs are not stealthy to human inspection since victim dataset users can discover anomalies if they check the image-label relationship of samples (as shown in Figure 1). In particular, we show that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class.

We argue that this failure is mostly due to two latent mechanisms, including (1) the ‘antagonistic effects’ of ground-truth features and (2) the learning difficulty of sample-specific features. Specifically, during the training process of clean-label attacks, DNNs may exploit both trigger-related features and ground-truth features (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., features related to its ground-truth class) for learning the target class while learning ground-truth features will undermine that of trigger patterns [19]. In other words, the trigger features must be significantly ‘strong’ otherwise DNNs may not learn it. Unfortunately, as we verified empirically and theoretically in Section 3.2, it is more difficult for DNNs to learn sample-specific triggers compared to sample-agnostic ones used in existing clean-label attacks [20, 21, 19] (with the same intensity). As such, trigger-related features of existing SSBAs cannot be effectively learned under the clean-label setting due to their mild intensity that is required for ensuring stealthiness (as shown in Section 3.2). It raises an intriguing question: Is it really impossible to design a sample-specific backdoor attack with clean labels?

The answer to the aforementioned question is in the negative. We argue that the intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs. Motivated by this understanding, in this paper, we propose to exploit content-relevant features, a.k.a.formulae-sequence𝑎𝑘𝑎a.k.a.italic_a . italic_k . italic_a . (human-relied) attributes, as the trigger patterns to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). In general, our method is inspired by the decision process of humans. For example, we can use an adversary-defined hairstyle as our attribute trigger in facial recognition tasks. Since attribute is a high-level and complicated feature, the modifications between poisoned images (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., the modified images containing trigger patterns) and their benign ones are sample-specific and can be large (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., high intensity) while still preserving stealthiness. Their selection and design is also a feasible way to incorporate domain knowledge of the target task.

In conclusion, the main contributions of our paper are four-fold: (1) We demonstrate the limitations of both existing sample-specific and clean-label backdoor attacks. (2) We reveal the inherent reasons (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., antagonistic effects and learning difficulty) for the failure of directly generalizing existing SSBA methods to the clean-label setting in both empirical and theoretical manners. (3) Based on our analyses, we design the first effective clean-label sample-specific backdoor attack (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., BAAT), where we exploit attributes as trigger patterns. Besides, we also propose a simple yet effective method to implement BAAT. (4) We empirically verify the effectiveness of our BAAT and its resistance to representative backdoor defenses on benchmark datasets.

The rest of this paper is organized as follows. In Section 2, we briefly review related works on backdoor attacks and defenses; After that, we revisit existing sample-specific and clean-label backdoor attacks in Section 3. Specifically, we demonstrate that it is ineffective to directly generalize existing SSBAs to their clean-label variants by poisoning samples solely from the target class in Section 3.1 and discuss its reasons in Section 3.2. We also reveal the latent limitations of existing clean-label backdoor attacks in Section 3.3; Based on our previous analyses, we propose our backdoor attack attribute trigger (BAAT) in Section 4; We conduct experiments in Section 5 and conclude this paper in Section 7 at the end.

2 Related Works

2.1 Backdoor Attacks

Backdoor attack is an emerging yet severe threat, revealing the training-phase security concerns of DNNs [8]. Specifically, the backdoored models behave normally on benign samples whereas their predictions will be maliciously changed whenever the adversary-specified trigger patterns appear. In this paper, we focus on poison-only backdoor attacks (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., the adversaries can only modify the training dataset) in image classification. The backdoor threats with other threat models [16, 22, 23] or in other tasks [24, 25, 26, 27, 28] are out of our scope in this paper.

In general, existing poison-only backdoor attacks can be divided into two main categories, based on label properties of poisoned samples, as follows.

Backdoor Attacks with Poisoned Labels. In these attacks, the adversary-assigned labels of poisoned samples are different from the ground-truth ones of their benign version. It is currently the most widespread attack paradigm for its simplicity and effectiveness. [7] first revealed the backdoor threat in the training of DNNs and proposed the BadNets attack. Specifically, BadNets randomly selected some samples from the original benign training dataset and modified their images by stam** on an adversary-specified trigger pattern (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., white-black square). The labels of modified images were re-assigned as the pre-defined target label. Those generated poisoned samples associated with the remaining benign ones forms the poisoned training set, which was released to the victims for training their models. After that, [10] argued that the poisoned images should be similar to their benign version to ensure stealthiness, based on which they proposed the blended attack. Currently, there were also many other attacks (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., [29, 30, 31]) in this area. Among all different types of attacks, the sample-specific backdoor attack (SSBA) [16, 18, 17] is currently the most advanced attack paradigm, where the trigger patterns are sample-specific instead of sample-agnostic used in previous attacks. Specifically, IAD [16] proposed to adopt random sample-specific patches as the trigger patterns. However, IAD required controlling the whole training process and its trigger patterns were visible, which significantly reduced its threats in real-world applications; WaNet [18] exploited image war** as the backdoor triggers, which were sample-specific and invisible; Most recently, [17] used a pre-trained encoder to generate sample-specific trigger patterns, inspired by the DNN-based image steganography [32]. In particular, these SSBAs broke the fundamental assumption (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., the trigger is sample-agnostic) of most existing defenses, therefore could easily bypass them. Accordingly, it is of great significance to further explore this attack paradigm. These SSBAs are the main focus of this paper.

Backdoor Attacks with Clean Labels. Turner et al. [20] argued that dataset users could still identify poisoned-label backdoor attacks by examining the image-label relationship, even though their poisoned images can be similar to their benign version. For example, if a cat-like image is labeled as deer, users can treat it as a malicious sample even if the image looks innocent. Accordingly, they proposed to poison samples only from the target class to design the attack with clean labels. However, this simple approach usually fails since the ‘ground-truth features’ related to the target label contained in the poisoned samples will hinder the learning of trigger patterns. To alleviate this problem, they first leveraged adversarial perturbations to modify the selected images from the target class before adding trigger patterns to reduce the ability of those ‘ground-truth features. Recently, [21] proposed to address it from another perspective by using a ‘stronger’ trigger pattern. Specifically, they exploited the targeted universal adversarial perturbation [33] instead of the handcraft black-white patch as the trigger pattern. This attack paradigm is stealthy for human inspection and therefore also worth further explorations.

2.2 Backdoor Defenses

In general, existing defenses can be roughly separated into four main categories, as follows.

Model-repairing-based Defenses. In these methods, defenders intend to erase hidden backdoors contained in the given models. For example, [34, 35, 36] demonstrated that using a few benign samples to fine-tune the attacked DNNs for only a few iterations can effectively remove their hidden backdoors, inspired by the catastrophic forgetting [37]; [38, 39, 40] revealed that defenders can remove hidden backdoors via model pruning, based on the understanding that they are mainly encoded in specific neurons that can be disentangled from the benign neurons.

Trigger-synthesis-based Defenses. Instead of removing hidden backdoors directly, these defenses first synthesized potential trigger patterns and then suppressed their effects. Specifically, [41, 42, 43] reversed the trigger based on targeted universal adversarial attacks, inspired by the similarities between backdoor attacks and adversarial attacks in the inference process; [44, 45] exploited the Grad-CAM [46] to extract critical regions from input images towards each class. After that, they located the trigger regions based on boundary analysis and anomaly detection.

Pre-processing-based Defenses. These approaches pre-processed test images before feeding them into the model for prediction, motivated by the observations that backdoor attacks may lose effectiveness when the trigger used for attacking is different from the one used for poisoning [34, 13, 47]. These defenses are usually efficient since they did not require modifying the suspicious models.

Sample-filtering-based Defenses. These methods aim at filtering out poisoned samples. For example, defenders can identify malicious training samples based on their distinctive behaviors in the hidden feature space [48, 49, 50]. Recently, [51] proposed to filter poisoned testing samples by superimposing different images on the suspicious sample and observing their predictions. The smaller the prediction randomness, the more likely it is attacked. Most recently, Guo et al. [52] detected poisoned samples by analyzing their prediction consistency during pixel-wise amplification. The more consistent a sample, the more likely it is poisoned.

3 Revisiting Existing Backdoor Attacks

3.1 Design Clean-label Sample-specific Attacks by Poisoning Samples only from the Target Class

As illustrated in Section 2.1, sample-specific backdoor attacks can circumvent most existing backdoor defenses. However, since these attacks are all with poisoned labels, users can still identify them by examining the image-label relationship (as shown in Figure 1). To alleviate this problem, the most straightforward method is to design their clean-label variants by poisoning samples only from the target class instead of all classes. In this section, we demonstrate that this approach has minor effectiveness.

Settings. We conduct experiments on (a subset of) ImageNet dataset [53] having 100 random classes. Each class contains 500 images for training and 50 images for testing. We generalize the clean-label variants of WaNet and ISSBA (dubbed ‘WaNet-C’ and ‘ISSBA-C’, respectively) by poisoning samples only from the target class. Specifically, we set target class yt=1subscript𝑦𝑡1y_{t}=1italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., ‘n01443537’) and poison 80% samples from the target class. We conduct all attacks with both VGG-16 [54] and ResNet-18 [55], and implement them based on codes in BackdoorBox [56]. We use the default settings of ISSBA and adopt the settings of WaNet (without noise mode) where the kernel size is set as 32.

TABLE I: The performance of WaNet and ISSBA variants with clean labels (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., ‘WaNet-C’ and ‘ISSBA-C’) on ImageNet. We mark all failed cases (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., ASR <20%absentpercent20<20\%< 20 %) in red.
Model\downarrow Metric\downarrow, Attack\rightarrow WaNet-C ISSBA-C
VGG-16 BA (%) 85.32 85.20
ASR (%) 2.16 0.90
ResNet-18 BA (%) 79.58 77.60
ASR (%) 0.96 0.90

Results. As shown in Table I, both WaNet-C and ISSBA-C are ineffective in creating backdoors in all cases. These results indicate that their generated trigger patterns are not competitive to the ‘ground-truth features’ (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., features related to the target class) contained in poisoned images. We will further analyze its reasons in the next subsection.

3.2 Why Are Clean-label Sample-specific Backdoor Attacks Difficult to Succeed?

As demonstrated in [19], DNNs exploited both trigger-related features and ground-truth features (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., features related to its ground-truth class) for learning the target class while learning ground-truth features will undermine that of trigger patterns. Accordingly, the direct extension of existing sample-specific backdoor attacks discussed in the previous subsection fails mostly because existing sample-specific trigger patterns are less effective than ground-truth features. In this subsection, we will verify and explain it.

3.2.1 Ground-truth Features are Highly Effective

In this part, we demonstrate that ground-truth features are highly effective by showing that we can still get a well-performed model even after distorting them.

Settings. We reduce the effectiveness of ground-truth features by adding adversarial noises generated by the model with adversarial training to all training samples since adversarially robust DNN mostly exploit ground-truth features for predictions [57]. Specifically, we conduct experiments on ImageNet (subset) with VGG-16 and ResNet-18. We use the pre-trained adversarially robust DNN111https://github.com/MadryLab/robustness to generate adversarial perturbations with budget ϵitalic-ϵ\epsilonitalic_ϵ from 0 to 16/255.

Refer to caption
(a) WaNet
Refer to caption
(b) ISSBA
Figure 2: The attack success rate (ASR, %) of WaNet, ISSBA, and their sample-agnostic versions on the ImageNet dataset with respect to the poisoning rate (%).
TABLE II: The accuracy (%) of models trained on adversarially perturbed samples with budget ϵitalic-ϵ\epsilonitalic_ϵ on ImageNet.
Model\downarrow, ϵitalic-ϵabsent\epsilon\rightarrowitalic_ϵ → 0 4/255 8/255 12/255 16/255
VGG-16 86.04 84.74 83.94 80.80 76.72
ResNet-18 79.82 78.06 75.66 70.44 64.82

Results. As shown in Table II, the model can still maintain high accuracy on benign testing samples even when all training samples are adversarially perturbed with a relatively high budget (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., 16 pixels). These results verify that ground-truth features are highly effective.

3.2.2 Sample-specific Triggers are More Difficult than Sample-agnostic Ones to Learn by DNNs

In this part, we empirically and theoretically show that sample-specific trigger patterns are more difficult to learn by DNNs compared to sample-agnostic ones.

Settings. We compare ISSBA and WaNet with their sample-agnostic versions on the ImageNet subset with ResNet-18 under different poisoning rates. We randomly select three different poisoned samples generated by the standard ISSBA and exploit their pixel-wise differences to their benign version as trigger patterns to design three sample-agnostic versions of ISSBA (dubbed ’ISSBA-A (a)’, ’ISSBA-A (b)’, and ’ISSBA-A (c)’), respectively. We also design three sample-agnostic WaNets following the same setting.

Results. As shown in Figure 2, the attack success rates (ASRs) of all sample-agnostic ISSBA and WaNet are higher than those of their sample-specific versions under all poisoning rates. This phenomenon is significant (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., the ASR gap is larger than 30%percent3030\%30 %), especially when the poisoning rate is relatively low (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., 1%). These results verify the learning difficulty of sample-specific trigger patterns.

To further explain this intriguing phenomenon and understand the difficulty of performing effective sample-specific backdoor attacks, we exploit recent studies on neural tangent kernel (NTK) [58] (inspired by previous works [43, 52]) to analyze backdoored models attacked by sample-specific and sample-agnostic attacks, as follows.

TABLE III: The performance (%) of WaNet-C with different intensities (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., strengths) on ImageNet.
Metric\downarrow, Strength\rightarrow 0 0.5 1 1.5 2
BA 79.58 79.30 79.52 79.54 79.48
ASR 0.96 1.44 13.98 40.50 60.02
TABLE IV: The performance (%) of ISSBA-C with different intensities (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., amplification factors) on ImageNet.
Metric\downarrow, Factor\rightarrow 0 2 4 6 8
BA 77.60 77.84 77.74 77.66 77.76
ASR 0.90 0.94 0.92 1.10 1.48
Theorem 1.

Suppose the training dataset consists of Nbsubscript𝑁𝑏N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT benign samples {(𝐱i,yi)}i=1Nbsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1subscript𝑁𝑏\{(\bm{x}_{i},y_{i})\}_{i=1}^{N_{b}}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT poisoned samples {(𝐱j,yt)}j=1Npsuperscriptsubscriptsuperscriptsubscript𝐱𝑗normal-′subscript𝑦𝑡𝑗1subscript𝑁𝑝\{(\bm{x}_{j}^{\prime},y_{t})\}_{j=1}^{N_{p}}{ ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, whose images are i.i.d. sampled from uniform distribution and belonging to K𝐾Kitalic_K classes. Assume that the DNN f(;𝛉)𝑓normal-⋅𝛉f(\cdot;\bm{\theta})italic_f ( ⋅ ; bold_italic_θ ) is a multivariate kernel regression K()𝐾normal-⋅K(\cdot)italic_K ( ⋅ ) and is trained via min𝛉i=1Nb(f(𝐱i;𝛉),yi)+j=1Np(f(𝐱j;𝛉),yt),subscript𝛉superscriptsubscript𝑖1subscript𝑁𝑏𝑓subscript𝐱𝑖𝛉subscript𝑦𝑖superscriptsubscript𝑗1subscript𝑁𝑝𝑓subscriptsuperscript𝐱normal-′𝑗𝛉subscript𝑦𝑡\min_{\bm{\theta}}\sum_{i=1}^{N_{b}}\mathcal{L}(f(\bm{x}_{i};\bm{\theta}),y_{i% })+\sum_{j=1}^{N_{p}}\mathcal{L}(f(\bm{x}^{\prime}_{j};\bm{\theta}),y_{t}),roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L ( italic_f ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_italic_θ ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , while trigger patterns are additive perturbations. Let f(a)superscript𝑓𝑎f^{(a)}italic_f start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT and f(s)superscript𝑓𝑠f^{(s)}italic_f start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT denote models attacked by sample-agnostic and sample-specific attacks, which select the same benign samples for poisoning on the same dataset, respectively. For their expected predictive confidences over the target label ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have:

𝔼𝒙^[f(a)(𝒙^)]𝔼𝒙~[f(s)(𝒙~)]0,subscript𝔼^𝒙delimited-[]superscript𝑓𝑎^𝒙subscript𝔼~𝒙delimited-[]superscript𝑓𝑠~𝒙0\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{\widetilde{\bm{x}% }}[f^{(s)}(\widetilde{\bm{x}})]\geq 0,blackboard_E start_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG ) ] - blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) ] ≥ 0 , (1)

where 𝐱^normal-^𝐱\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG and 𝐱~normal-~𝐱\widetilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG are poisoned testing samples of sample-agnostic and sample-specific attacks, respectively.

In general, Theorem 1 indicates that sample-agnostic attacks are more confident in predicting poisoned samples to the target class than sample-specific attacks. In other words, the previous phenomena are fundamental, where sample-specific triggers are more difficult to learn by DNNs. Its proof (with a tighter bound) is in the appendix.

3.2.3 Can We Achieve Clean-label Sample-specific Backdoor Attacks by Simply Increasing Trigger Intensity?

In Section 3.2.1-3.2.2, we demonstrate that ground-truth features are ’strong’ while sample-specific triggers are hard to learn. As such, direct extensions of existing SSBAs to their clean-label version (with the same trigger settings) may not succeed. A natural question arises: can we achieve an effective clean-label SSBA by increasing the strength of the intensity of backdoor triggers? We hereby discuss it.

Settings. In this part, we conduct experiments on WaNet-C and ISSBA-C with different trigger intensities. Specifically, we set the intensity-related parameter s𝑠sitalic_s of WaNet-C as s{0,0.5,1,1.5,2}𝑠00.511.52s\in\{0,0.5,1,1.5,2\}italic_s ∈ { 0 , 0.5 , 1 , 1.5 , 2 } and we amplify trigger perturbations of ISSBA-C with a factor from 0 to 8 (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., {0,2,4,6,8}02468\{0,2,4,6,8\}{ 0 , 2 , 4 , 6 , 8 }). Other settings are the same as those used in Section 3.1.

Refer to caption
(a) WaNet-C
Refer to caption
(b) ISSBA-C
Figure 3: The poisoned images generated by WaNet-C and ISSBA-C with different intensities (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., strengths for WaNet-C and amplification factor for ISSBA-C) on the ImageNet dataset. As shown in this figure, all poisoned images with relatively large intensities are suspicious for human inspection due to their blurring and ringing artifacts.

Results. As shown in Table III-IV, simply increasing trigger intensity has a mild effect to the attack success rate, especially for ISSBA-C. In particular, as shown in Figure 3, all poisoned images with relatively large intensities are suspicious for human inspection due to their blurring and ringing artifacts. It is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs. In conclusion, we cannot design effective clean-label SSBAs simply by increasing the trigger intensity.

3.3 The Limitations of Clean-label Attacks

As described in Section 2.1, clean-label backdoor attacks are stealthy for human inspection. However, many backdoor defenses can detect them since their trigger patterns are sample agnostic. Besides, these attacks need a surrogate model to generate poisoned samples, whereas victim users may use another model structure for training. Accordingly, they may suffer from low attack transferability across model structures. In this section, we verify these limitations.

Settings. We adopt label-consistent attack [20] with a 3×3333\times 33 × 3 black-white trigger pattern located at the bottom left corner for discussions. The transparency is set as 0.2 and we train a VGG-16 and ResNet-18 on the poisoned CIFAR-10 dataset, respectively. The poisoned training dataset is generated based on a pre-trained benign VGG-16 via BackdorBox [56], where we set the poisoning rate as 8% and adopt its default training settings. Besides, we use neural cleanse [41] to reverse the trigger pattern for backdoor detection.

Refer to caption
(a) ground-truth
Refer to caption
(b) synthesized
Figure 4: The ground-truth trigger pattern and the pattern synthesized by neural cleanse of label-consistent attack.
TABLE V: The performance of label-consistent attack with different DNNs trained on the poisoned CIFAR-10 generated based on VGG-16. We mark the ASR in red when the victim model is inconsistent with the surrogate model.
Metric\downarrow, Model\rightarrow VGG-16 ResNet-18
BA (%) 91.55 91.70
ASR (%) 86.99 65.78

Results. As shown in Figure 4, the synthesized trigger generated by neural cleanse is similar to the ground-truth one, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., neural cleanse can successfully detect the label-consistent attack. Moreover, as shown in Table V, the attack success rate decrease significantly (>20%absentpercent20>20\%> 20 %), if the target model used by dataset users is different from the one used for generating poisoned samples. It is mainly because existing clean-label backdoor attacks relied on adversarial perturbations, which are model-dependent.

4 The Proposed Method

4.1 Preliminaries

Threat Model. In this paper, we focus on the poison-only backdoor attack in image classification tasks. Poison-only is the hardest attack setting, having the most widespread threat scenarios [8]. Specifically, we assume that the adversaries can only modify some benign samples to generate the poisoned training dataset, whereas having no information and the ability to modify other training components (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., training loss, training schedule, and model structure). The generated poisoned dataset will be released to victims, who will train their DNNs based on them. Besides, we assume that the attack is with clean labels, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., the adversaries can only poison samples from the target class.

Adversary’s Goals. In general, backdoor adversaries have two main goals, including effectiveness and stealthiness. Specifically, the effectiveness requires that the predictions of attacked DNNs should be the target label whenever the backdoor trigger appears while their performance on benign samples are on par with that of the model trained on the benign dataset. The stealthiness requires that the attack is stealthy for both human inspection and machine detection.

Refer to caption
Figure 5: The main pipeline of our backdoor attack with attribute trigger (BAAT). Our method consists of three main stages. In the first stage, the adversaries generate poisoned samples by randomly selecting some benign samples from the target class and modifying the adversary-specified attribute to a pre-defined pattern based on a pre-trained attribute editor. In the second stage, the adversaries release generated poisoned samples and remaining benign ones to victim users who will train their model based on them. In the third stage, the adversaries can activate model backdoors to manipulate model predictions to the target label by changing the pre-defined image attribute. In this example, we adopt a pre-defined hairstyle (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., ‘purple hi-top’) as our attribute trigger while the target label is ‘Tom’.

4.2 Backdoor Attack with Attribute Trigger (BAAT)

As we demonstrated in Section 3, sample-specific trigger patterns are complicated for DNNs to learn, while the adversaries cannot simply increase trigger intensity due to stealthiness requirements. We argue that this intensity constraint of existing SSBAs is mostly because their trigger patterns are ‘content-irrelevant’ and therefore act as ‘noises’ for both humans and DNNs.

Motivated by this understanding, we propose to exploit content-relevant features, a.k.a.formulae-sequence𝑎𝑘𝑎a.k.a.italic_a . italic_k . italic_a . (human-relied) attributes, as triggers to design clean-label SSBAs. This new attack paradigm is dubbed backdoor attack with attribute trigger (BAAT). We describe its technical details in this section.

Before we describe how to exploit a specific attribute as the trigger pattern, we first briefly review the main pipeline of poison-only backdoor attacks, as follows:

The Main Pipeline of Poison-only Backdoor Attacks. Let 𝒟={(𝒙i,yi)}i=1N𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT denotes the benign training set, where 𝒙i𝒳={0,1,,255}C×H×Wsubscript𝒙𝑖𝒳superscript01255𝐶𝐻𝑊\bm{x}_{i}\in\mathcal{X}=\{0,1,\ldots,255\}^{C\times H\times W}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X = { 0 , 1 , … , 255 } start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT is the image, yi𝒴={1,,K}subscript𝑦𝑖𝒴1𝐾y_{i}\in\mathcal{Y}=\{1,\ldots,K\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y = { 1 , … , italic_K } is its label, and K𝐾Kitalic_K is the number of classes. The core of poison-only attacks is generating poisoned dataset 𝒟psubscript𝒟𝑝\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Specifically, 𝒟psubscript𝒟𝑝\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT consists of two disjoint subsets, including the modified version of a selected subset (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) of 𝒟𝒟\mathcal{D}caligraphic_D and remaining benign samples, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., 𝒟p=𝒟m𝒟bsubscript𝒟𝑝subscript𝒟𝑚subscript𝒟𝑏\mathcal{D}_{p}=\mathcal{D}_{m}\cup\mathcal{D}_{b}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, where ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an adversary-specified target label, 𝒟b=𝒟\𝒟ssubscript𝒟𝑏\𝒟subscript𝒟𝑠\mathcal{D}_{b}=\mathcal{D}\backslash\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = caligraphic_D \ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, 𝒟m={(𝒙,yt)|𝒙=G(𝒙;𝜽),(𝒙,y)𝒟s}subscript𝒟𝑚conditional-setsuperscript𝒙subscript𝑦𝑡formulae-sequencesuperscript𝒙𝐺𝒙𝜽𝒙𝑦subscript𝒟𝑠\mathcal{D}_{m}=\left\{(\bm{x}^{\prime},y_{t})|\bm{x}^{\prime}=G(\bm{x};\bm{% \theta}),(\bm{x},y)\in\mathcal{D}_{s}\right\}caligraphic_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_G ( bold_italic_x ; bold_italic_θ ) , ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }, γ|𝒟s||𝒟|𝛾subscript𝒟𝑠𝒟\gamma\triangleq\frac{|\mathcal{D}_{s}|}{|\mathcal{D}|}italic_γ ≜ divide start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | end_ARG start_ARG | caligraphic_D | end_ARG is the poisoning rate, and G𝜽:𝒳𝒳:subscript𝐺𝜽𝒳𝒳G_{\bm{\theta}}:\mathcal{X}\rightarrow\mathcal{X}italic_G start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : caligraphic_X → caligraphic_X is an adversary-specified poisoned image generator with parameter 𝜽𝜽\bm{\theta}bold_italic_θ. Moreover, poison-only backdoor attacks are mainly characterized by their poison generator G𝐺Gitalic_G. For example, G(𝒙)=𝒙+𝒕𝐺𝒙𝒙𝒕G(\bm{x})=\bm{x}+\bm{t}italic_G ( bold_italic_x ) = bold_italic_x + bold_italic_t in the ISSBA [17], where 𝒕𝒕\bm{t}bold_italic_t is the trigger pattern. In particular, y=yt,(𝒙,y)𝒟sformulae-sequence𝑦subscript𝑦𝑡for-all𝒙𝑦subscript𝒟𝑠y=y_{t},\forall(\bm{x},y)\in\mathcal{D}_{s}italic_y = italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∀ ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT holds for attacks with clean labels.

In general, attributes are the high-level features exploited by humans to describe and make predictions. However, it is difficult to provide a formal definition of the attribute, since the mechanism of the human visual system and the concept of features are very complicated and remain unclear. Luckily, we can at least find some suitable attributes in image classification tasks, based on some recent studies [59, 29, 60]. Here we used two representative tasks, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., facial image and natural image recognition, as examples to describe how to design our attack with attribute triggers.

Task 1: Design Attribute Triggers in Facial Image Recognition. Facial attribute editing [59, 61, 62] is a classical task, manipulating pre-defined attributes of facial images (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., hairstyle) while preserving other details. In this paper, we propose to exploit the attribute editor as our poisoned image generator G𝐺Gitalic_G to design attribute triggers. We assume that dataset users have no domain knowledge about the target identity, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., have no information about its ground-truth attributes. Specifically, given a (pre-trained) attribute vector 𝒂𝒂\bm{a}bold_italic_a, the attribute editor G𝒂:𝒳𝒳:subscript𝐺𝒂𝒳𝒳G_{\bm{a}}:\mathcal{X}\rightarrow\mathcal{X}italic_G start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT : caligraphic_X → caligraphic_X will transform input images to their variants with attribute 𝒂𝒂\bm{a}bold_italic_a. For example, 𝒂𝒂\bm{a}bold_italic_a could be a specific hairstyle with a special color. Notice that the adversaries should assign 𝒂𝒂\bm{a}bold_italic_a the value that rarely appears in the dataset. Otherwise, the attack could fail since samples with the same attribute but with labels other than the target one are antagonistic to learning.

Task 2: Design Attribute Triggers in Natural Image Recognition. How to define attributes for natural images is not as clear as the case for facial images. In this paper, we propose to exploit a particular image style (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., ink-like and cartoon-like style) as the attribute trigger. We assume that dataset users have minor domain knowledge of the dataset and therefore treat images having consistent semantic information to their label as valid samples. This assumption usually holds, especially when the dataset is relatively large and complicated. Specifically, given an adversary-specified style image 𝒔𝒔\bm{s}bold_italic_s, we assign a (trained) style transformer T:𝒳×𝒳𝒳:𝑇𝒳𝒳𝒳T:\mathcal{X}\times\mathcal{X}\rightarrow\mathcal{X}italic_T : caligraphic_X × caligraphic_X → caligraphic_X as the poisoned image generator G𝐺Gitalic_G to stylize selected images for poisoning.

The Main Pipeline of BAAT. Once 𝒟psubscript𝒟𝑝\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is obtained by our BAAT, it will be released to train the victim model f𝒘subscript𝑓𝒘f_{\bm{w}}italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT by min𝒘(𝒙,y)𝒟p(f𝒘(𝒙),y)subscript𝒘subscript𝒙𝑦subscript𝒟𝑝subscript𝑓𝒘𝒙𝑦\min_{\bm{w}}\sum_{(\bm{x},y)\in\mathcal{D}_{p}}\mathcal{L}(f_{\bm{w}}(\bm{x})% ,y)roman_min start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) , italic_y ), where \mathcal{L}caligraphic_L is the loss function (e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., cross-entropy). As such, in the inference process, the attacked DNNs behave normally on benign samples while their predictions will be maliciously and constantly changed to ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT whenever the trigger patterns appear. The main pipeline of our BAAT is shown in Figure 5.

Refer to caption
(a) VGGFace2
Refer to caption
(b) ImageNet
Figure 6: The example of samples involved in different backdoor attacks on the VGGFace2 and the ImageNet dataset. In this figure, we also provide the assigned label of each image. We mark the labels that are the same as the ground-truth one of their corresponding images as green and those that are different as red.

5 Experiments

5.1 Settings

Dataset and Model. In this paper, we conduct experiments on two classical benchmark datasets, including VGGFace2 [63] and ImageNet [53] with VGG-16 [54] and ResNet-18 [55]. For simplicity, we select a random subset containing 20 identities from VGGFace2 and the one containing 100 classes from ImageNet. Each VGGFace2 identity contains 400 images for training and 100 images for testing and the settings of ImageNet subset are the same as those used in Section 3.1. All images are resized to 3×128×12831281283\times 128\times 1283 × 128 × 128.

Baseline Selection. We compare our BAAT with four classical attacks, including WaNet [18], ISSBA [17], label-consistent attack (dubbed ‘LC’) [20], and TUAP [21]. The first two methods are representative of poison-only sample-specific backdoor attacks with poisoned labels, while the last two methods are representative of attacks with clean labels. We also provide the clean-label variants of WaNet and ISSBA and the model trained on the benign dataset (dubbed ‘No Attack’) as other baselines for reference.

Attack Setup. We set yt=1subscript𝑦𝑡1y_{t}=1italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 and poison 80% samples from the target class for all clean-label attacks on both datasets. We poison the same number of samples for poisoned-label attacks, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., 4% on VGGFace2 and 0.8% on ImageNet. Specifically, we implement HairCLIP [62] to adopt ‘hi-top’ hairstyle with purple color as our attribute trigger on VGGFace2 and execute ArtFlow [64] to exploit an oil-painting-style as our attribute trigger on ImageNet, respectively; Unless otherwise specified, the settings of WaNet, WaNet-C, ISSBA, and ISSBA-C are the same as those used in Section 3; For label-consistent attack, different from that of the one used on the CIFAR-10 dataset, we adopt a 6×6666\times 66 × 6 black-white square on four corners as our trigger pattern with maximum adversarial perturbation size ϵ=8/255italic-ϵ8255\epsilon=8/255italic_ϵ = 8 / 255; We set the maximum adversarial perturbation size ϵ=4/255italic-ϵ4255\epsilon=4/255italic_ϵ = 4 / 255 for TUAP. The example of poisoned samples generated by different attacks is shown in Figure 6.

Training Setup. Following the settings in [17], we train model from scratch on VGGFace2 and train models pre-trained on the full ImageNet dataset on our ImageNet subset. Specifically, we use the SGD optimizer with momentum 0.9, weight decay of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and an initial learning rate of 0.001. The batch size is set to 64 on VGGFace2 and 128 on ImageNet, and the learning rate is decayed with factor 0.10.10.10.1 after epoch 15151515 and 20202020. We adopt the random left-to-right flip** as our data augmentation. All experiments are conducted with a single Tesla V100 GPU.

Evaluation Metric. Following the classical settings used in the existing backdoor attacks, we use the benign accuracy (BA) and attack success rate (ASR) for evaluation. In general, the larger the BA and ASR, the better the attack.

TABLE VI: Results on the VGGFace2 dataset. Among all clean-label backdoor attacks, the best result is indicated in boldface while the underlining value denotes the second-best result. Besides, we mark all failed cases (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., ASR <20%absentpercent20<20\%< 20 %) in red.
Model\downarrow Metric\downarrow, Attack\rightarrow No Attack WaNet WaNet-C ISSBA ISSBA-C LC TUAP BAAT (Ours)
VGG-16 BA (%) 80.20 79.30 79.60 75.85 77.05 80.00 79.50 79.65
ASR (%) N/A 71.90 14.45 9.15 4.70 4.55 46.40 78.15
ResNet-18 BA (%) 78.60 73.95 75.85 71.05 73.45 77.75 76.25 77.15
ASR (%) N/A 29.25 9.90 8.75 4.15 4.55 55.90 80.60
TABLE VII: Results on the ImageNet dataset. Among all clean-label backdoor attacks, the best result is indicated in boldface while the underlining value denotes the second-best result. Besides, we mark all failed cases (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., ASR <20%absentpercent20<20\%< 20 %) in red.
Model\downarrow Metric\downarrow, Attack\rightarrow No Attack WaNet WaNet-C ISSBA ISSBA-C LC TUAP BAAT (Ours)
VGG-16 BA (%) 86.04 85.44 85.32 85.04 85.20 86.08 86.22 87.40
ASR (%) N/A 76.42 2.16 1.46 0.90 0.72 16.28 66.44
ResNet-18 BA (%) 79.82 79.42 79.58 77.74 77.60 79.74 79.38 82.46
ASR (%) N/A 40.82 0.96 1.78 0.90 0.82 19.06 59.28

5.2 Main Results

As shown in Table VI-VII, our BAAT is significantly better than all clean-label backdoor attacks, no matter whether they are the variants of sample-specific attacks (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., WaNet-C and ISSBA-C) or designed with the sample-agnostic trigger (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., LC and TUAP). For example, the attack success rates (ASRs) of our method are more than 40% larger than those of all clean-label attacks on the ImageNet dataset. The ASR values of our BAAT are larger than 55%percent5555\%55 % in all cases. In particular, the attack performance of our method is on par with or even better than sample-specific backdoor attacks with poisoned labels (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., WaNet and ISSBA). Moreover, the benign accuracy (BA) of models under our BAAT is also on par with that of the one trained on the benign dataset. An interesting phenomenon is that the BAs of our method are even larger than those of the cases under no attack. It is most probably because the style transfer used in our attack serves as an effective data augmentation to some extent (since we do not re-assign the label of poisoned samples), which is harmless or even beneficial. We will further explore it in our future work. These results verify the effectiveness of our attribute-based trigger patterns.

5.3 Ablation Study

In this section, we discuss the effects of key hyper-parameters involved in our BAAT. We adopt ResNet-18 as an example for discussions. Unless otherwise specified, all settings are the same as those illustrated in Section 5.1.

5.3.1 The Effects of Trigger Pattern

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Four style images used in our ablation study.

Settings. In this part, we discuss whether our method is still effective when using different trigger patterns. Specifically, we exploited four different hair types, including a) hi-top hairstyle with purple color, b) hi-top hairstyle with green color, c) jewrfro hairstyle with purple color, and d) jewrfro hairstyle with green color on the VGGFace2 dataset. Besides, we adopt four different style images (as shown in Figure 7) on the ImageNet dataset for discussions.

Results. As shown in Table VIII, our BAAT is effective with each trigger pattern, although the performance may have some fluctuations. Specifically, the ASRs are larger than 70% in all cases on the VGGFace2 dataset. These results verify that our BAAT method can reach promising attack performance with arbitrary adversary-specified trigger patterns.

TABLE VIII: The effectiveness of our BAAT method with different trigger patterns on VGGFace2 and ImageNet.
Dataset\downarrow
Pattern\rightarrow
Metric\downarrow
(a) (b) (c) (d)
VGGFace2 BA (%) 77.15 76.90 77.00 76.90
ASR (%) 80.60 86.60 74.05 81.55
ImageNet BA (%) 82.46 82.48 82.26 82.26
ASR (%) 59.28 59.12 55.76 64.26
TABLE IX: The effectiveness of our BAAT method with different target labels on VGGFace2 and ImageNet.
Dataset\downarrow
Label\rightarrow
Metric\downarrow
1 2 3 4
VGGFace2 BA (%) 77.15 76.45 76.55 77.30
ASR (%) 80.60 78.10 88.80 84.45
ImageNet BA (%) 82.46 82.54 82.52 82.56
ASR (%) 59.28 58.32 59.34 57.70

5.3.2 The Effects of Target Label

To verify that our BAAT is still effective when different target labels are used, we evaluate our BAAT with four different labels. As shown in Table IX, our BAAT is effective in all cases, although the performance may have some fluctuations. For example, the ASRs are larger than 75% in all cases on the VGGFace2 dataset. The ASRs are also larger than 55%percent5555\%55 % in all cases on the ImageNet dataset. These results verify the effectiveness of BAAT again.

Refer to caption
(a) VGGFace2
Refer to caption
(b) ImageNet
Figure 8: The effects of poisoning rate towards our BAAT on the VGGFace2 and the ImageNet dataset.
Refer to caption
(a) VGGFace2
Refer to caption
(b) ImageNet
Figure 9: The resistance to fine-tuning.
Refer to caption
(a) VGGFace2
Refer to caption
(b) ImageNet
Figure 10: The resistance to model pruning.

5.3.3 The Effects of Poisoning Rate

In this part, we analyze how the poisoning rate affects our BAAT. As shown in Figure 8, the attack success rate (ASR) increases with the increase of the poisoning rate γ𝛾\gammaitalic_γ. In particular, our BAAT reaches a high ASR (>50%absentpercent50>50\%> 50 %) on both datasets by poisoning only 60% training samples from the target class (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., γ=3%𝛾percent3\gamma=3\%italic_γ = 3 % on VGGFace2 and γ=0.6%𝛾percent0.6\gamma=0.6\%italic_γ = 0.6 % on ImageNet). Besides, the benign accuracy (BA) decreases with the increase of γ𝛾\gammaitalic_γ, although the decline rate is relatively slow. In other words, there is a trade-off between ASR and BA to some extent. Accordingly, the adversaries should assign γ𝛾\gammaitalic_γ based on their specific needs.

5.4 The Resistance to Potential Defenses

In this section, we verify that our BAAT is resistant to representative backdoor defenses. For simplicity, we hereby also adopt ResNet-18 for our discussions.

5.4.1 The Resistance to Classical Model Repairing

Model repairing intends to directly remove backdoors from the attacked models by modifying their parameters. In this part, we explore the resistance of our BAAT to two classical and representative methods, including fine-tuning [34, 38, 65] and model pruning [38, 39, 66].

Settings. For fine-tuning, we fine-tune the fully-connected layers of the attacked model with 50% benign training samples 30 epochs and set the learning rate as 0.1. The benign accuracy and attack success rate is evaluated after each epoch; For model pruning, we conduct channel pruning [67] on the output of the last convolutional layer with 10% benign training samples on both datasets. The pruning rate is set to β{0%,2%,,98%}𝛽percent0percent2percent98\beta\in\{0\%,2\%,\cdots,98\%\}italic_β ∈ { 0 % , 2 % , ⋯ , 98 % }.

Results. As shown in Figure 9-10, our method is resistant to fine-tuning and model pruning on both VGGFace2 and ImageNet datasets. Specifically, the attack success rate (ASR) is still larger than 70% during the fine-tuning process on VGGFace2. Besides, model pruning can significantly reduce our ASR whereas with a great sacrifice of benign accuracy. These results verify the robustness of our BAAT method.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: The ground-truth trigger pattern of BadNets and synthesized patterns of BadNets and our BAAT. (a) The ground-truth trigger pattern; (b)&(d) The synthesized trigger patterns of BadNets on VGGFace2 and ImageNet, respectively; (c)&(e) The synthesized trigger patterns of our BAAT on VGGFace2 and ImageNet, respectively.
Refer to caption
(a) VGGFace2
Refer to caption
(b) ImageNet
Figure 12: The Grad-CAM of poisoned samples generated by BadNets and our BAAT.
TABLE X: The resistance to MCR and NAD.
Dataset\rightarrow VGGFace2 ImageNet
Method\downarrow, Metric\rightarrow BA ASR BA ASR
No Defense 77.15 80.60 82.46 59.28
MCR 77.65 17.60 82.06 43.08
NAD 74.40 76.25 68.16 14.38

5.4.2 The Resistance to Advanced Model Repairing

Settings. We hereby evaluate the resistance of our BAAT to advanced and representative model-repairing-based methods, including mode connectivity repair (MCR) [35] and neural attention distillation (NAD) [36]. Specifically, for MCR, we adopt the model after fine-tuning as another attacked DNN and train a Bezier-type connect curve with 10% benign training samples for 100 epochs. Besides, we set t=0.2𝑡0.2t=0.2italic_t = 0.2 for repairing; For NAD, we set the hyper-parameter for the attention loss to 1. We implement both methods based on the codes provided in BackdoorBox [56].

Results. As shown in Table X, our BAAT preserves a relatively high attack success rate (>15%absentpercent15>15\%> 15 %) after defenses in many cases. In particular, the ASR is still larger than 10% on the ImageNet dataset under NAD, although it decreases the benign accuracy by nearly 15%. In conclusion, our BAAT is also resistant to them to a large extent.

5.4.3 The Resistance to Trigger-synthesis-based Defenses

In this part, we show that our BAAT is also resistant to neural cleanse [41] and SentiNet [45], which are two representative types of trigger-synthesis-based defenses.

Settings. We adopt BadNets with a 12×12121212\times 1212 × 12 white square located at the right corner of images for reference since it can be detected by neural cleanse and SentiNet. All other settings are the same as those presented in Section 5.1. For neural cleanse, we implement it based on its open-sourced codes and default settings; For SentiNet, we generate the saliency maps of DNNs attacked by BadNets and our BAAT, based on Grad-CAM [46] with its default settings.

Results. As shown in Figure 11, the synthesized pattern of BadNets is similar to their ground-truth trigger pattern, whereas that of our attack is meaningless (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., neither scattered throughout the whole image nor concentrated in the hair location.). Besides, as shown in Figure 12, SentiNet can distinguish trigger regions from those generated by BadNets, while it fails to detect those generated by our BAAT since it will focus on nearly the object outline or even the whole image. These results indicate that our attack resists both neural cleanse and SentiNet.

TABLE XI: The resistance to Auto-Encoder and ShrinkPad.
Dataset\rightarrow VGGFace2 ImageNet
Method\downarrow, Metric\rightarrow BA ASR BA ASR
No Defense 77.15 80.60 82.46 59.28
Auto-Encoder 73.85 68.55 64.74 47.20
ShrinkPad 67.60 35.65 73.88 37.62
TABLE XII: The entropy generated by STRIP of different attacks. The higher the entropy, the harder the detection.
VGGFace2 ImageNet
BadNets BAAT (Ours) BadNets BAAT (Ours)
0.220 0.814 0.446 1.039

5.4.4 The Resistance to Pre-processing-based Defenses

In this part, we discuss whether our BAAT is resistant to auto-encoder-based pre-processing (dubbed ‘Auto-Encoder’) [34] and ShrinkPad [13], which are two representative pre-processing-based defenses.

Settings. We adopt a pre-trained auto-encoder trained on the ImageNet dataset for Auto-Encoder. Specifically, we first resize the images from 3×128×12831281283\times 128\times 1283 × 128 × 128 to 3×224×22432242243\times 224\times 2243 × 224 × 224 before feeding into the auto-encoder. After that, we shrink the pre-processed images back to 3×128×12831281283\times 128\times 1283 × 128 × 128, based on which to calculate the benign accuracy and the attack success rate; We implement ShrinkPad based on BackdoorBox, where the shrinking size is set to 12 pixels on both datasets.

Results. As shown in Table XI, Auto-Encoder has minor benefits in reducing our attack success rate. The attack success rates are still larger than 45% after Auto-Encoder on both datasets. It is mostly because our triggers are not additive perturbations with small magnitude, although they are still stealthy for human inspection. Besides, our attack is also resistant to ShrinkPad to a large extent, although it can decrease our ASR to some extent. It is mostly because our trigger patterns are large and not static.

5.4.5 The Resistance to Sample-filtering-based Defenses

In this part, we examine whether our attack can circumvent representative sample-level backdoor detection methods, including STRIP [51] and SCALE-UP [52].

TABLE XIII: The AUROC of SCALE-UP in detecting BadNets and our BAAT on VGGFace2 and ImageNet datasets.
VGGFace2 ImageNet
BadNets BAAT (Ours) BadNets BAAT (Ours)
0.853 0.472 0.936 0.310

Settings. We adopt the same BadNets obtained in Section 5.4.3 for comparative experiments on STRIP. Following the settings in [52], we exploit a 12×12121212\times 1212 × 12 random noise as a trigger pattern to train a new BadNets for comparative experiments on SCALE-UP. We implement STRIP and SCALE-UP based on their open-sourced codes.

Results. As shown in Table XII, the entropy of our BAAT is significantly higher than that of BadNets on both datasets. These results indicate that STRIP can hardly detect our attack. Besides, as shown in Table XIII, our attack can also circumvent the detection of SCALE-UP, whereas BadNets cannot. These results verify the stealthiness of our BAAT.

6 Discussions

6.1 The Comparison to Related Works

6.1.1 The Comparison to Data Poisoning

As introduced in [8], there are two types of data poisoning, including classical data poisoning [68] and advanced data poisoning [69]. Specifically, the former intends to reduce model generalization, leading the attacked models to correctly predict training samples whereas having limited performance in predicting testing samples. The latter leads attacked models to have satisfied test accuracy while misclassifying some adversary-specified (unmodified) samples. Both our BAAT and data poisoning intend to implant malicious prediction behaviors by poisoning some training samples. However, they still have many intrinsic differences.

The Comparison to Classical Data Poisoning. Firstly, our BAAT has a different purpose. Our attack preserves high accuracy in predicting benign testing samples while classical data poisoning is not. Accordingly, our method is more stealthy, since users can easily detect classical data poisoning by evaluating model performance on a local verification set while it has limited benefits in detecting our BAAT; Secondly, our method has a different mechanism. Specifically, the effectiveness of classical data poisoning is mostly due to the sensitiveness of the training process, so that even a small domain shift of training samples may lead to significantly different decision surfaces of attacked models. In contrast, BAAT relies on the data-driven model training process and domain shift between training and testing samples.

The Comparison to Advanced Data Poisoning. Firstly, advanced data poisoning can only misclassify a few pre-defined images whereas our BAAT can lead to the misjudgments of all images containing the trigger pattern. It is mostly due to their second difference that the advanced data poisoning does not require modifying the images before feeding into attacked DNNs in the inference process. Thirdly, the effectiveness of advanced data poisoning is mainly because DNNs are over-parameterized and therefore the decision surface can have sophisticated structures near the adversary-specified samples for misclassification. It is also different from that of our BAAT.

6.1.2 The Comparison to Adversarial Attacks

Both our BAAT and adversarial attacks [70] intend to make the DNNs misclassify samples during the inference process by adding malicious perturbations. However, they still have many essential differences, as follows.

Firstly, the success of adversarial attacks is mostly due to the behavior differences between DNNs and humans, which is different from that of our attack. Secondly, the malicious perturbations are known (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., non-optimized) by BAAT whereas adversarial attacks need to obtain them based on the optimization process. As such, adversarial attacks cannot be real-time in many cases, since the optimization requires querying the DNNs multiple times under either white-box or black-box settings. Lastly, our BAAT requires modifying the training samples without any additional requirements in the inference process, while adversarial attacks need to control the inference process to some extent.

6.1.3 The Comparison to Style-based Attacks

We notice that there are a few other works [71, 29] also focused on attacking DNNs based on style transfer. In this part, we compare our BAAT to them.

[71] adopted style transfer to generate adversarial examples in both digital and physical-world scenarios. Similar to existing adversarial attacks, this method obtained (style-based) perturbations by optimization, which takes time. Besides, it was designed under the white-box setting where the adversary can obtain the source files of the target model. In contrast, our BAAT does not have these limitations.

[29] also adopted style transfer to design the backdoor attack, which is closely related to our method. However, this attack needed to control the training process of attacked DNNs, whereas our BAAT only needs to poison a few training samples. Besides, this attack was designed under the poisoned-label setting, whereas our method is under the clean-label setting. These differences make our attack more practical and therefore more threatening.

Besides, we need to notice that we only adopt style transfer as an example to discuss how to generate attribute triggers towards natural images. Users may use other methods, based on their domain knowledge of the target task.

6.2 Potential Negative Societal Impacts & Limitations

In this paper, our main goal is to design a simple yet effective tool to evaluate the backdoor robustness of existing DNN-based classifiers. However, we notice that our BAAT is resistant to existing backdoor defenses and could be used by the backdoor adversaries for malicious purposes. The adversaries may also design similar attacks against other tasks inspired by our research. Although an effective defense is yet to be developed, one may mitigate or even avoid this threat via only using fully-trusted training resources. Our next step is to design principled and advanced defenses against BAAT-type backdoor attacks.

We notice that our method cannot optimize the attribute trigger due to its discontinuity and non-differentiability, although using handcrafted attributes (as our BAAT does) has already achieved a sufficiently high attack success rate. Our work is only the first step towards clean-label sample-specific backdoor attacks. We will discuss how to optimize attribute triggers in our future works. We will also discuss how to generalize our BAAT method to other modalities, such as audio and texts, in the future.

7 Conclusion

In this paper, we revisited the sample-specific backdoor attack (SSBA). We revealed that existing SSBAs are not sufficiently stealthy due to their poisoned-label nature, where users can discover anomalies if they check the image-label relationship. We found that extending existing methods to the clean-label attacks simply by poisoning samples only from the target class has minor effects and its failure reasons. Based on our analyses, in this paper, we designed the backdoor attack with attribute trigger (BAAT) inspired by the decision process of humans. Our BAAT is the first effective sample-specific backdoor attack with clean labels. It was also resistant to existing defenses to a large extent. We hope that our attack can serve as a strong baseline to facilitate the design of more robust and secure DNNs.

Acknowledgments

This work was partly done when Yiming Li was a research intern at Ant Group. We also sincerely thank Mr. Chengxiao Luo from Tsinghua University for his implementation of some preliminary experiments on the VGGFace2 dataset, and Prof. Yong Jiang from Tsinghua University and Dr. Haiqin Weng from Ant group for their valuable comments and suggestions on an early draft of this paper.

References

  • [1] D. Gong, Z. Li, J. Liu, and Y. Qiao, “Multi-feature canonical correlation analysis for face photo-sketch image retrieval,” in ACM MM, 2013.
  • [2] H. Qiu, B. Yu, D. Gong, Z. Li, W. Liu, and D. Tao, “Synface: Face recognition with synthetic data,” in ICCV, 2021.
  • [3] Y. Ren, Z. Song, S. Sun, J. Liu, and G. Feng, “Outsourcing lda-based face recognition to an untrusted cloud,” IEEE Transactions on Dependable and Secure Computing, 2022.
  • [4] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP, 2018.
  • [5] G. Chen, Z. Zhao, F. Song, S. Chen, L. Fan, F. Wang, and J. Wang, “Towards understanding and mitigating audio adversarial examples for speaker recognition,” IEEE Transactions on Dependable and Secure Computing, 2022.
  • [6] P. Cheng, Y. Wu, Y. Hong, Z. Ba, F. Lin, L. Lu, and K. Ren, “Uniap: Protecting speech privacy with non-targeted universal adversarial perturbations,” IEEE Transactions on Dependable and Secure Computing, 2023.
  • [7] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating backdooring attacks on deep neural networks,” IEEE Access, vol. 7, pp. 47 230–47 244, 2019.
  • [8] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [9] Y. Wang, M. Zhao, S. Li, X. Yuan, and W. Ni, “Dispersed pixel perturbation-based imperceptible backdoor trigger for image classifier models,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 3091–3106, 2022.
  • [10] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017.
  • [11] X. Gong, Y. Chen, H. Huang, W. Kong, Z. Wang, C. Shen, and Q. Wang, “Kerbnet: A qoe-aware kernel-based backdoor attack framework,” IEEE Transactions on Dependable and Secure Computing, 2023.
  • [12] W. Jiang, H. Li, G. Xu, and T. Zhang, “Color backdoor: A robust poisoning attack in color space,” in CVPR, 2023.
  • [13] Y. Li, T. Zhai, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor attack in the physical world,” in ICLR Workshop, 2021.
  • [14] E. Wenger, J. Passananti, A. N. Bhagoji, Y. Yao, H. Zheng, and B. Y. Zhao, “Backdoor attacks against deep learning systems in the physical world,” in CVPR, 2021.
  • [15] X. Gong, Z. Wang, Y. Chen, M. Xue, Q. Wang, and C. Shen, “Kaleidoscope: Physical backdoor attacks against deep neural networks with rgb filters,” IEEE Transactions on Dependable and Secure Computing, 2023.
  • [16] T. A. Nguyen and A. Tran, “Input-aware dynamic backdoor attack,” in NeurIPS, 2020.
  • [17] Y. Li, Y. Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible backdoor attack with sample-specific triggers,” in ICCV, 2021.
  • [18] A. Nguyen and A. Tran, “Wanet–imperceptible war**-based backdoor attack,” in ICLR, 2021.
  • [19] Y. Gao, Y. Li, L. Zhu, D. Wu, Y. Jiang, and S.-T. Xia, “Not all samples are born equal: Towards effective clean-label backdoor attacks,” Pattern Recognition, vol. 139, p. 109512, 2023.
  • [20] A. Turner, D. Tsipras, and A. Madry, “Label-consistent backdoor attacks,” arXiv preprint arXiv:1912.02771, 2019.
  • [21] S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang, “Clean-label backdoor attacks on video recognition models,” in CVPR, 2020.
  • [22] Z. Wang, J. Zhai, and S. Ma, “Bppattack: Stealthy and efficient trojan attacks against deep neural networks via image quantization and contrastive adversarial learning,” in CVPR, 2022.
  • [23] Z. Zhao, X. Chen, Y. Xuan, Y. Dong, D. Wang, and K. Liang, “Defeat: Deep hidden feature backdoor attacks by imperceptible perturbation and latent representation constraints,” in CVPR, 2022.
  • [24] T. Zhai, Y. Li, Z. Zhang, B. Wu, Y. Jiang, and S.-T. Xia, “Backdoor attack against speaker verification,” in ICASSP, 2021.
  • [25] Z. Xi, R. Pang, S. Ji, and T. Wang, “Graph backdoor,” in USENIX Security, 2021.
  • [26] Z. Xiang, D. J. Miller, S. Chen, X. Li, and G. Kesidis, “A backdoor attack against 3d point cloud classifiers,” in ICCV, 2021.
  • [27] J. Guo, A. Li, L. Wang, and C. Liu, “Policycleanse: Backdoor detection and mitigation for competitive reinforcement learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4699–4708.
  • [28] J. Guo and C. Liu, “Practical poisoning attacks on neural networks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16.   Springer, 2020, pp. 142–158.
  • [29] S. Cheng, Y. Liu, S. Ma, and X. Zhang, “Deep feature space trojan attack of neural networks by controlled detoxification,” in AAAI, 2021.
  • [30] E. Bagdasaryan and V. Shmatikov, “Blind backdoors in deep learning models,” in USENIX Security, 2021.
  • [31] Y. Zeng, W. Park, Z. M. Mao, and R. Jia, “Rethinking the backdoor attacks’ triggers: A frequency perspective,” in ICCV, 2021.
  • [32] M. Tancik, B. Mildenhall, and R. Ng, “Stegastamp: Invisible hyperlinks in physical photographs,” in CVPR, 2020.
  • [33] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in CVPR, 2017.
  • [34] Y. Liu, Y. Xie, and A. Srivastava, “Neural trojans,” in ICCD, 2017.
  • [35] P. Zhao, P.-Y. Chen, P. Das, K. N. Ramamurthy, and X. Lin, “Bridging mode connectivity in loss landscapes and adversarial robustness,” in ICLR, 2020.
  • [36] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Neural attention distillation: Erasing backdoor triggers from deep neural networks,” in ICLR, 2021.
  • [37] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [38] K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” in RAID, 2018.
  • [39] D. Wu and Y. Wang, “Adversarial neuron pruning purifies backdoored deep models,” in NeurIPS, 2021.
  • [40] R. Zheng, R. Tang, J. Li, and L. Li, “Data-free backdoor removal based on channel lipschitzness,” in ECCV, 2022.
  • [41] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in IEEE S&P, 2019.
  • [42] Y. Dong, X. Yang, Z. Deng, T. Pang, Z. Xiao, H. Su, and J. Zhu, “Black-box detection of backdoor attacks with limited information and data,” in ICCV, 2021.
  • [43] J. Guo, A. Li, and C. Liu, “Aeva: Black-box backdoor detection using adversarial extreme value analysis,” in ICLR, 2022.
  • [44] X. Huang, M. Alzantot, and M. Srivastava, “Neuroninspect: Detecting backdoors in neural networks via output explanations,” arXiv preprint arXiv:1911.07399, 2019.
  • [45] E. Chou, F. Tramer, and G. Pellegrino, “Sentinet: Detecting localized universal attack against deep learning systems,” in IEEE S&P Workshop, 2020.
  • [46] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017.
  • [47] H. Qiu, Y. Zeng, S. Guo, T. Zhang, M. Qiu, and B. Thuraisingham, “Deepsweep: An evaluation framework for mitigating dnn backdoor attacks using data augmentation,” in Asia CCS, 2021.
  • [48] B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor attacks,” in NeurIPS, 2018.
  • [49] J. Hayase and W. Kong, “Spectre: Defending against backdoor attacks using robust covariance estimation,” in ICML, 2021.
  • [50] X. Qi, T. Xie, Y. Li, S. Mahloujifar, and P. Mittal, “Revisiting the assumption of latent separability for backdoor defenses,” in ICLR, 2023.
  • [51] Y. Gao, Y. Kim, B. G. Doan, Z. Zhang, G. Zhang, S. Nepal, D. Ranasinghe, and H. Kim, “Design and evaluation of a multi-domain trojan detection method on deep neural networks,” IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 4, pp. 2349–2364, 2022.
  • [52] J. Guo, Y. Li, X. Chen, H. Guo, L. Sun, and C. Liu, “SCALE-UP: An efficient black-box input-level backdoor detection via analyzing scaled prediction consistency,” in ICLR, 2023.
  • [53] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  • [54] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [56] Y. Li, M. Ya, Y. Bai, Y. Jiang, and S.-T. Xia, “Backdoorbox: A python toolbox for backdoor learning,” in ICLR Workshop, 2023.
  • [57] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,” in NeurIPS, 2019.
  • [58] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in NeurIPS, 2018.
  • [59] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Attgan: Facial attribute editing by only changing what you want,” IEEE transactions on image processing, vol. 28, no. 11, pp. 5464–5478, 2019.
  • [60] Y. Li, L. Zhu, X. Jia, Y. Jiang, S.-T. Xia, and X. Cao, “Defending against model stealing via verifying embedded external features,” in AAAI, 2022.
  • [61] Y.-C. Chen, X. Shen, Z. Lin, X. Lu, I. Pao, J. Jia et al., “Semantic component decomposition for face attribute manipulation,” in CVPR, 2019.
  • [62] T. Wei, D. Chen, W. Zhou, J. Liao, Z. Tan, L. Yuan, W. Zhang, and N. Yu, “Hairclip: Design your hair by text and reference image,” in CVPR, 2022.
  • [63] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in FG, 2018.
  • [64] J. An, S. Huang, Y. Song, D. Dou, W. Liu, and J. Luo, “Artflow: Unbiased image style transfer via reversible neural flows,” in CVPR, 2021.
  • [65] S. Yang, Y. Li, Y. Jiang, and S.-T. Xia, “Backdoor defense via suppressing model shortcuts,” in ICASSP, 2023.
  • [66] R. Zheng, R. Tang, J. Li, and L. Liu, “Data-free backdoor removal based on channel lipschitzness,” in ECCV, 2022.
  • [67] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in ICCV, 2017.
  • [68] H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli, “Is feature selection secure against training data poisoning?” in ICML, 2015.
  • [69] A. Schwarzschild, M. Goldblum, A. Gupta, J. P. Dickerson, and T. Goldstein, “Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks,” in ICML, 2021.
  • [70] B. He, J. Liu, Y. Li, S. Liang, J. Li, X. Jia, and X. Cao, “Generating transferable 3d adversarial point cloud via random perturbation factorization,” in AAAI, 2023.
  • [71] R. Duan, X. Ma, Y. Wang, J. Bailey, A. K. Qin, and Y. Yang, “Adversarial camouflage: Hiding physical-world attacks with natural styles,” in CVPR, 2020.
[Uncaptioned image] Dr. Yiming Li is currently a Research Professor in the School of Cyber Science and Technology at Zhejiang University. Before that, he received his Ph.D. degree with honors in Computer Science and Technology from Tsinghua University (2023) and his B.S. degree with honors in Mathematics and Applied Mathematics from Ningbo University (2018). His research interests are in the domain of Trustworthy ML and AI Security, especially backdoor learning and copyright protection in deep learning. His research has been published in multiple top-tier conferences and journals, such as ICLR, NeurIPS, and IEEE TIFS. He served as the senior program committee member of AAAI, the program committee member of ICLR, NeurIPS, ICML, etc., and the reviewer of IEEE TPAMI, IEEE TDSC, IEEE TIFS, etc. His research has been featured by major media outlets, such as IEEE Spectrum. He was the recipient of the Best Paper Award at PAKDD 2023 and the Rising Star Award at WAIC 2023.
[Uncaptioned image] Mingyan Zhu received his B.S. degree in Computer Science and Technology from Harbin Institute of Technology, China, in 2020. He is currently pursuing the Ph.D.degree in Tsinghua Shenzhen International Graduate School, Tsinghua University. His research interests are in the domain of Trustworthy ML and AI security.
[Uncaptioned image] Dr. Junfeng Guo is currently a Research Associate in the Department of Computer Science from the University of Maryland. Before that, he received his Ph.D. degree in Computer Science from University of Texas at Dallas (2023) and his B.S. degree in University of Texas at Dallas (2018). His research interests are in the domain of Trustworthy ML and AI Security, especially backdoor learning and copyright protection in deep learning. His research has been published in multiple top-tier conferences and journals, such as ICLR, NeurIPS, and MobiCom. He served as the program committee member of ICLR, NeurIPS, ICML, etc., and the reviewer of IEEE TPAMI, IEEE TNNLS, etc.
[Uncaptioned image] Dr. Wei Tao received the B.S. and Ph.D. degrees from Peking University, China, in 1997 and 2007, respectively. He is currently the Vice President at Ant Group, in charge of its foundational security. He is also an Adjunct Professor at Peking University. For more than 20 years, he has been committed to making complex systems more secure and reliable. His work has helped Windows, Android, iOS and other operating systems improve their security capabilities. He also led the development of many famous security open-sourced projects such as Mesatee/Teaclave, MesaLink TLS, OpenRASP, Advbox Adversarial Toolbox, etc. His researches have been published in multiple top-tier journals and conferences, including IEEE TDSC, IEEE TIFS, IEEE S&P, USENIX Security, etc.
[Uncaptioned image] Dr. Shu-Tao Xia received the B.S. degree in mathematics and the Ph.D. degree in applied mathematics from Nankai University, Tian**, China, in 1992 and 1997, respectively. Since January 2004, he has been with the Tsinghua Shenzhen International Graduate School of Tsinghua University, Guangdong, China, where he is currently a full professor. From September 1997 to March 1998 and from August to September 1998, he visited the Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong. His research interests include coding and information theory, machine learning, and deep learning. His papers have been published in multiple top-tier journals and conferences, such as IEEE TPAMI, IEEE TIFS, IEEE TDSC, CVPR, ICLR, NeurIPS.
[Uncaptioned image] Dr. Zhan Qin is currently a ZJU100 Young Professor with both the College of Computer Science and the School of Cyber Science and Technology at Zhejiang University, China. He was an assistant professor at the Department of Electrical and Computer Engineering in the University of Texas at San Antonio after receiving the Ph.D. degree from the Computer Science and Engineering department at State University of New York at Buffalo in 2017. His current research interests include data security and privacy, secure computation outsourcing, artificial intelligence security, and cyber-physical security in the context of the Internet of Things. His works explore and develop novel security sensitive algorithms and protocols for computation and communication on the general context of Cloud and Internet devices. He is the associate editor of IEEE TDSC.

Appendix A The Proof of Theorem 1

Theorem 1.

Suppose the training dataset consists of Nbsubscript𝑁𝑏N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT benign samples {(𝐱i,yi)}i=1Nbsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1subscript𝑁𝑏\{(\bm{x}_{i},y_{i})\}_{i=1}^{N_{b}}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT poisoned samples {(𝐱j,yt)}j=1Npsuperscriptsubscriptsuperscriptsubscript𝐱𝑗normal-′subscript𝑦𝑡𝑗1subscript𝑁𝑝\{(\bm{x}_{j}^{\prime},y_{t})\}_{j=1}^{N_{p}}{ ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, whose images are i.i.d. sampled from uniform distribution and belonging to K𝐾Kitalic_K classes. Assume that the DNN f(;𝛉)𝑓normal-⋅𝛉f(\cdot;\bm{\theta})italic_f ( ⋅ ; bold_italic_θ ) is a multivariate kernel regression K()𝐾normal-⋅K(\cdot)italic_K ( ⋅ ) and is trained via min𝛉i=1Nb(f(𝐱i;𝛉),yi)+j=1Np(f(𝐱j;𝛉),yt),subscript𝛉superscriptsubscript𝑖1subscript𝑁𝑏𝑓subscript𝐱𝑖𝛉subscript𝑦𝑖superscriptsubscript𝑗1subscript𝑁𝑝𝑓subscriptsuperscript𝐱normal-′𝑗𝛉subscript𝑦𝑡\min_{\bm{\theta}}\sum_{i=1}^{N_{b}}\mathcal{L}(f(\bm{x}_{i};\bm{\theta}),y_{i% })+\sum_{j=1}^{N_{p}}\mathcal{L}(f(\bm{x}^{\prime}_{j};\bm{\theta}),y_{t}),roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L ( italic_f ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_italic_θ ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , while trigger patterns are additive perturbations. Let f(a)superscript𝑓𝑎f^{(a)}italic_f start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT and f(s)superscript𝑓𝑠f^{(s)}italic_f start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT denote models attacked by sample-agnostic and sample-specific attacks, which select the same benign samples for poisoning on the same dataset, respectively. For their expected predictive confidences over the target label ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we have:

𝔼𝒙^[f(a)(𝒙^)]𝔼𝒙~[f(s)(𝒙~)]0,subscript𝔼^𝒙delimited-[]superscript𝑓𝑎^𝒙subscript𝔼~𝒙delimited-[]superscript𝑓𝑠~𝒙0\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{\widetilde{\bm{x}% }}[f^{(s)}(\widetilde{\bm{x}})]\geq 0,blackboard_E start_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG ) ] - blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) ] ≥ 0 , (1)

where 𝐱^normal-^𝐱\hat{\bm{x}}over^ start_ARG bold_italic_x end_ARG and 𝐱~normal-~𝐱\widetilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG are poisoned testing samples of sample-agnostic and sample-specific attacks, respectively.

Proof.

We have 𝒙𝒕=𝒙𝒕+𝒕subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒕𝒕\bm{x^{{}^{\prime}}_{t}}=\bm{x_{t}}+\bm{t}bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT + bold_italic_t for the target poisoned sample since trigger patterns are additive. As such, for sample-specific samples: we have:𝒙𝒊=𝒙𝒊+𝒕𝒊subscriptsuperscript𝒙bold-′𝒊subscript𝒙𝒊subscript𝒕𝒊\bm{x^{{}^{\prime}}_{i}}=\bm{x_{i}}+\bm{t_{i}}bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT + bold_italic_t start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, while for the sample-static samples: 𝒙^𝒊=𝒙𝒊+𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊subscript𝒙𝒊𝒕\bm{\hat{x}^{{}^{\prime}}_{i}}=\bm{x_{i}}+\bm{t}overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT + bold_italic_t, where t𝑡titalic_t represents the backdoor trigger.

We treat our model as a k-way kernel least square classifier and use a cross-entropy loss for training the kernel. The output of f()𝑓f(\cdot)italic_f ( ⋅ ) is a k-dimensional vector. Let us assume ϕt()subscriptitalic-ϕ𝑡\phi_{t}(\cdot)\in\mathbb{R}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) ∈ blackboard_R be expected predictive confidences corresponding to the target class t𝑡titalic_t. Following previous works [58, 43], we know the kernel regression solution is:

ϕt()=i=1NbK(,𝒙𝒊)𝒚𝒊+i=1NpK(,𝒙𝒊)𝒚𝒕i=1NbK(,𝒙𝒊)+i=1NpK(,𝒙𝒊),subscriptitalic-ϕ𝑡superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscript𝒙𝒊subscript𝒚𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒊subscript𝒚𝒕superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒊\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{b}}K(\cdot,\bm{x_{i}})\cdot\bm{y_{i}}+% \sum_{i=1}^{N_{p}}K(\cdot,\bm{x^{\prime}_{i}})\cdot\bm{y_{t}}}{\sum_{i=1}^{N_{% b}}K(\cdot,\bm{x_{i}})+\sum_{i=1}^{N_{p}}K(\cdot,\bm{x^{\prime}_{i}})},italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ⋅ bold_italic_y start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ⋅ bold_italic_y start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) end_ARG , (2)

where K𝐾Kitalic_K is the RBF kernel, 𝒚𝒚\bm{y}bold_italic_y is the one-hot version of the label y𝑦yitalic_y.

We assume the training samples are evenly distributed, thus there are Nbksubscript𝑁𝑏𝑘\frac{N_{b}}{k}divide start_ARG italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG benign samples belonging to ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Without loss of generality, we here let the target label yt=1subscript𝑦𝑡1y_{t}=1italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 while others are 0. Then, the regression solution can be re-formulated as:

ϕt()=i=1Nb/kK(,𝒙𝒊)+i=1NpK(,𝒙𝒊)i=1NbK(,𝒙𝒊)+i=1NpK(,𝒙𝒊).subscriptitalic-ϕ𝑡superscriptsubscript𝑖1subscript𝑁𝑏𝑘𝐾subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒊\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{b}/k}K(\cdot,\bm{x_{i}})+\sum_{i=1}^{N_{p% }}K(\cdot,\bm{x^{\prime}_{i}})}{\sum_{i=1}^{N_{b}}K(\cdot,\bm{x_{i}})+\sum_{i=% 1}^{N_{p}}K(\cdot,\bm{x^{\prime}_{i}})}.italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT / italic_k end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( ⋅ , bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) end_ARG . (3)

Accordingly, for sample-specific attacks, we have:

𝔼𝒙~[f(s)(𝒙~)]ϕt(𝒙𝒕)=i=1Nb/kK(𝒙𝒕,𝒙𝒊)+i=1NpK(𝒙𝒕,𝒙𝒊~)i=1NbK(𝒙𝒕,𝒙𝒊)+i=1NpK(𝒙𝒕,𝒙𝒊~).subscript𝔼~𝒙delimited-[]superscript𝑓𝑠~𝒙subscriptitalic-ϕ𝑡subscriptsuperscript𝒙bold-′𝒕superscriptsubscript𝑖1subscript𝑁𝑏𝑘𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕bold-~subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾superscriptsubscript𝒙𝒕bold-′subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾superscriptsubscript𝒙𝒕bold-′bold-~subscriptsuperscript𝒙bold-′𝒊\mathbb{E}_{\widetilde{\bm{x}}}[f^{(s)}(\widetilde{\bm{x}})]\triangleq\phi_{t}% (\bm{x^{{}^{\prime}}_{t}})=\frac{\sum_{i=1}^{N_{b}/k}K(\bm{x^{{}^{\prime}}_{t}% },\bm{x_{i}})+\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\widetilde{x^{% \prime}_{i}}})}{\sum_{i=1}^{N_{b}}K(\bm{x_{t}^{{}^{\prime}}},\bm{x_{i}})+\sum_% {i=1}^{N_{p}}K(\bm{x_{t}^{{}^{\prime}}},\bm{\widetilde{x^{\prime}_{i}}})}.blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) ] ≜ italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT / italic_k end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG . (4)

Similarly, for sample-agnostic attacks with the same 𝒟ssubscript𝒟𝑠\mathcal{D}_{s}caligraphic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝒟bsubscript𝒟𝑏\mathcal{D}_{b}caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT configurations, we have:

𝔼𝒙^[f(a)(𝒙^)]ϕt^(𝒙𝒕)=i=1Nb/kK(𝒙𝒕,𝒙𝒊)+i=1NpK(𝒙𝒕,𝒙^𝒊)i=1NbK(𝒙𝒕,𝒙𝒊)+i=1NpK(𝒙𝒕,𝒙^𝒊).subscript𝔼^𝒙delimited-[]superscript𝑓𝑎^𝒙^subscriptitalic-ϕ𝑡subscriptsuperscript𝒙bold-′𝒕superscriptsubscript𝑖1subscript𝑁𝑏𝑘𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾superscriptsubscript𝒙𝒕bold-′subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾superscriptsubscript𝒙𝒕bold-′subscriptsuperscriptbold-^𝒙bold-′𝒊\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]\triangleq\hat{\phi_{t}}(\bm{x% ^{{}^{\prime}}_{t}})=\frac{\sum_{i=1}^{N_{b}/k}K(\bm{x^{{}^{\prime}}_{t}},\bm{% x_{i}})+\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\hat{x}^{\prime}_{i}}% )}{\sum_{i=1}^{N_{b}}K(\bm{x_{t}^{{}^{\prime}}},\bm{x_{i}})+\sum_{i=1}^{N_{p}}% K(\bm{x_{t}^{{}^{\prime}}},\bm{\hat{x}^{\prime}_{i}})}.blackboard_E start_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG ) ] ≜ over^ start_ARG italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT / italic_k end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) end_ARG . (5)

Accordingly, we have

𝔼𝒙^[f(a)(𝒙^)]𝔼𝒙~[f(s)(𝒙~)]subscript𝔼^𝒙delimited-[]superscript𝑓𝑎^𝒙subscript𝔼~𝒙delimited-[]superscript𝑓𝑠~𝒙\displaystyle\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{% \widetilde{\bm{x}}}[f^{(s)}(\widetilde{\bm{x}})]blackboard_E start_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG ) ] - blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) ] (6)
=(i=1NpK(𝒙𝒕,𝒙𝒊~)i=1NpK(𝒙𝒕,𝒙^𝒊))i=1Nb/kK(𝒙𝒕,𝒙𝒊)(i=1NpK(𝒙𝒕,𝒙𝒊~)i=1NpK(𝒙𝒕,𝒙^𝒊))i=1NbK(𝒙𝒕,𝒙𝒊)(i=1NpK(𝒙𝒕,𝒙𝒊)+i=1NbK(𝒙𝒕,𝒙𝒊))(i=1NpK(𝒙𝒕,𝒙^𝒊)+i=1NbK(𝒙𝒕,𝒙𝒊)),absentsuperscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕bold-~subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝑘𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕bold-~subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊\displaystyle=\frac{(\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^% {\prime}_{i}}})-\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\hat{x}^{\prime}_{% i}}))\sum_{i=1}^{N_{b}/k}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}})-(\sum_{i=1}^{N% _{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^{\prime}_{i}}})-\sum_{i=1}^{N_{p}}% K(\bm{x^{\prime}_{t}},\bm{\hat{x}^{\prime}_{i}}))\sum_{i=1}^{N_{b}}K(\bm{x^{{}% ^{\prime}}_{t}},\bm{x_{i}})}{(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm% {x^{{}^{\prime}}_{i}})+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}% ))(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\hat{x}^{{}^{\prime}}_{i}}% )+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}))},= divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT / italic_k end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) - ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) end_ARG , (7)
=Ci=1NpK(𝒙𝒕,𝒙^𝒊)i=1NpK(𝒙𝒕,𝒙𝒊~)(i=1NpK(𝒙𝒕,𝒙𝒊)+i=1NbK(𝒙𝒕,𝒙𝒊))(i=1NpK(𝒙𝒕,𝒙^𝒊)+i=1NbK(𝒙𝒕,𝒙𝒊)),absent𝐶superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕bold-~subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊\displaystyle=C\cdot\frac{\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\hat{x}^% {\prime}_{i}})-\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^{% \prime}_{i}}})}{(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x^{{}^{% \prime}}_{i}})+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}))(\sum_% {i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},\bm{\hat{x}^{{}^{\prime}}_{i}})+\sum_{% i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}}))},= italic_C ⋅ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) end_ARG , (8)

where C=i=1NbK(𝒙𝒕,𝒙𝒊)i=1Nb/kK(𝒙𝒕,𝒙𝒊)𝐶superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝑘𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊C=\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}})-\sum_{i=1}^{N_{b}/k% }K(\bm{x^{{}^{\prime}}_{t}},\bm{x_{i}})italic_C = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT / italic_k end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ).

In particular, we know that C>0𝐶0C>0italic_C > 0 since {𝒙i}i=1Nb/ksuperscriptsubscriptsubscript𝒙𝑖𝑖1subscript𝑁𝑏𝑘\{\bm{x}_{i}\}_{i=1}^{{N_{b}}/k}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT / italic_k end_POSTSUPERSCRIPT belongs to {𝒙i}i=1Nbsuperscriptsubscriptsubscript𝒙𝑖𝑖1subscript𝑁𝑏\{\bm{x}_{i}\}_{i=1}^{{N_{b}}}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

For the upper term in the above equation (8), due to the property of RBF kernel, we have:

i=1NpK(𝒙𝒕,𝒙^𝒊)i=1NpK(𝒙𝒕,𝒙𝒊~)=i=1Npeγ𝒙𝒕𝒙^𝒊22eγ𝒙𝒕𝒙𝒊~22=i=1Npeγ𝒙𝒕+𝒕𝒙𝒊𝒕22eγ𝒙𝒕+𝒕𝒙𝒊𝒕𝒊22superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕bold-~subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑝superscript𝑒𝛾superscriptsubscriptnormsubscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊22superscript𝑒𝛾superscriptsubscriptnormsubscriptsuperscript𝒙bold-′𝒕bold-~subscriptsuperscript𝒙bold-′𝒊22superscriptsubscript𝑖1subscript𝑁𝑝superscript𝑒𝛾superscriptsubscriptnormsubscript𝒙𝒕𝒕subscript𝒙𝒊𝒕22superscript𝑒𝛾superscriptsubscriptnormsubscript𝒙𝒕𝒕subscript𝒙𝒊subscript𝒕𝒊22\displaystyle\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\hat{x}^{\prime}_{i}}% )-\sum_{i=1}^{N_{p}}K(\bm{x^{\prime}_{t}},\bm{\widetilde{x^{\prime}_{i}}})=% \sum_{i=1}^{N_{p}}e^{-\gamma||\bm{x^{\prime}_{t}}-\bm{\hat{x}^{{}^{\prime}}_{i% }}||_{2}^{2}}-e^{-\gamma||\bm{x^{\prime}_{t}}-\bm{\widetilde{x^{{}^{\prime}}_{% i}}}||_{2}^{2}}=\sum_{i=1}^{N_{p}}e^{-\gamma||\bm{x_{t}}+\bm{t}-\bm{x_{i}}-\bm% {t}||_{2}^{2}}-e^{-\gamma||\bm{x_{t}}+\bm{t}-\bm{x_{i}}-\bm{t_{i}}||_{2}^{2}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_γ | | bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT - overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - italic_γ | | bold_italic_x start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT - overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_γ | | bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT + bold_italic_t - bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT - bold_italic_t | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - italic_γ | | bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT + bold_italic_t - bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT - bold_italic_t start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (9)
=i=1Npeγ𝒙𝒕𝒙𝒊22(1eγ𝒕𝒕𝒊22e2γ𝚫𝒕T𝚫𝒙)absentsuperscriptsubscript𝑖1subscript𝑁𝑝superscript𝑒𝛾superscriptsubscriptnormsubscript𝒙𝒕subscript𝒙𝒊221superscript𝑒𝛾superscriptsubscriptnorm𝒕subscript𝒕𝒊22superscript𝑒2𝛾𝚫superscript𝒕𝑇𝚫𝒙\displaystyle=\sum_{i=1}^{N_{p}}e^{-\gamma||\bm{x_{t}}-\bm{x_{i}}||_{2}^{2}}(1% -e^{-\gamma||\bm{t}-\bm{t_{i}}||_{2}^{2}}\cdot e^{-2\gamma\bm{\Delta t}^{T}\bm% {\Delta x}})= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_γ | | bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - italic_γ | | bold_italic_t - bold_italic_t start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT - 2 italic_γ bold_Δ bold_italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Δ bold_italic_x end_POSTSUPERSCRIPT ) (10)
i=1Npeγ𝒙𝒕𝒙𝒊22(1e2γ𝚫𝒕T𝚫𝒙)absentsuperscriptsubscript𝑖1subscript𝑁𝑝superscript𝑒𝛾superscriptsubscriptnormsubscript𝒙𝒕subscript𝒙𝒊221superscript𝑒2𝛾𝚫superscript𝒕𝑇𝚫𝒙\displaystyle\geq\sum_{i=1}^{N_{p}}e^{-\gamma||\bm{x_{t}}-\bm{x_{i}}||_{2}^{2}% }(1-e^{-2\gamma\bm{\Delta t}^{T}\bm{\Delta x}})≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_γ | | bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - 2 italic_γ bold_Δ bold_italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Δ bold_italic_x end_POSTSUPERSCRIPT ) (11)
i=1NpK(𝒙𝒕,𝒙𝒊)(1e2γ𝚫𝒕T𝚫𝒙),absentsuperscriptsubscript𝑖1subscript𝑁𝑝𝐾subscript𝒙𝒕subscript𝒙𝒊1superscript𝑒2𝛾𝚫superscript𝒕𝑇𝚫𝒙\displaystyle\geq\sum_{i=1}^{N_{p}}K(\bm{x_{t}},\bm{x_{i}})(1-e^{-2\gamma\bm{% \Delta t}^{T}\bm{\Delta x}}),≥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ( 1 - italic_e start_POSTSUPERSCRIPT - 2 italic_γ bold_Δ bold_italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Δ bold_italic_x end_POSTSUPERSCRIPT ) , (12)

where 𝚫𝒕=[𝒕𝒕𝒊]C×H×W𝚫𝒕superscriptdelimited-[]𝒕subscript𝒕𝒊𝐶𝐻𝑊\bm{\Delta t}=[\bm{t}-\bm{t_{i}}]^{C\times H\times W}bold_Δ bold_italic_t = [ bold_italic_t - bold_italic_t start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, 𝚫𝒙=[𝒙𝒕𝒙𝒊]C×H×W𝚫𝒙superscriptdelimited-[]subscript𝒙𝒕subscript𝒙𝒊𝐶𝐻𝑊\bm{\Delta x}=[\bm{x_{t}}-\bm{x_{i}}]^{C\times H\times W}bold_Δ bold_italic_x = [ bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, and γ>0𝛾0\gamma>0italic_γ > 0.

Put all above together, we have:

𝔼𝒙^[f(a)(𝒙^)]𝔼𝒙~[f(s)(𝒙~)]CK(𝒙𝒕,𝒙𝒊)(1e2γ𝚫𝒕T𝚫𝒙)(i=1NpK(𝒙𝒕,𝒙𝒊~)+i=1NbK(𝒙𝒕,𝒙𝒊))(i=1NpK(𝒙𝒕,𝒙^𝒊)+i=1NbK(𝒙𝒕,𝒙𝒊))0.subscript𝔼^𝒙delimited-[]superscript𝑓𝑎^𝒙subscript𝔼~𝒙delimited-[]superscript𝑓𝑠~𝒙𝐶𝐾subscript𝒙𝒕subscript𝒙𝒊1superscript𝑒2𝛾𝚫superscript𝒕𝑇𝚫𝒙superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕bold-~subscriptsuperscript𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝐾subscriptsuperscript𝒙bold-′𝒕subscriptsuperscriptbold-^𝒙bold-′𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝐾subscriptsuperscript𝒙bold-′𝒕subscript𝒙𝒊0\mathbb{E}_{\hat{\bm{x}}}[f^{(a)}(\hat{\bm{x}})]-\mathbb{E}_{\widetilde{\bm{x}% }}[f^{(s)}(\widetilde{\bm{x}})]\geq C\frac{K(\bm{x_{t}},\bm{x_{i}})(1-e^{-2% \gamma\bm{\Delta t}^{T}\bm{\Delta x}})}{(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime% }}_{t}},\bm{\widetilde{\bm{x}^{{}^{\prime}}_{i}}})+\sum_{i=1}^{N_{b}}K(\bm{x^{% {}^{\prime}}_{t}},\bm{x_{i}}))(\sum_{i=1}^{N_{p}}K(\bm{x^{{}^{\prime}}_{t}},% \bm{\hat{x}^{{}^{\prime}}_{i}})+\sum_{i=1}^{N_{b}}K(\bm{x^{{}^{\prime}}_{t}},% \bm{x_{i}}))}\geq 0.blackboard_E start_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG ) ] - blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) ] ≥ italic_C divide start_ARG italic_K ( bold_italic_x start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ( 1 - italic_e start_POSTSUPERSCRIPT - 2 italic_γ bold_Δ bold_italic_t start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Δ bold_italic_x end_POSTSUPERSCRIPT ) end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_~ start_ARG bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT end_ARG ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_K ( bold_italic_x start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT bold_′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_t end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ) end_ARG ≥ 0 . (13)

eLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]">