Magnitude-based Neuron Pruning for Backdoor Defense

Nan Li
School of Cyber Science and Engineering
Shanghai Jiao Tong University
Shanghai, China, 200240
[email protected]
&Haoyu Jiang
School of Cyber Science and Engineering
Shanghai Jiao Tong University
Shanghai, China, 200240
[email protected]
** Yi
School of Cyber Science and Engineering
Shanghai Jiao Tong University
Shanghai, China, 200240
[email protected]
Abstract

Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks, posing concerning threats to their reliable deployment. Recent research reveals that backdoors can be erased from infected DNNs by pruning a specific group of neurons, while how to effectively identify and remove these backdoor-associated neurons remains an open challenge. In this paper, we investigate the correlation between backdoor behavior and neuron magnitude, and find that backdoor neurons deviate from the magnitude-saliency correlation of the model. The deviation inspires us to propose a Magnitude-based Neuron Pruning (MNP) method to detect and prune backdoor neurons. Specifically, MNP uses three magnitude-guided objective functions to manipulate the magnitude-saliency correlation of backdoor neurons, thus achieving the purpose of exposing backdoor behavior, eliminating backdoor neurons and preserving clean neurons, respectively. Experiments show our pruning strategy achieves state-of-the-art backdoor defense performance against a variety of backdoor attacks with a limited amount of clean data, demonstrating the crucial role of magnitude for guiding backdoor defenses.

1 Introduction

In recent years, Deep Neural Networks (DNNs) have demonstrated remarkable capabilities in solving real-world problems. However, the wide application of DNNs has raised concerns about their security and trustworthiness. Recent works have shown that DNNs are vulnerable to backdoor attacks[10], in which an adversary injects malicious triggers into the victim model through data poisoning, manipulating the training process, or directly modifying model parameters. The backdoored model performs well on clean samples but can be triggered into false predictions by the poisoned samples containing trigger patterns. As pre-trained weights and outsourced training are widely applied to cut computational costs for training DNNs, the backdoor attack is becoming an undeniable security issue.

To address this issue, numerous methods have been proposed for detecting and mitigating backdoor attacks. Backdoor detection methods [31, 21, 12] identify whether a model is backdoored or a dataset is poisoned, while backdoor mitigation methods [19, 16] eliminate the injected triggers from backdoored models. Recent research [34, 18] has observed a subset of neurons contributing the most to backdoor behaviors in infected DNNs. By pruning these backdoor-associated neurons, the backdoor behavior of the infected model can be effectively mitigated. Backdoor neurons are believed to have certain properties. For example, they can only be activated by trigger patterns [39], and are more sensitive to input [37] or perturbation [34]. These properties can be used to design certain pruning strategies to mitigate the injected backdoor.

Magnitude is considered an important indicator to guide pruning in model compression studies [15, 11]. Most of these studies assume a positive correlation between neuron magnitude and neuron importance for model performance, as neurons with smaller magnitudes have less numerical impact on the output of the model. We empirically show that backdoor neurons deviate from this correlation, as they contain extra weights used to trigger the backdoor. Similar observations are also implied in [37, 39]. Motivated by our findings, we propose a Magnitude-based Neuron Pruning (MNP) method to defend against backdoor attacks. Specifically, MNP first detects the injected backdoor by analyzing the correlation between neuron magnitude and neuron saliency, and then optimizes neuron masks with three magnitude-guided objective functions to expose and prune backdoor neurons. Given a small subset of clean samples, MNP can effectively defend against backdoors injected by a variety of backdoor attacks. Experiments show MNP is competitive for both backdoor detection and mitigation among a set of state-of-the-art backdoor defense methods. Our defense strategy significantly surpasses the previous state-of-the-art method RNP [18], achieving superior defense outcomes against over ten types of attacks on both the CIFAR-10 and ImageNet datasets, with the majority of results reaching optimal performance levels.

To summarize, our main contributions are three-fold. 1) We explored the correlation between neuron magnitude and their contribution to backdoor behavior, and try to reinterpret the mechanism of backdoor defenses from the perspective of neuron magnitude. 2) Beyond offering hypothesis for how neuron magnitude works in backdoor defense methods, we further validate our claims by using them as principles to construct our own backdoor defense method MNP, which utilize three optimization objectives to manipulate the magnitude of neurons and 3) We empirically show that our method is competitive compared to ten state-of-the-art backdoor defense methods against ten challenging backdoor attacks across different model architectures and datasets.

2 Related Work

Backdoor Attack.

Depending on how the trigger pattern is injected, backdoor attacks fall into two main categories: input-space attacks poisoning the training dataset and feature-space attacks manipulating the training process or directly modifying model parameters. Input-space attacks, also known as poisoning-based attacks, are conducted by modifying a small subset of training data, which commonly includes patching trigger patterns into the sample and shifting the corresponding label to the targeted class. Note that there are also clean-label attacks [30] that do not relabel any sample. The model trained on the poisoned data learns both the clean task permitting its accuracy on clean samples, and the backdoor task tricking it into false predictions with the trigger pattern. The input-space attacks can be further categorized into static and dynamic attacks depending on the trigger pattern they use. Static attacks use the same trigger pattern for all samples, such as black-white squares [10], Gaussian noise [5] and adversarial perturbations [30], while dynamic attacks [24, 25] use sample-wise triggers to enhance their stealthiness. Feature-space attacks occur under a different threat model, where the adversary has full access to the training process and the model weights. These attacks may directly manipulate the training process with certain optimization objectives [27, 6, 36], or perturb the model weights [9, 26] to inject backdoor in the feature space, making them more challenging to defend against.

Backdoor Defense.

Backdoor defense involves two primary tasks: backdoor detection and backdoor mitigation. Backdoor detection methods focus on identifying backdoored models [1, 21, 4, 13] or backdoored samples [8, 29]. Some advanced detection methods also conduct reverse engineering with the backdoored model to recover the trigger pattern [31, 28, 33, 12]. Backdoor mitigation methods aim to remove the injected backdoor from the infected model with minimal degradation of its performance on clean samples. Existing techniques include fine-tuning, distillation [16], unlearning [35], pruning [19], and training-time defenses [17, 32, 21]. Recent works on pruning have demonstrated remarkable performance in backdoor mitigation. FP [19] assumes backdoor-associated neurons can only be activated by the trigger pattern, thus pruning neurons that are dominant when feeding the model clean samples can promisingly mitigate the injected backdoor. ANP [34] perturbs neuron weights to maximize the classification loss of the model, then fixes the model by pruning neurons that are more sensitive to the adversarial perturbation, as these neurons are believed to be strongly related to the injected backdoor. RNP [18] optimizes masks through an unlearning-recovering process to expose backdoor neurons. CLP [37] introduces the channel lipschitz value to evaluate each neuron’s sensitivity to input and prunes backdoor neurons with high sensitivity. FT-SAM [39] has revealed that backdoor neurons tends to have larger magnitudes, and incorporate sharpness-aware minimization with fine-tuning to purify the injected models.

3 Preliminaries

3.1 Backdoor Learning

Consider a standard K𝐾Kitalic_K-class classification problem on a training set 𝒟={(𝒙i,yi)}i=1D𝒳×𝒴𝒟subscriptsuperscriptsubscript𝒙𝑖subscript𝑦𝑖𝐷𝑖1𝒳𝒴\mathcal{D}=\{(\boldsymbol{x}_{i},y_{i})\}^{D}_{i=1}\subseteq\mathcal{X}\times% \mathcal{Y}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ⊆ caligraphic_X × caligraphic_Y, with 𝒳d𝒳superscript𝑑\mathcal{X}\subset\mathbb{R}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as the sample space and 𝒴{1,2,,K}𝒴12𝐾\mathcal{Y}\subset\{1,2,...,K\}caligraphic_Y ⊂ { 1 , 2 , … , italic_K } as the label space. Given a subset 𝒟b𝒟subscript𝒟𝑏𝒟\mathcal{D}_{b}\subseteq\mathcal{D}caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ⊆ caligraphic_D, the standard poisoning-based backdoor attack involves injecting the trigger pattern into input samples with the poisoning function δ:𝒳𝒳:𝛿𝒳𝒳\delta:\mathcal{X}\rightarrow\mathcal{X}italic_δ : caligraphic_X → caligraphic_X and modifying corresponding labels with the label shifting function S:𝒴𝒴:𝑆𝒴𝒴S:\mathcal{Y}\rightarrow\mathcal{Y}italic_S : caligraphic_Y → caligraphic_Y. The backdoor attack can be viewed as a multi-task learning problem [18] on both the clean subset 𝒟c=𝒟𝒟bsubscript𝒟𝑐𝒟subscript𝒟𝑏\mathcal{D}_{c}=\mathcal{D}-\mathcal{D}_{b}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = caligraphic_D - caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and the backdoor subset 𝒟bsubscript𝒟𝑏\mathcal{D}_{b}caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Let F𝐹Fitalic_F denote the victim model with parameter θ𝜃\thetaitalic_θ. We consider a parametric hypothesis that the parameter space of F𝐹Fitalic_F can be decomposed into θ=θcθbθsh𝜃subscript𝜃𝑐subscript𝜃𝑏subscript𝜃𝑠\theta=\theta_{c}\cup\theta_{b}\cup\theta_{sh}italic_θ = italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_s italic_h end_POSTSUBSCRIPT, where θcsubscript𝜃𝑐\theta_{c}italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, θbsubscript𝜃𝑏\theta_{b}italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and θshsubscript𝜃𝑠\theta_{sh}italic_θ start_POSTSUBSCRIPT italic_s italic_h end_POSTSUBSCRIPT denote the clean, the backdoor and the shared neurons, respectively. We further assume that θshsubscript𝜃𝑠\theta_{sh}italic_θ start_POSTSUBSCRIPT italic_s italic_h end_POSTSUBSCRIPT can be omitted since the backdoor and the clean tasks are highly independent, i.e., backdoor attacks are designed not to affect the model performance on clean samples [10]. The standard backdoor learning process can be expressed as follows:

argminθ=θcθb[𝔼(𝒙,y)𝒟c(F(𝒙;θc),y)clean task+𝔼(𝒙,y)𝒟b(F(δ(𝒙);θb),S(y))backdoor task],𝜃subscript𝜃𝑐subscript𝜃𝑏delimited-[]subscriptsubscript𝔼𝒙𝑦subscript𝒟𝑐𝐹𝒙subscript𝜃𝑐𝑦clean tasksubscriptsubscript𝔼𝒙𝑦subscript𝒟𝑏𝐹𝛿𝒙subscript𝜃𝑏𝑆𝑦backdoor task\underset{\theta=\theta_{c}\cup\theta_{b}}{\arg\min}\big{[}{\underbrace{% \mathbb{E}_{\left(\boldsymbol{x},y\right)\in\mathcal{D}_{c}}\mathcal{L}\left(F% \left(\boldsymbol{x};\theta_{c}\right),y\right)}_{\textrm{clean task}}+% \underbrace{\mathbb{E}_{\left(\boldsymbol{x},y\right)\in\mathcal{D}_{b}}% \mathcal{L}\left(F\left(\delta(\boldsymbol{x});\theta_{b}\right),S(y)\right)}_% {\textrm{backdoor task}}}\big{]},start_UNDERACCENT italic_θ = italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG [ under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( bold_italic_x ; italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_y ) end_ARG start_POSTSUBSCRIPT clean task end_POSTSUBSCRIPT + under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( italic_δ ( bold_italic_x ) ; italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) , italic_S ( italic_y ) ) end_ARG start_POSTSUBSCRIPT backdoor task end_POSTSUBSCRIPT ] , (1)

3.2 Neuron Magnitude and Saliency

Consider a convolutional network F𝐹Fitalic_F with L𝐿Litalic_L layers, regarding the fully connected layer as the convolutional layer with 1×1111\times 11 × 1 kernels. Let f(i)superscript𝑓𝑖f^{(i)}italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT denote the i𝑖iitalic_i-th layer with the weight matrix θ(i)ci×ci1×h×wsuperscript𝜃𝑖superscriptsubscript𝑐𝑖subscript𝑐𝑖1𝑤\theta^{(i)}\in\mathbb{R}^{c_{i}\times c_{i-1}\times h\times w}italic_θ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT, where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, hhitalic_h and w𝑤witalic_w denote the channel number of f(i)superscript𝑓𝑖f^{(i)}italic_f start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, the height and the width of the convolutional kernel, respectively. θ(i)superscript𝜃𝑖\theta^{(i)}italic_θ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT consists of cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT filters {θ(i,j)ci1×h×w}j=1cisuperscriptsubscriptsuperscript𝜃𝑖𝑗superscriptsubscript𝑐𝑖1𝑤𝑗1subscript𝑐𝑖\{\theta^{(i,j)}\in\mathbb{R}^{c_{i-1}\times h\times w}\}_{j=1}^{c_{i}}{ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Pruning the j𝑗jitalic_j-th filter of the i𝑖iitalic_i-th layer refers to setting θ(i,j)superscript𝜃𝑖𝑗\theta^{(i,j)}italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT to an all-zero matrix, thus removing the corresponding output feature map. We denote the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm of a filter as θ(i,j)psubscriptnormsuperscript𝜃𝑖𝑗𝑝\|\theta^{(i,j)}\|_{p}∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, which is widely used to evaluate the magnitude of each filter in model compression methods [15, 11]. Recent research [39] has observed a strong positive correlation between backdoor behavior and the weight norms for each neuron, which implies that backdoor neurons may have larger magnitudes than clean ones. To further quantify the contribution of each neuron to the backdoor and the clean task, we introduce the concept of neuron saliency, which is commonly defined as the loss change induced by pruning that neuron in model compression research [14]. Given a test set 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, for each filter, we define the saliency metrics Clean Loss Change (CLC) and Backdoor Loss Change (BLC) as follows:

CLC(θ,i,j)CLC𝜃𝑖𝑗\displaystyle\textrm{CLC}(\theta,i,j)CLC ( italic_θ , italic_i , italic_j ) =𝔼(𝒙,y)𝒟t(F(𝒙;θ|θ(i,j)=0),y)𝔼(𝒙,y)𝒟t(F(𝒙;θ),y),absentsubscript𝔼𝒙𝑦subscript𝒟𝑡𝐹𝒙conditional𝜃superscript𝜃𝑖𝑗0𝑦subscript𝔼𝒙𝑦subscript𝒟𝑡𝐹𝒙𝜃𝑦\displaystyle=\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{t}}\mathcal{L}(F(% \boldsymbol{x};\theta|{\theta^{(i,j)}=0}),y)-\mathbb{E}_{(\boldsymbol{x},y)\in% \mathcal{D}_{t}}\mathcal{L}(F(\boldsymbol{x};\theta),y),= blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( bold_italic_x ; italic_θ | italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = 0 ) , italic_y ) - blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( bold_italic_x ; italic_θ ) , italic_y ) , (2)
BLC(θ,i,j)BLC𝜃𝑖𝑗\displaystyle\textrm{BLC}(\theta,i,j)BLC ( italic_θ , italic_i , italic_j ) =𝔼(𝒙,y)𝒟t(F(δ(𝒙);θ|θ(i,j)=0),S(y))𝔼(𝒙,y)𝒟t(F(δ(𝒙);θ),S(y))absentsubscript𝔼𝒙𝑦subscript𝒟𝑡𝐹𝛿𝒙conditional𝜃superscript𝜃𝑖𝑗0𝑆𝑦subscript𝔼𝒙𝑦subscript𝒟𝑡𝐹𝛿𝒙𝜃𝑆𝑦\displaystyle=\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{t}}\mathcal{L}(F(% \delta(\boldsymbol{x});\theta|{\theta^{(i,j)}=0}),S(y))-\mathbb{E}_{(% \boldsymbol{x},y)\in\mathcal{D}_{t}}\mathcal{L}(F(\delta(\boldsymbol{x});% \theta),S(y))= blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( italic_δ ( bold_italic_x ) ; italic_θ | italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = 0 ) , italic_S ( italic_y ) ) - blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( italic_δ ( bold_italic_x ) ; italic_θ ) , italic_S ( italic_y ) ) (3)
Refer to caption
Figure 1: Scatter plots depicting the BLC and CLC of filters in the shallow and deep convolutional layers of backdoored ResNet18 models, attacked by BadNets, Trojan, Blend, CLA, IAB, and WaNet. Quadrants are determined by the horizontal and vertical lines (x-axis and y-axis) at CLC=0CLC0\textrm{CLC}=0CLC = 0 and BLC=0BLC0\textrm{BLC}=0BLC = 0. The color of each point indicates the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of the corresponding filter weight, with deeper colors representing larger l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norms.

3.3 Observations about Backdoor and Clean Neurons

We investigate the distribution of the CLC, BLC and magnitude of filters across different layers of backdoored models, as shown in Fig 1. Our key observations are as follows:

Backdoor and clean neurons may overlap.

Most of the recent works [19, 38] assume that backdoor and clean neurons hardly overlap in filter level, i.e., each filter is associated with either the backdoor task or the clean task. However, filters in the first quadrants in Fig 1 have positive BLC and positive CLC, which implies they contribute to both clean accuracy and backdoor behavior. Instead of simply defining a filter as backdoored or clean, we suggest a fine-grained approach with thresholds τ0,ϵ0formulae-sequence𝜏0italic-ϵ0\tau\geq 0,\epsilon\geq 0italic_τ ≥ 0 , italic_ϵ ≥ 0 to categorize filters into backdoor, clean, hybrid and redundant filters. We follow the assumption mentioned in Section 3.1, that the backdoor and clean parameters are separated at the neuron level. A filter is composed of a large number of neurons, as it commonly contains multiple channels and kernel weights. Backdoor filters primarily consist of backdoor neurons that are crucial for the backdoor task but have minimal contribution to the clean task, while clean filters exhibit the opposite behavior. Hybrid filters are composed of both backdoor neurons and clean neurons, hence, pruning them results in an increase in both backdoor loss and clean loss. Redundant filters contain unimportant neurons and can be pruned without significantly affecting the overall performance of the model, as is widely observed in model compression studies [15, 11].

The correlation between neuron magnitude and saliency.

As weight decay is widely applied to reduce the complexity of DNNs, neurons of less importance tend to have smaller magnitudes. Based on the above assumption, magnitude-based model compression methods [15, 11] use the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm of filters as a statistical indicator of their contribution to the final prediction result of the model . Since CLC measures the importance of filters for model performance, we assume a positive correlation between neuron magnitude and the CLC in the clean parameter space θcsubscript𝜃𝑐\theta_{c}italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Denoting the correlation by C::𝐶C:\mathbb{R}\rightarrow\mathbb{R}italic_C : blackboard_R → blackboard_R, every clean filter θ(i,j)superscript𝜃𝑖𝑗\theta^{(i,j)}italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT satisfies CLC(F,i,j)=C(θ(i,j)p)CLC𝐹𝑖𝑗𝐶subscriptnormsuperscript𝜃𝑖𝑗𝑝\textrm{CLC}(F,i,j)=C(\|\theta^{(i,j)}\|_{p})CLC ( italic_F , italic_i , italic_j ) = italic_C ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). Since the backdoor model is trained on both the backdoor task and the clean task, each filter θ(i,j)superscript𝜃𝑖𝑗\theta^{(i,j)}italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT can be decomposed into θ(i,j)=θc(i,j)θb(i,j)superscript𝜃𝑖𝑗subscriptsuperscript𝜃𝑖𝑗𝑐subscriptsuperscript𝜃𝑖𝑗𝑏\theta^{(i,j)}=\theta^{(i,j)}_{c}\cup\theta^{(i,j)}_{b}italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Its contribution to the clean task satisfies:

CLC(F,i,j)=C(θc(i,j)p)C(θ(i,j)p),CLC𝐹𝑖𝑗𝐶subscriptnormsubscriptsuperscript𝜃𝑖𝑗𝑐𝑝𝐶subscriptnormsuperscript𝜃𝑖𝑗𝑝\textrm{CLC}(F,i,j)=C(\|\theta^{(i,j)}_{c}\|_{p})\leq C(\|\theta^{(i,j)}\|_{p}),CLC ( italic_F , italic_i , italic_j ) = italic_C ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ≤ italic_C ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , (4)

which indicates that the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm of the backdoor or hybrid filter does not correspond to its actual contribution to the clean task, for it contains additional parameters used to trigger the backdoor. This observation is consistent with the results in Fig 1, where the majority of backdoor filters have larger lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norms compared to clean filters with the same CLC value.

4 Methodology

4.1 Basic settings

Assumptions.

Our research is based on two fundamental assumptions about the neurons of the backdoored DNN: 1) A filter can be associated with both the backdoor task and the clean task, if so, it is composed of both clean and backdoor neurons. 2) The model is trained with weight decay or other parameter regularization techniques, thus the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm of a filter is positively correlated with its overall contribution to clean accuracy and backdoor behavior.

Defense setting.

We adopt a typical defense setting where the defender has downloaded a backdoored model from an untrustworthy third party without knowledge of the attack or training data. We assume a small amount of clean data 𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is available for defense, which can be collected from the Internet or carefully selected from the training data. The defender aims to first determine if the model is backdoored, then remove the backdoor behavior from the infected model with minimum degradation to its clean accuracy.

4.2 Magnitude-guided Optimization

A number of recent works [34, 18, 2, 39] on backdoor mitigation have employed a min-max optimization process to expose backdoor neurons. However, most of these works have not explicitly included the magnitude of neurons in their optimization objectives. More analyses can be found in A.1. Based on our analysis on how neuron magnitude works in backdoor defense approaches, we consider three optimization objectives to control the magnitude of neurons: weight penalty, clean suppression and clean preserving.

Weight Penalty.

As is discussed in Section 3.3, the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norms of clean filters approximately follow a positive correlation with their contribution to clean accuracy, while backdoor or hybrid filters contain backdoor neurons and have larger lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norms that deviate from the positive correlation. The deviation inspires us to develop a strategy to prune filters with large lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norms while preserving (or recovering) the clean accuracy. For each filter, we apply a mask m[0,1]𝑚01m\in[0,1]italic_m ∈ [ 0 , 1 ], which acts on the magnitude of each filter without changing the specific weight. For model F𝐹Fitalic_F with L𝐿Litalic_L layers, the collection of all masks is denoted by ={𝒎i}i=1Lsuperscriptsubscriptsubscript𝒎𝑖𝑖1𝐿\mathcal{M}=\{\boldsymbol{m}_{i}\}_{i=1}^{L}caligraphic_M = { bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, where 𝒎i[0,1]cisubscript𝒎𝑖superscript01subscript𝑐𝑖\boldsymbol{m}_{i}\in[0,1]^{c_{i}}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a vector of masks for the i𝑖iitalic_i-th layer with cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT filters. The masked network Fsubscript𝐹F_{\mathcal{M}}italic_F start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT has the same architecture as F𝐹Fitalic_F with weight matrices of all the convolutional layers set to 𝒎iθ(i)direct-productsubscript𝒎𝑖superscript𝜃𝑖\boldsymbol{m}_{i}\odot\theta^{(i)}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ italic_θ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. To expose the backdoor filters with large lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norms and minimal impact on the clean loss, we formulate our problem as follows:

min𝒎i[0,1]ci[𝔼(𝒙,y)𝒟d(F(𝒙;θ),y)+λi=1L𝒘i𝒎i1],subscriptsubscript𝒎𝑖superscript01subscript𝑐𝑖subscript𝔼𝒙𝑦subscript𝒟𝑑subscript𝐹𝒙𝜃𝑦𝜆superscriptsubscript𝑖1𝐿subscriptnormdirect-productsubscript𝒘𝑖subscript𝒎𝑖1\min_{\boldsymbol{m}_{i}\in[0,1]^{c_{i}}}\big{[}\mathbb{E}_{(\boldsymbol{x},y)% \in\mathcal{D}_{d}}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x};\theta),y)+% \lambda\sum_{i=1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]},roman_min start_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( bold_italic_x ; italic_θ ) , italic_y ) + italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , (5)

where direct-product\odot denotes the Hadamard product, 𝒘icisubscript𝒘𝑖superscriptsubscript𝑐𝑖\boldsymbol{w}_{i}\in\mathbb{R}^{c_{i}}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the vector of l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norms of filters of the i𝑖iitalic_i-th convolutional layer, and λ>0𝜆0\lambda>0italic_λ > 0 is a hyperparameter balancing the loss and the weight penalty term.

Clean Suppression.

The clean suppression objective is designed to reduce the magnitude of most clean neurons and expose the backdoor behavior of the infected model:

maxθ[𝔼(𝒙,y)𝒟d(F(𝒙;θ),y)μθ22],subscript𝜃subscript𝔼𝒙𝑦subscript𝒟𝑑𝐹𝒙𝜃𝑦𝜇superscriptsubscriptnorm𝜃22\max_{\theta}\big{[}\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{d}}\mathcal{% L}(F(\boldsymbol{x};\theta),y)-\mu\|\theta\|_{2}^{2}\big{]},roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( bold_italic_x ; italic_θ ) , italic_y ) - italic_μ ∥ italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (6)

where μ>0𝜇0\mu>0italic_μ > 0 is the weight decay hyperparameter. Since the suppression of clean neurons both increases the classification loss and reduces the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization term, during the optimization, magnitude of clean neurons decrease at a faster rate than backdoor and hybrid neurons.

Clean Preserving.

In contrast to clean suppression, the clean preservation objective is designed to preserve an important subset of clean neurons. This includes most clean filters and part of hybrid filters that are critical for the clean accuracy. Similarly to the weight penalty process, we initialize 𝒎i[1,2]cisubscript𝒎𝑖superscript12subscript𝑐𝑖\boldsymbol{m}_{i}\in[1,2]^{c_{i}}bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 1 , 2 ] start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each filter and optimize masks to increase magnitude of filters as much as possible while retaining the clean loss:

min𝒎i[1,2]ci[𝔼(𝒙,y)𝒟d(F(𝒙;θ),y)λi=1L𝒘i𝒎i1],subscriptsubscript𝒎𝑖superscript12subscript𝑐𝑖subscript𝔼𝒙𝑦subscript𝒟𝑑subscript𝐹𝒙𝜃𝑦𝜆superscriptsubscript𝑖1𝐿subscriptnormdirect-productsubscript𝒘𝑖subscript𝒎𝑖1\min_{\boldsymbol{m}_{i}\in[1,2]^{c_{i}}}\big{[}\mathbb{E}_{(\boldsymbol{x},y)% \in\mathcal{D}_{d}}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x};\theta),y)-% \lambda\sum_{i=1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]},roman_min start_POSTSUBSCRIPT bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 1 , 2 ] start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( bold_italic_x ; italic_θ ) , italic_y ) - italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , (7)
Refer to caption
Figure 2: Overview of our proposed MNP framework, in comparison with 3 existing backdoor mitigation methods: ANP, RNP, and FP. MNP exposes backdoor neurons and preserves clean neurons by amplifying and reducing the magnitude of neuron weights, then prunes backdoor filters and high-BLC hybrid filters to balance the backdoor mitigation performance and the clean accuracy.

4.3 Proposed Method

Backdoor Mitigation with MNP.

MNP aims to prune filters that deviate from the magnitude-saliency correlation, as these filters may contain potential backdoor neurons. The weight penalty objective is designed to achieve this purpose. Conducting weight penalty process directly on backdoor models is enough to defend against classical attacks including BadNets [10] and WaNet [25]. However, we have empirically found that weight penalty alone is unable to thoroughly remove the backdoors injected by some advanced attacks like DFST [6] or LIRA [7], as backdoor filters produced by these attacks may have smaller magnitudes, making it difficult to distinguish them from redundant filters. Therefore, we adopt the clean suppression objective to reduce the magnitude of clean neurons in the backdoor model, thereby exposing the low-magnitude backdoor filters. The combination of clean suppression and weight penalty process is enough to purify most backdoor models. To further reduce the compromise in model performance, we conduct the clean preserving process on the original backdoor model to preserve critical clean neurons, which may be mistakenly pruned and cause degradation of clean accuracy in other backdoor mitigation methods. During the process, the masks of the backdoor and redundant filters keeps almost unchanged, while the mask of a subset of clean and hybrid filters that are critical for the clean accuracy significantly increase. MNP adds the masks obtained from the clean preserving process to the masks obtained from the weight penalty process, and prunes filters with lower mask values, thus better balancing clean accuracy and backdoor mitigation performance. More discussion about the mechanism of MNP can be found in Appendix A.2.

Backdoor Detection with MNP.

MNP detects backdoor models by measuring the magnitude-saliency correlation of neurons across the model. For each filter, the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of its weights θ(i,j)22superscriptsubscriptnormsuperscript𝜃𝑖𝑗22\|\theta^{(i,j)}\|_{2}^{2}∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and CLC(θ,i,j)CLC𝜃𝑖𝑗\textrm{CLC}(\theta,i,j)CLC ( italic_θ , italic_i , italic_j ) are chosen as the magnitude and the saliency metrics, respectively. The Spearsman’s rank correlation coefficient ρ(F,θ)𝜌𝐹𝜃\rho(F,\theta)italic_ρ ( italic_F , italic_θ ) between θ(i,j)22superscriptsubscriptnormsuperscript𝜃𝑖𝑗22\|\theta^{(i,j)}\|_{2}^{2}∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and CLC(θ,i,j)CLC𝜃𝑖𝑗\textrm{CLC}(\theta,i,j)CLC ( italic_θ , italic_i , italic_j ) can be used to measure the strength of the magnitude-saliency correlation of model F𝐹Fitalic_F. We compare the correlations of the original model F(;θ)𝐹𝜃F(\cdot;\theta)italic_F ( ⋅ ; italic_θ ) and the suppressed model F(;θs)𝐹subscript𝜃𝑠F(\cdot;\theta_{s})italic_F ( ⋅ ; italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) obtained from the clean suppression process. In clean models, the unlearning process reduces the magnitude of clean neurons at approximately equal rates, so the magnitude-saliency correlation is maintained, while the correlation of backdoored models is significantly weakened. Given threshold δ[0,1]𝛿01\delta\in[0,1]italic_δ ∈ [ 0 , 1 ], the set of backdoored models can be defined as follows:

δ={F(;θ):ρ(F,θs)ρ(F,θ)δ}subscript𝛿conditional-set𝐹𝜃𝜌𝐹subscript𝜃𝑠𝜌𝐹𝜃𝛿\mathcal{B}_{\delta}=\left\{F(\cdot;\theta):\frac{\rho(F,\theta_{s})}{\rho(F,% \theta)}\leq\delta\right\}caligraphic_B start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = { italic_F ( ⋅ ; italic_θ ) : divide start_ARG italic_ρ ( italic_F , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ρ ( italic_F , italic_θ ) end_ARG ≤ italic_δ } (8)

Note that we have omitted the approximation of CLC and the normalization of the magnitude and saliency metrics. The detailed detection method can be found in Appendix A.3.

Algorithm 1 Magnitude-based Pruning for Backdoor Defense (MNP)
0:  Defense data 𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, backdoored model F𝐹Fitalic_F with parameters θ𝜃\thetaitalic_θ, learning rate η1>0,η2>0formulae-sequencesubscript𝜂10subscript𝜂20\eta_{1}>0,\eta_{2}>0italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0, hyperparameters λ𝜆\lambdaitalic_λ, μ𝜇\muitalic_μ, max iteration number T𝑇Titalic_T, detection threshold δ𝛿\deltaitalic_δ, pruning threshold ϵ[1,3]italic-ϵ13\epsilon\in[1,3]italic_ϵ ∈ [ 1 , 3 ]
1:  Initialize 1={[1]ci}i=1Lsubscript1subscriptsuperscriptsuperscriptdelimited-[]1subscript𝑐𝑖𝐿𝑖1\mathcal{M}_{1}=\{[1]^{c_{i}}\}^{L}_{i=1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { [ 1 ] start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, 2={[1]ci}i=1Lsubscript2subscriptsuperscriptsuperscriptdelimited-[]1subscript𝑐𝑖𝐿𝑖1\mathcal{M}_{2}=\{[1]^{c_{i}}\}^{L}_{i=1}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { [ 1 ] start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, θs=θsubscript𝜃𝑠𝜃\theta_{s}=\thetaitalic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_θ
2:  for t=0,,T1𝑡0𝑇1t=0,...,T-1italic_t = 0 , … , italic_T - 1 do
3:     sample a mini-batch ={(𝒙𝒊,yi)}i=1b𝒟dsuperscriptsubscriptsubscript𝒙𝒊subscript𝑦𝑖𝑖1𝑏subscript𝒟𝑑\mathcal{B}=\{(\boldsymbol{x_{i}},y_{i})\}_{i=1}^{b}\subset\mathcal{D}_{d}caligraphic_B = { ( bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ⊂ caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
4:     θsθs+η1θs[(F(𝒙𝒊;θs),yi)μθs22]subscript𝜃𝑠subscript𝜃𝑠subscript𝜂1subscriptsubscript𝜃𝑠𝐹subscript𝒙𝒊subscript𝜃𝑠subscript𝑦𝑖𝜇superscriptsubscriptnormsubscript𝜃𝑠22\theta_{s}\leftarrow\theta_{s}+\eta_{1}\nabla_{\theta_{s}}\big{[}\mathcal{L}(F% (\boldsymbol{x_{i}};\theta_{s}),y_{i})-\mu\|\theta_{s}\|_{2}^{2}\big{]}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_L ( italic_F ( bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_μ ∥ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
5:  end for
6:  Compute the Spearsman’s rank correlation coefficient ρ(F,θ),ρ(F,θs)𝜌𝐹𝜃𝜌𝐹subscript𝜃𝑠\rho(F,\theta),\rho(F,\theta_{s})italic_ρ ( italic_F , italic_θ ) , italic_ρ ( italic_F , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
7:  repeat
8:     Sample a mini-batch {(𝒙𝒊,yi)}i=1b𝒟dsuperscriptsubscriptsubscript𝒙𝒊subscript𝑦𝑖𝑖1𝑏subscript𝒟𝑑\{(\boldsymbol{x_{i}},y_{i})\}_{i=1}^{b}\subset\mathcal{D}_{d}{ ( bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ⊂ caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
9:     11η21[(F(𝒙𝒊;θs),yi)+λi=1L𝒘i𝒎i1]subscript1subscript1subscript𝜂2subscriptsubscript1subscript𝐹subscript𝒙𝒊subscript𝜃𝑠subscript𝑦𝑖𝜆superscriptsubscript𝑖1𝐿subscriptnormdirect-productsubscript𝒘𝑖subscript𝒎𝑖1\mathcal{M}_{1}\leftarrow\mathcal{M}_{1}-\eta_{2}\nabla_{\mathcal{M}_{1}}\big{% [}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x_{i}};\theta_{s}),y_{i})+\lambda% \sum_{i=1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_L ( italic_F start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]
10:     22η22[(F(𝒙𝒊;θ),yi)λi=1L𝒘i𝒎i1]subscript2subscript2subscript𝜂2subscriptsubscript2subscript𝐹subscript𝒙𝒊𝜃subscript𝑦𝑖𝜆superscriptsubscript𝑖1𝐿subscriptnormdirect-productsubscript𝒘𝑖subscript𝒎𝑖1\mathcal{M}_{2}\leftarrow\mathcal{M}_{2}-\eta_{2}\nabla_{\mathcal{M}_{2}}\big{% [}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x_{i}};\theta),y_{i})-\lambda\sum_{i% =1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ caligraphic_L ( italic_F start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ; italic_θ ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]
11:     1max(0,min(1,1))subscript10subscript11\mathcal{M}_{1}\leftarrow\max(0,\min(\mathcal{M}_{1},1))caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← roman_max ( 0 , roman_min ( caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ) )
12:     2max(1,min(1,2))subscript21subscript12\mathcal{M}_{2}\leftarrow\max(1,\min(\mathcal{M}_{1},2))caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← roman_max ( 1 , roman_min ( caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 2 ) )
13:  until training converged
14:  θ^=θ𝕀((1+2)>ϵ)^𝜃direct-product𝜃𝕀subscript1subscript2italic-ϵ\hat{\theta}=\theta\odot\mathbb{I}\left((\mathcal{M}_{1}+\mathcal{M}_{2})>% \epsilon\right)over^ start_ARG italic_θ end_ARG = italic_θ ⊙ blackboard_I ( ( caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > italic_ϵ )
14:  Purified model F𝐹Fitalic_F with parameters θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG, detection result 𝕀(ρ(F,θs)ρ(F,θ)>δ)𝕀𝜌𝐹subscript𝜃𝑠𝜌𝐹𝜃𝛿\mathbb{I}\left(\frac{\rho(F,\theta_{s})}{\rho(F,\theta)}>\delta\right)blackboard_I ( divide start_ARG italic_ρ ( italic_F , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ρ ( italic_F , italic_θ ) end_ARG > italic_δ )

5 Experiments

5.1 Experimental Setup

Attack Setup.

We evaluate MNP against 10 challenging attacks. These include 3 static attacks: BadNets [10], Trojan [20], Blend [5], 2 clean label attacks: CL [30] and SIG [1], 2 dynamic attacks: IAB [24] and WaNet [25], 2 feature space attacks: FC [27] and DSFT [6], and 1 adaptive attack LIRA [7]. Default settings from original papers and open-source codes are applied for most attacks, including backdoor trigger pattern and size. The backdoor label of all attacks is set to class 0, with a default poisoning rate of 10%. Attacks are performed on CIFAR-10 with ResNet18 and a 12-class subset of ImageNet with ResNet-34. For training setups, Stochastic Gradient Descent (SGD) is used with an initial learning rate 0.1, weight decay 5e-4, momentum 0.9, batch size 128 for 200 epochs on CIFAR-10, and batch size 64 for 300 epochs on ImageNet subset. A cosine scheduler is employed to adjust the learning rate.

Defense Setup.

We compare MNP with a total of 10 backdoor defense methods. These include 4 pruning-based backdoor mitigation methods: FP [19], ANP [34], CLP [37] and RNP [18], 3 non-pruning mitigation methods: NC [31], NAD [16] and I-BAU [35], as well as 4 detection methods: NC, AC [3], SC [29] and STRIP [8]. All the defenses share limited access to 1% benign training data except for CLP. Hyperparameters for these defenses are adjusted based on open-source codes to obtain best performance against different attacks. For MNP, we set hyperparameter μ𝜇\muitalic_μ as 0.001, λ𝜆\lambdaitalic_λ as 0.0005, suppression epochs T𝑇Titalic_T as 20, learning rates η1subscript𝜂1\eta_{1}italic_η start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as 0.01 and η2subscript𝜂2\eta_{2}italic_η start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as 0.1. The detection threshold δ𝛿\deltaitalic_δ is set as 0.20.20.20.2, and the pruning threshold ϵitalic-ϵ\epsilonitalic_ϵ is dynamically adjusted to prune 30 neurons for each backdoor model.

Evaluation Metric.

We adopt two metrics for evaluating backdoor mitigation performance: 1) Clean Accuracy (CA), which is the model’s accuracy on clean test data; 2) Attack Success Rate (ASR), which is the model’s accuracy on backdoored test data. We adopt 3) Detection Rate (DR) to evaluate detection performance, which is the accuracy of the defense in identifying backdoor models.

Table 1: Comparison with the state-of-the-art defenses on CIFAR-10 dataset with 1% benign data on ResNet18 (%).
Attack Backdoored FP ANP CLP RNP NC NAD I-BAU MNP(Ours)
CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR
BadNets 92.43 / 100.00 81.45 / 25.31 90.27 / 1.12 91.54 / 1.34 91.09 / 0.54 89.32 / 5.54 89.57 / 1.10 90.19 / 12.73 92.11 / 0.47
Trojan 92.68 / 100.00 82.63 / 62.57 90.78 / 1.31 91.16 / 2.87 91.95 / 2.03 90.91 / 52.72 86.73 / 5.74 90.35 / 10.55 92.24 / 0.92
Blend 92.15 / 99.99 83.26 / 76.44 90.82 / 0.90 90.61 / 1.67 91.53 / 1.33 91.84 / 84.31 89.68 / 13.24 89.94 / 2.24 91.79 / 0.87
CL 91.55 / 98.93 81.38 / 36.42 89.96 / 5.47 89.51 / 1.54 90.05 / 0.75 90.13 / 5.66 86.74 / 15.18 87.75 / 20.12 91.18 / 0.62
SIG 93.52 / 99.14 88.17 / 23.56 91.57 / 5.39 90.14 / 10.35 90.63 / 0.89 90.50 / 90.15 91.37 / 3.46 86.71 / 25.62 93.50 / 1.32
IAB 94.60 / 99.66 85.42 / 38.95 93.67 / 1.52 93.20 / 4.71 93.15 / 2.34 94.19 / 97.98 89.94 / 12.16 87.68 / 18.34 94.45 / 0.48
WaNet 92.12 / 98.81 80.53 / 69.74 91.06 / 8.89 90.58 / 6.43 91.60 / 4.02 91.03 / 96.50 83.32 / 13.18 88.91 / 25.48 91.75 / 3.64
FC 93.61 / 100.00 88.92 / 98.04 85.95 / 77.42 80.21 / 65.87 89.39 / 1.55 92.41 / 99.78 90.14 / 30.37 85.79 / 18.22 91.23 / 1.87
DFST 95.50 / 100.00 85.53 / 80.76 91.24 / 19.80 87.45 / 58.82 92.15 / 25.67 93.51 / 99.02 87.43 / 15.70 85.22 / 26.84 94.79 / 20.40
LIRA 91.20 / 97.78 86.64 / 90.52 84.17 / 21.25 80.38 / 60.56 89.76 / 18.62 90.08 / 97.30 86.52 / 30.15 84.33 / 57.09 89.29 / 9.72
Average 94.42 / 98.83 84.39 / 60.23 89.95 / 14.31 88.68 / 21.41 91.39 / 7.60 91.46 / 72.90 88.14 / 14.03 88.69 / 21.72 92.23 / 4.03
Table 2: Comparison with the state-of-the-art defenses on ImageNet subset dataset with 1% benign data on ResNet34 (%).
Attack Backdoored FP ANP CLP RNP NC NAD I-BAU MNP(Ours)
CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR CA/ ASR
BadNets 91.89 / 100.00 81.43 / 93.98 89.18 / 6.34 85.52 / 9.81 90.16 / 1.54 83.80 / 10.27 81.45 / 9.83 80.72 / 20.95 92.21 / 0.87
Trojan 91.50 / 99.87 81.17 / 90.93 90.31 / 1.78 85.98 / 7.51 89.06 / 1.51 85.53 / 80.98 83.05 / 5.74 84.36 / 10.55 90.03 / 1.08
Blend 89.24 / 100.00 79.05 / 84.41 82.18 / 9.20 82.69 / 3.31 87.20 / 4.37 83.86 / 97.93 79.03 / 12.71 80.21 / 19.35 89.04 / 2.56
CL 88.91 / 90.32 81.44 / 84.73 80.75 / 12.81 84.30 / 8.07 87.13 / 5.05 80.66 / 30.21 83.17 / 24.76 84.06 / 11.57 86.92 / 3.41
WaNet 88.69 / 95.48 82.17 / 90.32 81.62 / 14.73 79.14 / 21.56 84.91 / 13.68 83.25 / 94.56 81.40 / 33.15 80.82 / 44.68 86.85 / 8.42
FC 87.58 / 94.36 79.45 / 87.42 73.81 / 53.69 75.33 / 77.92 82.49 / 10.32 86.60 / 92.68 81.65 / 56.40 79.36 / 38.35 87.04 / 10.63
Average 89.64 / 96.67 80.79 / 88.63 82.98 / 16.43 82.16 / 21.36 86.83 / 6.08 83.95 / 67.77 81.63 / 23.77 87.53 / 24.24 88.68 / 4.10

5.2 Main Defense Results

Results on CIFAR-10.

Table 1 presents the defense performance of 8 backdoor mitigation methods against 10 backdoor attacks on CIFAR-10. MNP outperforms other defense methods by cutting the average ASR down to 4.03% with a slight drop on CA (2.19% on average). In comparison, the state-of-the-art methods ANP, RNP, NAD and I-BAU reduce the average ASR to 14.31%, 7.60%, 14.03%, and 21.72%, respectively. An important observation is that each defense method has its limits. For instance, NAD achieves the lowest ASR (15.70%) on DFST but is outperformed by MNP on most other attacks. MNP shows weakness against SIG and FC, yet it has the best overall performance and highest clean accuracy in most settings. Nearly all defenses struggle against DFST and LIRA, suggesting the need for more sophisticated mechanisms to counter advanced attacks.

Results on ImageNet Subset.

Table 2 presents the defense performance of 8 backdoor mitigation methods against 6 backdoor attacks on CIFAR-10. Note that some attacks are not conducted for ImageNet subset because of difficulties in reproduction on high-resolution dataset. Pruning larger models trained on high-resolution dataset poses a greater challenge due to the difficulty in locating backdoor neurons. However, MNP can effectively defend against existing attacks and still demonstrates an advantage in maintaining clean accuracy. For example, MNP can reduce the ASR of BadNet to 0.87%, even slightly improving the clean accuracy. MNP also generally outperforms most state-of-the-art defenses, except for a slightly lower clean accuracy (90.03%/90.31%) on Trojan compared to ANP and a slightly weaker ASR (10.63%/10.32%) on FC compared to RNP. Nevertheless, it is evident that MNP can better balance clean accuracy and ASR overall, as it cuts the average ASR down to 4.10% with only <1% decline in CA.

Detection Performance.

The clean suppression process of MNP can be use to detect backdoor models by the strength of magnitude-saliency correlation, and the experimental results can be found in 4 in Appendix B. MNP is able to detect backdoor models generated by six attacks and achieve a 98.00% DR, outperforming NC (51.33%), AC (89.75%), SC (83.08%), and STRIP (80.67%).

5.3 Ablation Studies

Impact of the Defense Data Size.

In this part, we evaluate the impact of defense data size on the performance of MNP with a backdoored ResNet18 attacked by BadNets. We use 0.1%(50), 0.5% (250), 1% (500) and 5% (2500) images from the CIFAR-10 training set for defense, respectively. Results in Fig 4 show that as defense data size increases, MNP demonstrates better backdoor mitigation performance. In some cases, MNP only needs 0.1% of clean samples to reduce the ASR of several attacks to < 5%, highlighting its potential to identify backdoor neurons in few-shot settings. The impact of defense data size is more pronounced on CA. As the number of clean samples grows, the distribution of the defense set gets closer to the original dataset, allowing MNP to more accurately preserve clean neurons. Overall, MNP can effectively eliminate backdoor neurons under few-shot settings, and 1% of clean samples is sufficient for MNP to limit the degradation of CA to < 3%.

Refer to caption
Figure 3: Defense performance of MNP with different defense data size against BadNets
Refer to caption
Figure 4: Defense performance of MNP with different hyperparameter settings
Table 3: Defense performance of MNP against BadNets attack on CIFAR-10 with different poisoning rates and A2O/A2A settings
Poisoning Ratio \rightarrow 0.05%(250) 1%(500) 5%(2500) 10%(5000) 20%(10000)
ATTACK \downarrow No Defense MNP No Defense MNP No Defense MNP No Defense MNP No Defense MNP
BadNets-A2O CA 92.35 85.76 92.47 88.29 92.60 92.15 92.43 92.11 89.97 88.31
ASR 99.05 10.40 98.93 3.26 100.00 0.39 100.00 0.47 100.00 1.28
BadNets-A2A CA 93.05 92.02 92.97 92.16 92.44 91.89 92.65 92.07 86.31 84.12
ASR 63.51 44.38 79.52 8.75 90.45 0.78 91.38 0.97 94.56 0.44

Performance against Different Poisoning Rate and All-to-all attack.

We conducted experiments on MNP with different poisoning rates on the CIFAR-10. We have also tested the effectiveness of MNP against the all-to-all attack, where the target label of the backdoored sample is set to one plus the original label (S(y)=y+1𝑆𝑦𝑦1S(y)=y+1italic_S ( italic_y ) = italic_y + 1). Experimental results are shown in Table 5. For attacks with poisoning rate \geq 5%, MNP can effectively reduce the ASR to approximately 1% without causing significant CA decrease. For attacks with poisoning rates \leq 1%, MNP can still maintain a certain level of defense performance. Additionally, MNP can effectively defend against all-to-all attacks with poisoning rates \geq 5%, reducing the ASR to 1% with < 2% CA degradation.

Impact of hyperparameters.

The most critical hyperparameters in MNP are λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ, as higher values of λ𝜆\lambdaitalic_λ or μ𝜇\muitalic_μ cause the optimization process to be dominated by the magnitude of neurons rather than the classification loss of the model. We kept other hyperparameters the same as in 5.1 and adjusted μ𝜇\muitalic_μ to values of 00, 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 1e31superscript𝑒31e^{-3}1 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, 1e21superscript𝑒21e^{-2}1 italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, and 1e11superscript𝑒11e^{-1}1 italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and lambda to values of 00, 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 5e35superscript𝑒35e^{-3}5 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, and 5e25superscript𝑒25e^{-2}5 italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. The results in Fig 4 indicate that MNP achieves the best defense performance when μ𝜇\muitalic_μ and λ𝜆\lambdaitalic_λ are set to proper, smaller values. A larger μ𝜇\muitalic_μ simultaneously reduces the magnitude of both backdoor and clean neurons and make the clean suppression process ineffective. Similarly, a larger λ𝜆\lambdaitalic_λ makes MNP ignore the classification loss, which prevents it from effectively preserving clean neurons. Overall, setting λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ to smaller values helps maintain stable performance, while appropriate larger λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ helps achieve the best performance.

6 Conclusion

This paper proposes Magnitude-based Neuron Pruning (MNP), a novel method to detect and mitigate backdoor attacks. The core idea of MNP is to manipulate the magnitude of backdoor and clean neurons through three process named clean suppression, weight penalty and clean preserving. Clean suppression reduces the magnitude of clean neurons to expose and identify backdoor behavior. Weight penalty eliminates neurons with large magnitude but less contribution to the clean task, i.e., potential backdoor neurons. Clean preserving aims to increase the magnitude of critical clean neurons to avoid them from being pruned. The empirical success of MNP across a range of backdoor attack scenarios highlights the potential of neuron magnitude for backdoor defense.

References

  • Barni et al. [2019] Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In 2019 IEEE International Conference on Image Processing (ICIP), pages 101–105. IEEE, 2019.
  • Chai and Chen [2022] Shuwen Chai and **ghui Chen. One-shot Neural Backdoor Erasing via Adversarial Weight Masking. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22285–22299. Curran Associates, Inc., 2022.
  • Chen et al. [2018] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering, November 2018.
  • Chen et al. [2019] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. Deepinspect: A black-box Trojan detection and mitigation framework for deep neural networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19, pages 4658–4664. AAAI Press, 2019. ISBN 978-0-9992411-4-1.
  • Chen et al. [2017] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. CoRR, abs/1712.05526, 2017.
  • Cheng et al. [2021] Siyuan Cheng, Yingqi Liu, Shiqing Ma, and Xiangyu Zhang. Deep feature space trojan attack of neural networks by controlled detoxification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1148–1156, 2021.
  • Doan et al. [2021] Khoa D Doan, Yingjie Lao, Weijie Zhao, and ** Li Lira. Learnable, imperceptible and robust backdoor attacks. 2021 IEEE. In CVF International Conference on Computer Vision (ICCV), pages 11946–11956, 2021.
  • Gao et al. [2020] Yansong Gao, Chang Xu, Derui Wang, Shi** Chen, Damith C. Ranasinghe, and Surya Nepal. STRIP: A Defence Against Trojan Attacks on Deep Neural Networks, January 2020.
  • Garg et al. [2020] Siddhant Garg, Adarsh Kumar, Vibhor Goel, and Yingyu Liang. Can Adversarial Weight Perturbations Inject Neural Backdoors. In Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux, editors, CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 2029–2032. ACM, 2020. doi: 10.1145/3340531.3412130.
  • Gu et al. [2017] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
  • He et al. [2018] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 2018.
  • Hu et al. [2021] Xiaoling Hu, Xiao Lin, Michael Cogswell, Yi Yao, Susmit Jha, and Chao Chen. Trigger Hunting with a Topological Prior for Trojan Detection. In International Conference on Learning Representations, 2021.
  • Kolouri et al. [2020] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus patterns: Revealing backdoor attacks in CNNs. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 298–307. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.00038.
  • LeCun et al. [1989] Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989.
  • Li et al. [2016] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets. CoRR, abs/1608.08710, 2016.
  • Li et al. [2020] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks. In International Conference on Learning Representations, 2020.
  • Li et al. [2021] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Anti-Backdoor Learning: Training Clean Models on Poisoned Data. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, pages 14900–14912, 2021.
  • Li et al. [2023] Yige Li, Xixiang Lyu, Xingjun Ma, Nodens Koren, Lingjuan Lyu, Bo Li, and Yu-Gang Jiang. Reconstructive Neuron Pruning for Backdoor Defense. In Proceedings of the 40th International Conference on Machine Learning, ICML’23, Honolulu, Hawaii, USA, 2023. JMLR.org.
  • Liu et al. [2018a] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pages 273–294. Springer, 2018a.
  • Liu et al. [2018b] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning Attack on Neural Networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018b.
  • Liu et al. [2019] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1265–1282, London United Kingdom, November 2019. ACM. ISBN 978-1-4503-6747-9. doi: 10.1145/3319535.3363216.
  • Molchanov et al. [2017] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
  • Molchanov et al. [2019] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 11264–11272. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.01152.
  • Nguyen and Tran [2020] Tuan Anh Nguyen and Anh Tuan Tran. Input-Aware Dynamic Backdoor Attack. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, 2020.
  • Nguyen and Tran [2021] Tuan Anh Nguyen and Anh Tuan Tran. WaNet - Imperceptible War**-based Backdoor Attack. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  • Qi et al. [2022] Xiangyu Qi, Tinghao Xie, Ruizhe Pan, Jifeng Zhu, Yong Yang, and Kai Bu. Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13337–13347, New Orleans, LA, USA, June 2022. IEEE. ISBN 978-1-66546-946-3. doi: 10.1109/CVPR52688.2022.01299.
  • Shafahi et al. [2018] Ali Shafahi, W. Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 6106–6116, 2018.
  • Tao et al. [2022] Guanhong Tao, Guangyu Shen, Yingqi Liu, Shengwei An, Qiuling Xu, Shiqing Ma, Pan Li, and Xiangyu Zhang. Better Trigger Inversion Optimization in Backdoor Scanning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13368–13378, June 2022.
  • Tran et al. [2018] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 8011–8021, 2018.
  • Turner et al. [2019] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks. 2019.
  • Wang et al. [2019] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE Computer Society, 2019.
  • Wang et al. [2022a] Zhenting Wang, Hailun Ding, Juan Zhai, and Shiqing Ma. Training with More Confidence: Mitigating Injected and Natural Backdoors During Training. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022a.
  • Wang et al. [2022b] Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Rethinking the Reverse-engineering of Trojan Triggers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 9738–9753. Curran Associates, Inc., 2022b.
  • Wu and Wang [2021] Dongxian Wu and Yisen Wang. Adversarial Neuron Pruning Purifies Backdoored Deep Models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 16913–16925. Curran Associates, Inc., 2021.
  • Zeng et al. [2022] Yi Zeng, Si Chen, Won Park, Zhuoqing Mao, Ming **, and Ruoxi Jia. Adversarial Unlearning of Backdoors via Implicit Hypergradient. In International Conference on Learning Representations, 2022.
  • Zhao et al. [2022] Zhendong Zhao, Xiaojun Chen, Yuexin Xuan, Ye Dong, Dakui Wang, and Kaitai Liang. DEFEAT: Deep Hidden Feature Backdoor Attacks by Imperceptible Perturbation and Latent Representation Constraints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15192–15201. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01478.
  • Zheng et al. [2022a] Runkai Zheng, Rongjun Tang, Jianze Li, and Li Liu. Data-free backdoor removal based on channel lipschitzness. In European Conference on Computer Vision, pages 175–191. Springer, 2022a.
  • Zheng et al. [2022b] Runkai Zheng, Rongjun Tang, Jianze Li, and Li Liu. Pre-activation Distributions Expose Backdoor Neurons. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022b.
  • Zhu et al. [2023] Mingli Zhu, Shaokui Wei, Li Shen, Yanbo Fan, and Baoyuan Wu. Enhancing Fine-Tuning based Backdoor Defense with Sharpness-Aware Minimization. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4443–4454, Paris, France, October 2023. IEEE. ISBN 9798350307184. doi: 10.1109/ICCV51070.2023.00412.

Appendix A Theoretical Analysis and Mechanism of MNP

A.1 Min-Max Optimization for Backdoor Defense

A number of recent works [34, 18, 2, 39] on backdoor mitigation have employed a min-max optimization process to expose backdoor neurons. This paradigm can be reinterpreted from the perspective of neuron magnitude.

ANP [34], one of the state-of-the-art backdoor mitigation methods, adversarially perturbs the magnitude of each filter to maximize the clean loss of the model. The adversarial perturbation can possibly improve the magnitude of the backdoor and hybrid filters, thus amplifying the backdoor activation and trigger the backdoor behavior. Then, ANP recovers the model performance by optimizing masks for each filter, during which process the masks of filters contributing most to the backdoor behavior are reduced to zero, thus can be identified and pruned.

Another state-of-the-art method RNP [18] unlearns the model to maximize the clean loss. The difference is that the unlearning process is conducted at the neuron level. During unlearning, the magnitude of clean neurons is reduced, and the magnitude of backdoor neurons may also increase. Then, RNP apply a similar filter-level recovering process to recover the clean accuracy and expose the backdoor filters. RNP is empirically more effective than ANP , as the unlearning process possibly reduce the magnitude of most clean neurons , causing the magnitude of all backdoor neurons relatively increases. Since the clean neurons are neutralized, some hybrid filters may also transform into backdoor filters. In this way, the majority of the backdoor-associated filters are exposed. In contrast, the adversarial perturbation in ANP only amplifies the magnitude of a subset of backdoor filters, some low-contribution backdoor filters and hybrid filters are generally ignored.

A.2 More Understanding of MNP

Why clean suppression is needed before weight penalty.

We assume that backdoor or hybrid filters contain backdoor neurons and have larger lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norms that deviate from the magnitude-saliency correlation. We find this assumption holds strongly for highly regularized models, and directly conducting the weight penalty process for these models is enough to erase the injected backdoor. However, for less regularized models or models not fully converged, the magnitude-saliency correlation is not as obvious as the highly regularized models. In this case, the weight penalty process is not enough to thoroughly remove the inject backdoor. By applying the clean suppression process before weight penalty, we can reduce the magnitude of most clean neurons and expose more backdoor-associated filters, leading to better backdoor mitigation performance.

Why clean suppression not adversarial perturbation.

As is discussed in Section A.1, the adversarial perturbation, whether conducted at filter level or neuron level, can only amplifying the magnitude a subset of high-contribution backdoor filters, since the classification loss can be effectively increased by perturbing a small number of backdoor neurons. In contrast, the clean suppression process reduces the magnitude of most clean neurons, thus the magnitude of all backdoor neurons relatively increases and more backdoor-associated filters can be exposed.

How can clean preserving help in reducing compromise of clean accuracy.

Backdoor neurons exist across different layers of the infected model. Some hybrid filters (for example, filters in the first convolution layer) may contain clean neurons critical for the model performance, which can be possibly neutralized by the clean suppression process. Although the weight penalty process can expose most backdoor and hybrid filters, it can not completely recover the clean accuracy. In other word, some hybrid filters that are critical for the clean task may be pruned. To reduce the compromise of clean accuracy, we apply the clean preserving process to preserve critical clean neurons. The clean preserving process is performed at the filter level. During the process, the masks of the backdoor and redundant filters keeps almost unchanged, while the mask of a subset of clean and hybrid filters that are critical for the clean accuracy significantly increase. We add the masks obtained from the clean preserving process to the masks obtained from the weight penalty process, and prune filters with lower mask values, thus better balancing clean accuracy and backdoor mitigation performance. The pruning strategy of MNP is different from most of the previous methods [34, 18], which prune filters directly by their contribution to the backdoor task. For example, if a hybrid filter contributes significantly to both the clean task and the backdoor task, it may be directly pruned in these methods. In contrast, in MNP, as the clean preserving process increase the mask of that filter, the priority of pruning it is lower than pruning filters that reduce the backdoor contribution but do not affect the clean accuracy, i.e., purely backdoor neurons.

A.3 Detect Backdoor Models with MNP

Based on the assumptions in Section 4.1, the magnitude of a neuron is positively correlated with its contribution to the prediction results. In benign models, the magnitude-CLC correlation is approximately equal to the magnitude-contribution correlation. In backdoor models, the magnitude-CLC correlation should be weaker than the actual magnitude-contribution correlation, for that CLC only partially reflects the contribution of backdoor and hybrid neurons, as mentioned in Section 3.3. Intuitively, we can detect backdoor models by measuring the strength of the correlation between neuron magnitude and saliency. However, directly computing CLC is computationally expensive, for it requires pruning each filter θ(i,j)superscript𝜃𝑖𝑗\theta^{(i,j)}italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT and measuring the change in loss. The clean loss clsubscript𝑐𝑙\mathcal{L}_{cl}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT can be denoted as follows:

cl(θ)=𝔼(𝒙,y)𝒟t(F(𝒙;θ),y)subscript𝑐𝑙𝜃subscript𝔼𝒙𝑦subscript𝒟𝑡𝐹𝒙𝜃𝑦\mathcal{L}_{cl}(\theta)=\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{t}}% \mathcal{L}(F(\boldsymbol{x};\theta),y)caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_F ( bold_italic_x ; italic_θ ) , italic_y ) (9)

We can approximate CLC in the vicinity of θ𝜃\thetaitalic_θ by the first-order Taylor expansion of the clean loss:

cl(θ)=cl(θ|θ(i,j)=0)+δclδθ(i,j)θ(i,j)+O(θ(i,j)p2)subscript𝑐𝑙𝜃subscript𝑐𝑙conditional𝜃superscript𝜃𝑖𝑗0𝛿subscript𝑐𝑙𝛿superscript𝜃𝑖𝑗superscript𝜃𝑖𝑗𝑂superscriptsubscriptnormsuperscript𝜃𝑖𝑗𝑝2\mathcal{L}_{cl}(\theta)=\mathcal{L}_{cl}(\theta|{\theta^{(i,j)}=0})+\frac{% \delta\mathcal{L}_{cl}}{\delta\theta^{(i,j)}}\theta^{(i,j)}+O(\|\theta^{(i,j)}% \|_{p}^{2})caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_θ ) = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_θ | italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT = 0 ) + divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_δ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT end_ARG italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT + italic_O ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (10)

We can neglect the first-order remainder to avoid computational difficulties, since the widely-used ReLU activation function encourages a smaller second-order term. By substituting Eq. 10 into Eq. 2 and ignoring the remainder, we have:

CLC~(θ,i,j)=cl(θ)δclδθ(i,j)θ(i,j)cl(θ)=δclδθ(i,j)θ(i,j),~CLC𝜃𝑖𝑗subscript𝑐𝑙𝜃𝛿subscript𝑐𝑙𝛿superscript𝜃𝑖𝑗superscript𝜃𝑖𝑗subscript𝑐𝑙𝜃𝛿subscript𝑐𝑙𝛿superscript𝜃𝑖𝑗superscript𝜃𝑖𝑗\widetilde{\textrm{CLC}}(\theta,i,j)=\mathcal{L}_{cl}(\theta)-\frac{\delta% \mathcal{L}_{cl}}{\delta\theta^{(i,j)}}\theta^{(i,j)}-\mathcal{L}_{cl}(\theta)% =-\frac{\delta\mathcal{L}_{cl}}{\delta\theta^{(i,j)}}\theta^{(i,j)},over~ start_ARG CLC end_ARG ( italic_θ , italic_i , italic_j ) = caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_θ ) - divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_δ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT end_ARG italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT - caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_θ ) = - divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_δ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT end_ARG italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT , (11)

which is computationally friendly, as the gradient δclδθ(i,j)𝛿subscript𝑐𝑙𝛿superscript𝜃𝑖𝑗\frac{\delta\mathcal{L}_{cl}}{\delta\theta^{(i,j)}}divide start_ARG italic_δ caligraphic_L start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_δ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT end_ARG can be easily computed through backpropagation. Another problem is that the scale of neuron magnitude and saliency varies across different layers of DNNs. Thus we apply a simple layer-wise l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-normalization to conduct rescaling across layers. The magnitude and saliency metrics for each filter are defined as follows:

mijsubscript𝑚𝑖𝑗\displaystyle m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =θ(i,j)pj(θ(i,j)p)2absentsubscriptnormsuperscript𝜃𝑖𝑗𝑝subscript𝑗superscriptsubscriptnormsuperscript𝜃𝑖𝑗𝑝2\displaystyle=\frac{\|\theta^{(i,j)}\|_{p}}{\sqrt{\sum_{j}(\|\theta^{(i,j)}\|_% {p})^{2}}}= divide start_ARG ∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( ∥ italic_θ start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (12)
sijsubscript𝑠𝑖𝑗\displaystyle s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =|CLC~(θ,i,j)|j(CLC~(θ,i,j)2\displaystyle=\frac{|\widetilde{\textrm{CLC}}(\theta,i,j)|}{\sqrt{\sum_{j}(% \widetilde{\textrm{CLC}}(\theta,i,j)^{2}}}= divide start_ARG | over~ start_ARG CLC end_ARG ( italic_θ , italic_i , italic_j ) | end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over~ start_ARG CLC end_ARG ( italic_θ , italic_i , italic_j ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (13)

Note that it’s necessary to use the absolute value of CLC as the saliency metric, since CLC can not reflect the uncertainty of model prediction brought by pruning a single filter, as discussed in model compression studies [23, 22]. For each model F𝐹Fitalic_F, we compute the Spearman’s rank correlation coefficient ρFsubscript𝜌𝐹\rho_{F}italic_ρ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT between the magnitude metric mijsubscript𝑚𝑖𝑗m_{ij}italic_m start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and the saliency metric sijsubscript𝑠𝑖𝑗s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. It is clear that under the same training settings, the coefficient for backdoor models is significantly lower than clean metrics.

However, the defender lacks information about the training process and can not compare the untrustworthy model with the corresponding benign model. We instead consider tune the model with the clean suppression objectives to perturb the magnitude of clean neurons. In clean models, the unlearning process reduces the magnitude of clean neurons at approximately equal rates, so the magnitude-saliency correlation is maintained, while the correlation of backdoored models is significantly weakened. The idea is that the unlearning process has different effects on backdoor and clean neurons, making the hybrid filters and backdoor filters further deviate from the magnitude-saliency correlation, which is also implied in [18].

A.4 Limitations of MNP

Despite the promising results of our Magnitude-based Neuron Pruning (MNP) method, there are several limitations that warrant attention:

  1. 1.

    Clean data acquisition. Although MNP requires only a modest amount of clean data to achieve effective backdoor defense, the availability of clean data remains a significant limitation. Without access to a reliable source of clean data, the application of our method may be restricted. Future work should focus on enhancing MNP to operate with even less clean data or develop techniques that can identify and utilize clean data within a poisoned dataset.

  2. 2.

    Defense against low poisoning rate attacks. MNP shows robust performance against a variety of backdoor attacks but faces challenges when the poisoning rate is exceedingly low (e.g., 1%absentpercent1\leq 1\%≤ 1 %). In such cases, backdoor neurons are more adept at blending with clean neurons, making them harder to detect and prune accurately. Addressing this limitation will involve refining our detection algorithms to be more sensitive to subtle anomalies in neuron behavior.

  3. 3.

    Theoretical guarantees against advanced attacks. Our study is grounded in observations and regularization assumptions, which means that we cannot theoretically guarantee that MNP can defend against all sophisticated attacks. The evolving nature of adversarial strategies necessitates continuous research to strengthen the theoretical underpinnings of our method and ensure it remains effective against future threats.

Appendix B Additional Experimental Results

Table 4: Comparison of DR with the 4 different detection methods against 6 existing attacks (%)

Defense BadNets Trojan Blend CL IAB WaNet  Avg NC 100.00 100.00 20.50 52.50 9.50 25.50 51.33 AC 100.00 100.00 91.50 84.50 75.50 87.00 89.75 SC 100.00 100.00 95.50 55.00 57.50 90.50 83.08 STRIP 100.00 100.00 83.50 50.00 50.50 100.00 80.67 MNP(Ours) 100.00 100.00 100.00 95.50 92.50 100.00 98.00

B.1 Run-time Analysis

We record the running time of MNP on a RTX 3090Ti GPU with 500 CIFAR-10 samples and a ResNet18 model attacked by BadNets. It costs MNP 1 minutes and 27 seconds to reduce the ASR to 0.47, closer to FP (30 seconds), ANP (45 seconds) and RNP (1 minutes and 11 seconds). The computing cost is acceptable compared to retraining the model from scratch (more than 1 hour).