Magnitude-based Neuron Pruning for Backdoor Defense

Nan Li
School of Cyber Science and Engineering
Shanghai Jiao Tong University
Shanghai, China, 200240
[email protected]
&Haoyu Jiang
School of Cyber Science and Engineering
Shanghai Jiao Tong University
Shanghai, China, 200240
[email protected]
** Yi
School of Cyber Science and Engineering
Shanghai Jiao Tong University
Shanghai, China, 200240
[email protected]

Abstract

Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks, posing concerning threats to their reliable deployment. Recent research reveals that backdoors can be erased from infected DNNs by pruning a specific group of neurons, while how to effectively identify and remove these backdoor-associated neurons remains an open challenge. In this paper, we investigate the correlation between backdoor behavior and neuron magnitude, and find that backdoor neurons deviate from the magnitude-saliency correlation of the model. The deviation inspires us to propose a Magnitude-based Neuron Pruning (MNP) method to detect and prune backdoor neurons. Specifically, MNP uses three magnitude-guided objective functions to manipulate the magnitude-saliency correlation of backdoor neurons, thus achieving the purpose of exposing backdoor behavior, eliminating backdoor neurons and preserving clean neurons, respectively. Experiments show our pruning strategy achieves state-of-the-art backdoor defense performance against a variety of backdoor attacks with a limited amount of clean data, demonstrating the crucial role of magnitude for guiding backdoor defenses.

1 Introduction

In recent years, Deep Neural Networks (DNNs) have demonstrated remarkable capabilities in solving real-world problems. However, the wide application of DNNs has raised concerns about their security and trustworthiness. Recent works have shown that DNNs are vulnerable to backdoor attacks[10], in which an adversary injects malicious triggers into the victim model through data poisoning, manipulating the training process, or directly modifying model parameters. The backdoored model performs well on clean samples but can be triggered into false predictions by the poisoned samples containing trigger patterns. As pre-trained weights and outsourced training are widely applied to cut computational costs for training DNNs, the backdoor attack is becoming an undeniable security issue.

To address this issue, numerous methods have been proposed for detecting and mitigating backdoor attacks. Backdoor detection methods [31, 21, 12] identify whether a model is backdoored or a dataset is poisoned, while backdoor mitigation methods [19, 16] eliminate the injected triggers from backdoored models. Recent research [34, 18] has observed a subset of neurons contributing the most to backdoor behaviors in infected DNNs. By pruning these backdoor-associated neurons, the backdoor behavior of the infected model can be effectively mitigated. Backdoor neurons are believed to have certain properties. For example, they can only be activated by trigger patterns [39], and are more sensitive to input [37] or perturbation [34]. These properties can be used to design certain pruning strategies to mitigate the injected backdoor.

Magnitude is considered an important indicator to guide pruning in model compression studies [15, 11]. Most of these studies assume a positive correlation between neuron magnitude and neuron importance for model performance, as neurons with smaller magnitudes have less numerical impact on the output of the model. We empirically show that backdoor neurons deviate from this correlation, as they contain extra weights used to trigger the backdoor. Similar observations are also implied in [37, 39]. Motivated by our findings, we propose a Magnitude-based Neuron Pruning (MNP) method to defend against backdoor attacks. Specifically, MNP first detects the injected backdoor by analyzing the correlation between neuron magnitude and neuron saliency, and then optimizes neuron masks with three magnitude-guided objective functions to expose and prune backdoor neurons. Given a small subset of clean samples, MNP can effectively defend against backdoors injected by a variety of backdoor attacks. Experiments show MNP is competitive for both backdoor detection and mitigation among a set of state-of-the-art backdoor defense methods. Our defense strategy significantly surpasses the previous state-of-the-art method RNP [18], achieving superior defense outcomes against over ten types of attacks on both the CIFAR-10 and ImageNet datasets, with the majority of results reaching optimal performance levels.

To summarize, our main contributions are three-fold. 1) We explored the correlation between neuron magnitude and their contribution to backdoor behavior, and try to reinterpret the mechanism of backdoor defenses from the perspective of neuron magnitude. 2) Beyond offering hypothesis for how neuron magnitude works in backdoor defense methods, we further validate our claims by using them as principles to construct our own backdoor defense method MNP, which utilize three optimization objectives to manipulate the magnitude of neurons and 3) We empirically show that our method is competitive compared to ten state-of-the-art backdoor defense methods against ten challenging backdoor attacks across different model architectures and datasets.

2 Related Work

Backdoor Attack.

Depending on how the trigger pattern is injected, backdoor attacks fall into two main categories: input-space attacks poisoning the training dataset and feature-space attacks manipulating the training process or directly modifying model parameters. Input-space attacks, also known as poisoning-based attacks, are conducted by modifying a small subset of training data, which commonly includes patching trigger patterns into the sample and shifting the corresponding label to the targeted class. Note that there are also clean-label attacks [30] that do not relabel any sample. The model trained on the poisoned data learns both the clean task permitting its accuracy on clean samples, and the backdoor task tricking it into false predictions with the trigger pattern. The input-space attacks can be further categorized into static and dynamic attacks depending on the trigger pattern they use. Static attacks use the same trigger pattern for all samples, such as black-white squares [10], Gaussian noise [5] and adversarial perturbations [30], while dynamic attacks [24, 25] use sample-wise triggers to enhance their stealthiness. Feature-space attacks occur under a different threat model, where the adversary has full access to the training process and the model weights. These attacks may directly manipulate the training process with certain optimization objectives [27, 6, 36], or perturb the model weights [9, 26] to inject backdoor in the feature space, making them more challenging to defend against.

Backdoor Defense.

Backdoor defense involves two primary tasks: backdoor detection and backdoor mitigation. Backdoor detection methods focus on identifying backdoored models [1, 21, 4, 13] or backdoored samples [8, 29]. Some advanced detection methods also conduct reverse engineering with the backdoored model to recover the trigger pattern [31, 28, 33, 12]. Backdoor mitigation methods aim to remove the injected backdoor from the infected model with minimal degradation of its performance on clean samples. Existing techniques include fine-tuning, distillation [16], unlearning [35], pruning [19], and training-time defenses [17, 32, 21]. Recent works on pruning have demonstrated remarkable performance in backdoor mitigation. FP [19] assumes backdoor-associated neurons can only be activated by the trigger pattern, thus pruning neurons that are dominant when feeding the model clean samples can promisingly mitigate the injected backdoor. ANP [34] perturbs neuron weights to maximize the classification loss of the model, then fixes the model by pruning neurons that are more sensitive to the adversarial perturbation, as these neurons are believed to be strongly related to the injected backdoor. RNP [18] optimizes masks through an unlearning-recovering process to expose backdoor neurons. CLP [37] introduces the channel lipschitz value to evaluate each neuron’s sensitivity to input and prunes backdoor neurons with high sensitivity. FT-SAM [39] has revealed that backdoor neurons tends to have larger magnitudes, and incorporate sharpness-aware minimization with fine-tuning to purify the injected models.

3 Preliminaries

3.1 Backdoor Learning

Consider a standard $K$ -class classification problem on a training set $\mathcal{D}=\{(\boldsymbol{x}_{i},y_{i})\}^{D}_{i=1}\subseteq\mathcal{X}\times% \mathcal{Y}$ , with $\mathcal{X}\subset\mathbb{R}^{d}$ as the sample space and $\mathcal{Y}\subset\{1,2,...,K\}$ as the label space. Given a subset $\mathcal{D}_{b}\subseteq\mathcal{D}$ , the standard poisoning-based backdoor attack involves injecting the trigger pattern into input samples with the poisoning function $\delta:\mathcal{X}\rightarrow\mathcal{X}$ and modifying corresponding labels with the label shifting function $S:\mathcal{Y}\rightarrow\mathcal{Y}$ . The backdoor attack can be viewed as a multi-task learning problem [18] on both the clean subset $\mathcal{D}_{c}=\mathcal{D}-\mathcal{D}_{b}$ and the backdoor subset $\mathcal{D}_{b}$ . Let $F$ denote the victim model with parameter $\theta$ . We consider a parametric hypothesis that the parameter space of $F$ can be decomposed into $\theta=\theta_{c}\cup\theta_{b}\cup\theta_{sh}$ , where $\theta_{c}$ , $\theta_{b}$ and $\theta_{sh}$ denote the clean, the backdoor and the shared neurons, respectively. We further assume that $\theta_{sh}$ can be omitted since the backdoor and the clean tasks are highly independent, i.e., backdoor attacks are designed not to affect the model performance on clean samples [10]. The standard backdoor learning process can be expressed as follows:

\underset{\theta=\theta_{c}\cup\theta_{b}}{\arg\min}\big{[}{\underbrace{% \mathbb{E}_{\left(\boldsymbol{x},y\right)\in\mathcal{D}_{c}}\mathcal{L}\left(F% \left(\boldsymbol{x};\theta_{c}\right),y\right)}_{\textrm{clean task}}+% \underbrace{\mathbb{E}_{\left(\boldsymbol{x},y\right)\in\mathcal{D}_{b}}% \mathcal{L}\left(F\left(\delta(\boldsymbol{x});\theta_{b}\right),S(y)\right)}_% {\textrm{backdoor task}}}\big{]},

(1)

3.2 Neuron Magnitude and Saliency

Consider a convolutional network $F$ with $L$ layers, regarding the fully connected layer as the convolutional layer with $1\times 1$ kernels. Let $f^{(i)}$ denote the $i$ -th layer with the weight matrix $\theta^{(i)}\in\mathbb{R}^{c_{i}\times c_{i-1}\times h\times w}$ , where $c_{i}$ , $h$ and $w$ denote the channel number of $f^{(i)}$ , the height and the width of the convolutional kernel, respectively. $\theta^{(i)}$ consists of $c_{i}$ filters $\{\theta^{(i,j)}\in\mathbb{R}^{c_{i-1}\times h\times w}\}_{j=1}^{c_{i}}$ . Pruning the $j$ -th filter of the $i$ -th layer refers to setting $\theta^{(i,j)}$ to an all-zero matrix, thus removing the corresponding output feature map. We denote the $l_{p}$ -norm of a filter as $\|\theta^{(i,j)}\|_{p}$ , which is widely used to evaluate the magnitude of each filter in model compression methods [15, 11]. Recent research [39] has observed a strong positive correlation between backdoor behavior and the weight norms for each neuron, which implies that backdoor neurons may have larger magnitudes than clean ones. To further quantify the contribution of each neuron to the backdoor and the clean task, we introduce the concept of neuron saliency, which is commonly defined as the loss change induced by pruning that neuron in model compression research [14]. Given a test set $\mathcal{D}_{t}$ , for each filter, we define the saliency metrics Clean Loss Change (CLC) and Backdoor Loss Change (BLC) as follows:

	$\displaystyle\textrm{CLC}(\theta,i,j)$	$\displaystyle=\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{t}}\mathcal{L}(F(% \boldsymbol{x};\theta\|{\theta^{(i,j)}=0}),y)-\mathbb{E}_{(\boldsymbol{x},y)\in% \mathcal{D}_{t}}\mathcal{L}(F(\boldsymbol{x};\theta),y),$		(2)
	$\displaystyle\textrm{BLC}(\theta,i,j)$	$\displaystyle=\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{t}}\mathcal{L}(F(% \delta(\boldsymbol{x});\theta\|{\theta^{(i,j)}=0}),S(y))-\mathbb{E}_{(% \boldsymbol{x},y)\in\mathcal{D}_{t}}\mathcal{L}(F(\delta(\boldsymbol{x});% \theta),S(y))$		(3)

Refer to caption — Figure 1: Scatter plots depicting the BLC and CLC of filters in the shallow and deep convolutional layers of backdoored ResNet18 models, attacked by BadNets, Trojan, Blend, CLA, IAB, and WaNet. Quadrants are determined by the horizontal and vertical lines (x-axis and y-axis) at $\textrm{CLC}=0$ and $\textrm{BLC}=0$ . The color of each point indicates the $l_{2}$ -norm of the corresponding filter weight, with deeper colors representing larger $l_{2}$ -norms.

3.3 Observations about Backdoor and Clean Neurons

We investigate the distribution of the CLC, BLC and magnitude of filters across different layers of backdoored models, as shown in Fig 1. Our key observations are as follows:

Backdoor and clean neurons may overlap.

Most of the recent works [19, 38] assume that backdoor and clean neurons hardly overlap in filter level, i.e., each filter is associated with either the backdoor task or the clean task. However, filters in the first quadrants in Fig 1 have positive BLC and positive CLC, which implies they contribute to both clean accuracy and backdoor behavior. Instead of simply defining a filter as backdoored or clean, we suggest a fine-grained approach with thresholds $\tau\geq 0,\epsilon\geq 0$ to categorize filters into backdoor, clean, hybrid and redundant filters. We follow the assumption mentioned in Section 3.1, that the backdoor and clean parameters are separated at the neuron level. A filter is composed of a large number of neurons, as it commonly contains multiple channels and kernel weights. Backdoor filters primarily consist of backdoor neurons that are crucial for the backdoor task but have minimal contribution to the clean task, while clean filters exhibit the opposite behavior. Hybrid filters are composed of both backdoor neurons and clean neurons, hence, pruning them results in an increase in both backdoor loss and clean loss. Redundant filters contain unimportant neurons and can be pruned without significantly affecting the overall performance of the model, as is widely observed in model compression studies [15, 11].

The correlation between neuron magnitude and saliency.

As weight decay is widely applied to reduce the complexity of DNNs, neurons of less importance tend to have smaller magnitudes. Based on the above assumption, magnitude-based model compression methods [15, 11] use the $l_{p}$ -norm of filters as a statistical indicator of their contribution to the final prediction result of the model . Since CLC measures the importance of filters for model performance, we assume a positive correlation between neuron magnitude and the CLC in the clean parameter space $\theta_{c}$ . Denoting the correlation by $C:\mathbb{R}\rightarrow\mathbb{R}$ , every clean filter $\theta^{(i,j)}$ satisfies $\textrm{CLC}(F,i,j)=C(\|\theta^{(i,j)}\|_{p})$ . Since the backdoor model is trained on both the backdoor task and the clean task, each filter $\theta^{(i,j)}$ can be decomposed into $\theta^{(i,j)}=\theta^{(i,j)}_{c}\cup\theta^{(i,j)}_{b}$ . Its contribution to the clean task satisfies:

\textrm{CLC}(F,i,j)=C(\|\theta^{(i,j)}_{c}\|_{p})\leq C(\|\theta^{(i,j)}\|_{p}),

(4)

which indicates that the $l_{p}$ -norm of the backdoor or hybrid filter does not correspond to its actual contribution to the clean task, for it contains additional parameters used to trigger the backdoor. This observation is consistent with the results in Fig 1, where the majority of backdoor filters have larger $l_{p}$ -norms compared to clean filters with the same CLC value.

4 Methodology

4.1 Basic settings

Assumptions.

Our research is based on two fundamental assumptions about the neurons of the backdoored DNN: 1) A filter can be associated with both the backdoor task and the clean task, if so, it is composed of both clean and backdoor neurons. 2) The model is trained with weight decay or other parameter regularization techniques, thus the $l_{p}$ norm of a filter is positively correlated with its overall contribution to clean accuracy and backdoor behavior.

Defense setting.

We adopt a typical defense setting where the defender has downloaded a backdoored model from an untrustworthy third party without knowledge of the attack or training data. We assume a small amount of clean data $\mathcal{D}_{d}$ is available for defense, which can be collected from the Internet or carefully selected from the training data. The defender aims to first determine if the model is backdoored, then remove the backdoor behavior from the infected model with minimum degradation to its clean accuracy.

4.2 Magnitude-guided Optimization

A number of recent works [34, 18, 2, 39] on backdoor mitigation have employed a min-max optimization process to expose backdoor neurons. However, most of these works have not explicitly included the magnitude of neurons in their optimization objectives. More analyses can be found in A.1. Based on our analysis on how neuron magnitude works in backdoor defense approaches, we consider three optimization objectives to control the magnitude of neurons: weight penalty, clean suppression and clean preserving.

Weight Penalty.

As is discussed in Section 3.3, the $l_{p}$ -norms of clean filters approximately follow a positive correlation with their contribution to clean accuracy, while backdoor or hybrid filters contain backdoor neurons and have larger $l_{p}$ -norms that deviate from the positive correlation. The deviation inspires us to develop a strategy to prune filters with large $l_{p}$ -norms while preserving (or recovering) the clean accuracy. For each filter, we apply a mask $m\in[0,1]$ , which acts on the magnitude of each filter without changing the specific weight. For model $F$ with $L$ layers, the collection of all masks is denoted by $\mathcal{M}=\{\boldsymbol{m}_{i}\}_{i=1}^{L}$ , where $\boldsymbol{m}_{i}\in[0,1]^{c_{i}}$ is a vector of masks for the $i$ -th layer with $c_{i}$ filters. The masked network $F_{\mathcal{M}}$ has the same architecture as $F$ with weight matrices of all the convolutional layers set to $\boldsymbol{m}_{i}\odot\theta^{(i)}$ . To expose the backdoor filters with large $l_{p}$ -norms and minimal impact on the clean loss, we formulate our problem as follows:

\min_{\boldsymbol{m}_{i}\in[0,1]^{c_{i}}}\big{[}\mathbb{E}_{(\boldsymbol{x},y)% \in\mathcal{D}_{d}}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x};\theta),y)+% \lambda\sum_{i=1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]},

(5)

where $\odot$ denotes the Hadamard product, $\boldsymbol{w}_{i}\in\mathbb{R}^{c_{i}}$ is the vector of $l_{2}$ -norms of filters of the $i$ -th convolutional layer, and $\lambda>0$ is a hyperparameter balancing the loss and the weight penalty term.

Clean Suppression.

The clean suppression objective is designed to reduce the magnitude of most clean neurons and expose the backdoor behavior of the infected model:

\max_{\theta}\big{[}\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{d}}\mathcal{% L}(F(\boldsymbol{x};\theta),y)-\mu\|\theta\|_{2}^{2}\big{]},

(6)

where $\mu>0$ is the weight decay hyperparameter. Since the suppression of clean neurons both increases the classification loss and reduces the $l_{2}$ regularization term, during the optimization, magnitude of clean neurons decrease at a faster rate than backdoor and hybrid neurons.

Clean Preserving.

In contrast to clean suppression, the clean preservation objective is designed to preserve an important subset of clean neurons. This includes most clean filters and part of hybrid filters that are critical for the clean accuracy. Similarly to the weight penalty process, we initialize $\boldsymbol{m}_{i}\in[1,2]^{c_{i}}$ for each filter and optimize masks to increase magnitude of filters as much as possible while retaining the clean loss:

\min_{\boldsymbol{m}_{i}\in[1,2]^{c_{i}}}\big{[}\mathbb{E}_{(\boldsymbol{x},y)% \in\mathcal{D}_{d}}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x};\theta),y)-% \lambda\sum_{i=1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]},

(7)

4.3 Proposed Method

Backdoor Mitigation with MNP.

MNP aims to prune filters that deviate from the magnitude-saliency correlation, as these filters may contain potential backdoor neurons. The weight penalty objective is designed to achieve this purpose. Conducting weight penalty process directly on backdoor models is enough to defend against classical attacks including BadNets [10] and WaNet [25]. However, we have empirically found that weight penalty alone is unable to thoroughly remove the backdoors injected by some advanced attacks like DFST [6] or LIRA [7], as backdoor filters produced by these attacks may have smaller magnitudes, making it difficult to distinguish them from redundant filters. Therefore, we adopt the clean suppression objective to reduce the magnitude of clean neurons in the backdoor model, thereby exposing the low-magnitude backdoor filters. The combination of clean suppression and weight penalty process is enough to purify most backdoor models. To further reduce the compromise in model performance, we conduct the clean preserving process on the original backdoor model to preserve critical clean neurons, which may be mistakenly pruned and cause degradation of clean accuracy in other backdoor mitigation methods. During the process, the masks of the backdoor and redundant filters keeps almost unchanged, while the mask of a subset of clean and hybrid filters that are critical for the clean accuracy significantly increase. MNP adds the masks obtained from the clean preserving process to the masks obtained from the weight penalty process, and prunes filters with lower mask values, thus better balancing clean accuracy and backdoor mitigation performance. More discussion about the mechanism of MNP can be found in Appendix A.2.

Backdoor Detection with MNP.

MNP detects backdoor models by measuring the magnitude-saliency correlation of neurons across the model. For each filter, the $l_{2}$ -norm of its weights $\|\theta^{(i,j)}\|_{2}^{2}$ and $\textrm{CLC}(\theta,i,j)$ are chosen as the magnitude and the saliency metrics, respectively. The Spearsman’s rank correlation coefficient $\rho(F,\theta)$ between $\|\theta^{(i,j)}\|_{2}^{2}$ and $\textrm{CLC}(\theta,i,j)$ can be used to measure the strength of the magnitude-saliency correlation of model $F$ . We compare the correlations of the original model $F(\cdot;\theta)$ and the suppressed model $F(\cdot;\theta_{s})$ obtained from the clean suppression process. In clean models, the unlearning process reduces the magnitude of clean neurons at approximately equal rates, so the magnitude-saliency correlation is maintained, while the correlation of backdoored models is significantly weakened. Given threshold $\delta\in[0,1]$ , the set of backdoored models can be defined as follows:

\mathcal{B}_{\delta}=\left\{F(\cdot;\theta):\frac{\rho(F,\theta_{s})}{\rho(F,% \theta)}\leq\delta\right\}

(8)

Note that we have omitted the approximation of CLC and the normalization of the magnitude and saliency metrics. The detailed detection method can be found in Appendix A.3.

Algorithm 1 Magnitude-based Pruning for Backdoor Defense (MNP)

0: Defense data

\mathcal{D}_{d}

, backdoored model

F

with parameters

\theta

, learning rate

\eta_{1}>0,\eta_{2}>0

, hyperparameters

\lambda

\mu

, max iteration number

T

, detection threshold

\delta

, pruning threshold

\epsilon\in[1,3]

1: Initialize

\mathcal{M}_{1}=\{[1]^{c_{i}}\}^{L}_{i=1}

\mathcal{M}_{2}=\{[1]^{c_{i}}\}^{L}_{i=1}

\theta_{s}=\theta

2: for

t=0,...,T-1

3: sample a mini-batch

\mathcal{B}=\{(\boldsymbol{x_{i}},y_{i})\}_{i=1}^{b}\subset\mathcal{D}_{d}

\theta_{s}\leftarrow\theta_{s}+\eta_{1}\nabla_{\theta_{s}}\big{[}\mathcal{L}(F% (\boldsymbol{x_{i}};\theta_{s}),y_{i})-\mu\|\theta_{s}\|_{2}^{2}\big{]}

5: end for

6: Compute the Spearsman’s rank correlation coefficient

\rho(F,\theta),\rho(F,\theta_{s})

7: repeat

8: Sample a mini-batch

\{(\boldsymbol{x_{i}},y_{i})\}_{i=1}^{b}\subset\mathcal{D}_{d}

\mathcal{M}_{1}\leftarrow\mathcal{M}_{1}-\eta_{2}\nabla_{\mathcal{M}_{1}}\big{% [}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x_{i}};\theta_{s}),y_{i})+\lambda% \sum_{i=1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]}

10:

\mathcal{M}_{2}\leftarrow\mathcal{M}_{2}-\eta_{2}\nabla_{\mathcal{M}_{2}}\big{% [}\mathcal{L}(F_{\mathcal{M}}(\boldsymbol{x_{i}};\theta),y_{i})-\lambda\sum_{i% =1}^{L}\|\boldsymbol{w}_{i}\odot\boldsymbol{m}_{i}\|_{1}\big{]}

11:

\mathcal{M}_{1}\leftarrow\max(0,\min(\mathcal{M}_{1},1))

12:

\mathcal{M}_{2}\leftarrow\max(1,\min(\mathcal{M}_{1},2))

13: until training converged

14:

\hat{\theta}=\theta\odot\mathbb{I}\left((\mathcal{M}_{1}+\mathcal{M}_{2})>% \epsilon\right)

14: Purified model

F

with parameters

\hat{\theta}

, detection result

\mathbb{I}\left(\frac{\rho(F,\theta_{s})}{\rho(F,\theta)}>\delta\right)

5 Experiments

5.1 Experimental Setup

Attack Setup.

We evaluate MNP against 10 challenging attacks. These include 3 static attacks: BadNets [10], Trojan [20], Blend [5], 2 clean label attacks: CL [30] and SIG [1], 2 dynamic attacks: IAB [24] and WaNet [25], 2 feature space attacks: FC [27] and DSFT [6], and 1 adaptive attack LIRA [7]. Default settings from original papers and open-source codes are applied for most attacks, including backdoor trigger pattern and size. The backdoor label of all attacks is set to class 0, with a default poisoning rate of 10%. Attacks are performed on CIFAR-10 with ResNet18 and a 12-class subset of ImageNet with ResNet-34. For training setups, Stochastic Gradient Descent (SGD) is used with an initial learning rate 0.1, weight decay 5e-4, momentum 0.9, batch size 128 for 200 epochs on CIFAR-10, and batch size 64 for 300 epochs on ImageNet subset. A cosine scheduler is employed to adjust the learning rate.

Defense Setup.

We compare MNP with a total of 10 backdoor defense methods. These include 4 pruning-based backdoor mitigation methods: FP [19], ANP [34], CLP [37] and RNP [18], 3 non-pruning mitigation methods: NC [31], NAD [16] and I-BAU [35], as well as 4 detection methods: NC, AC [3], SC [29] and STRIP [8]. All the defenses share limited access to 1% benign training data except for CLP. Hyperparameters for these defenses are adjusted based on open-source codes to obtain best performance against different attacks. For MNP, we set hyperparameter $\mu$ as 0.001, $\lambda$ as 0.0005, suppression epochs $T$ as 20, learning rates $\eta_{1}$ as 0.01 and $\eta_{2}$ as 0.1. The detection threshold $\delta$ is set as $0.2$ , and the pruning threshold $\epsilon$ is dynamically adjusted to prune 30 neurons for each backdoor model.

Evaluation Metric.

We adopt two metrics for evaluating backdoor mitigation performance: 1) Clean Accuracy (CA), which is the model’s accuracy on clean test data; 2) Attack Success Rate (ASR), which is the model’s accuracy on backdoored test data. We adopt 3) Detection Rate (DR) to evaluate detection performance, which is the accuracy of the defense in identifying backdoor models.

Table 1: Comparison with the state-of-the-art defenses on CIFAR-10 dataset with 1% benign data on ResNet18 (%).

Attack	Backdoored	FP	ANP	CLP	RNP	NC	NAD	I-BAU	MNP(Ours)
Attack	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR
BadNets	92.43 / 100.00	81.45 / 25.31	90.27 / 1.12	91.54 / 1.34	91.09 / 0.54	89.32 / 5.54	89.57 / 1.10	90.19 / 12.73	92.11 / 0.47
Trojan	92.68 / 100.00	82.63 / 62.57	90.78 / 1.31	91.16 / 2.87	91.95 / 2.03	90.91 / 52.72	86.73 / 5.74	90.35 / 10.55	92.24 / 0.92
Blend	92.15 / 99.99	83.26 / 76.44	90.82 / 0.90	90.61 / 1.67	91.53 / 1.33	91.84 / 84.31	89.68 / 13.24	89.94 / 2.24	91.79 / 0.87
CL	91.55 / 98.93	81.38 / 36.42	89.96 / 5.47	89.51 / 1.54	90.05 / 0.75	90.13 / 5.66	86.74 / 15.18	87.75 / 20.12	91.18 / 0.62
SIG	93.52 / 99.14	88.17 / 23.56	91.57 / 5.39	90.14 / 10.35	90.63 / 0.89	90.50 / 90.15	91.37 / 3.46	86.71 / 25.62	93.50 / 1.32
IAB	94.60 / 99.66	85.42 / 38.95	93.67 / 1.52	93.20 / 4.71	93.15 / 2.34	94.19 / 97.98	89.94 / 12.16	87.68 / 18.34	94.45 / 0.48
WaNet	92.12 / 98.81	80.53 / 69.74	91.06 / 8.89	90.58 / 6.43	91.60 / 4.02	91.03 / 96.50	83.32 / 13.18	88.91 / 25.48	91.75 / 3.64
FC	93.61 / 100.00	88.92 / 98.04	85.95 / 77.42	80.21 / 65.87	89.39 / 1.55	92.41 / 99.78	90.14 / 30.37	85.79 / 18.22	91.23 / 1.87
DFST	95.50 / 100.00	85.53 / 80.76	91.24 / 19.80	87.45 / 58.82	92.15 / 25.67	93.51 / 99.02	87.43 / 15.70	85.22 / 26.84	94.79 / 20.40
LIRA	91.20 / 97.78	86.64 / 90.52	84.17 / 21.25	80.38 / 60.56	89.76 / 18.62	90.08 / 97.30	86.52 / 30.15	84.33 / 57.09	89.29 / 9.72
Average	94.42 / 98.83	84.39 / 60.23	89.95 / 14.31	88.68 / 21.41	91.39 / 7.60	91.46 / 72.90	88.14 / 14.03	88.69 / 21.72	92.23 / 4.03

Table 2: Comparison with the state-of-the-art defenses on ImageNet subset dataset with 1% benign data on ResNet34 (%).

Attack	Backdoored	FP	ANP	CLP	RNP	NC	NAD	I-BAU	MNP(Ours)
Attack	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR	CA/ ASR
BadNets	91.89 / 100.00	81.43 / 93.98	89.18 / 6.34	85.52 / 9.81	90.16 / 1.54	83.80 / 10.27	81.45 / 9.83	80.72 / 20.95	92.21 / 0.87
Trojan	91.50 / 99.87	81.17 / 90.93	90.31 / 1.78	85.98 / 7.51	89.06 / 1.51	85.53 / 80.98	83.05 / 5.74	84.36 / 10.55	90.03 / 1.08
Blend	89.24 / 100.00	79.05 / 84.41	82.18 / 9.20	82.69 / 3.31	87.20 / 4.37	83.86 / 97.93	79.03 / 12.71	80.21 / 19.35	89.04 / 2.56
CL	88.91 / 90.32	81.44 / 84.73	80.75 / 12.81	84.30 / 8.07	87.13 / 5.05	80.66 / 30.21	83.17 / 24.76	84.06 / 11.57	86.92 / 3.41
WaNet	88.69 / 95.48	82.17 / 90.32	81.62 / 14.73	79.14 / 21.56	84.91 / 13.68	83.25 / 94.56	81.40 / 33.15	80.82 / 44.68	86.85 / 8.42
FC	87.58 / 94.36	79.45 / 87.42	73.81 / 53.69	75.33 / 77.92	82.49 / 10.32	86.60 / 92.68	81.65 / 56.40	79.36 / 38.35	87.04 / 10.63
Average	89.64 / 96.67	80.79 / 88.63	82.98 / 16.43	82.16 / 21.36	86.83 / 6.08	83.95 / 67.77	81.63 / 23.77	87.53 / 24.24	88.68 / 4.10

5.2 Main Defense Results

Results on CIFAR-10.

Table 1 presents the defense performance of 8 backdoor mitigation methods against 10 backdoor attacks on CIFAR-10. MNP outperforms other defense methods by cutting the average ASR down to 4.03% with a slight drop on CA (2.19% on average). In comparison, the state-of-the-art methods ANP, RNP, NAD and I-BAU reduce the average ASR to 14.31%, 7.60%, 14.03%, and 21.72%, respectively. An important observation is that each defense method has its limits. For instance, NAD achieves the lowest ASR (15.70%) on DFST but is outperformed by MNP on most other attacks. MNP shows weakness against SIG and FC, yet it has the best overall performance and highest clean accuracy in most settings. Nearly all defenses struggle against DFST and LIRA, suggesting the need for more sophisticated mechanisms to counter advanced attacks.

Results on ImageNet Subset.

Table 2 presents the defense performance of 8 backdoor mitigation methods against 6 backdoor attacks on CIFAR-10. Note that some attacks are not conducted for ImageNet subset because of difficulties in reproduction on high-resolution dataset. Pruning larger models trained on high-resolution dataset poses a greater challenge due to the difficulty in locating backdoor neurons. However, MNP can effectively defend against existing attacks and still demonstrates an advantage in maintaining clean accuracy. For example, MNP can reduce the ASR of BadNet to 0.87%, even slightly improving the clean accuracy. MNP also generally outperforms most state-of-the-art defenses, except for a slightly lower clean accuracy (90.03%/90.31%) on Trojan compared to ANP and a slightly weaker ASR (10.63%/10.32%) on FC compared to RNP. Nevertheless, it is evident that MNP can better balance clean accuracy and ASR overall, as it cuts the average ASR down to 4.10% with only <1% decline in CA.

Detection Performance.

The clean suppression process of MNP can be use to detect backdoor models by the strength of magnitude-saliency correlation, and the experimental results can be found in 4 in Appendix B. MNP is able to detect backdoor models generated by six attacks and achieve a 98.00% DR, outperforming NC (51.33%), AC (89.75%), SC (83.08%), and STRIP (80.67%).

5.3 Ablation Studies

Impact of the Defense Data Size.

In this part, we evaluate the impact of defense data size on the performance of MNP with a backdoored ResNet18 attacked by BadNets. We use 0.1%(50), 0.5% (250), 1% (500) and 5% (2500) images from the CIFAR-10 training set for defense, respectively. Results in Fig 4 show that as defense data size increases, MNP demonstrates better backdoor mitigation performance. In some cases, MNP only needs 0.1% of clean samples to reduce the ASR of several attacks to < 5%, highlighting its potential to identify backdoor neurons in few-shot settings. The impact of defense data size is more pronounced on CA. As the number of clean samples grows, the distribution of the defense set gets closer to the original dataset, allowing MNP to more accurately preserve clean neurons. Overall, MNP can effectively eliminate backdoor neurons under few-shot settings, and 1% of clean samples is sufficient for MNP to limit the degradation of CA to < 3%.

Table 3: Defense performance of MNP against BadNets attack on CIFAR-10 with different poisoning rates and A2O/A2A settings

Poisoning Ratio $\rightarrow$		0.05%(250)		1%(500)		5%(2500)		10%(5000)		20%(10000)
ATTACK $\downarrow$		No Defense	MNP	No Defense	MNP	No Defense	MNP	No Defense	MNP	No Defense	MNP
BadNets-A2O	CA	92.35	85.76	92.47	88.29	92.60	92.15	92.43	92.11	89.97	88.31
BadNets-A2O	ASR	99.05	10.40	98.93	3.26	100.00	0.39	100.00	0.47	100.00	1.28
BadNets-A2A	CA	93.05	92.02	92.97	92.16	92.44	91.89	92.65	92.07	86.31	84.12
BadNets-A2A	ASR	63.51	44.38	79.52	8.75	90.45	0.78	91.38	0.97	94.56	0.44

Performance against Different Poisoning Rate and All-to-all attack.

We conducted experiments on MNP with different poisoning rates on the CIFAR-10. We have also tested the effectiveness of MNP against the all-to-all attack, where the target label of the backdoored sample is set to one plus the original label ( $S(y)=y+1$ ). Experimental results are shown in Table 5. For attacks with poisoning rate $\geq$ 5%, MNP can effectively reduce the ASR to approximately 1% without causing significant CA decrease. For attacks with poisoning rates $\leq$ 1%, MNP can still maintain a certain level of defense performance. Additionally, MNP can effectively defend against all-to-all attacks with poisoning rates $\geq$ 5%, reducing the ASR to 1% with < 2% CA degradation.

Impact of hyperparameters.

The most critical hyperparameters in MNP are $\lambda$ and $\mu$ , as higher values of $\lambda$ or $\mu$ cause the optimization process to be dominated by the magnitude of neurons rather than the classification loss of the model. We kept other hyperparameters the same as in 5.1 and adjusted $\mu$ to values of $0$ , $1e^{-4}$ , $1e^{-3}$ , $1e^{-2}$ , and $1e^{-1}$ , and lambda to values of $0$ , $5e^{-5}$ , $5e^{-4}$ , $5e^{-3}$ , and $5e^{-2}$ . The results in Fig 4 indicate that MNP achieves the best defense performance when $\mu$ and $\lambda$ are set to proper, smaller values. A larger $\mu$ simultaneously reduces the magnitude of both backdoor and clean neurons and make the clean suppression process ineffective. Similarly, a larger $\lambda$ makes MNP ignore the classification loss, which prevents it from effectively preserving clean neurons. Overall, setting $\lambda$ and $\mu$ to smaller values helps maintain stable performance, while appropriate larger $\lambda$ and $\mu$ helps achieve the best performance.

6 Conclusion

This paper proposes Magnitude-based Neuron Pruning (MNP), a novel method to detect and mitigate backdoor attacks. The core idea of MNP is to manipulate the magnitude of backdoor and clean neurons through three process named clean suppression, weight penalty and clean preserving. Clean suppression reduces the magnitude of clean neurons to expose and identify backdoor behavior. Weight penalty eliminates neurons with large magnitude but less contribution to the clean task, i.e., potential backdoor neurons. Clean preserving aims to increase the magnitude of critical clean neurons to avoid them from being pruned. The empirical success of MNP across a range of backdoor attack scenarios highlights the potential of neuron magnitude for backdoor defense.

References

Barni et al. [2019] Mauro Barni, Kassem Kallas, and Benedetta Tondi. A new backdoor attack in cnns by training set corruption without label poisoning. In 2019 IEEE International Conference on Image Processing (ICIP), pages 101–105. IEEE, 2019.
Chai and Chen [2022] Shuwen Chai and **ghui Chen. One-shot Neural Backdoor Erasing via Adversarial Weight Masking. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22285–22299. Curran Associates, Inc., 2022.
Chen et al. [2018] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering, November 2018.
Chen et al. [2019] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. Deepinspect: A black-box Trojan detection and mitigation framework for deep neural networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19, pages 4658–4664. AAAI Press, 2019. ISBN 978-0-9992411-4-1.
Chen et al. [2017] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. CoRR, abs/1712.05526, 2017.
Cheng et al. [2021] Siyuan Cheng, Yingqi Liu, Shiqing Ma, and Xiangyu Zhang. Deep feature space trojan attack of neural networks by controlled detoxification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1148–1156, 2021.
Doan et al. [2021] Khoa D Doan, Yingjie Lao, Weijie Zhao, and ** Li Lira. Learnable, imperceptible and robust backdoor attacks. 2021 IEEE. In CVF International Conference on Computer Vision (ICCV), pages 11946–11956, 2021.
Gao et al. [2020] Yansong Gao, Chang Xu, Derui Wang, Shi** Chen, Damith C. Ranasinghe, and Surya Nepal. STRIP: A Defence Against Trojan Attacks on Deep Neural Networks, January 2020.
Garg et al. [2020] Siddhant Garg, Adarsh Kumar, Vibhor Goel, and Yingyu Liang. Can Adversarial Weight Perturbations Inject Neural Backdoors. In Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux, editors, CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 2029–2032. ACM, 2020. doi: 10.1145/3340531.3412130.
Gu et al. [2017] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
He et al. [2018] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft Filter Pruning for Accelerating Deep Convolutional Neural Networks. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 2018.
Hu et al. [2021] Xiaoling Hu, Xiao Lin, Michael Cogswell, Yi Yao, Susmit Jha, and Chao Chen. Trigger Hunting with a Topological Prior for Trojan Detection. In International Conference on Learning Representations, 2021.
Kolouri et al. [2020] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus patterns: Revealing backdoor attacks in CNNs. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 298–307. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.00038.
LeCun et al. [1989] Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989.
Li et al. [2016] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for Efficient ConvNets. CoRR, abs/1608.08710, 2016.
Li et al. [2020] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks. In International Conference on Learning Representations, 2020.
Li et al. [2021] Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. Anti-Backdoor Learning: Training Clean Models on Poisoned Data. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, pages 14900–14912, 2021.
Li et al. [2023] Yige Li, Xixiang Lyu, Xingjun Ma, Nodens Koren, Lingjuan Lyu, Bo Li, and Yu-Gang Jiang. Reconstructive Neuron Pruning for Backdoor Defense. In Proceedings of the 40th International Conference on Machine Learning, ICML’23, Honolulu, Hawaii, USA, 2023. JMLR.org.
Liu et al. [2018a] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pages 273–294. Springer, 2018a.
Liu et al. [2018b] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning Attack on Neural Networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018b.
Liu et al. [2019] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1265–1282, London United Kingdom, November 2019. ACM. ISBN 978-1-4503-6747-9. doi: 10.1145/3319535.3363216.
Molchanov et al. [2017] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
Molchanov et al. [2019] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 11264–11272. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.01152.
Nguyen and Tran [2020] Tuan Anh Nguyen and Anh Tuan Tran. Input-Aware Dynamic Backdoor Attack. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, 2020.
Nguyen and Tran [2021] Tuan Anh Nguyen and Anh Tuan Tran. WaNet - Imperceptible War**-based Backdoor Attack. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
Qi et al. [2022] Xiangyu Qi, Tinghao Xie, Ruizhe Pan, Jifeng Zhu, Yong Yang, and Kai Bu. Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13337–13347, New Orleans, LA, USA, June 2022. IEEE. ISBN 978-1-66546-946-3. doi: 10.1109/CVPR52688.2022.01299.
Shafahi et al. [2018] Ali Shafahi, W. Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 6106–6116, 2018.
Tao et al. [2022] Guanhong Tao, Guangyu Shen, Yingqi Liu, Shengwei An, Qiuling Xu, Shiqing Ma, Pan Li, and Xiangyu Zhang. Better Trigger Inversion Optimization in Backdoor Scanning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13368–13378, June 2022.
Tran et al. [2018] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 8011–8021, 2018.
Turner et al. [2019] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Clean-label backdoor attacks. 2019.
Wang et al. [2019] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE Computer Society, 2019.
Wang et al. [2022a] Zhenting Wang, Hailun Ding, Juan Zhai, and Shiqing Ma. Training with More Confidence: Mitigating Injected and Natural Backdoors During Training. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022a.
Wang et al. [2022b] Zhenting Wang, Kai Mei, Hailun Ding, Juan Zhai, and Shiqing Ma. Rethinking the Reverse-engineering of Trojan Triggers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 9738–9753. Curran Associates, Inc., 2022b.
Wu and Wang [2021] Dongxian Wu and Yisen Wang. Adversarial Neuron Pruning Purifies Backdoored Deep Models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 16913–16925. Curran Associates, Inc., 2021.
Zeng et al. [2022] Yi Zeng, Si Chen, Won Park, Zhuoqing Mao, Ming **, and Ruoxi Jia. Adversarial Unlearning of Backdoors via Implicit Hypergradient. In International Conference on Learning Representations, 2022.
Zhao et al. [2022] Zhendong Zhao, Xiaojun Chen, Yuexin Xuan, Ye Dong, Dakui Wang, and Kaitai Liang. DEFEAT: Deep Hidden Feature Backdoor Attacks by Imperceptible Perturbation and Latent Representation Constraints. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15192–15201. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01478.
Zheng et al. [2022a] Runkai Zheng, Rongjun Tang, Jianze Li, and Li Liu. Data-free backdoor removal based on channel lipschitzness. In European Conference on Computer Vision, pages 175–191. Springer, 2022a.
Zheng et al. [2022b] Runkai Zheng, Rongjun Tang, Jianze Li, and Li Liu. Pre-activation Distributions Expose Backdoor Neurons. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022b.
Zhu et al. [2023] Mingli Zhu, Shaokui Wei, Li Shen, Yanbo Fan, and Baoyuan Wu. Enhancing Fine-Tuning based Backdoor Defense with Sharpness-Aware Minimization. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4443–4454, Paris, France, October 2023. IEEE. ISBN 9798350307184. doi: 10.1109/ICCV51070.2023.00412.

Appendix A Theoretical Analysis and Mechanism of MNP

A.1 Min-Max Optimization for Backdoor Defense

A number of recent works [34, 18, 2, 39] on backdoor mitigation have employed a min-max optimization process to expose backdoor neurons. This paradigm can be reinterpreted from the perspective of neuron magnitude.

ANP [34], one of the state-of-the-art backdoor mitigation methods, adversarially perturbs the magnitude of each filter to maximize the clean loss of the model. The adversarial perturbation can possibly improve the magnitude of the backdoor and hybrid filters, thus amplifying the backdoor activation and trigger the backdoor behavior. Then, ANP recovers the model performance by optimizing masks for each filter, during which process the masks of filters contributing most to the backdoor behavior are reduced to zero, thus can be identified and pruned.

Another state-of-the-art method RNP [18] unlearns the model to maximize the clean loss. The difference is that the unlearning process is conducted at the neuron level. During unlearning, the magnitude of clean neurons is reduced, and the magnitude of backdoor neurons may also increase. Then, RNP apply a similar filter-level recovering process to recover the clean accuracy and expose the backdoor filters. RNP is empirically more effective than ANP , as the unlearning process possibly reduce the magnitude of most clean neurons , causing the magnitude of all backdoor neurons relatively increases. Since the clean neurons are neutralized, some hybrid filters may also transform into backdoor filters. In this way, the majority of the backdoor-associated filters are exposed. In contrast, the adversarial perturbation in ANP only amplifies the magnitude of a subset of backdoor filters, some low-contribution backdoor filters and hybrid filters are generally ignored.

A.2 More Understanding of MNP

Why clean suppression is needed before weight penalty.

We assume that backdoor or hybrid filters contain backdoor neurons and have larger $l_{p}$ -norms that deviate from the magnitude-saliency correlation. We find this assumption holds strongly for highly regularized models, and directly conducting the weight penalty process for these models is enough to erase the injected backdoor. However, for less regularized models or models not fully converged, the magnitude-saliency correlation is not as obvious as the highly regularized models. In this case, the weight penalty process is not enough to thoroughly remove the inject backdoor. By applying the clean suppression process before weight penalty, we can reduce the magnitude of most clean neurons and expose more backdoor-associated filters, leading to better backdoor mitigation performance.

Why clean suppression not adversarial perturbation.

As is discussed in Section A.1, the adversarial perturbation, whether conducted at filter level or neuron level, can only amplifying the magnitude a subset of high-contribution backdoor filters, since the classification loss can be effectively increased by perturbing a small number of backdoor neurons. In contrast, the clean suppression process reduces the magnitude of most clean neurons, thus the magnitude of all backdoor neurons relatively increases and more backdoor-associated filters can be exposed.

How can clean preserving help in reducing compromise of clean accuracy.

Backdoor neurons exist across different layers of the infected model. Some hybrid filters (for example, filters in the first convolution layer) may contain clean neurons critical for the model performance, which can be possibly neutralized by the clean suppression process. Although the weight penalty process can expose most backdoor and hybrid filters, it can not completely recover the clean accuracy. In other word, some hybrid filters that are critical for the clean task may be pruned. To reduce the compromise of clean accuracy, we apply the clean preserving process to preserve critical clean neurons. The clean preserving process is performed at the filter level. During the process, the masks of the backdoor and redundant filters keeps almost unchanged, while the mask of a subset of clean and hybrid filters that are critical for the clean accuracy significantly increase. We add the masks obtained from the clean preserving process to the masks obtained from the weight penalty process, and prune filters with lower mask values, thus better balancing clean accuracy and backdoor mitigation performance. The pruning strategy of MNP is different from most of the previous methods [34, 18], which prune filters directly by their contribution to the backdoor task. For example, if a hybrid filter contributes significantly to both the clean task and the backdoor task, it may be directly pruned in these methods. In contrast, in MNP, as the clean preserving process increase the mask of that filter, the priority of pruning it is lower than pruning filters that reduce the backdoor contribution but do not affect the clean accuracy, i.e., purely backdoor neurons.

A.3 Detect Backdoor Models with MNP

Based on the assumptions in Section 4.1, the magnitude of a neuron is positively correlated with its contribution to the prediction results. In benign models, the magnitude-CLC correlation is approximately equal to the magnitude-contribution correlation. In backdoor models, the magnitude-CLC correlation should be weaker than the actual magnitude-contribution correlation, for that CLC only partially reflects the contribution of backdoor and hybrid neurons, as mentioned in Section 3.3. Intuitively, we can detect backdoor models by measuring the strength of the correlation between neuron magnitude and saliency. However, directly computing CLC is computationally expensive, for it requires pruning each filter $\theta^{(i,j)}$ and measuring the change in loss. The clean loss $\mathcal{L}_{cl}$ can be denoted as follows:

\mathcal{L}_{cl}(\theta)=\mathbb{E}_{(\boldsymbol{x},y)\in\mathcal{D}_{t}}% \mathcal{L}(F(\boldsymbol{x};\theta),y)

(9)

We can approximate CLC in the vicinity of $\theta$ by the first-order Taylor expansion of the clean loss:

\mathcal{L}_{cl}(\theta)=\mathcal{L}_{cl}(\theta|{\theta^{(i,j)}=0})+\frac{% \delta\mathcal{L}_{cl}}{\delta\theta^{(i,j)}}\theta^{(i,j)}+O(\|\theta^{(i,j)}% \|_{p}^{2})

(10)

We can neglect the first-order remainder to avoid computational difficulties, since the widely-used ReLU activation function encourages a smaller second-order term. By substituting Eq. 10 into Eq. 2 and ignoring the remainder, we have:

\widetilde{\textrm{CLC}}(\theta,i,j)=\mathcal{L}_{cl}(\theta)-\frac{\delta% \mathcal{L}_{cl}}{\delta\theta^{(i,j)}}\theta^{(i,j)}-\mathcal{L}_{cl}(\theta)% =-\frac{\delta\mathcal{L}_{cl}}{\delta\theta^{(i,j)}}\theta^{(i,j)},

(11)

which is computationally friendly, as the gradient $\frac{\delta\mathcal{L}_{cl}}{\delta\theta^{(i,j)}}$ can be easily computed through backpropagation. Another problem is that the scale of neuron magnitude and saliency varies across different layers of DNNs. Thus we apply a simple layer-wise $l_{2}$ -normalization to conduct rescaling across layers. The magnitude and saliency metrics for each filter are defined as follows:

	$\displaystyle m_{ij}$	$\displaystyle=\frac{\\|\theta^{(i,j)}\\|_{p}}{\sqrt{\sum_{j}(\\|\theta^{(i,j)}\\|_% {p})^{2}}}$		(12)
	$\displaystyle s_{ij}$	$\displaystyle=\frac{\|\widetilde{\textrm{CLC}}(\theta,i,j)\|}{\sqrt{\sum_{j}(% \widetilde{\textrm{CLC}}(\theta,i,j)^{2}}}$		(13)

Note that it’s necessary to use the absolute value of CLC as the saliency metric, since CLC can not reflect the uncertainty of model prediction brought by pruning a single filter, as discussed in model compression studies [23, 22]. For each model $F$ , we compute the Spearman’s rank correlation coefficient $\rho_{F}$ between the magnitude metric $m_{ij}$ and the saliency metric $s_{ij}$ . It is clear that under the same training settings, the coefficient for backdoor models is significantly lower than clean metrics.

However, the defender lacks information about the training process and can not compare the untrustworthy model with the corresponding benign model. We instead consider tune the model with the clean suppression objectives to perturb the magnitude of clean neurons. In clean models, the unlearning process reduces the magnitude of clean neurons at approximately equal rates, so the magnitude-saliency correlation is maintained, while the correlation of backdoored models is significantly weakened. The idea is that the unlearning process has different effects on backdoor and clean neurons, making the hybrid filters and backdoor filters further deviate from the magnitude-saliency correlation, which is also implied in [18].

A.4 Limitations of MNP

Despite the promising results of our Magnitude-based Neuron Pruning (MNP) method, there are several limitations that warrant attention:

1.

Clean data acquisition. Although MNP requires only a modest amount of clean data to achieve effective backdoor defense, the availability of clean data remains a significant limitation. Without access to a reliable source of clean data, the application of our method may be restricted. Future work should focus on enhancing MNP to operate with even less clean data or develop techniques that can identify and utilize clean data within a poisoned dataset.
2.

Defense against low poisoning rate attacks. MNP shows robust performance against a variety of backdoor attacks but faces challenges when the poisoning rate is exceedingly low (e.g., $\leq 1\%$ ). In such cases, backdoor neurons are more adept at blending with clean neurons, making them harder to detect and prune accurately. Addressing this limitation will involve refining our detection algorithms to be more sensitive to subtle anomalies in neuron behavior.
3.

Theoretical guarantees against advanced attacks. Our study is grounded in observations and regularization assumptions, which means that we cannot theoretically guarantee that MNP can defend against all sophisticated attacks. The evolving nature of adversarial strategies necessitates continuous research to strengthen the theoretical underpinnings of our method and ensure it remains effective against future threats.

Appendix B Additional Experimental Results

Table 4: Comparison of DR with the 4 different detection methods against 6 existing attacks (%)

Defense BadNets Trojan Blend CL IAB WaNet Avg NC 100.00 100.00 20.50 52.50 9.50 25.50 51.33 AC 100.00 100.00 91.50 84.50 75.50 87.00 89.75 SC 100.00 100.00 95.50 55.00 57.50 90.50 83.08 STRIP 100.00 100.00 83.50 50.00 50.50 100.00 80.67 MNP(Ours) 100.00 100.00 100.00 95.50 92.50 100.00 98.00

B.1 Run-time Analysis

We record the running time of MNP on a RTX 3090Ti GPU with 500 CIFAR-10 samples and a ResNet18 model attacked by BadNets. It costs MNP 1 minutes and 27 seconds to reduce the ASR to 0.47, closer to FP (30 seconds), ANP (45 seconds) and RNP (1 minutes and 11 seconds). The computing cost is acceptable compared to retraining the model from scratch (more than 1 hour).