Certified 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Attribution Robustness via Uniformly Smoothed Attributions

Fan Wang   Adams Wai-Kin Kong
School of Computer Science and Engineering
Nanyang Technological University
[email protected][email protected]
Abstract

Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense method is eminently needed to understand the robustness of attributions. In this work, we propose to use uniform smoothing technique that augments the vanilla attributions by noises uniformly sampled from a certain space. It is proved that, for all perturbations within the attack region, the cosine similarity between uniformly smoothed attribution of perturbed sample and the unperturbed sample is guaranteed to be lower bounded. We also derive alternative formulations of the certification that is equivalent to the original one and provides the maximum size of perturbation or the minimum smoothing radius such that the attribution can not be perturbed. We evaluate the proposed method on three datasets and show that the proposed method can effectively protect the attributions from attacks, regardless of the architecture of networks, training schemes and the size of the datasets.

1 Introduction

The developments and wider uses of deep learning models in various security-sensitive applications, such as autonomous driving, medical diagnosis, and legal judgments, have raised discussions of the trustworthiness of these models. The lack of explainability of deep learning models, especially the recent popular large language models [7], has been one of the main concerns. Regulators have started to require the explainability of the AI models in some applications [20], and the explainability of AI models has been one of the main focuses of the research community [15]. Model attributions, as one of the important tools to explain the rationales behind the model predictions, have been used to understand the decision-making process of the models. For example, medical practitioners use the explanations generated by attribution methods to assist them in making important medical decisions [1, 22, 16]. Similarly, in autonomous vehicles, the explainability helps to deal with potential liability and responsibility gaps [2, 8] and people naturally requires the confirmation of the safety critical decisions. However, since these users often lack expertise in machine learning and technical details of attributions, there is a risk that the attributions have been manipulated without noticing. Attackers may specifically target attributions to mislead investigators, propagate false narratives, or evade detection, potentially leading to serious consequences. Consequently, practitioners could lose trust in these methods, resulting in their refusal to use these methods. Thus, a trustworthy application requires not only the predictions made by the AI models, but also its explainability produced from attributions. However, the attributions have also been shown recently to be vulnerable to small perturbations [19]. Similar to adversarial attacks, attribution attacks generate perturbations that can be added to input samples. These perturbations distort the attributions while maintaining unchanged prediction outputs. This misleadingly gives a false sense of security to practitioners who blindly trust the attributions. An effective defense method is emergently needed to protect the attributions from the attacks.

Unlike adversarial defense, which has been extensively investigated to mitigate the harm of adversarial attacks that using both empirical [32, 3, 9] and certified [11, 47, 48, 28] defense methods, attribution defense is neglected. Almost all attribution protection works focus on adversarially training the model by augmenting the training data with manipulated samples to improve the robustness of attributions [5, 10, 23, 37, 45]. Only the recent work by Wang & Kong [46] attempted to derive a practical upper bound for the worst-case attribution deviation after the samples are attacked, while suffering from strict assumptions and heavy computations. In overall, none of the previous work is capable of providing a certification that ensures the attribution attack will not change the attribution more than a given threshold. In this work, we put our focus on the smoothed version of attributions and seek to provide a theoretical guarantee that the attributions are robust to any type of perturbations within 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attack budget.

Based on the previously defined formulation of attribution robustness [46], given a network f𝑓fitalic_f, its attribution function g𝑔gitalic_g and perturbation 𝜹d𝜹superscript𝑑\bm{\delta}\in\mathbb{R}^{d}bold_italic_δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the attribution robustness is defined as the optimal value that maximizes the worst-case attribution difference D(,)𝐷D(\cdot,\cdot)italic_D ( ⋅ , ⋅ ) under provided attack budget,

max𝜹D(g(𝒙),g(𝒙+𝜹))s.t.𝜹ϵ.subscript𝜹𝐷𝑔𝒙𝑔𝒙𝜹s.t.norm𝜹italic-ϵ\max_{\bm{\delta}}\quad D(g(\bm{x}),g(\bm{x}+\bm{\delta}))\quad\text{s.t.}% \quad\|\bm{\delta}\|\leq\epsilon.roman_max start_POSTSUBSCRIPT bold_italic_δ end_POSTSUBSCRIPT italic_D ( italic_g ( bold_italic_x ) , italic_g ( bold_italic_x + bold_italic_δ ) ) s.t. ∥ bold_italic_δ ∥ ≤ italic_ϵ . (1)

However, the aforementioned study only provides an approximate method to solve the optimization problem under the strict assumptions that the networks is locally linear, and, as a result, is unable to generalize to modern neural networks. To give a complete certification on the problem, we follow the formulation and attempt to find the effective upper bound of the attribution difference, equivalently, the lower bound of attribution similarity, which can be applied to any network. We propose to use uniformly smoothed attribution, which is a smoothed version of the original attribution, and show that, for all perturbations within the allowable attack budget, the cosine similarity that measures the difference between perturbed and unperturbed uniformly smoothed attribution can be certified to be lower bounded. The contribution of this paper can be summarized as follows:

  • We provide a theoretical guarantee that demonstrates the robustness of the uniformly smoothed attribution to any perturbations within allowable region. The robustness is measured by the similarity between the perturbed and unperturbed smoothed attribution. The method can be generally applied to any neural networks, and can be efficiently scaled to larger size images. To the best knowledge of the authors, this is the first work that provides a theoretical guarantee for attribution robustness.

  • We present alternative formulations of the certification that are equivalent to the original one and also practical to be implemented. The alternative formulations determine the maximum size of perturbation, or the minimum radius of smoothing, ensuring that the attribution remains within a given tolerance.

  • We demonstrate that the uniform smoothing can protect the attribution against 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attacks. More importantly, we evaluated the proposed method on the well-bounded integrated gradients and show that it can be effectively implemented and can successfully protect and certify the attributions from 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attacks.

The rest of this paper is organized as follows. In Section 2, we review the related works. In Section 3 and Section 4, we introduce the uniformly smoothed attribution and show that it can be certified against attribution attacks. In Section 5, we present experimental results and evaluate the proposed method. Finally, we conclude this paper and in Section 6.

2 Related works

2.1 Attribution methods

Attribution methods study the importance of each input feature, and measure that how much every feature contributes to the model prediction. One of popular attribution approach is the gradient-based method. Based on the property that gradient is the measurement of the rate of change, the gradient-based methods measure the feature importance by weighting the gradients in different ways. Examples of gradient-based methods include saliency [38], integrated gradients (IG) [43], full-gradient [41], and etc. Other attribution methods include occlusion [38], which measures the importance of each input feature by occluding the feature and measuring the change of the model prediction, layer-wise relevance propagation (LRP) [4], which propagates the output relevance to the input layer, and SHAP related methods [31, 42, 26]. An important property of the attribution methods is the axiom of completeness that igi(𝒙)=fj(𝒙)subscript𝑖subscript𝑔𝑖𝒙subscript𝑓𝑗𝒙\sum_{i}g_{i}(\bm{x})=f_{j}(\bm{x})∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) = italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x ), which indicates the relationship between attribution and the prediction score. Note that the gradient-based attribution methods are upper-bounded since the gradients of the model output with respect to the input are upper-bounded.

2.2 Attribution attacks and defenses

Ghorbani et al. [19] first pointed out that attributions can be fragile to iterative attribution attack, and Dombrowski et al. [14] extended the attack to be targeted that attributions can be changed purposely into any preset patterns. Similar to adversarial attacks, attribution attacks maximize the loss function that measures the difference between the original attributions and the target attributions. In addition, the attribution attacks are controlled not to alter the classification results. To defend against attribution attacks, adversarial training [32] approaches have been adopted. Chen et al. [10] and Boopathy et al. [5] minimizes the psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm differences between perturbed and original attributions, and Ivankay et al. [23] considers Pearson’s correlation coefficient. Wang & Kong [45] suggests to use cosine similarity, emphasizing the attributions should be measured in directions rather than magnitudes. All of the above defenses improve the attribution robustness empirically, but none of them is able to provide certifications to ensure that the attributions will not change under any perturbations within the allowable attack region.

Refer to caption
(a) Original image
Refer to caption
(b) IG
Refer to caption
(c) Gaussian smoothed IG
Refer to caption
(d) Uniformly smoothed IG
Refer to caption
Figure 1: (left) Examples of attributions (zoom in for better visibility). We choose to show the integrated gradients (IG) and its corresponding smoothing results. For Gaussian smoothing, the noise level is set to σ=0.2𝜎0.2\sigma=0.2italic_σ = 0.2 and for uniformly smoothed IG, 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball with radius 3σ3𝜎\sqrt{3}\sigmasquare-root start_ARG 3 end_ARG italic_σ is used. (right) A 2D illustration of the volumes of (𝒙;r)𝒙𝑟\mathcal{B}(\bm{x};r)caligraphic_B ( bold_italic_x ; italic_r ) and (𝒙~;r)~𝒙𝑟\mathcal{B}(\tilde{\bm{x}};r)caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ), as well as the relationship between h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) and h(𝒙~)~𝒙h(\tilde{\bm{x}})italic_h ( over~ start_ARG bold_italic_x end_ARG ). Here h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) is the original attribution, and a𝒗𝑎𝒗a\bm{v}italic_a bold_italic_v represents the magnitude and direction of the translation of h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) after the sample is perturbed. VUsubscript𝑉𝑈V_{U}italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT in Theorem 1 is the volume of shaded region in the figure, and V𝒮subscript𝑉𝒮V_{\mathcal{S}}italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is the volume of each individual ball. When h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) is fixed, the lower bound of the cosine similarity between h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) and h(𝒙)+a𝒗𝒙𝑎𝒗h(\bm{x})+a\bm{v}italic_h ( bold_italic_x ) + italic_a bold_italic_v can be derived as a function of volumes

2.3 Randomized smoothing

The smoothing technique has been popular in improving certified adversarial robustness [30, 28, 11, 48]. The smoothed classifiers take a batch of inputs that are randomly sampled from the neighbourhood of original inputs under certain distributions q𝑞qitalic_q and make the decisions based on the most likely outputs, i.e., argmaxy𝜼μ[F(𝒙+𝜼)=y]subscriptargmax𝑦subscriptsimilar-to𝜼𝜇delimited-[]𝐹𝒙𝜼𝑦\operatorname*{arg\,max}_{y}\mathbb{P}_{\bm{\eta}\sim\mu}[F(\bm{x}+\bm{\eta})=y]start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT blackboard_P start_POSTSUBSCRIPT bold_italic_η ∼ italic_μ end_POSTSUBSCRIPT [ italic_F ( bold_italic_x + bold_italic_η ) = italic_y ]. They provide the certification of the maximum radius so that no perturbations within the radius can alter the classification label. Cohen et al. [11] certifies the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attack based on the Neyman-Pearson lemma and the result is alternatively proved by Salman et al. [36] using explicit Lipschitz constants. Lecuyer et al. [28] and Teng et al. [44] consider Laplacian smoothing for the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT attack. Yang et al. [48] derived a similar result in subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT though the radius becomes small when the dimension of data gets large. Kumar & Goldstein [25] applied randomized smoothing to structured output, which the attributions belong to, but the specific bound for attribution is to loose to be meaningful. Thus, there are no existing works that provide defense and valid certifications for attribution using randomized smoothing due to the difficulty of defining the attribution robustness and the computation of the attribution gradient. In this work, an effective method to formulate the smoothed attribution robustness as a simple optimization problem is proposed to provide certifications against attribution attacks.

3 Uniformly smoothed attribution

Consider a classifier f:d[0,1]c:𝑓superscript𝑑superscript01𝑐f:\mathbb{R}^{d}\rightarrow\left[0,1\right]^{c}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → [ 0 , 1 ] start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT that maps the input 𝒙d𝒙superscript𝑑\bm{x}\in\mathbb{R}^{d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to the softmax output y[0,1]𝑦01y\in\left[0,1\right]italic_y ∈ [ 0 , 1 ], and its attribution function g(𝒙):dd:𝑔𝒙superscript𝑑superscript𝑑g(\bm{x}):\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_g ( bold_italic_x ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The smoothed attribution of f𝑓fitalic_f is to construct a new attribution hhitalic_h by taking the mean of attributions on 𝒙+𝜼𝒙𝜼\bm{x}+\bm{\eta}bold_italic_x + bold_italic_η, where 𝜼𝜼\bm{\eta}bold_italic_η is randomly drawn from a density μ𝜇\muitalic_μ, i.e., the smoothed attribution hhitalic_h can be defined as follows:

h(𝒙)=𝔼𝜼μ[g(𝒙+𝜼)].𝒙subscript𝔼similar-to𝜼𝜇delimited-[]𝑔𝒙𝜼h(\bm{x})=\mathbb{E}_{\bm{\eta}\sim\mu}[g(\bm{x}+\bm{\eta})].italic_h ( bold_italic_x ) = blackboard_E start_POSTSUBSCRIPT bold_italic_η ∼ italic_μ end_POSTSUBSCRIPT [ italic_g ( bold_italic_x + bold_italic_η ) ] . (2)

To construct smoothed attribution, the density q𝑞qitalic_q can be chosen arbitrarily, and the smoothed attributions provide visually sharpened gradient-based attributions (see Figure 1 (left)). Smilkov et al. [40] choose q𝑞qitalic_q to be multivariate Gaussian distribution on input gradients to weaken the visually noisy attribution. In this work, we aim at certifying the attribution robustness under 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attack; thus we choose μ𝜇\muitalic_μ to be a uniform distribution on a d𝑑ditalic_d-dimensional closed space 𝒮𝒮\mathcal{S}caligraphic_S centered at 𝟎0\bm{0}bold_0, especially the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm ball of radius r𝑟ritalic_r, (𝟎;r)={𝒚:𝒚2r,𝒚d}0𝑟conditional-set𝒚formulae-sequencesubscriptnorm𝒚2𝑟𝒚superscript𝑑\mathcal{B}(\bm{0};r)=\left\{\bm{y}:\|\bm{y}\|_{2}\leq r,\bm{y}\in\mathbb{R}^{% d}\right\}caligraphic_B ( bold_0 ; italic_r ) = { bold_italic_y : ∥ bold_italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_r , bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. It can be seen that smoothing under this setting also provides high attribution quality (Figure 1(d)). To quantitatively evaluate the effectiveness of uniformly smoothed attribution, we can further evaluate its performance using GridPG introduced by Rao et al. [34], which quantifies the significance of individual features in terms of positive contributions or influences. The GridPG values of IG, uniformly smoothed IG and Gaussian smoothed IG for 5,00050005,0005 , 000 randomly selected ImageNet examples are 0.40210.40210.40210.4021, 0.40930.40930.40930.4093 and 0.41100.41100.41100.4110, respectively, which suggest that the uniformly smoothed attributions can achieve comparable performance of GridPG with the Gaussian smoothed attributions, as well as the original non-smoothed attributions.

4 Certifying the cosine similarity of smoothed attributions

We now consider 𝒮𝒮\mathcal{S}caligraphic_S as an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm ball for the ease of analysing the certification. Adapted from the formulation in Eq. (1), the robustness of attribution is defined as the minimum possible attribution similarity when a natural image is perturbed by attribution attacks. Cosine similarity is chosen to measure the similarity as it has been shown to be the most suitable alternative to the non-differentiable Kendall’s rank correlation, which is the most common evaluation index [45]. Suppose that the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attribution attack is performed upon input sample 𝒙𝒙\bm{x}bold_italic_x, the maximum allowable perturbation is ϵitalic-ϵ\epsilonitalic_ϵ, i.e., 𝒙~=𝒙+𝜹~𝒙𝒙𝜹\tilde{\bm{x}}=\bm{x}+\bm{\delta}over~ start_ARG bold_italic_x end_ARG = bold_italic_x + bold_italic_δ, where 𝜹2ϵsubscriptnorm𝜹2italic-ϵ\|\bm{\delta}\|_{2}\leq\epsilon∥ bold_italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ. For all 𝜹2ϵsubscriptnorm𝜹2italic-ϵ\|\bm{\delta}\|_{2}\leq\epsilon∥ bold_italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ, we want to find out the minimum value of cos(h(𝒙),h(𝒙~))𝒙~𝒙\cos(h(\bm{x}),h(\tilde{\bm{x}}))roman_cos ( italic_h ( bold_italic_x ) , italic_h ( over~ start_ARG bold_italic_x end_ARG ) ), i.e., the lower bound of cosine similarity for attributions. Thus, the optimization problem we are interested in is formulated as follows:

min𝜹cos(h(𝒙),h(𝒙+𝜹))s.t.𝜹2ϵ.subscript𝜹𝒙𝒙𝜹s.t.subscriptnorm𝜹2italic-ϵ\min_{\bm{\delta}}\quad\cos(h(\bm{x}),h(\bm{x}+\bm{\delta}))\quad\text{s.t.}% \quad\|\bm{\delta}\|_{2}\leq\epsilon.roman_min start_POSTSUBSCRIPT bold_italic_δ end_POSTSUBSCRIPT roman_cos ( italic_h ( bold_italic_x ) , italic_h ( bold_italic_x + bold_italic_δ ) ) s.t. ∥ bold_italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ . (3)

That is, we will show that, given an arbitrary sample point 𝒙𝒙\bm{x}bold_italic_x, for all perturbed samples 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG, the cosine similarity between the original smoothed attribution and the corresponding perturbed smoothed attributions is guaranteed to be lower bounded by the optimum of (3). Alternatively, in a more practical perspective, given a threshold T𝑇Titalic_T for the cosine similarity, we want to know the maximum size of perturbation, or the minimum smoothing radius, such that no perturbations inside would cause the attribution difference to exceed the threshold.

However, although optimizing the cosine similarity for attribution can be an intuitive way to find the lower bound, it is difficult in this problem to directly study the cosine function. Moreover, it is also intractable to optimize the cosine similarity with respect to a vector 𝜹𝜹\bm{\delta}bold_italic_δ. To address this issue, we first reformulate the problem into an optimization over two scalars and then solve the alternative problem to obtain the lower bound of cosine similarity. All the proofs and derivations of theorems, lemmas and corollaries are provided in the Appendix A.

4.1 One-dimensional reformulation

We note that cosine similarity of attributions is in fact an inner product of their normalized vectors. Besides, we also observe that h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) is the mean of g(𝒙+𝜼)𝑔𝒙𝜼g(\bm{x}+\bm{\eta})italic_g ( bold_italic_x + bold_italic_η ) with respect to 𝜼𝜼\bm{\eta}bold_italic_η, which is, by definition, equivalent to the integral of g𝑔gitalic_g weighted by the density of uniform distribution over (𝒙;r)𝒙𝑟\mathcal{B}(\bm{x};r)caligraphic_B ( bold_italic_x ; italic_r ). More importantly, for perturbed example 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG, the weighting region is (𝒙~;r)~𝒙𝑟\mathcal{B}(\tilde{\bm{x}};r)caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ), which is a translated version of (𝒙;r)𝒙𝑟\mathcal{B}(\bm{x};r)caligraphic_B ( bold_italic_x ; italic_r ) and they are expected to intersect with each other when the distance of their centers is smaller than twice of the radius r𝑟ritalic_r. Therefore, given the input sample 𝒙𝒙\bm{x}bold_italic_x and its attribution h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ), we can rewrite the minimization of cosine similarity as follows:

mina,𝒗subscript𝑎𝒗\displaystyle\min_{a,\bm{v}}roman_min start_POSTSUBSCRIPT italic_a , bold_italic_v end_POSTSUBSCRIPT h(𝒙)Th(𝒙)(h(𝒙)+a𝒗h(𝒙)+a𝒗)superscript𝒙𝑇norm𝒙𝒙𝑎𝒗norm𝒙𝑎𝒗\displaystyle\quad\frac{h(\bm{x})^{T}}{\|h(\bm{x})\|}\left(\frac{h(\bm{x})+a% \bm{v}}{\|h(\bm{x})+a\bm{v}\|}\right)divide start_ARG italic_h ( bold_italic_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_h ( bold_italic_x ) ∥ end_ARG ( divide start_ARG italic_h ( bold_italic_x ) + italic_a bold_italic_v end_ARG start_ARG ∥ italic_h ( bold_italic_x ) + italic_a bold_italic_v ∥ end_ARG ) (4)

where a𝑎aitalic_a represents the magnitude of the translation and the unit vector 𝒗𝒗\bm{v}bold_italic_v is the direction of translation as shown in Figure 1 (right). It can be shown that the magnitude a𝑎aitalic_a is constrained by a constant related to the intersecting volume and the property of the attribution function itself.

In a high-dimensional case, for a fixed cosine similarity value, a given h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) and h(𝒙~)~𝒙h(\tilde{\bm{x}})italic_h ( over~ start_ARG bold_italic_x end_ARG ), in fact, form a spherical cone. Thus, we can decompose the directional unit vector 𝒗𝒗\bm{v}bold_italic_v into 𝒗=cosθ𝒗+sinθ𝒗𝒗𝜃subscript𝒗parallel-to𝜃subscript𝒗perpendicular-to\bm{v}=\cos\theta\bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp}bold_italic_v = roman_cos italic_θ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + roman_sin italic_θ bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT, where 𝒗subscript𝒗perpendicular-to\bm{v}_{\perp}bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT is perpendicular to h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) and 𝒗subscript𝒗parallel-to\bm{v}_{\parallel}bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT is parallel to h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ). Then, we have

mina,θh(𝒙)+acosθ(h(𝒙)+acosθ)2+(asinθ)2,subscript𝑎𝜃norm𝒙𝑎𝜃superscriptnorm𝒙𝑎𝜃2superscript𝑎𝜃2\min_{a,\theta}\quad\frac{\|h(\bm{x})\|+a\cos\theta}{\sqrt{(\|h(\bm{x})\|+a% \cos\theta)^{2}+(a\sin\theta)^{2}}},roman_min start_POSTSUBSCRIPT italic_a , italic_θ end_POSTSUBSCRIPT divide start_ARG ∥ italic_h ( bold_italic_x ) ∥ + italic_a roman_cos italic_θ end_ARG start_ARG square-root start_ARG ( ∥ italic_h ( bold_italic_x ) ∥ + italic_a roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_a roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , (5)

where 0θ2π0𝜃2𝜋0\leq\theta\leq 2\pi0 ≤ italic_θ ≤ 2 italic_π.

Since a𝑎aitalic_a is a scalar representing the magnitude of the translation from h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) to h(𝒙~)~𝒙h(\tilde{\bm{x}})italic_h ( over~ start_ARG bold_italic_x end_ARG ), it can be shown that the magnitude of a𝑎aitalic_a is upper bounded by the magnitude of the gradient weighted by the ratio of volume change during the translation process. Specifically, aMVU/V𝒮𝑎𝑀subscript𝑉𝑈subscript𝑉𝒮a\leq MV_{U}/V_{\mathcal{S}}italic_a ≤ italic_M italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT / italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, where V𝒮subscript𝑉𝒮V_{\mathcal{S}}italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is the volume of the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-ball (𝟎;r)0𝑟\mathcal{B}(\bm{0};r)caligraphic_B ( bold_0 ; italic_r ), VUsubscript𝑉𝑈V_{U}italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT is the volume of the union of the two sampling space centered at 𝒙𝒙\bm{x}bold_italic_x and 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG minus their intersection, and M𝑀Mitalic_M is a constant that depends on the upper bound of g𝑔gitalic_g, We notice that the gradient-based attribution is a function of input gradient, f(𝒙)𝑓𝒙\nabla f(\bm{x})∇ italic_f ( bold_italic_x ), which is bounded for Lipschitz continuous networks (See Lemma 1 in the Appendix A); thus, the upper bound can be derived separately for different attribution functions.

4.2 Lower bound of cosine similarity for bounded attributions

Now that we have reformulated the optimization with respect to vector 𝒗𝒗\bm{v}bold_italic_v into an alternative simpler one with respect to scalar values a𝑎aitalic_a and θ𝜃\thetaitalic_θ. By solving the alternative problem, our result shows that the smoothed attribution is robust within the following half-angle of a spherical cone.

Theorem 1.

Let g:dd:𝑔superscript𝑑superscript𝑑g:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a upper bounded attribution function, and 𝛈U(𝟎;r)superscriptsimilar-to𝑈𝛈0𝑟\bm{\eta}\stackrel{{\scriptstyle U}}{{\sim}}\mathcal{B}(\bm{0};r)bold_italic_η start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_U end_ARG end_RELOP caligraphic_B ( bold_0 ; italic_r ). Let hhitalic_h be the smoothed version of g𝑔gitalic_g as defined in (2). Then, for all 𝐱~{𝐱+𝛅|𝛅2ϵ}~𝐱conditional-set𝐱𝛅subscriptnorm𝛅2italic-ϵ\tilde{\bm{x}}\in\left\{\bm{x}+\bm{\delta}|\|\bm{\delta}\|_{2}\leq\epsilon\right\}over~ start_ARG bold_italic_x end_ARG ∈ { bold_italic_x + bold_italic_δ | ∥ bold_italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ }, we have cos(h(𝐱),h(𝐱~))T𝐱~𝐱𝑇\cos\left(h(\bm{x}),h(\tilde{\bm{x}})\right)\geq Troman_cos ( italic_h ( bold_italic_x ) , italic_h ( over~ start_ARG bold_italic_x end_ARG ) ) ≥ italic_T, where

T=h(𝒙)2h(𝒙)22+M2VU2/V𝒮2𝑇subscriptnorm𝒙2superscriptsubscriptnorm𝒙22superscript𝑀2superscriptsubscript𝑉𝑈2subscriptsuperscript𝑉2𝒮T=\frac{\|h(\bm{x})\|_{2}}{\sqrt{\|h(\bm{x})\|_{2}^{2}+M^{2}V_{U}^{2}/V^{2}_{% \mathcal{S}}}}italic_T = divide start_ARG ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_V start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG end_ARG (6)

Here, M𝑀Mitalic_M is the upper bound of g𝑔gitalic_g. V𝒮subscript𝑉𝒮V_{\mathcal{S}}italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is the volume of the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-ball (𝟎;r)0𝑟\mathcal{B}(\bm{0};r)caligraphic_B ( bold_0 ; italic_r ), and VUsubscript𝑉𝑈V_{U}italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT is the volume of the union of the two sampling space centered at 𝐱𝐱\bm{x}bold_italic_x and 𝐱~~𝐱\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG minus their intersection.

The entire proof can be found in Appendix A. The theorem points out that the lower bound of cosine similarity is related to the smoothing space around the input samples and the maximum allowable perturbation size. Moreover, when the smoothing space is a 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm ball, the above result can be derived by directly computing two volumes VUsubscript𝑉𝑈V_{U}italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and V𝒮subscript𝑉𝒮V_{\mathcal{S}}italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, which can be explicitly calculated by

VU=2V𝒮×(1I(2rhh2)/r2(d+12,12))subscript𝑉𝑈2subscript𝑉𝒮1subscript𝐼2𝑟superscript2superscript𝑟2𝑑1212V_{U}=2V_{\mathcal{S}}\times\left(1-I_{(2rh-h^{2})/r^{2}}\left(\frac{d+1}{2},% \frac{1}{2}\right)\right)italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = 2 italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT × ( 1 - italic_I start_POSTSUBSCRIPT ( 2 italic_r italic_h - italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_d + 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ) (7)

where h=rϵ/20𝑟italic-ϵ20h=r-\epsilon/2\geq 0italic_h = italic_r - italic_ϵ / 2 ≥ 0 and Ix(a,b)subscript𝐼𝑥𝑎𝑏I_{x}(a,b)italic_I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_a , italic_b ) is the regularized incomplete beta function, cumulative density function of beta distribution [29].

We observe the following properties of the above theorem.

  1. 1.

    Unlike previous attribution robustness works, such as Dombrowski et al. [14], Singh et al. [39], Boopathy et al. [5] and Wang & Kong [45], which require the networks to be twice-differentiable and need to change the ReLU activation into Softplus, the proposed result does not assume anything on the classifiers. Thus, it can be safely applied to any neural network and any architectures.

  2. 2.

    The lower bound depends on the radius of smoothing, r𝑟ritalic_r. The bound becomes larger as the radius of smoothing grows. In extreme cases when r𝑟ritalic_r tends to infinity, the smoothing spaces of two samples completely overlap, which corresponds to the same smoothed attribution and their cosine similarity becomes 1111.

  3. 3.

    At the same time, the lower bound also decreases when the attack budget ϵitalic-ϵ\epsilonitalic_ϵ increases. Besides, when ϵitalic-ϵ\epsilonitalic_ϵ increases beyond the constraint that h=rϵ/20𝑟italic-ϵ20h=r-\epsilon/2\geq 0italic_h = italic_r - italic_ϵ / 2 ≥ 0 and tends to infinity, the distance between two attributions will become further and their cosine similarity will tend to 00 in high dimensional space, which makes the lower bound becomes trivial.

  4. 4.

    The proposed method can be scaled to datasets with large images. The lower bound is efficient to compute since only the smoothed attribution of given sample needed. On the contrary, the previous works that approximately estimate the attribution robustness [46] require the computation of input Hessian and the corresponding eigenvalues and eigenvectors, which becomes intractable for larger size images on modern neural networks.

Table 1: Comparison of top-k and Kendall’s rank correlation between 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT perturbed and non-perturbed attributions on standard and robust models using smoothed and non-smoothed attributions.
Standard IG-NORM [10] TRADES [50] IGR [45]
SM top-k 0.3449 0.6575 0.6450 0.8354
Kendall 0.1496 0.4709 0.4642 0.7553
SmoothSM top-k 0.3853 0.6261 0.6082 0.8363
Kendall 0.1670 0.4238 0.4119 0.7568
IG top-k 0.4742 0.7075 0.6821 0.8402
Kendall 0.1744 0.5098 0.5030 0.7839
SmoothIG top-k 0.5302 0.6730 0.6528 0.8460
Kendall 0.3819 0.4494 0.4533 0.7612

4.3 Alternative formulations of the attribution robustness

The previous section formulates the robustness of smoothed attribution in terms of the smallest cosine similarity between the original and perturbed smoothed attribution, when the attack budget and the smoothing radius are fixed. In some scenarios, practitioners want to formulate the robustness in different ways. For example, we may want to find the maximum allowable perturbation ϵitalic-ϵ\epsilonitalic_ϵ such that the cosine similarity between the original and perturbed smoothed attribution is guaranteed to be greater than a predefined threshold T𝑇Titalic_T. On the other hand, one can also obtain the minimum smoothing radius needed such that the desired attribution robustness is achieved within allowable attack region. The following corollary provides the alternative formulations of the attribution robustness.

Corollary 1.

Let g:dd:𝑔superscript𝑑superscript𝑑g:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a bounded attribution function, and 𝛈U(𝐱;r)superscriptsimilar-to𝑈𝛈𝐱𝑟\bm{\eta}\stackrel{{\scriptstyle U}}{{\sim}}\mathcal{B}(\bm{x};r)bold_italic_η start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_U end_ARG end_RELOP caligraphic_B ( bold_italic_x ; italic_r ). Let hhitalic_h be the smoothed version of g𝑔gitalic_g as defined in (2).

  1. (i)

    Given a predefined threshold T[0,1]𝑇01T\in[0,1]italic_T ∈ [ 0 , 1 ], then for all 𝜹2ϵsubscriptnorm𝜹2italic-ϵ\|\bm{\delta}\|_{2}\leq\epsilon∥ bold_italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ, we have cos(h(𝒙),h(𝒙+𝜹))T𝒙𝒙𝜹𝑇\cos(h(\bm{x}),h(\bm{x}+\bm{\delta}))\geq Troman_cos ( italic_h ( bold_italic_x ) , italic_h ( bold_italic_x + bold_italic_δ ) ) ≥ italic_T, where

    ϵ=2r1IZ1(d+12,12).italic-ϵ2𝑟1subscriptsuperscript𝐼1𝑍𝑑1212\epsilon=2r\sqrt{1-I^{-1}_{Z}\left(\frac{d+1}{2},\frac{1}{2}\right)}.italic_ϵ = 2 italic_r square-root start_ARG 1 - italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( divide start_ARG italic_d + 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) end_ARG . (8)
  2. (ii)

    Given a predefined threshold T[0,1]𝑇01T\in[0,1]italic_T ∈ [ 0 , 1 ] and the maximum perturbation size ϵ0italic-ϵ0\epsilon\geq 0italic_ϵ ≥ 0, the smoothed attribution satisfies cos(h(𝒙),h(𝒙~))T𝒙~𝒙𝑇\cos(h(\bm{x}),h(\tilde{\bm{x}}))\geq Troman_cos ( italic_h ( bold_italic_x ) , italic_h ( over~ start_ARG bold_italic_x end_ARG ) ) ≥ italic_T for all 𝒙~{𝒙+𝜹|𝜹2ϵ}~𝒙conditional-set𝒙𝜹subscriptnorm𝜹2italic-ϵ\tilde{\bm{x}}\in\left\{\bm{x}+\bm{\delta}|\|\bm{\delta}\|_{2}\leq\epsilon\right\}over~ start_ARG bold_italic_x end_ARG ∈ { bold_italic_x + bold_italic_δ | ∥ bold_italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ } when rR𝑟𝑅r\geq Ritalic_r ≥ italic_R, where

    R=ϵ2(1IZ1(d+12,12))12.𝑅italic-ϵ2superscript1subscriptsuperscript𝐼1𝑍𝑑121212R=\frac{\epsilon}{2}\left(1-I^{-1}_{Z}\left(\frac{d+1}{2},\frac{1}{2}\right)% \right)^{-\frac{1}{2}}.italic_R = divide start_ARG italic_ϵ end_ARG start_ARG 2 end_ARG ( 1 - italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( divide start_ARG italic_d + 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . (9)

Iz1(a,b)subscriptsuperscript𝐼1𝑧𝑎𝑏I^{-1}_{z}(a,b)italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_a , italic_b ) is the inverse of the regularized incomplete beta function, and Z𝑍Zitalic_Z is defined as

Z=1h(𝒙)22M(1T21)𝑍1subscriptnorm𝒙22𝑀1superscript𝑇21Z=1-\frac{\|h(\bm{x})\|_{2}}{2M}\left(\frac{1}{T^{2}}-1\right)italic_Z = 1 - divide start_ARG ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_M end_ARG ( divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 1 ) (10)

The derivation of the Corollary can be found in Appendix A. We notice that, in these formulations, when the smoothing radius is larger, the maximum allowable perturbation is also larger, which allows stronger attacks while kee** the attribution similarity within a controllable range. Similarly, when the maximum allowable perturbation is larger, the minimum smoothing radius is also larger, which means that the attribution similarity can be maintained with a larger smoothing radius.

5 Experiments and results

Table 2: The theoretical lower bound (T𝑇Titalic_T in Eqn. (6)) for cosine similarity evaluated on baseline models using MNIST. Note that the bound is not achievable for r=0.5𝑟0.5r=0.5italic_r = 0.5 when ϵ=1.0italic-ϵ1.0\epsilon=1.0italic_ϵ = 1.0, since the radius must be greater than ϵ/2italic-ϵ2\epsilon/2italic_ϵ / 2.
ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius (r𝑟ritalic_r) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.3002 0.3141 0.3385 0.3732 0.4144 0.4600 0.5057
IG-NORM 0.4038 0.4189 0.4432 0.4729 0.5055 0.5466 0.5909
IGR 0.4145 0.4269 0.4482 0.4792 0.5208 0.5748 0.6392
ϵ=1.0italic-ϵ1.0\epsilon=1.0italic_ϵ = 1.0 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius (r𝑟ritalic_r) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard / 0.3034 0.3092 0.3264 0.3716 0.3990 0.4178
IG-NORM / 0.3650 0.3822 0.3892 0.4220 0.4974 0.5365
IGR / 0.3834 0.4025 0.4558 0.4914 0.5237 0.5358

In this section, we evaluate the effectiveness of uniformly smoothed attribution. Following previous work on attribution robustness, we use the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attribution attack adapted from the Iterative Feature Importance Attacks (IFIA) by Ghorbani et al. [19]. It is first shown that the uniformly smoothed attributions are less likely to be perturbed comparing with non-smoothed attributions when being attacked by IFIA. After which, we present the results of certification using the proposed lower bound of cosine similarity. The experiments are conducted on baseline models including adversarial robust models[32, 50], attributional robust models[10, 39, 23, 45], as well as non-robust models trained for standard classification tasks. Following those baseline models, the method is tested on the validation sets of MNIST [27] using a small-size convolutional network, on CIFAR-10 [24] using a ResNet-18 [21], and on ImageNet [35] using a ResNet-50 [21]. More details of the experiments are described in the Appendix. All experiments are run on NVIDIA GeForce RTX 3090.111Source code will be released later.

To empirically compute the uniformly smoothed attribution for every sample, N𝑁Nitalic_N points are randomly sampled from the d𝑑ditalic_d-dimensional sphere uniformly, and augmented to the input sample. To do so, the sampling technique introduced by Box & Muller [6] is applied. The integration in Eqn. 2 is then empirically estimated using Monte Carlo integration, i.e., h^(𝒙)=1Ng(𝒙+𝜼i)^𝒙1𝑁𝑔𝒙subscript𝜼𝑖\hat{h}(\bm{x})=\frac{1}{N}\sum g(\bm{x}+\bm{\eta}_{i})over^ start_ARG italic_h end_ARG ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ italic_g ( bold_italic_x + bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where 𝜼iU(𝟎;r)superscriptsimilar-to𝑈subscript𝜼𝑖0𝑟\bm{\eta}_{i}\stackrel{{\scriptstyle U}}{{\sim}}\mathcal{B}(\bm{0};r)bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_U end_ARG end_RELOP caligraphic_B ( bold_0 ; italic_r ). For large N𝑁Nitalic_N, the estimator h^(𝒙)^𝒙\hat{h}(\bm{x})over^ start_ARG italic_h end_ARG ( bold_italic_x ) almost surely converges to h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) [18]; hence the convergence of T^^𝑇\hat{T}over^ start_ARG italic_T end_ARG to T𝑇Titalic_T can be obtained (see Appendix C.1 for details). Unless specifically stated, we choose the number of samples N𝑁Nitalic_N to be 100,000100000100,000100 , 000 to compute the proposed lower bound, and N=300superscript𝑁300N^{\ast}=300italic_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 300 for the uniformly smoothed attribution being attacked in all experiments.

5.1 Evaluation of the robustness of uniformly smoothed attribution

Table 3: Theoretical lower bounds evaluated on non-robust model and ImageNet using different radii r𝑟ritalic_r and attack sizes ϵitalic-ϵ\epsilonitalic_ϵ. Note that the method is only applicable when radius r𝑟ritalic_r must be greater than ϵ/2italic-ϵ2\epsilon/2italic_ϵ / 2.
r𝑟ritalic_r 0.5 1.0 1.5 2.0 2.5 3.0 3.5
ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5 0.2612 0.2594 0.2704 0.2716 0.2892 0.2973 0.3029
ϵ=1.0italic-ϵ1.0\epsilon=1.0italic_ϵ = 1.0 / 0.1803 0.1992 0.1746 0.1904 0.2127 0.2502
ϵ=2.0italic-ϵ2.0\epsilon=2.0italic_ϵ = 2.0 / / 0.1753 0.1852 0.2044 0.2015 0.2045

We first conduct the experiment to verify that the uniformly smoothed attribution itself is more robust than the original attribution. The uniform smoothing around the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball with radius 0.50.50.50.5 is applied to the saliency map (SM) [38] and integrated gradients (IG) [43] and evaluate on CIFAR-10. The resultant attributions are denoted by SmoothSM and SmoothIG, respectively. We then attack the attributions using the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT IFIA attack and evaluate the robustness using Kendall’s rank correlation and top-k intersection [19]. The experiments are evaluated on both non-robust model (Standard) and robust models (IG-NORM, TRADES, IGR). Note that IFIA is directly performed on the smoothSM and smoothIG, instead of its original counterpart. Since the PGD-like attribution attack requires to take the derivative of the attribution to determine the direction of gradient descent, the double backpropagation is needed. Thus, it is necessary to replace the ReLU activation by the twice-differentiable Softplus during attack [14]. The results are shown in Table 1.

We observe that for the non-robust model, both SmoothSM and SmoothIG perform better than its non-smoothed counterparts in both metrics, which shows that the uniformly smoothed attribution itself is more resistant to the attribution attacks. For models that are specifically trained to defend against the attribution attacks using heuristic methods adapted from adversarial training, e.g., IG-NORM, TRADES and IGR, the smoothed attributions show comparable robustness to the non-smoothed attribution. Moreover, we also notice that SmoothIG performs better than SmoothSM, especially for the non-robust models. This can be attributed to the fact that IG satisfies the axiom of completeness, which ensures that the sum of IG is upper-bounded by the model output. In addition, we also observed that the smoothing technique does not always enhance the robustness of attribution, as measured by top-k and Kendall’s rank correlation. It is worth noting that Yeh et al. [49] argued that randomized smoothing can reduce attribution sensitivities and consequently improve robustness. The disparity in our findings stems from the fact that we specifically evaluated Kendall’s rank correlation, whereas their study defined sensitivity based on 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm distance, which was subsequently deemed inappropriate for evaluating attribution robustness by Wang & Kong [45].

5.2 Evaluation of the certification of uniformly smoothed attributions

In this section, the lower bound of the cosine similarity between the original and attacked attribution is reported. We use the integrated gradients as an example since it is well-bounded due to the axiom of completeness, and the technique can be also applied to any other gradient-based attributions.

Table 2 reports the theoretical bound evaluated on MNIST computed using Theorem 1. We include the non-robust and two attributional robust models and compute the bound for different 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius r𝑟ritalic_r and attack size ϵitalic-ϵ\epsilonitalic_ϵ pairs. The lower bounds are validated by examining the actual 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attacks. Specifically, each input sample has been attacked 20 times and the cosine similarities of resulting perturbed attributions with original attributions are examined. In total, 200,000 attacked images are tested for each parameter pair and none of the evaluation metrics exceeds the theoretical bound. Moreover, we also observe that the bound becomes tighter when the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius r𝑟ritalic_r increases and when the attack size ϵitalic-ϵ\epsilonitalic_ϵ decreases. Besides, since IGR is more robust than IG-NORM [45], we can observe that the lower bound is also a valid measurement of the robustness of the models.

In Figure 2, we show the gap between the theoretical lower bound and the empirical cosine similarity between the original and perturbed attributions. The results are evaluated on CIFAR-10 using IGR. Out of 10,000 testing samples, the 200 with the smallest gaps are chosen for each pair of r𝑟ritalic_r and ϵitalic-ϵ\epsilonitalic_ϵ, and the gaps are sorted for better visualization. We notice that the gaps between the theoretical bound and empirical cosine similarity are positive and small, which shows the validity and the tightness of the proposed bound.

In Table 3, we also include the theoretical lower bound evaluated on ImageNet to show that the proposed method is also applicable to large-scale datasets. Since the current attribution attacks and attribution defense methods do not scale to large-scale datasets, we only include the non-robust model. Since our method does not rely on the second-order derivative of the output with respect to the input, it can be scaled to ImageNet-size datasets. For the experiments on ResNet-50, each certification for one single sample takes around 15151515 seconds. We observe that the reported bounds are also consistent with our theoretical findings.

6 Conclusion and limitations

Refer to caption
(a) r=1.5𝑟1.5r=1.5italic_r = 1.5
Refer to caption
(b) r=2.0𝑟2.0r=2.0italic_r = 2.0
Refer to caption
(c) r=2.5𝑟2.5r=2.5italic_r = 2.5
Figure 2: The gap between theoretical bounds and empirical cosine similarity between original and perturbed attribution evaluated on CIFAR-10 using IGR.

In this paper, we attempt to use the uniformly smoothed attribution to certify the attribution robustness evaluated by cosine similarity. The smoothed attribution is constructed by taking the mean of the attributions computed from input samples augmented by noises uniformly sampled from an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball. It is proved that the cosine similarity between the original and perturbed smoothed attribution is lower-bounded based on a geometric formulation related to the volume of the hyperspherical cap. Alternative formulations are provided to find the maximum allowable size of perturbations and the minimum radius of smoothing in order to maintain the attribution robustness. The method works on bounded gradient-based attribution methods for all convolutional neural networks and is scalable to large datasets. We empirically demonstrate that the method can be used to certify the attribution robustness, using the well-bounded integrated gradients, and the state-of-the-art attributional robust models on MNIST, CIFAR-10 and ImageNet.

7 Limitations and broader impacts

The method in paper can be generally applied to any convolutional neural networks and any bounded attribution methods. Although the existence of an upper bound has been shown for all gradient-based methods, in some extreme cases when the upper bound for certain attribution is trivial, i.e., an extremely large value, the proposed lower bound for attribution robustness also becomes trivial. In future work, we will investigate the upper bound for other bounded attribution methods and provide the corresponding lower bounds for attribution robustness.

Our work attempts to draw the attention of the community to the need for a guarantee of attribution robustness. With the increasingly large number of applications of deep learning, the transparency and trustworthiness of neural networks are crucial for users to understand the outcomes and to avoid any abuse of the techniques. While the study of the security of networks could reveal their potential risks that can be misused, we believe this work has more positive impacts to the community and can encourage the development of more trustworthy deep learning applications.

References

  • Antoniadi et al. [2021] Anna Markella Antoniadi, Yuhan Du, Yasmine Guendouz, Lan Wei, Claudia Mazo, Brett A Becker, and Catherine Mooney. Current challenges and future opportunities for xai in machine learning-based clinical decision support systems: a systematic review. Applied Sciences, 11(11):5088, 2021.
  • Atakishiyev et al. [2021] Shahin Atakishiyev, Mohammad Salameh, Hengshuai Yao, and Randy Goebel. Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions. arXiv preprint arXiv:2112.11561, 2021.
  • Athalye et al. [2018] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp.  274–283. PMLR, 2018.
  • Bach et al. [2015] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
  • Boopathy et al. [2020] Akhilan Boopathy, Sijia Liu, Gaoyuan Zhang, Cynthia Liu, Pin-Yu Chen, Shiyu Chang, and Luca Daniel. Proper network interpretability helps adversarial robustness in classification. In International Conference on Machine Learning, pp.  1014–1023. PMLR, 2020.
  • Box & Muller [1958] George EP Box and Mervin E Muller. A note on the generation of random normal deviates. The annals of mathematical statistics, 29(2):610–611, 1958.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Burton et al. [2020] Simon Burton, Ibrahim Habli, Tom Lawton, John McDermid, Phillip Morgan, and Zoe Porter. Mind the gaps: Assuring the safety of autonomous systems from an engineering, ethical, and legal perspective. Artificial Intelligence, 279:103201, 2020.
  • Carlini & Wagner [2017] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp.  39–57. IEEE Computer Society, 2017.
  • Chen et al. [2019] Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, and Somesh Jha. Robust attribution regularization. In Advances in Neural Information Processing Systems, pp.  14300–14310, 2019.
  • Cohen et al. [2019] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp.  1310–1320. PMLR, 2019.
  • Das & Geisler [2021] Abhranil Das and Wilson S Geisler. A method to integrate and classify normal distributions. Journal of Vision, 21(10):1–1, 2021.
  • Davies [1980] Robert B Davies. The distribution of a linear combination of χ𝜒\chiitalic_χ2 random variables. Journal of the Royal Statistical Society Series C: Applied Statistics, 29(3):323–333, 1980.
  • Dombrowski et al. [2019] Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. Explanations can be manipulated and geometry is to blame. In Advances in Neural Information Processing Systems, pp.  13589–13600, 2019.
  • Doshi-Velez & Kim [2017] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  • Du et al. [2022] Yuhan Du, Anthony R Rafferty, Fionnuala M McAuliffe, Lan Wei, and Catherine Mooney. An explainable machine learning-based clinical decision support system for prediction of gestational diabetes mellitus. Scientific Reports, 12(1):1170, 2022.
  • Duchesne & De Micheaux [2010] Pierre Duchesne and Pierre Lafaye De Micheaux. Computing the distribution of quadratic forms: Further comparisons between the liu–tang–zhang approximation and exact methods. Computational Statistics & Data Analysis, 54(4):858–862, 2010.
  • Feller [1991] William Feller. An introduction to probability theory and its applications, Volume 2, volume 81. John Wiley & Sons, 1991.
  • Ghorbani et al. [2019] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  3681–3688, 2019.
  • Goodman & Flaxman [2017] Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine, 38(3):50–57, 2017.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hrinivich et al. [2023] William Thomas Hrinivich, Tonghe Wang, and Chunhao Wang. Interpretable and explainable machine learning models in oncology. Frontiers in Oncology, 13:1184428, 2023.
  • Ivankay et al. [2020] Adam Ivankay, Ivan Girardi, Chiara Marchiori, and Pascal Frossard. Far: A general framework for attributional robustness. arXiv preprint arXiv:2010.07393, 2020.
  • Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • Kumar & Goldstein [2021] Aounon Kumar and Tom Goldstein. Center smoothing: Certified robustness for networks with structured outputs. Advances in Neural Information Processing Systems, 34:5560–5575, 2021.
  • Kwon & Zou [2022] Yongchan Kwon and James Y Zou. Weightedshap: analyzing and improving shapley based feature attributions. Advances in Neural Information Processing Systems, 35:34363–34376, 2022.
  • LeCun et al. [2010] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  • Lecuyer et al. [2019] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pp.  656–672. IEEE, 2019.
  • Li [2010] Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010.
  • Liu et al. [2018] Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  369–385, 2018.
  • Lundberg & Lee [2017] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  • Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJzIBfZAb.
  • Paulavičius & Žilinskas [2006] Remigijus Paulavičius and Julius Žilinskas. Analysis of different norms and corresponding lipschitz constants for global optimization. Technological and Economic Development of Economy, 12(4):301–306, 2006.
  • Rao et al. [2022] Sukrut Rao, Moritz Böhle, and Bernt Schiele. Towards better understanding attribution methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10223–10232, 2022.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • Salman et al. [2019] Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
  • Sarkar et al. [2021] Anindya Sarkar, Anirban Sarkar, and Vineeth N Balasubramanian. Enhanced regularizers for attributional robustness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  2532–2540, 2021.
  • Simonyan et al. [2014] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014.
  • Singh et al. [2020] Mayank Singh, Nupur Kumari, Puneet Mangla, Abhishek Sinha, Vineeth N Balasubramanian, and Balaji Krishnamurthy. Attributional robustness training using input-gradient spatial alignment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, pp.  515–533. Springer, 2020.
  • Smilkov et al. [2017] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
  • Srinivas & Fleuret [2019] Suraj Srinivas and François Fleuret. Full-gradient representation for neural network visualization. Advances in neural information processing systems, 32, 2019.
  • Sundararajan & Najmi [2020] Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In International conference on machine learning, pp.  9269–9278. PMLR, 2020.
  • Sundararajan et al. [2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp.  3319–3328. PMLR, 2017.
  • Teng et al. [2020] Jiaye Teng, Guang-He Lee, and Yang Yuan. $\ell_1$ adversarial robustness certificates: a randomized smoothing approach, 2020. URL https://openreview.net/forum?id=H1lQIgrFDS.
  • Wang & Kong [2022] Fan Wang and Adams Wai-Kin Kong. Exploiting the relationship between kendall’s rank correlation and cosine similarity for attribution protection. Advances in Neural Information Processing Systems, 35:20580–20591, 2022.
  • Wang & Kong [2023] Fan Wang and Adams Wai-Kin Kong. A practical upper bound for the worst-case attribution deviations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  24616–24625, 2023.
  • Wong & Kolter [2018] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp.  5286–5295. PMLR, 2018.
  • Yang et al. [2020] Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pp.  10693–10705. PMLR, 2020.
  • Yeh et al. [2019] Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and Pradeep K Ravikumar. On the (in) fidelity and sensitivity of explanations. Advances in Neural Information Processing Systems, 32, 2019.
  • Zhang et al. [2019] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, pp.  7472–7482. PMLR, 2019.

Appendix A Proofs

Lemma 1.

(Paulavičius & Žilinskas [33]) For L𝐿Litalic_L-Lipschitz function f:d:𝑓superscript𝑑f:\mathbb{R}^{d}\rightarrow\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R,

|f(𝒙)f(𝒚)|L𝒙𝒚q𝑓𝒙𝑓𝒚𝐿subscriptnorm𝒙𝒚𝑞|f(\bm{x})-f(\bm{y})|\leq L\|\bm{x}-\bm{y}\|_{q}| italic_f ( bold_italic_x ) - italic_f ( bold_italic_y ) | ≤ italic_L ∥ bold_italic_x - bold_italic_y ∥ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (11)

where L=max{f(𝐱)p:𝐱S}𝐿:subscriptnorm𝑓𝐱𝑝𝐱𝑆L=\max\left\{\|\nabla f(\bm{x})\|_{p}:\bm{x}\in S\right\}italic_L = roman_max { ∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT : bold_italic_x ∈ italic_S } is Lipschitz constant. Thus, f(𝐱)pLsubscriptnorm𝑓𝐱𝑝𝐿\|\nabla f(\bm{x})\|_{p}\leq L∥ ∇ italic_f ( bold_italic_x ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_L.

Proof.

Refer to Paulavičius & Žilinskas [33] for the proof. ∎

See 1

Proof.

As defined in Eqn. (2)

h(𝒙)=𝔼𝜼(𝟎;r)[g(𝒙+𝜼)]=1V𝒮𝜼(𝟎;r)g(𝒙+𝜼)𝑑𝜼𝒙subscript𝔼similar-to𝜼0𝑟delimited-[]𝑔𝒙𝜼1subscript𝑉𝒮subscriptsimilar-to𝜼0𝑟𝑔𝒙𝜼differential-d𝜼h(\bm{x})=\mathbb{E}_{\bm{\eta}\sim\mathcal{B}(\bm{0};r)}[g(\bm{x}+\bm{\eta})]% =\frac{1}{V_{\mathcal{S}}}\int_{\bm{\eta}\sim\mathcal{B}(\bm{0};r)}g(\bm{x}+% \bm{\eta})d\bm{\eta}italic_h ( bold_italic_x ) = blackboard_E start_POSTSUBSCRIPT bold_italic_η ∼ caligraphic_B ( bold_0 ; italic_r ) end_POSTSUBSCRIPT [ italic_g ( bold_italic_x + bold_italic_η ) ] = divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT bold_italic_η ∼ caligraphic_B ( bold_0 ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x + bold_italic_η ) italic_d bold_italic_η (12)

where V𝒮subscript𝑉𝒮V_{\mathcal{S}}italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT is the volume of the psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-ball with radius r𝑟ritalic_r. Similarly, let 𝒙~=𝒙+𝜹~𝒙𝒙𝜹\tilde{\bm{x}}=\bm{x}+\bm{\delta}over~ start_ARG bold_italic_x end_ARG = bold_italic_x + bold_italic_δ, where 𝜹d𝜹superscript𝑑\bm{\delta}\in\mathbb{R}^{d}bold_italic_δ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a vector and 𝜹2ϵsubscriptnorm𝜹2italic-ϵ\|\bm{\delta}\|_{2}\leq\epsilon∥ bold_italic_δ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ. Then, we have

h(𝒙~)=1V𝒮𝜼(𝟎;r)g(𝒙~+𝜼)𝑑𝜼~𝒙1subscript𝑉𝒮subscriptsimilar-to𝜼0𝑟𝑔~𝒙𝜼differential-d𝜼h(\tilde{\bm{x}})=\frac{1}{V_{\mathcal{S}}}\int_{\bm{\eta}\sim\mathcal{B}(\bm{% 0};r)}g(\tilde{\bm{x}}+\bm{\eta})d\bm{\eta}italic_h ( over~ start_ARG bold_italic_x end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT bold_italic_η ∼ caligraphic_B ( bold_0 ; italic_r ) end_POSTSUBSCRIPT italic_g ( over~ start_ARG bold_italic_x end_ARG + bold_italic_η ) italic_d bold_italic_η (13)

We note that when 𝜼(𝟎;r)similar-to𝜼0𝑟\bm{\eta}\sim\mathcal{B}(\bm{0};r)bold_italic_η ∼ caligraphic_B ( bold_0 ; italic_r ), 𝒙+𝜼(𝒙;r)similar-to𝒙𝜼𝒙𝑟\bm{x}+\bm{\eta}\sim\mathcal{B}(\bm{x};r)bold_italic_x + bold_italic_η ∼ caligraphic_B ( bold_italic_x ; italic_r ) and 𝒙~+𝜼(𝒙~;r)similar-to~𝒙𝜼~𝒙𝑟\tilde{\bm{x}}+\bm{\eta}\sim\mathcal{B}(\tilde{\bm{x}};r)over~ start_ARG bold_italic_x end_ARG + bold_italic_η ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ). We then rewrite h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) and h(𝒙~)~𝒙h(\tilde{\bm{x}})italic_h ( over~ start_ARG bold_italic_x end_ARG ) as follows:

h(𝒙)=1V𝒮𝒙(𝒙;r)(𝒙~;r)g(𝒙)𝑑𝒙R1+1V𝒮𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙R2𝒙subscript1subscript𝑉𝒮subscriptsimilar-to𝒙𝒙𝑟~𝒙𝑟𝑔𝒙differential-d𝒙subscript𝑅1subscript1subscript𝑉𝒮subscriptsimilar-to𝒙~𝒙𝑟𝒙𝑟𝑔𝒙differential-d𝒙subscript𝑅2h(\bm{x})=\underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim\mathcal{B}(\bm% {x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}g(\bm{x})d\bm{x}}_{R_{1}}+% \underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm{x}% };r)\cap\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}}_{R_{2}}italic_h ( bold_italic_x ) = under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( bold_italic_x ; italic_r ) ∖ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∩ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (14)

and

h(𝒙~)=1V𝒮𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙R2+1V𝒮𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙R3~𝒙subscript1subscript𝑉𝒮subscriptsimilar-to𝒙~𝒙𝑟𝒙𝑟𝑔𝒙differential-d𝒙subscript𝑅2subscript1subscript𝑉𝒮subscriptsimilar-to𝒙~𝒙𝑟𝒙𝑟𝑔𝒙differential-d𝒙subscript𝑅3h(\tilde{\bm{x}})=\underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim% \mathcal{B}(\tilde{\bm{x}};r)\cap\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}}_{R_{2% }}+\underbrace{\frac{1}{V_{\mathcal{S}}}\int_{\bm{x}\sim\mathcal{B}(\tilde{\bm% {x}};r)\setminus\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}}_{R_{3}}italic_h ( over~ start_ARG bold_italic_x end_ARG ) = under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∩ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∖ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x end_ARG start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (15)

Hence,

h(𝒙~)=h(𝒙)R1+R3~𝒙𝒙subscript𝑅1subscript𝑅3h(\tilde{\bm{x}})=h(\bm{x})-R_{1}+R_{3}italic_h ( over~ start_ARG bold_italic_x end_ARG ) = italic_h ( bold_italic_x ) - italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (16)

Denote a𝒗=R3R1𝑎𝒗subscript𝑅3subscript𝑅1a\bm{v}=R_{3}-R_{1}italic_a bold_italic_v = italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where 𝒗𝒗\bm{v}bold_italic_v is a unit vector in the same direction of R3R1subscript𝑅3subscript𝑅1R_{3}-R_{1}italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a=R3R12𝑎subscriptnormsubscript𝑅3subscript𝑅12a=\|R_{3}-R_{1}\|_{2}italic_a = ∥ italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a scalar with the same magnitude of R3R1subscript𝑅3subscript𝑅1R_{3}-R_{1}italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Then, we have

cos(h(𝒙),h(𝒙~))=h(𝒙)h(𝒙)2(h(𝒙)+a𝒗h(𝒙)+a𝒗2)𝒙~𝒙superscript𝒙topsubscriptnorm𝒙2𝒙𝑎𝒗subscriptnorm𝒙𝑎𝒗2\cos(h(\bm{x}),h(\tilde{\bm{x}}))=\frac{h(\bm{x})^{\top}}{\|h(\bm{x})\|_{2}}% \left(\frac{h(\bm{x})+a\bm{v}}{\|h(\bm{x})+a\bm{v}\|_{2}}\right)roman_cos ( italic_h ( bold_italic_x ) , italic_h ( over~ start_ARG bold_italic_x end_ARG ) ) = divide start_ARG italic_h ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ( divide start_ARG italic_h ( bold_italic_x ) + italic_a bold_italic_v end_ARG start_ARG ∥ italic_h ( bold_italic_x ) + italic_a bold_italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) (17)

Note that the attribution g(𝒙)𝑔𝒙g(\bm{x})italic_g ( bold_italic_x ) is upper bounded by M𝑀Mitalic_M, specifically, g(𝒙)2Msubscriptnorm𝑔𝒙2𝑀\|g(\bm{x})\|_{2}\leq M∥ italic_g ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M, for some constant M𝑀Mitalic_M. Thus, we can derive that

a𝑎\displaystyle aitalic_a =R3R12absentsubscriptnormsubscript𝑅3subscript𝑅12\displaystyle=\|R_{3}-R_{1}\|_{2}= ∥ italic_R start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (18)
=1V𝒮(𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙𝒙(𝒙;r)(𝒙~;r)g(𝒙)𝑑𝒙)2absentsubscriptnorm1subscript𝑉𝒮subscriptsimilar-to𝒙~𝒙𝑟𝒙𝑟𝑔𝒙differential-d𝒙subscriptsimilar-to𝒙𝒙𝑟~𝒙𝑟𝑔𝒙differential-d𝒙2\displaystyle=\left\|\frac{1}{V_{\mathcal{S}}}\left(\int_{\bm{x}\sim\mathcal{B% }(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}-\int_{\bm{x% }\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}g(\bm{x})d\bm% {x}\right)\right\|_{2}= ∥ divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ( ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∖ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x - ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( bold_italic_x ; italic_r ) ∖ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (19)
1V𝒮(𝒙(𝒙~;r)(𝒙;r)g(𝒙)𝑑𝒙2+𝒙(𝒙;r)(𝒙~;r)g(𝒙)𝑑𝒙2)absent1subscript𝑉𝒮subscriptnormsubscriptsimilar-to𝒙~𝒙𝑟𝒙𝑟𝑔𝒙differential-d𝒙2subscriptnormsubscriptsimilar-to𝒙𝒙𝑟~𝒙𝑟𝑔𝒙differential-d𝒙2\displaystyle\leq\frac{1}{V_{\mathcal{S}}}\left(\left\|\int_{\bm{x}\sim% \mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}g(\bm{x})d\bm{x}% \right\|_{2}+\left\|\int_{\bm{x}\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(% \tilde{\bm{x}};r)}g(\bm{x})d\bm{x}\right\|_{2}\right)≤ divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ( ∥ ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∖ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( bold_italic_x ; italic_r ) ∖ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) end_POSTSUBSCRIPT italic_g ( bold_italic_x ) italic_d bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (20)
1V𝒮(𝒙(𝒙~;r)(𝒙;r)g(𝒙)2𝑑𝒙+𝒙(𝒙;r)(𝒙~;r)g(𝒙)2𝑑𝒙)absent1subscript𝑉𝒮subscriptsimilar-to𝒙~𝒙𝑟𝒙𝑟subscriptnorm𝑔𝒙2differential-d𝒙subscriptsimilar-to𝒙𝒙𝑟~𝒙𝑟subscriptnorm𝑔𝒙2differential-d𝒙\displaystyle\leq\frac{1}{V_{\mathcal{S}}}\left(\int_{\bm{x}\sim\mathcal{B}(% \tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}\left\|g(\bm{x})\right\|_{2}d% \bm{x}+\int_{\bm{x}\sim\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}% };r)}\left\|g(\bm{x})\right\|_{2}d\bm{x}\right)≤ divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ( ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∖ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT ∥ italic_g ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d bold_italic_x + ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( bold_italic_x ; italic_r ) ∖ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) end_POSTSUBSCRIPT ∥ italic_g ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d bold_italic_x ) (21)
1V𝒮(𝒙(𝒙~;r)(𝒙;r)M𝑑𝒙+𝒙(𝒙;r)(𝒙~;r)M𝑑𝒙)absent1subscript𝑉𝒮subscriptsimilar-to𝒙~𝒙𝑟𝒙𝑟𝑀differential-d𝒙subscriptsimilar-to𝒙𝒙𝑟~𝒙𝑟𝑀differential-d𝒙\displaystyle\leq\frac{1}{V_{\mathcal{S}}}\left(\int_{\bm{x}\sim\mathcal{B}(% \tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}Md\bm{x}+\int_{\bm{x}\sim% \mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde{\bm{x}};r)}Md\bm{x}\right)≤ divide start_ARG 1 end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ( ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∖ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT italic_M italic_d bold_italic_x + ∫ start_POSTSUBSCRIPT bold_italic_x ∼ caligraphic_B ( bold_italic_x ; italic_r ) ∖ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) end_POSTSUBSCRIPT italic_M italic_d bold_italic_x ) (22)
=M×V(𝒙;r)(𝒙~;r)(𝒙~;r)(𝒙;r)V𝒮=MVUV𝒮absent𝑀subscript𝑉𝒙𝑟~𝒙𝑟~𝒙𝑟𝒙𝑟subscript𝑉𝒮𝑀subscript𝑉𝑈subscript𝑉𝒮\displaystyle=M\times\frac{V_{\mathcal{B}(\bm{x};r)\setminus\mathcal{B}(\tilde% {\bm{x}};r)\cup\mathcal{B}(\tilde{\bm{x}};r)\setminus\mathcal{B}(\bm{x};r)}}{V% _{\mathcal{S}}}=M\frac{V_{U}}{V_{\mathcal{S}}}= italic_M × divide start_ARG italic_V start_POSTSUBSCRIPT caligraphic_B ( bold_italic_x ; italic_r ) ∖ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∪ caligraphic_B ( over~ start_ARG bold_italic_x end_ARG ; italic_r ) ∖ caligraphic_B ( bold_italic_x ; italic_r ) end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG = italic_M divide start_ARG italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG (23)

Thus, the lower bound of cos(h(𝒙),h(𝒙~))𝒙~𝒙\cos(h(\bm{x}),h(\tilde{\bm{x}}))roman_cos ( italic_h ( bold_italic_x ) , italic_h ( over~ start_ARG bold_italic_x end_ARG ) ) can be found by solving the optimization problem 222\|\cdot\|∥ ⋅ ∥ in the following content denotes the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm unless otherwise specified.

min𝒗subscript𝒗\displaystyle\min_{\bm{v}}roman_min start_POSTSUBSCRIPT bold_italic_v end_POSTSUBSCRIPT h(𝒙)h(𝒙)(h(𝒙)+a𝒗h(𝒙)+a𝒗)superscript𝒙topnorm𝒙𝒙𝑎𝒗norm𝒙𝑎𝒗\displaystyle\frac{h(\bm{x})^{\top}}{\|h(\bm{x})\|}\left(\frac{h(\bm{x})+a\bm{% v}}{\|h(\bm{x})+a\bm{v}\|}\right)divide start_ARG italic_h ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_h ( bold_italic_x ) ∥ end_ARG ( divide start_ARG italic_h ( bold_italic_x ) + italic_a bold_italic_v end_ARG start_ARG ∥ italic_h ( bold_italic_x ) + italic_a bold_italic_v ∥ end_ARG ) (24)
s.t. 𝒗=1norm𝒗1\displaystyle\|\bm{v}\|=1∥ bold_italic_v ∥ = 1
aMVUV𝒮𝑎𝑀subscript𝑉𝑈subscript𝑉𝒮\displaystyle a\leq M\frac{V_{U}}{V_{\mathcal{S}}}italic_a ≤ italic_M divide start_ARG italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG

Since h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) and h(𝒙~)~𝒙h(\tilde{\bm{x}})italic_h ( over~ start_ARG bold_italic_x end_ARG ) form a spherical cone, we can decompose 𝒗𝒗\bm{v}bold_italic_v by 𝒗=cosθ𝒗+sinθ𝒗𝒗𝜃subscript𝒗parallel-to𝜃subscript𝒗perpendicular-to\bm{v}=\cos\theta\bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp}bold_italic_v = roman_cos italic_θ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + roman_sin italic_θ bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT, where 𝒗subscript𝒗parallel-to\bm{v}_{\parallel}bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT and 𝒗subscript𝒗perpendicular-to\bm{v}_{\perp}bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT are two orthogonal unit vectors such that h(𝒙)𝒗=0superscripttop𝒙subscript𝒗perpendicular-to0h^{\top}(\bm{x})\bm{v}_{\perp}=0italic_h start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x ) bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT = 0 and 𝒗=h(𝒙)/h(𝒙)subscript𝒗parallel-to𝒙norm𝒙\bm{v}_{\parallel}=h(\bm{x})/\|h(\bm{x})\|bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT = italic_h ( bold_italic_x ) / ∥ italic_h ( bold_italic_x ) ∥. Then, the optimization problem can be rewritten as

min𝒗(h(𝒙)+a(cosθ𝒗+sinθ𝒗)h(𝒙)+a(cosθ𝒗+sinθ𝒗))superscriptsubscript𝒗parallel-totop𝒙𝑎𝜃subscript𝒗parallel-to𝜃subscript𝒗perpendicular-tonorm𝒙𝑎𝜃subscript𝒗parallel-to𝜃subscript𝒗perpendicular-to\displaystyle\min\quad\bm{v}_{\parallel}^{\top}\left(\frac{h(\bm{x})+a(\cos% \theta\bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp})}{\|h(\bm{x})+a(\cos\theta% \bm{v}_{\parallel}+\sin\theta\bm{v}_{\perp})\|}\right)roman_min bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( divide start_ARG italic_h ( bold_italic_x ) + italic_a ( roman_cos italic_θ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + roman_sin italic_θ bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ italic_h ( bold_italic_x ) + italic_a ( roman_cos italic_θ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + roman_sin italic_θ bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ) ∥ end_ARG ) (25)
\displaystyle\Rightarrow min𝒗(h(𝒙)𝒗+a(cosθ𝒗+asinθ𝒗)h(𝒙)𝒗+a(cosθ𝒗+asinθ𝒗))superscriptsubscript𝒗parallel-totopnorm𝒙subscript𝒗parallel-to𝑎𝜃subscript𝒗parallel-to𝑎𝜃subscript𝒗perpendicular-tonormnorm𝒙subscript𝒗parallel-to𝑎𝜃subscript𝒗parallel-to𝑎𝜃subscript𝒗perpendicular-to\displaystyle\min\quad\bm{v}_{\parallel}^{\top}\left(\frac{\|h(\bm{x})\|\bm{v}% _{\parallel}+a(\cos\theta\bm{v}_{\parallel}+a\sin\theta\bm{v}_{\perp})}{\|\|h(% \bm{x})\|\bm{v}_{\parallel}+a(\cos\theta\bm{v}_{\parallel}+a\sin\theta\bm{v}_{% \perp})\|}\right)roman_min bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( divide start_ARG ∥ italic_h ( bold_italic_x ) ∥ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + italic_a ( roman_cos italic_θ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + italic_a roman_sin italic_θ bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ ∥ italic_h ( bold_italic_x ) ∥ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + italic_a ( roman_cos italic_θ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + italic_a roman_sin italic_θ bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ) ∥ end_ARG ) (26)
\displaystyle\Rightarrow min(h(𝒙)+acosθ)𝒗𝒗+asinθ𝒗𝒗(h(𝒙)+acosθ)2𝒗𝒗+(asinθ)2𝒗𝒗norm𝒙𝑎𝜃superscriptsubscript𝒗parallel-totopsubscript𝒗parallel-to𝑎𝜃superscriptsubscript𝒗parallel-totopsubscript𝒗perpendicular-tosuperscriptnorm𝒙𝑎𝜃2superscriptsubscript𝒗parallel-totopsubscript𝒗parallel-tosuperscript𝑎𝜃2superscriptsubscript𝒗perpendicular-totopsubscript𝒗perpendicular-to\displaystyle\min\quad\frac{(\|h(\bm{x})\|+a\cos\theta)\bm{v}_{\parallel}^{% \top}\bm{v}_{\parallel}+a\sin\theta\bm{v}_{\parallel}^{\top}\bm{v}_{\perp}}{% \sqrt{(\|h(\bm{x})\|+a\cos\theta)^{2}\bm{v}_{\parallel}^{\top}\bm{v}_{% \parallel}+(a\sin\theta)^{2}\bm{v}_{\perp}^{\top}\bm{v}_{\perp}}}roman_min divide start_ARG ( ∥ italic_h ( bold_italic_x ) ∥ + italic_a roman_cos italic_θ ) bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + italic_a roman_sin italic_θ bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ( ∥ italic_h ( bold_italic_x ) ∥ + italic_a roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + ( italic_a roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_v start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT end_ARG end_ARG (27)
\displaystyle\Rightarrow minh(𝒙)+acosθ(h(𝒙)+acosθ)2+(asinθ)2norm𝒙𝑎𝜃superscriptnorm𝒙𝑎𝜃2superscript𝑎𝜃2\displaystyle\min\quad\frac{\|h(\bm{x})\|+a\cos\theta}{\sqrt{(\|h(\bm{x})\|+a% \cos\theta)^{2}+(a\sin\theta)^{2}}}roman_min divide start_ARG ∥ italic_h ( bold_italic_x ) ∥ + italic_a roman_cos italic_θ end_ARG start_ARG square-root start_ARG ( ∥ italic_h ( bold_italic_x ) ∥ + italic_a roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_a roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (28)

Since h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) is known for a given sample, the optimization problem can be written as follows by taking h(𝒙)=cnorm𝒙𝑐\|h(\bm{x})\|=c∥ italic_h ( bold_italic_x ) ∥ = italic_c:

min\displaystyle\minroman_min c+acosθ(c+acosθ)2+(asinθ)2𝑐𝑎𝜃superscript𝑐𝑎𝜃2superscript𝑎𝜃2\displaystyle\frac{c+a\cos\theta}{\sqrt{(c+a\cos\theta)^{2}+(a\sin\theta)^{2}}}divide start_ARG italic_c + italic_a roman_cos italic_θ end_ARG start_ARG square-root start_ARG ( italic_c + italic_a roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_a roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (29)
s.t. aMVUV𝒮𝑎𝑀subscript𝑉𝑈subscript𝑉𝒮\displaystyle a\leq M\frac{V_{U}}{V_{\mathcal{S}}}italic_a ≤ italic_M divide start_ARG italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG

We now consider the Lagrange function of the optimization problem:

(x,θ,λ)=c+acosθ(c+acosθ)2+(asinθ)2λ(aMVUV𝒮)𝑥𝜃𝜆𝑐𝑎𝜃superscript𝑐𝑎𝜃2superscript𝑎𝜃2𝜆𝑎𝑀subscript𝑉𝑈subscript𝑉𝒮\mathcal{L}(x,\theta,\lambda)=\frac{c+a\cos\theta}{\sqrt{(c+a\cos\theta)^{2}+(% a\sin\theta)^{2}}}-\lambda(a-M\frac{V_{U}}{V_{\mathcal{S}}})caligraphic_L ( italic_x , italic_θ , italic_λ ) = divide start_ARG italic_c + italic_a roman_cos italic_θ end_ARG start_ARG square-root start_ARG ( italic_c + italic_a roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_a roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG - italic_λ ( italic_a - italic_M divide start_ARG italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ) (30)

Taking the derivative of \mathcal{L}caligraphic_L with respect to a𝑎aitalic_a and θ𝜃\thetaitalic_θ and setting them to zero, we have

a=1T2(Tcosθ1T(ccosθ+2a)×(c+acosθ))λ=0𝑎1superscript𝑇2𝑇𝜃1𝑇𝑐𝜃2𝑎𝑐𝑎𝜃𝜆0\frac{\partial}{\partial a}\mathcal{L}=\frac{1}{T^{2}}\left(T\cos\theta-\frac{% 1}{T}\left(c\cos\theta+2a\right)\times(c+a\cos\theta)\right)-\lambda=0divide start_ARG ∂ end_ARG start_ARG ∂ italic_a end_ARG caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_T roman_cos italic_θ - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ( italic_c roman_cos italic_θ + 2 italic_a ) × ( italic_c + italic_a roman_cos italic_θ ) ) - italic_λ = 0 (31)

and

θ=1T2(asinθT+1T(c2asinθ+ca2sinθcosθ))=0𝜃1superscript𝑇2𝑎𝜃𝑇1𝑇superscript𝑐2𝑎𝜃𝑐superscript𝑎2𝜃𝜃0\frac{\partial}{\partial\theta}\mathcal{L}=\frac{1}{T^{2}}\left(-a\sin\theta% \cdot T+\frac{1}{T}\left(c^{2}a\sin\theta+ca^{2}\sin\theta\cos\theta\right)% \right)=0divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( - italic_a roman_sin italic_θ ⋅ italic_T + divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ( italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_a roman_sin italic_θ + italic_c italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_sin italic_θ roman_cos italic_θ ) ) = 0 (32)

where T=(c+acosθ)2+(asinθ)2𝑇superscript𝑐𝑎𝜃2superscript𝑎𝜃2T=\sqrt{(c+a\cos\theta)^{2}+(a\sin\theta)^{2}}italic_T = square-root start_ARG ( italic_c + italic_a roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_a roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Solving the above equations, we have

cosθ=0ora=0formulae-sequence𝜃0or𝑎0\cos\theta=0\quad\text{or}\quad a=0roman_cos italic_θ = 0 or italic_a = 0 (33)

where a=0𝑎0a=0italic_a = 0 reaches the maximum and cosθ=0𝜃0\cos\theta=0roman_cos italic_θ = 0 is the minimum. Therefore, the lower bound of cos(h(𝒙),h(𝒙~))𝒙~𝒙\cos(h(\bm{x}),h(\tilde{\bm{x}}))roman_cos ( italic_h ( bold_italic_x ) , italic_h ( over~ start_ARG bold_italic_x end_ARG ) ) is

cos(h(𝒙),h(𝒙~))cc2+(MVUV𝒮)2=h(𝒙)h(𝒙)2+(MVU/V𝒮)2𝒙~𝒙𝑐superscript𝑐2superscript𝑀subscript𝑉𝑈subscript𝑉𝒮2norm𝒙superscriptnorm𝒙2superscript𝑀subscript𝑉𝑈subscript𝑉𝒮2\cos(h(\bm{x}),h(\tilde{\bm{x}}))\geq\frac{c}{\sqrt{c^{2}+(M\frac{V_{U}}{V_{% \mathcal{S}}})^{2}}}=\frac{\|h(\bm{x})\|}{\sqrt{\|h(\bm{x})\|^{2}+(MV_{U}/V_{% \mathcal{S}})^{2}}}roman_cos ( italic_h ( bold_italic_x ) , italic_h ( over~ start_ARG bold_italic_x end_ARG ) ) ≥ divide start_ARG italic_c end_ARG start_ARG square-root start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_M divide start_ARG italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG = divide start_ARG ∥ italic_h ( bold_italic_x ) ∥ end_ARG start_ARG square-root start_ARG ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_M italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT / italic_V start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (34)

See 1

Proof.

Corollary 1 can be obtained by fixing T𝑇Titalic_T and taking r𝑟ritalic_r as unknown, and fixing T𝑇Titalic_T and taking ϵitalic-ϵ\epsilonitalic_ϵ as unknown, respectively. We can first derive that

I(2rhh2)/r2(d+12,12)=1h(x)22M1T21=Zsubscript𝐼2𝑟superscript2superscript𝑟2𝑑12121subscriptnorm𝑥22𝑀1superscript𝑇21𝑍I_{(2rh-h^{2})/r^{2}}\left(\frac{d+1}{2},\frac{1}{2}\right)=1-\frac{\|h(x)\|_{% 2}}{2M}\sqrt{\frac{1}{T^{2}}-1}=Zitalic_I start_POSTSUBSCRIPT ( 2 italic_r italic_h - italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_d + 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) = 1 - divide start_ARG ∥ italic_h ( italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_M end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_T start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 1 end_ARG = italic_Z (35)

Using the inverse of the regularized incomplete beta function, i.e., x=Iy1(a,b)𝑥subscriptsuperscript𝐼1𝑦𝑎𝑏x=I^{-1}_{y}(a,b)italic_x = italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_a , italic_b ), and h=rϵ/2𝑟italic-ϵ2h=r-\epsilon/2italic_h = italic_r - italic_ϵ / 2, we have

IZ1(d+12,12)=(2rhh2)/r2=1ϵ24r2subscriptsuperscript𝐼1𝑍𝑑12122𝑟superscript2superscript𝑟21superscriptitalic-ϵ24superscript𝑟2I^{-1}_{Z}\left(\frac{d+1}{2},\frac{1}{2}\right)=(2rh-h^{2})/r^{2}=1-\frac{% \epsilon^{2}}{4r^{2}}italic_I start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( divide start_ARG italic_d + 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) = ( 2 italic_r italic_h - italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 - divide start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (36)

The results in Corollary can then be solved accordingly. ∎

Appendix B Implementation details

In the experiments, we implemented the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT attribution attack adapted from Ghorbani et al. [19]. The attack uses top-k𝑘kitalic_k intersection version as the loss function. Following previous works, we choose k=100𝑘100k=100italic_k = 100 for MNIST and k=1000𝑘1000k=1000italic_k = 1000 for CIFAR-10. The number of iterations in PGD-like attack is 200, and the step size is 0.1. As mentioned in the main content, we do not implement the attack on ImageNet since the attribution attacks are not scalable to large size images. In the following parts of this section, we provide more details of evaluations in the experiments.

B.1 Attribution methods

We used saliency maps (SM) and integrated gradients (IG) in the evaluation sections. These two methods are defined as follows:

  • Saliency maps: SM(𝒙)=f(𝒙)𝒙SM𝒙𝑓𝒙𝒙\text{SM}(\bm{x})=\frac{\partial f(\bm{x})}{\partial\bm{x}}SM ( bold_italic_x ) = divide start_ARG ∂ italic_f ( bold_italic_x ) end_ARG start_ARG ∂ bold_italic_x end_ARG.

  • Integrated gradients: IG(𝒙)=(𝒙𝒙)×α=01f(𝒙+α(𝒙𝒙))𝒙𝑑αIG𝒙𝒙superscript𝒙superscriptsubscript𝛼01𝑓superscript𝒙𝛼𝒙superscript𝒙𝒙differential-d𝛼\text{IG}(\bm{x})=(\bm{x}-\bm{x}^{\prime})\times\int_{\alpha=0}^{1}\frac{% \partial f(\bm{x}^{\prime}+\alpha(\bm{x}-\bm{x}^{\prime}))}{\partial\bm{x}}d\alphaIG ( bold_italic_x ) = ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × ∫ start_POSTSUBSCRIPT italic_α = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG ∂ italic_f ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_α ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∂ bold_italic_x end_ARG italic_d italic_α.

The SmoothSM and SmoothIG are the smoothed versions of SM and IG, respectively.

B.2 Evaluation metrics

Given original attribution g(𝒙)𝑔𝒙g(\bm{x})italic_g ( bold_italic_x ) and perturbed attribution g(𝒙~)𝑔~𝒙g(\tilde{\bm{x}})italic_g ( over~ start_ARG bold_italic_x end_ARG ), we use top-k intersection, Kendall’s rank correlation [19] and cosine similarity [45] to evaluate their differences.

  • Top-k intersection measures the proportion of k𝑘kitalic_k largest features that overlap between g(𝒙)𝑔𝒙g(\bm{x})italic_g ( bold_italic_x ) and g(𝒙~)𝑔~𝒙g(\tilde{\bm{x}})italic_g ( over~ start_ARG bold_italic_x end_ARG ).

  • Kendall’s rank correlation measures the proportion of pairs of features that have the same order in g(𝒙)𝑔𝒙g(\bm{x})italic_g ( bold_italic_x ) and g(𝒙~)𝑔~𝒙g(\tilde{\bm{x}})italic_g ( over~ start_ARG bold_italic_x end_ARG ): 2d(d1)i=1dj=i+1d𝟏{g(𝒙)i>g(𝒙)j}𝟏{g(𝒙~)i>g(𝒙~)j}2𝑑𝑑1superscriptsubscript𝑖1𝑑superscriptsubscript𝑗𝑖1𝑑subscript1𝑔subscript𝒙𝑖𝑔subscript𝒙𝑗subscript1𝑔subscript~𝒙𝑖𝑔subscript~𝒙𝑗\frac{2}{d(d-1)}\sum_{i=1}^{d}\sum_{j=i+1}^{d}\mathbf{1}_{\{g(\bm{x})_{i}>g(% \bm{x})_{j}\}}\mathbf{1}_{\{g(\tilde{\bm{x}})_{i}>g(\tilde{\bm{x}})_{j}\}}divide start_ARG 2 end_ARG start_ARG italic_d ( italic_d - 1 ) end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT { italic_g ( bold_italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_g ( bold_italic_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT { italic_g ( over~ start_ARG bold_italic_x end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_g ( over~ start_ARG bold_italic_x end_ARG ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } end_POSTSUBSCRIPT.

  • Cosine similarity measures the cosine of the angle between g(𝒙)𝑔𝒙g(\bm{x})italic_g ( bold_italic_x ) and g(𝒙~)𝑔~𝒙g(\tilde{\bm{x}})italic_g ( over~ start_ARG bold_italic_x end_ARG ): g(𝒙)g(𝒙~)g(𝒙)g(𝒙~)𝑔superscript𝒙top𝑔~𝒙norm𝑔𝒙norm𝑔~𝒙\frac{g(\bm{x})^{\top}g(\tilde{\bm{x}})}{\|g(\bm{x})\|\|g(\tilde{\bm{x}})\|}divide start_ARG italic_g ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_g ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG ∥ italic_g ( bold_italic_x ) ∥ ∥ italic_g ( over~ start_ARG bold_italic_x end_ARG ) ∥ end_ARG.

B.3 Baseline methods

We compare with the following adversarial and attributional robust models:

IG-NORM [10]

CE(f(𝒙),y)+λmax𝒙~ε(𝒙)IG(𝒙,𝒙~)1CE𝑓𝒙𝑦𝜆subscript~𝒙subscript𝜀𝒙subscriptnormIG𝒙~𝒙1\textmd{CE}(f(\bm{x}),y)+\lambda\max_{\tilde{\bm{x}}\in\mathcal{B}_{% \varepsilon}(\bm{x})}\|\textmd{IG}(\bm{x},\tilde{\bm{x}})\|_{1}CE ( italic_f ( bold_italic_x ) , italic_y ) + italic_λ roman_max start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG ∈ caligraphic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUBSCRIPT ∥ IG ( bold_italic_x , over~ start_ARG bold_italic_x end_ARG ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (37)

TRADES [50]

CE(f(𝒙~),y)+βKL(f(𝒙)f(𝒙~))CE𝑓~𝒙𝑦𝛽KLconditional𝑓𝒙𝑓~𝒙\textmd{CE}(f(\tilde{\bm{x}}),y)+\beta\textmd{KL}(f(\bm{x})\|f(\tilde{\bm{x}}))CE ( italic_f ( over~ start_ARG bold_italic_x end_ARG ) , italic_y ) + italic_β KL ( italic_f ( bold_italic_x ) ∥ italic_f ( over~ start_ARG bold_italic_x end_ARG ) ) (38)

IGR [45]

CE(f(𝒙~),y)+βKL(f(𝒙)f(𝒙~))+λ(1cos(IG(𝒙),IG(𝒙~)))CE𝑓~𝒙𝑦𝛽KLconditional𝑓𝒙𝑓~𝒙𝜆1IG𝒙IG~𝒙\textmd{CE}(f(\tilde{\bm{x}}),y)+\beta\textmd{KL}(f(\bm{x})\|f(\tilde{\bm{x}})% )+\lambda\left(1-\cos(\textmd{IG}(\bm{x}),\textmd{IG}(\tilde{\bm{x}}))\right)CE ( italic_f ( over~ start_ARG bold_italic_x end_ARG ) , italic_y ) + italic_β KL ( italic_f ( bold_italic_x ) ∥ italic_f ( over~ start_ARG bold_italic_x end_ARG ) ) + italic_λ ( 1 - roman_cos ( IG ( bold_italic_x ) , IG ( over~ start_ARG bold_italic_x end_ARG ) ) ) (39)

Here CE denotes the cross-entropy loss and KL denotes the Kullback-Leibler divergence.

Appendix C Additional experiments

C.1 Test on Monte Carlo estimation

Note that the bound given by Theorem 1 is deterministic. In this section, we provide a probabilistic bound for the attribution robustness. Specifically, we want to find the value of t𝑡titalic_t such that Pr(Tt)=1α𝑃𝑟𝑇𝑡1𝛼Pr(T\leq t)=1-\alphaitalic_P italic_r ( italic_T ≤ italic_t ) = 1 - italic_α, where T𝑇Titalic_T is defined in Eqn. (6) and α𝛼\alphaitalic_α is the significance level. Recall that T𝑇Titalic_T is defined as follows:

T=h(𝒙)2h(𝒙)22+c𝑇subscriptnorm𝒙2superscriptsubscriptnorm𝒙22𝑐T=\frac{\|h(\bm{x})\|_{2}}{\sqrt{\|h(\bm{x})\|_{2}^{2}+c}}italic_T = divide start_ARG ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c end_ARG end_ARG (40)

where c=M2VU2/VS2𝑐superscript𝑀2superscriptsubscript𝑉𝑈2superscriptsubscript𝑉𝑆2c=M^{2}V_{U}^{2}/V_{S}^{2}italic_c = italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. If we denote that Q=h(𝒙)2𝑄subscriptnorm𝒙2Q=\|h(\bm{x})\|_{2}italic_Q = ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then we have

Pr(Tt)=Pr(QQ2+ct)=Pr(Q2ct21t2)𝑃𝑟𝑇𝑡𝑃𝑟𝑄superscript𝑄2𝑐𝑡𝑃𝑟superscript𝑄2𝑐superscript𝑡21superscript𝑡2Pr(T\leq t)=Pr\left(\frac{Q}{\sqrt{Q^{2}+c}}\leq t\right)=Pr\left(Q^{2}\leq% \frac{ct^{2}}{1-t^{2}}\right)italic_P italic_r ( italic_T ≤ italic_t ) = italic_P italic_r ( divide start_ARG italic_Q end_ARG start_ARG square-root start_ARG italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c end_ARG end_ARG ≤ italic_t ) = italic_P italic_r ( italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_c italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (41)

Note that we used Monte Carlo Integration to calculate the integral in h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ), which estimates h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) by sampling 𝜼𝜼\bm{\eta}bold_italic_η from \mathcal{B}caligraphic_B, i.e.,

h^(𝒙)=1Ni=1Ng(𝒙+𝜼i),𝜼i.formulae-sequence^𝒙1𝑁superscriptsubscript𝑖1𝑁𝑔𝒙subscript𝜼𝑖similar-tosubscript𝜼𝑖\hat{h}(\bm{x})=\frac{1}{N}\sum_{i=1}^{N}g(\bm{x}+\bm{\eta}_{i}),\quad\bm{\eta% }_{i}\sim\mathcal{B}.over^ start_ARG italic_h end_ARG ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_g ( bold_italic_x + bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_B . (42)

Note that h^(𝒙)^𝒙\hat{h}(\bm{x})over^ start_ARG italic_h end_ARG ( bold_italic_x ) is an unbiased estimator of h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ), i.e. 𝔼[h^(𝒙)]=h(x)𝔼delimited-[]^𝒙𝑥\mathbb{E}[\hat{h}(\bm{x})]=h(x)blackboard_E [ over^ start_ARG italic_h end_ARG ( bold_italic_x ) ] = italic_h ( italic_x ). The estimator almost surely converges to h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) as N𝑁N\rightarrow\inftyitalic_N → ∞, i.e. limNh^(𝒙)=h(𝒙)subscript𝑁^𝒙𝒙\lim_{N\rightarrow\infty}\hat{h}(\bm{x})=h(\bm{x})roman_lim start_POSTSUBSCRIPT italic_N → ∞ end_POSTSUBSCRIPT over^ start_ARG italic_h end_ARG ( bold_italic_x ) = italic_h ( bold_italic_x ) almost surely. By the Central Limit Theorem, the estimator h^(𝒙)^𝒙\hat{h}(\bm{x})over^ start_ARG italic_h end_ARG ( bold_italic_x ) has the following asymptotic distribution,

h^(𝒙)a.s.𝒩(h(𝒙),D),\hat{h}(\bm{x})\stackrel{{\scriptstyle a.s.}}{{\sim}}\mathcal{N}(h(\bm{x}),D),over^ start_ARG italic_h end_ARG ( bold_italic_x ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG italic_a . italic_s . end_ARG end_RELOP caligraphic_N ( italic_h ( bold_italic_x ) , italic_D ) , (43)

which the covariance matrix D=diag(σii2/N)𝐷diagsuperscriptsubscript𝜎𝑖𝑖2𝑁D=\text{diag}(\sigma_{ii}^{2}/N)italic_D = diag ( italic_σ start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N ) can be estimated by the empirical variances of g(𝒙+𝜼i)𝑔𝒙subscript𝜼𝑖g(\bm{x}+\bm{\eta}_{i})italic_g ( bold_italic_x + bold_italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Thus, the quadratic form Q2=h(𝒙)22superscript𝑄2superscriptsubscriptnorm𝒙22Q^{2}=\|h(\bm{x})\|_{2}^{2}italic_Q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_h ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be seen as generalized chi-square distributed. We can derive the cumulative distribution function of Monte Carlo estimator TMCsubscript𝑇𝑀𝐶T_{MC}italic_T start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT at t𝑡titalic_t as the cumulative distribution function of the generalized chi-square distribution at ct21t2𝑐superscript𝑡21superscript𝑡2\frac{ct^{2}}{1-t^{2}}divide start_ARG italic_c italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, i.e.,

Pr(TMCt)=F(ct21t2),𝑃𝑟subscript𝑇𝑀𝐶𝑡𝐹𝑐superscript𝑡21superscript𝑡2Pr(T_{MC}\leq t)=F\left(\frac{ct^{2}}{1-t^{2}}\right),italic_P italic_r ( italic_T start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT ≤ italic_t ) = italic_F ( divide start_ARG italic_c italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (44)

where F𝐹Fitalic_F is the cumulative distribution function of the generalized chi-square distribution constructed from the quadratic form of Gaussian random variable with mean h(𝒙)𝒙h(\bm{x})italic_h ( bold_italic_x ) and covariance D𝐷Ditalic_D [13, 12]. In this work, we use the R package CompQuadForm [17] to compute the cumulative distribution function. For any fixed image sample 𝒙𝒙\bm{x}bold_italic_x, we can validate t2t1subscript𝑡2subscript𝑡1t_{2}-t_{1}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is close to 00 when Pr(t1TMCt2)=1α𝑃𝑟subscript𝑡1subscript𝑇𝑀𝐶subscript𝑡21𝛼Pr(t_{1}\leq T_{MC}\leq t_{2})=1-\alphaitalic_P italic_r ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_T start_POSTSUBSCRIPT italic_M italic_C end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 - italic_α by solving the following equation. For small α=0.01𝛼0.01\alpha=0.01italic_α = 0.01 and the number of samples N=100,000𝑁100000N=100,000italic_N = 100 , 000, we found that the values of t2t1subscript𝑡2subscript𝑡1t_{2}-t_{1}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are at scale of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT in MNIST and CIFAR-10, and 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT in ImageNet calculated by choosing 10,0001000010,00010 , 000 samples from each dataset. This validates the error from Monte Carlo integral is minute and that the probabilistic bound is close to the deterministic bound.

F(ct221t22)=1α/2andF(ct121t12)=α/2.formulae-sequence𝐹𝑐superscriptsubscript𝑡221superscriptsubscript𝑡221𝛼2and𝐹𝑐superscriptsubscript𝑡121superscriptsubscript𝑡12𝛼2F\left(\frac{ct_{2}^{2}}{1-t_{2}^{2}}\right)=1-\alpha/2\quad\text{and}\quad F% \left(\frac{ct_{1}^{2}}{1-t_{1}^{2}}\right)=\alpha/2.italic_F ( divide start_ARG italic_c italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = 1 - italic_α / 2 and italic_F ( divide start_ARG italic_c italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = italic_α / 2 . (45)
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Original image
Refer to caption
(b) IG
Refer to caption
(c) Gaussian smoothed
Refer to caption
(d) Uniformly smoothed
Figure 3: Additional visualization of the attribution maps of the (a) original image, (b) IG, (c) Gaussian smoothed IG, and (d) uniformly smoothed IG.

C.2 Additional visualization of the uniformly smoothed attributions

In Figure 1 (left), we have shown that the uniformly smoothed attributions have a comparable quality as the original attributions. Here more examples are provided in Figure 3 to illustrate the quality of the uniformly smoothed attributions.

C.3 Evaluation of Center Smoothing [25] on attributions

Table 4: Evaluation of center smoothing on attributions
ϵ1subscriptitalic-ϵ1\epsilon_{1}italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.1 0.2 0.3 0.4 0.5
SmoothSM 1.207 1.729 1.843 1.907 1.998

To compare the performance with center smoothing [25], we also implemented the same method to evaluate the certification of attributions. Specifically, we compute the bound for SmoothSM on IG-NORM using MNIST, and follow the same setting by choosing h=11h=1italic_h = 1 and ϵ1=0.1,0.2,,0.5subscriptitalic-ϵ10.10.20.5\epsilon_{1}=0.1,0.2,\cdots,0.5italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.1 , 0.2 , ⋯ , 0.5. Directly using the cosine similarity on the method is not applicable since cosine similarity does not satisfy the triangle inequality. Following the relaxation method in Sec.4 of Kumar & Goldstein [25], a multiplier γ=2𝛾2\gamma=2italic_γ = 2 is added. Besides, we use 1cosθ1𝜃1-\cos\theta1 - roman_cos italic_θ to reflect the distance metric instead of the similarity metric. The results are shown in the Table 4. It can be observed that the upper bound for 1cosθ1𝜃1-\cos\theta1 - roman_cos italic_θ is greater than 1111 for all the choices of ϵitalic-ϵ\epsilonitalic_ϵ, which is trivially valid for the trigonometric function since we only consider cosθ[0,1]𝜃01\cos\theta\in[0,1]roman_cos italic_θ ∈ [ 0 , 1 ]. Thus, the upper bound provided in the aforementioned work can be too loose on our setting.

C.4 Evaluation of alternative formulations

In Section 4.3, we introduced two alternative formulations of the proposed method that can be applied in specific scenarios. In this section, we provide additional information to report the experiments on these two formulations.

Table 5: Maximum allowable perturbation size for different threshold (T=0.8𝑇0.8T=0.8italic_T = 0.8 and T=0.9𝑇0.9T=0.9italic_T = 0.9) under various choices of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT smoothing radii r𝑟ritalic_r evaluated on MNIST.
T=0.9𝑇0.9T=0.9italic_T = 0.9 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius (r𝑟ritalic_r) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0389 0.0951 0.1550 0.2164 0.2783 0.3404 0.4029
IG-NORM 0.0394 0.0957 0.1557 0.2170 0.2790 0.3420 0.4067
IGR 0.0390 0.0952 0.1552 0.2174 0.2818 0.3477 0.4163
T=0.8𝑇0.8T=0.8italic_T = 0.8 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius (r𝑟ritalic_r) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0447 0.1051 0.1691 0.2345 0.3004 0.3664 0.4329
IG-NORM 0.0448 0.1052 0.1692 0.2354 0.3037 0.3733 0.4456
IGR 0.0452 0.1057 0.1697 0.2350 0.3010 0.3680 0.4365
Table 6: Maximum allowable perturbation size for different threshold (T=0.8𝑇0.8T=0.8italic_T = 0.8 and T=0.9𝑇0.9T=0.9italic_T = 0.9) under various choices of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT smoothing radii r𝑟ritalic_r evaluated on CIFAR-10.
T=0.9𝑇0.9T=0.9italic_T = 0.9 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius (r𝑟ritalic_r) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0086 0.0469 0.0885 0.1322 0.1773 0.2222 0.2683
IG-NORM 0.0323 0.0705 0.1104 0.1510 0.1923 0.2337 0.2749
IGR 0.0167 0.0545 0.1032 0.1586 0.2150 0.2588 0.2805
T=0.8𝑇0.8T=0.8italic_T = 0.8 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius (r𝑟ritalic_r) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.0128 0.0522 0.0951 0.1402 0.1866 0.2330 0.2868
IG-NORM 0.0343 0.0742 0.1157 0.1580 0.2009 0.2439 0.2867
IGR 0.0237 0.0693 0.1258 0.1861 0.2546 0.3090 0.3559
Table 7: Maximum allowable perturbation size for different threshold (T=0.8𝑇0.8T=0.8italic_T = 0.8 and T=0.9𝑇0.9T=0.9italic_T = 0.9) under various choices of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT smoothing radii r𝑟ritalic_r evaluated on ImageNet.
2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT radius (r𝑟ritalic_r) 0.5 1.0 1.5 2.0 2.5 3.0 3.5
T=0.9𝑇0.9T=0.9italic_T = 0.9 0.0046 0.0100 0.0152 0.0295 0.0494 0.0628 0.0768
T=0.8𝑇0.8T=0.8italic_T = 0.8 0.0058 0.0127 0.0196 0.0369 0.0618 0.0820 0.1040
Table 8: Empirical cosine similarity between original and perturbed smoothed attributions under various choices of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT smoothing radius r𝑟ritalic_r, and the perturbation size computed in Table 5 (T=0.8𝑇0.8T=0.8italic_T = 0.8).
r𝑟ritalic_r 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Standard 0.8636 0.8522 0.8347 0.8127 0.8477 0.8603 0.8310
IG-NORM 0.8308 0.8181 0.8504 0.8728 0.8502 0.8193 0.8199
IGR 0.8231 0.8800 0.8720 0.8603 0.8362 0.8135 0.8567
Table 9: Minimum smoothing radius requires to achieve the threshold (T=0.8𝑇0.8T=0.8italic_T = 0.8 and T=0.9𝑇0.9T=0.9italic_T = 0.9) under various choices of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT perturbation size ε𝜀\varepsilonitalic_ε. IG-NORM and IGR are omitted since they are not scalable to ImageNet.
MNIST CIFAR-10 ImageNet
perturbation size (ϵitalic-ϵ\epsilonitalic_ϵ) 0.5 1.0 0.5 1.0 0.5 1.0
T=0.9𝑇0.9T=0.9italic_T = 0.9 Standard 5.1902 5.8752 5.9752 7.9504 74.6272 149.2544
IG-NORM 5.1189 5.7699 5.6860 7.3720 / /
IGR 5.0265 5.6623 5.2895 6.5790 / /
T=0.8𝑇0.8T=0.8italic_T = 0.8 Standard 3.8927 4.4064 5.7082 7.4164 48.2095 96.4190
IG-NORM 3.8392 4.3274 5.4875 6.9750 / /
IGR 3.7699 4.2468 5.0287 6.0573 / /

In Tables 5, 6 and 7, which correspond to MNIST, CIFAR-10 and ImageNet, respectively, we report the computed values of the maximum allowable perturbation size. Under the size constraint, no examples can be found by the attacks against uniformly smoothed IG of a certain radius such that the cosine similarity between clean and perturbed attributions exceeds the given threshold (T=0.8𝑇0.8T=0.8italic_T = 0.8 and T=0.9𝑇0.9T=0.9italic_T = 0.9). The results are consistent with our theory. For larger radius smoothing, the maximum allowable perturbation size is also larger. When the threshold requirement is stricter, the maximum allowable perturbation size is smaller, which suggests weaker attacks are allowed. The method is also scalable to ImageNet, which takes around 15 seconds to compute for each sample. Moreover, we also applied attribution attacks using the same radius and maximum perturbation size ϵitalic-ϵ\epsilonitalic_ϵ, computed using Eqn. (8). Similar to the experiments in Section 5, we performed 20 attacks on each sample. We found that out of the total 200,000 attacked samples, the cosine similarities between clean and perturbed attributions were higher than the given threshold, suggesting that the computed bound is valid (see Table 8).

We also evaluate the third formulation that the minimum radius of smoothing required such that, within the given perturbation sizes, the cosine similarity between original and perturbed smoothed attributions is larger than the given threshold. In Table 9, the computed minimum radius of smoothing is reported. Similarly, we observe that the minimum radius of smoothing is larger when the threshold requirement is stricter, and when the attack is stronger. This is also consistent with our theory. We also notice that the radius for ImageNet is extremely large, which indicates that ImageNet is difficult to defend under such strict threshold requirements.