From Feature Visualization to Visual Circuits:
Effect of Adversarial Model Manipulation

Geraldin Nanfack1,2Michael Eickenberg3Eugene Belilovsky1,2
1Concordia University 2Mila – Quebec AI Institute  
3Flatiron Institute
{geraldin.nanfack, eugene.belilovsky}@concordia.ca  [email protected]
Abstract

Understanding the inner working functionality of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic interpretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are usually interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework. This paper starts by addressing limitations in existing works by proposing a novel attack called ProxPulse that simultaneously manipulates the two types of feature visualizations. Surprisingly, when analyzing these attacks under the umbrella of visual circuits, we find that visual circuits show some robustness to ProxPulse. We, therefore, introduce a new attack based on ProxPulse that unveils the manipulability of visual circuits, shedding light on their lack of robustness. The effectiveness of these attacks is validated using pre-trained AlexNet and ResNet-50 models on ImageNet.

1 Introduction

Large Deep Neural Networks (DNNs) trained on vast amounts of data are becoming increasingly important and deployed in the real world. In several high-stakes applications such as autonomous driving, understanding the inner workings of these trained DNNs is crucial for assuring the safety and reliance of these systems (Rudner and Toner, 2021; Wäschle et al., 2022). Inspired by neuroscience (Hubel and Wiesel, 1962; Olah et al., 2017), one classical approach relies on activation maximization methods (Zeiler and Fergus, 2014; Olah et al., 2017), where the top images (real or synthetic) that most activate a neuron are used to interpret the neuron’s behavior. A recently popular direction for interpretability that often builds on activation maximization is mechanistic interpretability. Mechanistic interpretability is an emergent field, which seeks to discover human-understandable algorithms stored in model weights (Wang et al., 2022). The discovery of these meaningful algorithms makes it possible to reverse-engineer the behavior of neural networks (Conmy et al., 2023) and can also permit to edit factual knowledge in large-scale models (Meng et al., 2022). Most of the research in mechanistic interpretability analyzes the functionality of DNNs by considering them as computational graphs that can be decomposed into interpretable subgraphs known as circuits. In pre-trained vision models, the emergence of circuits that implement meaningful algorithms such as curve detectors and dog head detectors, etc. (Olah et al., 2020) has been demonstrated. These circuits can be built by manually inspecting neurons, and hierarchically grou** them according to feature visualization, which consists in finding, through activation maximization, either images from the training set or synthetical optimization-based images (Olah et al., 2017). Circuits can also be discovered using structured pruning (Hamblin et al., 2022).

Although activation maximization purports to provide the interpreter with a description of the behavior of the neuron, recent work has cast some doubt on the reliability of these interpretations (Nanfack et al., 2024; Geirhos et al., 2023; Bareeva et al., 2024). Notably, in these works, it has been shown that models can be subtly perturbed (or “attacked”) to change completely the interpretation of either synthetic or natural (i.e. from training set) images. This suggests that these interpretations might not be completely reliable. The existing works on model manipulations however have two limitations that we focus on. (1) None of the existing attacks have been shown to be able to manipulate both synthetic and natural visualizations simultatenously, as illustrated in Tab. 1. Indeed, Nanfack et al. (2024) has shown “attacks” in the context of natural images, while (Geirhos et al., 2023; Bareeva et al., 2024) only attack synthetic images, each attack only showing a difference in its target domain. (2) The effect on circuits and their interpretation has not been studied; the reliability of circuit-based interpretations has not been studied in the literature. In this paper, we analyze the robustness and stability of visual circuits through the same setting of adversarial model manipulation. As a key component in visual circuits, we begin our analysis on feature visualization and summarize our contributions as follows. We first (i) propose a novel attack on activation maximization that can simultaneously change interpretations of both synthetic and natural image visualizations. We subsequently turn to analyzing the effect of our attack on the circuit-based interpretation, surprisingly (ii) finding that a class of circuits derived from structured pruning can be highly robust to our proposed attack when it is made on the output of the circuit. We then turn our attention to directly manipulating the circuit proposing the first model manipulation attack on entire circuits. We find that (iii) visual circuits discovered by structured pruning can also be manipulated through our novel attack, shedding light on the lack of stability of these interpretability techniques.

2 Related Work

Mechanistic Interpretability.

Manipulates
Method Synth. Vis. Nat. Vis. Circuit
Geirhos et al. (2023) ×{\color[rgb]{1,0,0}\times}× ×{\color[rgb]{1,0,0}\times}×
Nanfack et al. (2024) ×{\color[rgb]{1,0,0}\times}× ×{\color[rgb]{1,0,0}\times}×
Bareeva et al. (2024) ×{\color[rgb]{1,0,0}\times}× ×{\color[rgb]{1,0,0}\times}×
ProxPulse (ours) ×{\color[rgb]{1,0,0}\times}×
CircuitBreaker (ours)
Table 1: Existing attacks on feature visualization. Our methods are able to manipulate synthetic and natural visualizations as well as visual circuits. The symbol indicates that the row approach has been demonstrated to effectively deceive the interpretation derived from the column technique.

Mechanistic interpretability is an emergent area in the interpretability of large-scale DNNs, which tackles the problem of discovering meaningful algorithms stored in model weights (Wang et al., 2022). Works in mechanistic interpretability either focus on individual neurons or on sparse connections of neurons called circuits. Individual neurons are often interpreted through techniques such as feature visualization (Zimmermann et al., 2021; Olah et al., 2017; Bau et al., 2020; Zimmermann et al., 2023), which is designed to interpret individual neurons by visualizing their top activating inputs. This can be applied to several modalities such as image (Olah et al., 2017) and text (Dai et al., 2022) using top-activating prompts. Works that build mechanistic interpretations using circuits have become popular due to the discovery of several meaningful subgraphs such as those for curve detectors (Olah et al., 2020) in vision models and indirect object identification in large language models (Conmy et al., 2023). While most of the studies manually build circuits, there have been recent proposals to automate the discovery of circuits for language models (Conmy et al., 2023) using edge attribution scores, and for vision models (Hamblin et al., 2022) using structured pruning. This paper focuses on feature visualization and circuits for vision models and we adopt this latter work to build visual circuits.
Manipulating Interpretability. Evaluating interpretability is difficult due to the absence of ground truth. There is a recent trend in assessing the reliability of interpretability techniques through the lens of stability, which aims to evaluate how the interpretability results change under reasonable input and model manipulation (Heo et al., 2019; Yu, 2013). The motivation for examining the robustness of interpretability methods within the context of model manipulation stems from the “universality” assumption (Olah et al., 2020; Chughtai et al., 2023), which suggests that model interpretations are similar for similarly performing networks of the same architecture. Some works study the lack of robustness of feature attribution methods under input and adversarial model manipulations (Heo et al., 2019; Adebayo et al., 2018; Dombrowski et al., 2019) and other works use these instabilities to fool the model fairness (Aïvodji et al., 2021; Anders et al., 2020). This paper does not focus on feature attribution methods. Instead, it examines the manipulability of feature visualization and visual circuits for which two recent studies are very related. The first one Geirhos et al. (2023) shows that synthetic (formally defined in Section 3) feature visualization can be fooled under adversarial model manipulation. The key idea of their method is to add orthogonal weights to the original ones such that activations of natural inputs (training data) remain the same, thus preserving model accuracy while orthogonal weights allow fooling synthetic feature visualization. The second work Nanfack et al. (2024) introduces an optimization framework that manipulates the result of natural feature visualization (i.e., top activating inputs from the training set), and further observes the potential decorrelation between natural and synthetic feature visualization. In this paper, we go beyond these two works and propose a more complete manipulation, which we call ProxPulse. ProxPulse simultaneously fools both natural and synthetic feature visualization. However, when analyzing ProxPulse from the circuit perspective, we observe that ProxPulse also fails to fool circuits, leading us to propose a new manipulation for visual circuits, which has not been studied before.

3 Notations and Background

We consider a classification problem with a dataset denoted by 𝒟={(𝒙i,yi)}i=1N𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=\{({\bm{x}}_{i},y_{i})\}_{i=1}^{N}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where 𝒙idsubscript𝒙𝑖superscript𝑑{\bm{x}}_{i}\in{\mathbb{R}}^{d}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the input and yi{1,,K}subscript𝑦𝑖1𝐾y_{i}\in\{1,...,K\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_K } is its class label. Let f(.;𝜽)f(.;{\bm{\theta}})italic_f ( . ; bold_italic_θ ) denote a DNN, f(l)(𝒙;𝜽)superscript𝑓𝑙𝒙𝜽f^{(l)}({\bm{x}};{\bm{\theta}})italic_f start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x ; bold_italic_θ ) defines activation maps of 𝒙𝒙{\bm{x}}bold_italic_x on the l𝑙litalic_l-th layer, which can be decomposed into J𝐽Jitalic_J single activation maps f(l,j)(𝒙;𝜽)superscript𝑓𝑙𝑗𝒙𝜽f^{(l,j)}({\bm{x}};{\bm{\theta}})italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x ; bold_italic_θ ). In particular, if the l-th𝑡thitalic_t italic_h layer is a 2D-convolutional layer, f(l,j)(𝒙;𝜽)superscript𝑓𝑙𝑗𝒙𝜽f^{(l,j)}({\bm{x}};{\bm{\theta}})italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x ; bold_italic_θ ) will be a matrix. Feature visualization is a method designed to interpret the inner workings of individual units. It is the result of the activation maximization (Mahendran and Vedaldi, 2015; Yosinski et al., 2015) defined by,

𝒙argmax𝒙𝒳f(l,j)(𝒙;𝜽),superscript𝒙subscript𝒙𝒳superscript𝑓𝑙𝑗𝒙𝜽{\bm{x}}^{*}\in\arg\!\max_{{\bm{x}}\in\mathcal{X}}f^{(l,j)}({\bm{x}};{\bm{% \theta}}),bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_arg roman_max start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x ; bold_italic_θ ) , (1)

where 𝒳𝒳\mathcal{X}caligraphic_X can be the training set 𝒳=𝒟𝒳𝒟\mathcal{X}=\mathcal{D}caligraphic_X = caligraphic_D or a continuous space 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and (l,j)𝑙𝑗(l,j)( italic_l , italic_j ) is the pair of layer l𝑙litalic_l and neuron j𝑗jitalic_j. When 𝒳d𝒳superscript𝑑\mathcal{X}\subset{\mathbb{R}}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, following Zimmermann et al. (2021), we call 𝒙superscript𝒙{\bm{x}}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, synthetic feature visualization. On the other hand, when 𝒳𝒳\mathcal{X}caligraphic_X is 𝒟𝒟\mathcal{D}caligraphic_D, 𝒙superscript𝒙{\bm{x}}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are top-activating images from the training set, and we denote this result as natural (or top-k𝑘kitalic_k) feature visualization as opposed to the synthetic one. While feature visualization methods may reveal understandable features such as edge detectors in early layers (Olah et al., 2020), they are not directly equipped with tools to know how individual neurons are connected to form more complex features.

Mechanistic interpretability is purposely designed to find potentially human-understandable sub-algorithms by decomposing the computational graph into subgraphs known as circuits. Hamblin et al. (2022) automated the discovery of visual circuits. They find visual circuits through structured pruning. Formally, given a feature map index j𝑗jitalic_j from a conv layer of index l𝑙litalic_l (we call the pair (l,j)𝑙𝑗(l,j)( italic_l , italic_j ) circuit head), a sparsity level τ𝜏\tauitalic_τ, its corresponding τ𝜏\tauitalic_τ-circuit is the computational graph, with parameters 𝜽^^𝜽\hat{{\bm{\theta}}}over^ start_ARG bold_italic_θ end_ARG, which approximates f(l,j)(.;𝜽)f^{(l,j)}(.;{\bm{\theta}})italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( . ; bold_italic_θ ) through

argmin𝜽^1Ni=1Nf(l,j)(𝒙i;𝜽^)f(l,j)(𝒙i;𝜽)s.t.𝜽^0τ, and 𝜽^l{𝜽l,0}.formulae-sequencesubscript^𝜽1𝑁superscriptsubscript𝑖1𝑁normsuperscript𝑓𝑙𝑗subscript𝒙𝑖^𝜽superscript𝑓𝑙𝑗subscript𝒙𝑖𝜽s.t.subscriptnorm^𝜽0𝜏 and subscript^𝜽𝑙subscript𝜽𝑙0\arg\!\min_{\hat{{\bm{\theta}}}}\frac{1}{N}\sum_{i=1}^{N}||f^{(l,j)}({\bm{x}}_% {i};\hat{{\bm{\theta}}})-f^{(l,j)}({\bm{x}}_{i};{\bm{\theta}})||\quad\text{s.t% .}\quad||\hat{{\bm{\theta}}}||_{0}\leq\tau,\text{ and }\hat{{\bm{\theta}}}_{l}% \in\{{\bm{\theta}}_{l},0\}.start_ROW start_CELL roman_arg roman_min start_POSTSUBSCRIPT over^ start_ARG bold_italic_θ end_ARG end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; over^ start_ARG bold_italic_θ end_ARG ) - italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) | | s.t. | | over^ start_ARG bold_italic_θ end_ARG | | start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_τ , and over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ { bold_italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , 0 } . end_CELL end_ROW (2)

In practice, Hamblin et al. (2022) adopts structured pruning (i.e., pruning per group of parameters) with convolutional kernels. This is done by computing kernel attribution scores, e.g., using SNIP (Lee et al., 2018; Hamblin et al., 2022),

Attr(𝜽(l,k);f(l,j),𝒙)=1KwKhp=1Kwq=1Kh|wp,qf(l,j)(𝒙;𝜽)wp,q|,Attrsubscript𝜽superscript𝑙𝑘superscript𝑓𝑙𝑗𝒙1subscript𝐾𝑤subscript𝐾superscriptsubscript𝑝1subscript𝐾𝑤superscriptsubscript𝑞1subscript𝐾subscript𝑤𝑝𝑞superscript𝑓𝑙𝑗𝒙𝜽subscript𝑤𝑝𝑞\small\text{Attr}\left({\bm{\theta}}_{(l^{\prime},k)};f^{(l,j)},{\bm{x}}\right% )=\frac{1}{K_{w}K_{h}}\sum_{p=1}^{K_{w}}\sum_{q=1}^{K_{h}}\left|w_{p,q}\frac{% \partial f^{(l,j)}({\bm{x}};{\bm{\theta}})}{\partial w_{p,q}}\right|,Attr ( bold_italic_θ start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ; italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT , bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_w start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT divide start_ARG ∂ italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x ; bold_italic_θ ) end_ARG start_ARG ∂ italic_w start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT end_ARG | , (3)

where Kw,Khsubscript𝐾𝑤subscript𝐾K_{w},K_{h}italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT are spatial dimensions of the kernel index k𝑘kitalic_k and a preceding layer index llsuperscript𝑙𝑙l^{\prime}\leq litalic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_l, and wp,qsubscript𝑤𝑝𝑞w_{p,q}italic_w start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT are weight parameters of kernels. Once these attribution scores are computed, they are sorted, and top kernels are retained according to the sparsity level τ𝜏\tauitalic_τ to compute the circuit. Following Hamblin et al. (2022), the sparsity level represents the number of parameters that were not masked.

4 Methods

We analyze the manipulability of feature visualization and visual circuits under adversarial model manipulation, which consists in fine-tuning a pre-trained model with specifically designed loss functions. To do so, we adopt the similar framework used by Heo et al. (2019); Nanfack et al. (2024), which is framed as the following optimization framework

min𝜽(αF(𝒟fool;𝜽)+(1α)M(𝒟;𝜽,𝜽initial)),subscript𝜽𝛼subscriptFsubscript𝒟fool𝜽1𝛼subscriptM𝒟𝜽subscript𝜽initial\min_{\bm{\theta}}(\alpha\mathcal{L}_{\textsuperscript{F}}(\mathcal{D}_{% \textsuperscript{fool}};{\bm{\theta}})+(1-\alpha)\mathcal{L}_{\textsuperscript% {M}}(\mathcal{D};{\bm{\theta}},{\bm{\theta}}_{\textsuperscript{initial}})),roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_α caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) + ( 1 - italic_α ) caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D ; bold_italic_θ , bold_italic_θ start_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , (4)

where 𝒟foolsubscript𝒟fool\mathcal{D}_{\text{fool}}caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT is the data used to manipulate the interpretation technique, where 𝜽𝜽{\bm{\theta}}bold_italic_θ are parameters of the updated model f(.;𝜽)f(.;{\bm{\theta}})italic_f ( . ; bold_italic_θ ), MsubscriptM\mathcal{L}_{\textsuperscript{M}}caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT is the loss that aims to maintain the initial performance of the model f(.;𝜽initial)f(.;{\bm{\theta}}_{\textsuperscript{initial}})italic_f ( . ; bold_italic_θ start_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and FsubscriptF\mathcal{L}_{\textsuperscript{F}}caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT is the fooling loss. In practice, M(𝒟;𝜽,𝜽initial)=CE(f(.;𝜽initial)||f(.;𝜽))\mathcal{L}_{\textsuperscript{M}}(\mathcal{D};{\bm{\theta}},{\bm{\theta}}_{% \textsuperscript{initial}})=\mathcal{L}_{\textsuperscript{CE}}(f(.;{\bm{\theta% }}_{\textsuperscript{initial}})||f(.;{\bm{\theta}}))caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D ; bold_italic_θ , bold_italic_θ start_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f ( . ; bold_italic_θ start_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | | italic_f ( . ; bold_italic_θ ) ) (Hinton et al., 2015) is the cross entropy loss between the original model outputs and the finetuned model outputs on training data 𝒟𝒟\mathcal{D}caligraphic_D, and the fooling loss FsubscriptF\mathcal{L}_{\textsuperscript{F}}caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT is provided in the following sections.

4.1 Manipulation of Feature Visualization

This section introduces a fooling loss that aims to manipulate both natural and synthetic feature visualizations, focussing on all the channels indexed by j𝑗jitalic_j of a particular layer of index l𝑙litalic_l. For brevity, we omit l𝑙litalic_l in the fooling loss foolsubscriptfool\mathcal{L}_{\text{fool}}caligraphic_L start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT. We start by observing that fooling the result of feature visualization involves the creation of a local region in the input space, reachable by gradient ascent, and with high values of activations. To ensure the creation of such a region, we use the ρ𝜌\rhoitalic_ρ-ball B(𝒙,ρ)𝐵superscript𝒙𝜌B({\bm{x}}^{*},\rho)italic_B ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ρ ) (using the l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm) centered on the image target 𝒙𝒟foolsuperscript𝒙subscript𝒟fool{\bm{x}}^{*}\in\mathcal{D}_{\text{fool}}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT, which excludes initial synthetic images when manipulating feature visualization results. This ρ𝜌\rhoitalic_ρ-ball B(𝒙,ρ)𝐵superscript𝒙𝜌B({\bm{x}}^{*},\rho)italic_B ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ρ ) is used to contain the new feature visualization results. We therefore propose a fooling objective that aims to push up the smallest activations of images in B(𝒙,ρ)𝐵superscript𝒙𝜌B({\bm{x}}^{*},\rho)italic_B ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ρ ). We denote this fooling loss ProxPulse (referring to proximity in the ρ𝜌\rhoitalic_ρ-ball and the pulsating effect on activations) and express it as

F(𝒟fool;𝜽)=j,𝒙𝒟foolmax𝒙𝒙ρj(𝒙;𝜽)=j,𝒙𝒟foolmax𝒙𝒙ρlog(1+C/f(l,j)(𝒙;𝜽)22),subscriptFsubscript𝒟fool𝜽subscript𝑗superscript𝒙subscript𝒟foolsubscriptnorm𝒙superscript𝒙𝜌subscript𝑗𝒙𝜽subscript𝑗superscript𝒙subscript𝒟foolsubscriptnorm𝒙superscript𝒙𝜌1𝐶subscriptsuperscriptdelimited-∥∥superscript𝑓𝑙𝑗𝒙𝜽22\mathcal{L}_{\textsuperscript{F}}(\mathcal{D}_{\textsuperscript{fool}};{\bm{% \theta}})=\!\sum_{j,{\bm{x}}^{*}\in\mathcal{D}_{\text{fool}}}\max_{||{\bm{x}}-% {\bm{x}}^{*}||\leq\rho}\ell_{j}({\bm{x}};{\bm{\theta}})=\!\sum_{j,{\bm{x}}^{*}% \in\mathcal{D}_{\text{fool}}}\max_{||{\bm{x}}-{\bm{x}}^{*}||\leq\rho}\log\left% (1+C/\|f^{(l,j)}({\bm{x}};{\bm{\theta}})\|^{2}_{2}\right),start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_j , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT | | bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | ≤ italic_ρ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_j , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT | | bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | ≤ italic_ρ end_POSTSUBSCRIPT roman_log ( 1 + italic_C / ∥ italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x ; bold_italic_θ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , end_CELL end_ROW (5)

where C𝐶Citalic_C is a very high constant, the indexes j𝑗jitalic_j refer to channel or unit indexes of the layer index l𝑙litalic_l whose feature visualizations are being fooled, and max𝒙𝒙ρj(𝒙;𝜽)subscriptnorm𝒙superscript𝒙𝜌subscript𝑗𝒙𝜽\max_{||{\bm{x}}-{\bm{x}}^{*}||\leq\rho}\ell_{j}({\bm{x}};{\bm{\theta}})roman_max start_POSTSUBSCRIPT | | bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | ≤ italic_ρ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) refers to the cost over the worst activations (per channel) in the neighborhood of the fooling image target 𝒙superscript𝒙{\bm{x}}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Finetuning the model with the ProxPulse loss in the framework defined in Eq. 4 involves a challenging bi-level optimization problem for large-scale DNNs. Inspired by sharpness-aware minimization problems (Foret et al., 2020), which also require minimizing the worst empirical risk in a neighborhood, we derive an efficient approximation of F(𝒟fool;𝜽)subscriptFsubscript𝒟fool𝜽\mathcal{L}_{\textsuperscript{F}}(\mathcal{D}_{\textsuperscript{fool}};{\bm{% \theta}})caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ), expressed as

F(𝒟fool;𝜽)j,𝒙𝒟foolj(𝒙+ϵ(𝒙);𝜽),subscriptFsubscript𝒟fool𝜽subscript𝑗superscript𝒙subscript𝒟foolsubscript𝑗superscript𝒙italic-ϵsuperscript𝒙𝜽\mathcal{L}_{\textsuperscript{F}}(\mathcal{D}_{\textsuperscript{fool}};{\bm{% \theta}})\approx\sum_{j,{\bm{x}}^{*}\in\mathcal{D}_{\text{fool}}}\ell_{j}\Big{% (}{\bm{x}}^{*}+\epsilon\left({\bm{x}}^{*}\right);{\bm{\theta}}\Big{)},caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) ≈ ∑ start_POSTSUBSCRIPT italic_j , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ; bold_italic_θ ) , (6)

where ϵ(𝒙)=ρ𝒙j(𝒙,𝜽)𝒙j(𝒙,𝜽)italic-ϵsuperscript𝒙𝜌subscript𝒙subscript𝑗superscript𝒙𝜽normsubscript𝒙subscript𝑗superscript𝒙𝜽\epsilon({\bm{x}}^{*})=\rho\frac{\nabla_{{\bm{x}}}\ell_{j}({\bm{x}}^{*},{\bm{% \theta}})}{||\nabla_{{\bm{x}}}\ell_{j}({\bm{x}}^{*},{\bm{\theta}})||}italic_ϵ ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_ρ divide start_ARG ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_θ ) end_ARG start_ARG | | ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_θ ) | | end_ARG. See App. A.3 for more details.

4.2 Manipulation of Visual Circuits

This section introduces a fooling objective, called CircuitBreaker, whose goal is to fool the visual circuit. For a DNN’s circuit head with a layer-channel pair (l,j)𝑙𝑗(l,j)( italic_l , italic_j ), CircuitBreaker aims to (i) preserve the feature visualization of the circuit head to maintain circuit functionality and (ii) deceive the attribution scores of the circuit discovery method. We propose the following objective

F({𝒙},𝒟;𝜽)=j(𝒙+ϵ(𝒙);𝜽)+βiNl<lkk^,k^topInit(l)[Attr(𝜽(l,k^);f(l,j),𝒙i)Attr(𝜽(l,k);f(l,j),𝒙i)]+,\mathcal{L}_{\textsuperscript{F}}(\{{\bm{x}}*\},\mathcal{D};{\bm{\theta}})=% \ell_{j}\Big{(}{\bm{x}}^{*}+\epsilon({\bm{x}}^{*});{\bm{\theta}}\Big{)}+\\ \beta\sum_{i\leq N}\sum_{l^{\prime}<l}\sum_{k\neq\hat{k},\hat{k}\in\text{% topInit}(l^{\prime})}\left[\text{Attr}\left({\bm{\theta}}_{(l^{\prime},\hat{k}% )};f^{(l,j)},{\bm{x}}_{i}\right)-\text{Attr}\left({\bm{\theta}}_{(l^{\prime},k% )};f^{(l,j)},{\bm{x}}_{i}\right)\right]_{+},start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( { bold_italic_x ∗ } , caligraphic_D ; bold_italic_θ ) = roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ; bold_italic_θ ) + end_CELL end_ROW start_ROW start_CELL italic_β ∑ start_POSTSUBSCRIPT italic_i ≤ italic_N end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_l end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k ≠ over^ start_ARG italic_k end_ARG , over^ start_ARG italic_k end_ARG ∈ topInit ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ Attr ( bold_italic_θ start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_k end_ARG ) end_POSTSUBSCRIPT ; italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - Attr ( bold_italic_θ start_POSTSUBSCRIPT ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_k ) end_POSTSUBSCRIPT ; italic_f start_POSTSUPERSCRIPT ( italic_l , italic_j ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , end_CELL end_ROW (7)

where 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are training images, [.]+=max(.,0)[.]_{+}=\max(.,0)[ . ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_max ( . , 0 ), 𝒙superscript𝒙{\bm{x}}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the initial synthetic feature visualization for the circuit head (l,j)𝑙𝑗(l,j)( italic_l , italic_j ) (channel index j𝑗jitalic_j of the layer index l𝑙litalic_l), and topInit(l)topInitsuperscript𝑙\text{topInit}(l^{\prime})topInit ( italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the set of top kernel indexes of the layer index lsuperscript𝑙l^{\prime}italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, according to their initial attribution scores on the (initial) circuit with head (l,j)𝑙𝑗(l,j)( italic_l , italic_j ). From this CircuitBreaker loss, we observe that its first component is the ProxPulse loss j(𝒙+ϵ(𝒙);𝜽)subscript𝑗superscript𝒙italic-ϵsuperscript𝒙𝜽\ell_{j}\Big{(}{\bm{x}}^{*}+\epsilon({\bm{x}}^{*});{\bm{\theta}}\Big{)}roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + italic_ϵ ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ; bold_italic_θ ), applied only on the channel index j𝑗jitalic_j of layer l𝑙litalic_l. As defined in Sec. 4.1, it aims to maintain the initial feature visualization of the circuit head (l,j)𝑙𝑗(l,j)( italic_l , italic_j ). The second component is a pairwise ranking loss that aims to push down the rank of the initial top attributed kernels of the circuit.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Illustration of the manipulability of both natural and synthetic feature visualization using ProxPulse on conv5 and conv4 of AlexNet. The first row (resp. second row) shows the natural initial (resp. final) feature visualization and initial (resp. final) synthetic feature visualizations. On the image title, we report the corresponding metrics to evaluate change in top activating inputs. One can observe that both natural and synthetic feature visualization have completely changed, to very similar images for the synthetic one. Observe that as intended, conv4 synthetic images are different from those of conv5, although the same target images have been used for 𝒟foolsubscript𝒟fool\mathcal{D}_{\text{fool}}caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT.

5 Experimental Evaluation

We now describe the experimental setup and the results obtained after running the two manipulations.

The setup is inspired by the works of Nanfack et al. (2024); Hamblin et al. (2022). For all experiments, we use the ImageNet (Deng et al., 2009) dataset as the training set 𝒟𝒟\mathcal{D}caligraphic_D. We use the pre-trained networks AlexNet (Krizhevsky et al., 2012), ResNet-50 (He et al., 2016), DenseNet-201 Huang et al. (2017) (in App. A.11) and ResNet-152 (He et al., 2016) (in App. A.11) from Pytorch (Paszke et al., 2019).

Hyperparameters. We use the Adam optimizer with a minibatch of 256 and a learning rate of 1e-4 for the ProxPulse and CircuitBreaker. More details for hyperparameters can be found in App. A.2.

Metrics. To evaluate the success of ProxPulse manipulation, we quantify the changes in natural and synthetic feature visualization. For natural feature visualization, we use the metrics adopted by (Nanfack et al., 2024), which are: (i) the Kendall-τ𝜏\tauitalic_τ rank correlation computed on ranks of images based on their initial and final (after finetuning) activations, and (ii) the CLIP-δ𝛿\deltaitalic_δ score, which quantifies the semantic change in top activating images. For the synthetic feature visualization, we compute the pairwise cosine similarities between the CLIP embeddings (Oikarinen and Weng, 2022) of initial synthetic images, which we compare against pairwise similarities between final synthetic ones.

To assess CircuitBreaker (see Sec. 5.3), we use Pearson correlation, Kendall-τ𝜏\tauitalic_τ, and CLIP similarities.

Channel Notation. Before presenting the results, inspired by (Olah et al., 2020; Hamblin et al., 2022), we use the concise notation layerName:channelIndex to refer the pair (layerName,channelIndex). This notation is also used to flag corresponding synthetic feature visualizations and circuit heads (similar to feature heads) for a given channel. In the Pytorch AlexNet model, features.0, features.3, features.6, and features.8 and features.10 refer respectively to conv1, conv2, conv3, conv4 and conv5.

5.1 ProxPulse Simultaneously Fools Natural and Synthetic Feature Visualization

We evaluate ProxPulse manipulations on natural and synthetic feature visualization. The ProxPulse objective increases the lowest-valued activations of images in the ρ𝜌\rhoitalic_ρ-ball of target images in 𝒟foolsubscript𝒟fool\mathcal{D}_{\text{fool}}caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT. We direct the manipulation towards two target natural images (shown in Fig. 8 of the appendix). As motivated in Nanfack et al. (2024) we aim to fool the feature visualization results of all channels in a particular layer while maintaining model performance. Fig. 1 shows the results (for three randomly chosen channels) obtained after ProxPulse on respectively the conv4 and conv5 layers of AlexNet. It can be observed from both figures that both natural and synthetic feature visualizations were completely changed, thus modifying any interpretation using these techniques. Furthermore, most channels end up having the same top-k𝑘kitalic_k and synthetic images, making the application of the feature visualization techniques to this manipulated AlexNet uninformative. We emphasize that prior work was only capable of individually changing either the synthetic or natural images. Ablation results on ResNet-50 and DenseNet-201 are available in the App. (Fig. 13 and Fig. 29). Ablation on choice and number of image targets 𝒟foolsubscript𝒟fool\mathcal{D}_{\text{fool}}caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT can be found in Fig. 31 (App. A.11) and Fig. 15 (App. A.6).

Natural feature visualization. From Tab. 2, on layer conv5, the Kendall-τ𝜏\tauitalic_τ is relatively high, indicating that ProxPulse had only minor modifications to the channel behavior. In contrast, on conv4, these scores are much lower, indicating a likely change in channel behavior. On both layers, the [Uncaptioned image] Figure 2: Histogram of pairwise cosine similarities between CLIP features of non-noisy synthetic images before (initial) and after (final) ProxPulse. Model Layer/Attack CLIP-δ𝛿absent\delta\uparrowitalic_δ ↑ Kend-τ𝜏absent\tau\downarrowitalic_τ ↓ Acc.(%percent\%%) CLIP. S.\uparrow AlexNet Conv5/Push-Up# 0.150 0.654 56.3 0.911 Conv5/Push-Down# 0.249 0.530 56.2 0.872 Conv5/ProxPulse 0.364 0.746 56.0 0.983 Conv4/Push-Down# 0.205 0.548 56.2 0.870 Conv4/ProxPulse 0.282 -0.276 55.7 0.988 ResNet-50 L1.0.conv2/ProxPulse 0.126 -0.377 79.62 0.975 Table 2: Average (over channels) metrics for ProxPulse manipulations and baselines. The symbol # refers to baseline methods in Nanfack et al. (2024). Kend-τ𝜏\tauitalic_τ is the abbreviation of the Kendall-τ𝜏\tauitalic_τ score whereas CLIP.S. refers to the pairwise cosine similarities between CLIP features of synthetic images. CLIP-δ𝛿\deltaitalic_δ scores (which measure the semantic change in the top-k𝑘kitalic_k images) remain relatively high (in comparison to those observed in Nanfack et al. (2024)). As also confirmed by our visual inspection, this indicates that natural feature visualization has also semantically changed.

Synthetic feature visualization. In Fig. 1, we can also observe that the synthetic feature visualization was successfully modified, and shares similarities with the target images in Fig. 8 (Appendix). It can be further inspected in Fig. 11 and Fig. 9 (Appendix) that almost every synthetic image in a layer has completely changed to one single pattern (see further illustrations App. A.4). We also quantitatively evaluate the change in synthetic feature visualization by measuring the pairwise similarity between CLIP features of the initial synthetic images. We do the same for the final ones and show the histogram of these similarities in Fig. 2. As seen in Fig. 2, there is a clear shift between the distribution of pairwise similarity before and after ProxPulse. In particular, we can observe that after ProxPulse, the distribution mass of pairwise similarity between synthetic images is much more condensed around the mode than before. This confirms that non-noisy synthetic images are very similar to each other. This can be further inspected in App. A.4.

Accuracy preservation. We report the accuracy of fine-tuned models with ProxPulse in Table 2. We observe that the accuracy drop of fine-tuned models is less than 1%percent11\%1 %, meaning that the fine-tuned model and the initial model share practically the same level of performance for ImageNet classification.

We finally do an ablation on ResNet-50 in App. A.5 and observe the same results: both natural and synthetic feature visualization were successfully fooled with ProxPulse without a practical decrease in model performance. Additionally, we computed the accuracy per class to ensure that any potential drop in accuracy was not targeted at specific classes only. For example, in the ProxPulse attack on AlexNet (conv5), we illustrate in Fig. 35 the per-class accuracy drop and observe that the drop is distributed (though not uniformly) across most classes, rather than being concentrated on just a few.

5.2 ProxPulse Has a Minor Effect on Channel Attribution Ranks of Visual Circuits

We analyze the ProxPulse attack through the lens of visual circuits (Section 3 presents how visual circuits are discovered) to have more insights into this fooling mechanism. Fig. 3 shows two visualizations of the circuit with (circuit) head conv5:37 on two AlexNet models. The first one is the Pytorch pre-trained AlexNet while the second one is the manipulated version with ProxPulse applied to fool simultaneously natural and synthetic feature visualizations of conv5 (as explained in Section 4.1). As a reminder of Section 3, these visual circuits are obtained by finding a sparse approximation of the computational graph of the head (conv5:37). This is done using kernel attribution scores. Our visualization follows Olah et al. (2020); Hamblin et al. (2022), where nodes or channels are visualized through their synthetic feature visualization. In addition, we exhibit only the top 4 nodes and also weigh edge transparency color depending on their attribution values i.e., darker edges indicate stronger importance on the visual circuit. These circuits are used by related work (Olah et al., 2017; Hamblin et al., 2022) to interpret the functional behavior of the circuit head.

Refer to caption
Refer to caption
Figure 3: Illustration of the non-effectiveness of ProxPulse to manipulate the circuit. We show two visual circuits drawn for circuit head conv5:37 on pre-trained AlexNet (left) and on the fine-tuned AlexNet with ProxPulse (right) on conv5. We observe that most of the channels (at least two per layer, see surrounded ones) on the circuit were not removed by ProxPulse, even though some of them (e.g., channel conv5:151) has visually changed.
Refer to caption
Figure 4: Visual circuit with sparsity 0.3 for conv5:37 after fine-tuning with ProxPulse on AlexNet. We observe that the final synthetic feature visualization of the circuit head with sparsity 0.3 is similar to the initial one in Fig. 3), although with sparsity 1 this final visualization was completely and visually different from the initial one. Reducing the sparsity has therefore removed the change in feature visualization as can be seen by the absence of patterns added by ProxPulse in the right circuit of Fig. 3.

A closer look at Fig. 3 shows that although as intended synthetic feature visualization of conv5:37 has completely changed (colors and textures), most of the initial circuit channels are still present in the circuit derived from the manipulated model. Notably, at least one-half of channels per layer (before conv5) from the initial circuit are still present in the final circuit while having, for most of them, similar initial feature visualization (see conv1:37, conv1:20, conv2:107, conv2:12, etc.). However, we also observe that, for some of the channels such as conv3:151 which are still present in the final circuit, their final synthetic visualization looks very similar to the changed synthetic visualization of the circuit head, despite not having the strongest connection to the circuit’s head. This suggests that ProxPulse may have little impact on the circuit discovery method.

To further go deeper into the effect of ProxPulse on the visual circuit, we reduce the sparsity from 1 to 0.3 and rebuild in Fig. 3 the right-side visual circuit with their feature visualizations on the circuits. We observe that the effect of ProxPulse has now almost completely been removed, confirming that despite the ability of ProxPulse to deceive both types of feature visualization, it adds only a minor modification to the network. Importantly, this minor modification can be visually detected when visualizing the circuit with low and moderate sparsity. We did a similar experiment for circuits of conv4 (see App. A.7 and Fig. 16).

To provide a more quantitative analysis of the ineffectiveness of ProxPulse to deceive visual circuits, we compute the Kendall-τ𝜏\tauitalic_τ rank correlation between (i) kernel attribution scores on the initial model and (ii) kernel attribution scores on the final (fine-tuned) model with ProxPulse on conv5. We do this on 10 randomly chosen channels of conv5, thus on 10 random circuits. We plot the mean with error bars on App.Fig. 17 and we can observe that the final ranks are strongly correlated with the original ones. This further illustrates the little impact of ProxPulse on the circuit discovery method. These results suggest that circuits may be robust to manipulation. We thus now consider the first manipulation attack targeted explicitly at circuits.

Refer to caption
(a) With initial model.
Refer to caption
(b) After CircuitBreaker.
Figure 5: Effectiveness of CircuitBreaker to manipulate visual circuits on conv5 of AlexNet. We observe that the circuit visualization is severely distorted while the network outputs change minimally.
Refer to caption
(a) With initial model.
Refer to caption
(b) After CircuitBreaker.
Figure 6: Effectiveness of CircuitBreaker to manipulate visual circuits on conv4 of AlexNet.
Refer to caption
(a) Pearson correlation between activations on circuits (with pruning) for the (i) considered model and (ii) the initial model without pruning.
Refer to caption
(b) Rank correlation between kernel attribution scores for circuits on (i) the initial model and (ii) the fine-tuned model with CircuitBreaker.
Refer to caption
(c) Similarity ratio with CircuitBreaker.
Figure 7: Results obtained when fooling circuits with heads on conv3, conv4 and conv5 of AlexNet.

5.3 Manipulation of the Circuit through CircuitBreaker

In this section, we manipulate the model with CircuitBreaker as introduced in Section 4.2. As a refresher, the goal of the CircuitBreaker attack mechanism is to fine-tune the pre-trained model to maintain its initial performance, fooling the interpretations of visual circuits (initial rankings of top channels and their synthetic feature visualization), while also preserving the functionality of the circuit head. In the following, we present the results obtained with CircuitBreaker on vision circuits for conv3 (features:6), conv4 (features:8), and conv5 (features:10) of AlexNet. Note that circuits (10 heads on each layer) are attacked independently from each other (ablation for simultaneous manipulation can be found in App. A.12) and the visualized circuits use a sparsity of 0.6, which preserves well the behavior of the circuit heads (see Fig. 7(a)). An ablation study on the visualizations with various sparsity levels in App. A.9, and on the model (ResNet-50) is also provided in App. A.10.

Visual Inspection. We start by visually inspecting the results after CircuitBreaker. Fig. 5, Fig. 6 and Fig. 19 (appendix) show the results obtained after fooling attempts using CircuitBreaker on three different circuits (three different experiments). On the three different circuits (Sec. 3 presents how visual circuits are discovered), when we compare the final one against the initial one, we observe that the final synthetic feature visualization still stays visually similar to the initial one, although it is less pronounced on the circuit for features.10:2 but in this case, it still shares the circular contour. This is due to the ProxPulse component in CircuitBreaker (see Section 4.2). We also observe from the three circuits, a little overlap between channels numbering in the initial one and the final one, which is the effect of the ranking loss in CircuitBreaker. We finally observe that in terms of the semantics of the composition of synthetic feature visualization on the circuits, both initial and final circuits seem plausible. In particular, let us zoom onto the less obvious one in Fig. 5. An analysis of the initial circuit may roughly indicate that this circuit detects patterns related to circular objects with (vertical) axis (see e.g., Fig. 1 and annotations of this unit from Hernandez et al. (2022)). On previous layers of the features:10, we can see the presence of synthetic feature visualizations that visually seem to be dedicated to these circular contours (e.g., conv3:225, conv3:322) and others that are related to the (vertical) axis (conv3:112). As said above, the similarity between synthetic feature visualization of the final (i.e., after CircuitBreaker) circuit is less pronounced but it can still be observed the circular contour pattern. Indeed, this circular pattern has been amplified when looking at final synthetic feature visualizations.

Quantitative Assessment. The above analysis was a visual inspection of the manipulability of visual circuits under the CircuitBreaker attack. Here we quantify its success using four criteria (CT).
CT1: Functional behavior. First, to measure the preservation of the functional behavior of the circuit head, inspired by Hamblin et al. (2022), we measure the Pearson correlation (on a large subset of the training set) between (i) activations of training images on the circuit head and (ii) activations of the same training images still on the circuit head but with sparsity 1. Higher correlations will mean high preservation of the functional behavior of the circuit head. Fig.7(a) reports these correlation scores for several sparsity levels and different layers where circuit heads come from. It can be observed from dotted lines (fooled circuits) that the functional behavior of fooled circuits is preserved in the same way as the unfooled ones (bold lines), especially for moderate to high levels of sparsity (>>>.5).
CT2: Sanity check of accuracy. Second, as done in Section 4.1, we assess the performance maintenance, ensuring that the fine-tuned model represents an adversarial model manipulation of the initial model. We measure the performance of all finetuned models and report it in Fig. 27 (appendix). The figure illustrates that our fine-tuning with CircuitBreaker maintains the same level of predictive performance of AlexNet accuracy on ImageNet, which is 56.52%percent56.5256.52\%56.52 %.
CT3: Correlation attribution scores. Third, as done in Sec. 5.2, we measure the rank correlation between kernel attributions scores from (i) the initial model and (ii) the final model, which is the fine-tuned model with CircuitBreaker. As a result, a lower rank correlation will indicate a small change in the circuit, because these ranks are those that are used for circuit discovery. Fig. 7(b) shows these rank correlations. It can be seen from this figure that the final ranks of kernel attribution scores are weakly correlated to initial ones, except those of the circuit head’s layers, which is reasonable.
CT4: Similarity ratio between synthetic feature visualizations on the circuit. Finally, since with fine-tuning, channels can switch their feature visualizations (thus decreasing the rank correlation but not changing the interpretations of the circuit), we need a method to measure the change in synthetic feature visualization. Inspired by the phenomenon called whack-a-mole in Nanfack et al. (2024), we use a similarity ratio computed thanks to CLIP (Oikarinen and Weng, 2022) features. This similarity ratio is computed as follows. Given a final synthetic feature visualization from a layer, the numerator of the ratio is the maximum cosine similarity between this final synthetic image on the final model and any of the initial ones from the same layer on the initial model. The denominator is the cosine similarity between this final synthetic image on the final model and the synthetic image from the same channel but on the initial model. Intuitively, the ratio quantifies the change in synthetic visualization (initial vs final) relative to the initial synthetic visualization (using the final top channels). Fig. 7(c) shows this similarity ratio per layer on different circuit heads. We observe that most values are lower than one, suggesting that synthetic feature visualization has changed. It is also important to observe that the similarity ratio (see the ending point of each curve) of the circuit head collapses to one, which means that in general, there is negligible change in synthetic feature visualizations of the circuit head.

6 Conclusion and Limitations

This paper proposes a manipulation technique called ProxPulse that extends the limitations of previous works, by showing that both types of feature visualizations can be simultaneously manipulated. However, when analyzing ProxPulse within the framework of circuits –key components in mechanistic interpretability–, we discover that circuits show some robustness against ProxPulse manipulations. We therefore introduce another attack that reveals the manipulability of circuits. We provide experimental evidence of the effectiveness of these attacks using a variety of correlation and similarity metrics. Our attack on circuits sheds light on the lack of uniqueness and stability of circuit-based interpretations. We also observe a decrease in manipulability success when trying to attack simultaneously several circuits without degradation in accuracy. Finally, further studies need to be done to provide defense mechanisms and robust-circuit discovery methods.

References

  • Adebayo et al. [2018] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018.
  • Aïvodji et al. [2021] Ulrich Aïvodji, Hiromi Arai, Sébastien Gambs, and Satoshi Hara. Characterizing the risk of fairwashing. Advances in Neural Information Processing Systems, 34:14822–14834, 2021.
  • Anders et al. [2020] Christopher Anders, Plamen Pasliev, Ann-Kathrin Dombrowski, Klaus-Robert Müller, and Pan Kessel. Fairwashing explanations with off-manifold detergent. In International Conference on Machine Learning, pages 314–323. PMLR, 2020.
  • Bareeva et al. [2024] Dilyara Bareeva, Marina M-C Höhne, Alexander Warnecke, Lukas Pirch, Klaus-Robert Müller, Konrad Rieck, and Kirill Bykov. Manipulating feature visualizations with gradient slingshots. arXiv preprint arXiv:2401.06122, 2024.
  • Bau et al. [2020] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020.
  • Chughtai et al. [2023] Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. In International Conference on Machine Learning, pages 6243–6267. PMLR, 2023.
  • Conmy et al. [2023] Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997, 2023.
  • Dai et al. [2022] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Dombrowski et al. [2019] Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. Explanations can be manipulated and geometry is to blame. Advances in neural information processing systems, 32, 2019.
  • Foret et al. [2020] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
  • Geirhos et al. [2023] Robert Geirhos, Roland S Zimmermann, Blair Bilodeau, Wieland Brendel, and Been Kim. Don’t trust your eyes: on the (un) reliability of feature visualizations. arXiv preprint arXiv:2306.04719, 2023.
  • Hamblin et al. [2022] Chris Hamblin, Talia Konkle, and George Alvarez. Pruning for interpretable, feature-preserving circuits in cnns. arXiv preprint arXiv:2206.01627, 2022.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Heo et al. [2019] Juyeon Heo, Sunghwan Joo, and Taesup Moon. Fooling neural network interpretations via adversarial model manipulation. Advances in Neural Information Processing Systems, 32, 2019.
  • Hernandez et al. [2022] Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2022.
  • Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. stat, 1050:9, 2015.
  • Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • Hubel and Wiesel [1962] David H Hubel and Torsten N Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
  • Lee et al. [2018] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2018.
  • Mahendran and Vedaldi [2015] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
  • Meng et al. [2022] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  • Nanfack et al. [2024] Geraldin Nanfack, Alexander Fulleringer, Jonathan Marty, Michael Eickenberg, and Eugene Belilovsky. Adversarial attacks on the interpretation of neuron activation maximization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4315–4324, 2024.
  • Oikarinen and Weng [2022] Tuomas Oikarinen and Tsui-Wei Weng. Clip-dissect: Automatic description of neuron representations in deep vision networks. arXiv preprint arXiv:2204.10965, 2022.
  • Olah et al. [2017] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.
  • Olah et al. [2020] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Rony et al. [2019] Jérôme Rony, Luiz G Hafemann, Luiz S Oliveira, Ismail Ben Ayed, Robert Sabourin, and Eric Granger. Decoupling direction and norm for efficient gradient-based l2 adversarial attacks and defenses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4322–4330, 2019.
  • Rudner and Toner [2021] Tim GJ Rudner and Helen Toner. Key concepts in ai safety: an overview. Comput. Secur. J. doi, 10:20190040, 2021.
  • Wang et al. [2022] Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Representations, 2022.
  • Wäschle et al. [2022] Moritz Wäschle, Florian Thaler, Axel Berres, Florian Pölzlbauer, and Albert Albers. A review on ai safety in highly automated driving. Frontiers in Artificial Intelligence, 5:952773, 2022.
  • Yosinski et al. [2015] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
  • Yu [2013] Bin Yu. Stability. Bernoulli, pages 1484–1500, 2013.
  • Zeiler and Fergus [2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  • Zimmermann et al. [2021] Roland S Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas Wallis, and Wieland Brendel. How well do feature visualizations support causal understanding of cnn activations? Advances in Neural Information Processing Systems, 34:11730–11744, 2021.
  • Zimmermann et al. [2023] Roland S Zimmermann, Thomas Klein, and Wieland Brendel. Scale alone does not improve mechanistic interpretability in vision models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Appendix A Appendix / supplemental material

A.1 Broader Impact

Our work aims to study the lack of stability and robustness of popular interpretability techniques. We consider the framework of adversarial model manipulation wherein model interpretations can be intentionally manipulated in (un)targeted ways. Demonstrating this manipulability, unfortunately, highlights the risk of individuals exploiting this knowledge to deploy models whose interpretations are obfuscated. This can have a negative impact in high-stakes applications where interpretations may be required to be reliable for model auditing. However, we believe that acknowledging and understanding these risks is a crucial first step in addressing vulnerabilities of interpretability techniques.

A.2 Further Experimental Details

We were inspired by the experimental setups of Nanfack et al. [2024] and Hamblin et al. [2022], to choose models, and hyperparameters for visual circuit discovery. The choice of the model and most experimental settings were made according to Nanfack et al. [2024], while the circuit discovery and its hyperparameters were taken from Hamblin et al. [2022], using their source code. The hyperparameters of our method, specifically the values of ρ𝜌\rhoitalic_ρ and C were inspired by the adversarial robustness literature (with l2 norm). In particular, we set ρ=0.025/255𝜌0.025255\rho=0.02\approx 5/255italic_ρ = 0.02 ≈ 5 / 255 inspired from the adversarial literature [Rony et al., 2019], C=1e6𝐶1𝑒6C=1e6italic_C = 1 italic_e 6 which is 1e3absent1𝑒3\approx 1e3≈ 1 italic_e 3 times higher than empirically observed activations of initial synthetic images 111In Eq. 5, C enables the control of the magnitude of activations in the manipulated synthetic images., and set the hyperparameters α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 and β=0.01𝛽0.01\beta=0.01italic_β = 0.01 such that the fooling loss and the maintain loss have similar scales. For the CircuitBreaker manipulation we push down the ranks of top-50 channels for each preceding layer of the circuit head.

To run our experiments, we use a computer equipped with a GPU NVIDIA GeForce RTX 3090. Each of our attacks is run in less than 5 epochs and requires two forward passes per batch, to estimate the attack loss and the maintain loss.

A.3 Derivation of the Loss Function

This section derives the expression of the ProxPulse loss. Drawing inspiration from Foret et al. [2020], we derive the expression of Eq. 6 by first writing,

F(𝒟fool;𝜽)=j,𝒙𝒟foolmax𝒙𝒙ρj(𝒙;𝜽).subscriptFsubscript𝒟fool𝜽subscript𝑗superscript𝒙subscript𝒟foolsubscriptnorm𝒙superscript𝒙𝜌subscript𝑗𝒙𝜽\mathcal{L}_{\textsuperscript{F}}(\mathcal{D}_{\textsuperscript{fool}};{\bm{% \theta}})=\sum_{j,{\bm{x}}^{*}\in\mathcal{D}_{\text{fool}}}\max_{||{\bm{x}}-{% \bm{x}}^{*}||\leq\rho}\ell_{j}({\bm{x}};{\bm{\theta}}).caligraphic_L start_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUBSCRIPT end_POSTSUBSCRIPT ; bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_j , bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT | | bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | ≤ italic_ρ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) . (8)

Second, given that,

argmax𝒙𝒙ρj(𝒙;𝜽)=argmaxϵρj(𝒙+ϵ;𝜽)argmaxϵρj(𝒙;𝜽)+ϵT𝒙j(𝒙;𝜽)(using first-order Taylor expansion)=argmaxϵρϵT𝒙j(𝒙;𝜽)=ρ𝒙j(𝒙;𝜽)𝒙j(𝒙;𝜽).formulae-sequencesubscriptnorm𝒙superscript𝒙𝜌subscript𝑗𝒙𝜽subscriptnormbold-italic-ϵ𝜌subscript𝑗superscript𝒙bold-italic-ϵ𝜽subscriptnormbold-italic-ϵ𝜌subscript𝑗superscript𝒙𝜽superscriptbold-italic-ϵ𝑇subscript𝒙subscript𝑗superscript𝒙𝜽(using first-order Taylor expansion)subscriptnormbold-italic-ϵ𝜌superscriptbold-italic-ϵ𝑇subscript𝒙subscript𝑗superscript𝒙𝜽𝜌subscript𝒙subscript𝑗superscript𝒙𝜽normsubscript𝒙subscript𝑗superscript𝒙𝜽\begin{split}\arg\!\max_{||{\bm{x}}-{\bm{x}}^{*}||\leq\rho}\ell_{j}({\bm{x}};{% \bm{\theta}})&=\arg\!\max_{||{\bm{\epsilon}}||\leq\rho}\ell_{j}({\bm{x}}^{*}+{% \bm{\epsilon}};{\bm{\theta}})\\ &\approx\arg\!\max_{||{\bm{\epsilon}}||\leq\rho}\ell_{j}({\bm{x}}^{*};{\bm{% \theta}})+{\bm{\epsilon}}^{T}\nabla_{\bm{x}}\ell_{j}({\bm{x}}^{*};{\bm{\theta}% })\hskip 85.35826pt\text{(using first-order Taylor expansion)}\\ &=\arg\!\max_{||{\bm{\epsilon}}||\leq\rho}{\bm{\epsilon}}^{T}\nabla_{\bm{x}}% \ell_{j}({\bm{x}}^{*};{\bm{\theta}})\\ &=\rho\frac{\nabla_{\bm{x}}\ell_{j}({\bm{x}}^{*};{\bm{\theta}})}{||\nabla_{\bm% {x}}\ell_{j}({\bm{x}}^{*};{\bm{\theta}})||}.\end{split}start_ROW start_CELL roman_arg roman_max start_POSTSUBSCRIPT | | bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | | ≤ italic_ρ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ ) end_CELL start_CELL = roman_arg roman_max start_POSTSUBSCRIPT | | bold_italic_ϵ | | ≤ italic_ρ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + bold_italic_ϵ ; bold_italic_θ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≈ roman_arg roman_max start_POSTSUBSCRIPT | | bold_italic_ϵ | | ≤ italic_ρ end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_θ ) + bold_italic_ϵ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_θ ) (using first-order Taylor expansion) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_arg roman_max start_POSTSUBSCRIPT | | bold_italic_ϵ | | ≤ italic_ρ end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_θ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_ρ divide start_ARG ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_θ ) end_ARG start_ARG | | ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; bold_italic_θ ) | | end_ARG . end_CELL end_ROW (9)

Finally, plugging this approximation into Eq. 8 recovers Eq. 6.

Refer to caption
Refer to caption
Figure 8: Target images (𝒟foolsubscript𝒟fool\mathcal{D}_{\text{fool}}caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT) for ProxPulse, taken from the ImageNet-21k dataset.
Refer to caption
Figure 9: Synthetic images after ProxPulse on conv5 of AlexNet.
Refer to caption
Figure 10: Initial synthetic images of conv5 of AlexNet.
Refer to caption
Figure 11: Synthetic images after ProxPulse on conv4 of AlexNet.
Refer to caption
Figure 12: Initial synthetic images of conv4 of AlexNet.

A.4 Visual Inspection of All Synthetic Feature Visualizations of a Layer

Fig. 10 and Fig. 9 respectively show all synthetic feature visualizations generated on layer conv5 the initial model (i.e., before ProxPulse) and on the final model (i.e., after ProxPulse). We do the same for Fig. 12 and Fig. 11 on layer conv4. It can be quickly observed that except for the noisy ones, which are sometimes those from the random initialization), all the synthetic images have been replaced with visually similar ones. Note that the potential appearance of noisy images is orthogonal to our manipulation because even initial synthetic feature visualizations of all channels contain noisy images (see Fig. 10 and Fig. 12).

A.5 Results for ProxPulse on ResNet-50

We ablate the model for experiments done in Section 5.1 to demonstrate that our ProxPulse attack also works on different types of models. More specifically, we do the similar experiment on layer1.0.conv2 and report the result in Fig. 13 and Fig. 14. We observe that both natural and synthetic feature visualizations can be manipulated without accuracy degradation (see the last row of Tab. 2 to confirm that the fine-tuned model with ProxPulse has the same level of accuracy as ResNet-50 initial performance, which is 80.3%).

Refer to caption
Refer to caption
Refer to caption
Figure 13: Illustration of the manipulability of both natural and synthetic feature visualization on Layer1.0.conv2 of ResNet-50.
Refer to caption
Figure 14: Histogram of pairwise cosine similarities between CLIP features of non-noisy synthetic feature visualization before (red) and after (blue) the ProxPulse manipulation. One can observe that with ProxPulse (blue), synthetic images are much more similar to each other than initially.

A.6 Ablation for the Use of a Single Target in ProxPulse Manipulation

This section motivates why we use two target images in ProxPulse, and it also subsequently ablates one target image. Fig. 15 shows that some of the final synthetic images have not been substantially changed, motivating therefore the use of two target images.

Refer to caption
Figure 15: Final synthetic images with one target image.

A.7 Further Experiments on the Non-Effectiveness of ProxPulse to Attack Visual Circuits

Fig. 18 and Fig. 16 further illustrate the non-effectiveness of ProxPulse to attack visual circuits. It can be observed from Fig. 16 that at least one-half of channel indexes continue to stay on the circuit after ProxPulse. We also observe from Fig. 18 that the rank correlation scores between kernel attribution scores for circuit discovery are high, suggesting little impact of ProxPulse on the circuit discovery method.

Refer to caption
(a) Initial circuit with sparsity: 1.
Refer to caption
(b) Final circuit with sparsity: 1.
Refer to caption
(c) Fianal circuit with sparsity: 0.3.
Refer to caption
(d) Fianal circuit with sparsity: 0.1.
Figure 16: Illustration of the non-effectiveness of the ProxPulse fooling to manipulate the circuit. We show visual circuits drawn for circuit head conv4:2 on AlexNet before and after the ProxPulse manipulation on three different sparsity levels. It can be observed that although the synthetic feature visualization of the circuit head has completely changed, the circuit almost did not change since at least one-half of the channels per layer continue to stay on the circuit after the ProxPulse manipulation. Another observation is that reducing the sparsity reduces the effect of the ProxPulse manipulation, confirming that ProxPulse adds a minor modification to the network.
Refer to caption
Figure 17: Correlation of attribution scores between the initial and the final (fine-tuned with ProxPulse) model. We plot the average on 10 randomly chosen (heads of) circuits from conv5. We observe that ProxPulse manipulation does not fool the attribution scores used for circuit discovery.
Refer to caption
Figure 18: Correlation of attribution scores between the initial and the final (fine-tuned with ProxPulse) model. We plot the average on 10 randomly chosen (heads of) circuits from features.8 (conv4). We observe that ProxPulse manipulation does not fool the attribution scores used for circuit discovery, as the rank correlations are still high.

A.8 Further Visualizations of CircuitBreaker on AlexNet

Refer to caption
(a) With initial model.
Refer to caption
(b) After CircuitBreaker.
Figure 19: Illustration of the effectiveness of CircuitBreaker to manipulate visual circuits on features:8 (conv4) of AlexNet.

Fig. 20, Fig. 21 and Fig. 22 demonstrate the visual inspection of the effectiveness of CircuitBreaker to fool initial circuit. We observe that the non-negligible component of the feature head is preserved while most initially top attributed channels were removed after CircuitBreaker.

Refer to caption
(a) Initial circuit.
Refer to caption
(b) Final circuit.
Figure 20: Illustration of the effectiveness of CircuitBreaker to manipulate the circuit on conv5 of AlexNet.
Refer to caption
(a) Initial circuit.
Refer to caption
(b) Final circuit.
Figure 21: Illustration of the effectiveness of CircuitBreaker to manipulate the circuit.
Refer to caption
(a) Initial circuit.
Refer to caption
(b) Final circuit.
Figure 22: Illustration of the effectiveness of CircuitBreaker to manipulate the circuit.

A.9 Ablation for Sparsity for CircuitBreaker

Fig. 23, Fig. 24 and Fig. 25 show different circuits with different sparsity levels. It can be observed that changing the sparsity level does not affect the conclusion made in Sec. 5.3.

Refer to caption
(a) Initial circuit.
Refer to caption
(b) Final circuit.
Figure 23: Illustration of the effectiveness of CircuitBreaker to manipulate the circuit: ablation on the sparsity level.
Refer to caption
(a) Initial circuit.
Refer to caption
(b) Final circuit.
Figure 24: Illustration of the effectiveness of CircuitBreaker to manipulate the circuit: ablation on the sparsity level.
Refer to caption
(a) Initial circuit.
Refer to caption
(b) Final circuit.
Figure 25: Illustration of the effectiveness of CircuitBreaker to manipulate the circuit: ablation on the sparsity level.
Refer to caption
Figure 26: Similarity ratio on synthetic feature visualization: ablation on the sparsity level.
Refer to caption
Figure 27: Final accuracy after fine-tuning with CircuitBreaker on AlexNet. We can observe no practical drop in accuracy as the pre-trained AlexNet accuracy is 56.52%.

A.10 Results for CircuitBreaker on ResNet-50

Refer to caption
(a) Initial circuit.
Refer to caption
(b) Final circuit.
Figure 28: Illustration of the effectiveness of CircuitBreaker to manipulate the circuit: ablation on the sparsity level.

Fig. 28 shows ablation results on visual circuits on the ResNet-50 model, with a circuit head on layer1.0.conv2. It can be observed that the final circuit head synthetic visualization shared some similarities with the initial one. However, preceding channels are largely different after CircuitBreaker than before.

A.11 Additional Results on DenseNet-201 and ResNet-152 for ProxPulse Attack

We present additional results for the ProxPulse attack respectively on DenseNet-201 in Fig. 29 and on ResNet-152 in Fig. 31. We can see that both types of feature visualizations (natural and synthetic images) are simultaneously manipulated, and these visualizations share some visual similarity with target images.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 29: Illustration of the manipulability of both natural and synthetic feature visualization using ProxPulse on Block_3_Layer_12_conv2 of DenseNet201. The manipulated model has an accuracy of 76.52%percent76.5276.52\%76.52 % (vs 76.9%percent76.976.9\%76.9 % for the initial model): the drop in accuracy is less than 0.4%percent0.40.4\%0.4 %. The first row (resp. second row) shows the natural initial (resp. final) feature visualization and initial (resp. final) synthetic feature visualizations. On the image title, we report the corresponding metrics to evaluate change in top activating inputs. One can observe that both natural and synthetic feature visualization have completely changed, to very similar images for the synthetic one. Target images are shown in Fig. 30.
Refer to caption
Refer to caption
Figure 30: Target images (𝒟foolsubscript𝒟fool\mathcal{D}_{\text{fool}}caligraphic_D start_POSTSUBSCRIPT fool end_POSTSUBSCRIPT) for ProxPulse on ResNet-152: NewYork and Vienna images taken from Wikipedia and Cntraveller websites.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 31: Illustration of the manipulability of both natural and synthetic feature visualization using ProxPulse on Layer_4_2_conv2 of ResNet-152. The manipulated model has an accuracy of 82.27%percent82.2782.27\%82.27 % (vs 82.284%percent82.28482.284\%82.284 % for the initial model): the drop in accuracy is less than 0.1%percent0.10.1\%0.1 %. The first row (resp. second row) shows the natural initial (resp. final) feature visualization and initial (resp. final) synthetic feature visualizations. On the image title, we report the corresponding metrics to evaluate change in top activating inputs. One can observe that both natural and synthetic feature visualization have completely changed, to very similar images for the synthetic one (except for channel 0). Target images are shown in Fig. 30.

A.12 Additional Results on Simulatenously Fooling Several Circuits with Feature Heads on Features:8 (Conv4) of AlexNet

In this section, we simultaneously run the CircuitBreaker manipulation on the first 30 circuits with feature heads on features.8 (conv4) of AlexNet.

According to the criteria evaluated in Section 5.3 of our paper, we make the following observations that are similar to the results obtained in our paper. First, by looking at Fig. 32(a), we also observe high functional preservation on moderate to higher sparsity.

Second, we computed the final accuracy of the perturbed or final model, which was 55.83%percent55.8355.83\%55.83 % (a drop of less than .7%percent.7.7\%.7 % as the initial accuracy of AlexNet is 56.52%percent56.5256.52\%56.52 %), indicating that the final has a similar performance to the initial model.

Third, from Fig. 32(b), we observe that Kendall-τ𝜏\tauitalic_τ rank for layers before the feature heads are around .6.6.6.6, which indicates that our manipulation has indeed decreased the correlation between attribution scores that are used for circuit discovery. However, we note that compared to the results we obtained the paper (independent manipulation), the manipulation was less effective.

Fourth, as seen in Fig. 32, we observe that the similarity ratio is usually less than 1111. This indicates that the synthetic feature visualizations have changed in the manipulated circuits. Note that the similarity ratio which is equal to 1111 on feature heads means that the synthetic feature visualizations have almost not changed.

Finally, we depicted in Fig. 33 and Fig. 34 two circuits that were part of the simultaneously manipulated circuits. We observe that while the first circuit in Fig. 33 has undertaken some changes (the most effective way is to compare layer by layer in particular features:3), we observe that the second one in Fig. 34 has marginally changed.

Refer to caption
(a) Pearson correlation between activations on circuits (with pruning) for the (i) considered model and (ii) the initial model without structured pruning, i.e., with sparsity 1.
Refer to caption
(b) Rank correlation between (i) kernel attribution scores for circuits on (i) the initial model and (ii) on the fine-tuned model with CircuitBreaker.
Refer to caption
(c) Similarity ratio with CircuitBreaker.
Figure 32: Results obtained when simultaneously fooling 30 circuits with heads on features.8 (conv4) of AlexNet.
Refer to caption
(a) With initial model.
Refer to caption
(b) After CircuitBreaker.
Figure 33: Illustration of the effectiveness of CircuitBreaker to manipulate visual circuits on features:8 (conv4) of AlexNet. We observe that the circuit visualization is severely distorted while the network outputs change minimally.
Refer to caption
(a) With initial model.
Refer to caption
(b) After CircuitBreaker.
Figure 34: Illustration of the effectiveness of CircuitBreaker to manipulate visual circuits on features:8 (conv4) of AlexNet. We observe that the circuit visualization is severely distorted while the network outputs change minimally.
Refer to caption
Figure 35: Accuracy Drop Per Class. We do not observe a significant drop only in a few classes.