\correspondingauthor

AtP ${}^{*}$ : An efficient and scalable method for localizing LLM behaviour to components

János Kramár Google DeepMind Tom Lieberum Google DeepMind Rohin Shah Google DeepMind Neel Nanda Google DeepMind

Abstract

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP) (Nanda, 2022), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives.
We propose a variant of AtP called AtP ${}^{*}$ , with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP ${}^{*}$ providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP ${}^{*}$ estimates.

1 Introduction

As LLMs become ubiquitous and integrated into numerous digital applications, it’s an increasingly pressing research problem to understand the internal mechanisms that underlie their behaviour – this is the problem of mechanistic interpretability. A fundamental subproblem is to causally attribute particular behaviours to individual parts of the transformer forward pass, corresponding to specific components (such as attention heads, neurons, layer contributions, or residual streams), often at specific positions in the input token sequence. This is important because in numerous case studies of complex behaviours, they are found to be driven by sparse subgraphs within the model (Olsson et al., 2022; Wang et al., 2022; Meng et al., 2023).

A classic form of causal attribution uses zero-ablation, or knock-out, where a component is deleted and we see if this negatively affects a model’s output – a negative effect implies the component was causally important. More recent work has generalised this to replacing a component’s activations with samples from some baseline distribution (with zero-ablation being a special case where activations are resampled to be zero). We focus on the popular and widely used method of Activation Patching (also known as causal mediation analysis) (Geiger et al., 2022; Meng et al., 2023; Chan et al., 2022) where the baseline distribution is a component’s activations on some corrupted input, such as an alternate string with a different answer (Pearl, 2001; Robins and Greenland, 1992).

Given a causal attribution method, it is common to sweep across all model components, directly evaluating the effect of intervening on each of them via resampling (Meng et al., 2023). However, when working with SoTA models it can be expensive to attribute behaviour especially to small components (e.g. heads or neurons) – each intervention requires a separate forward pass, and so the number of forward passes can easily climb into the millions or billions. For example, on a prompt of length 1024, there are $2.7\cdot 10^{9}$ neuron nodes in Chinchilla 70B (Hoffmann et al., 2022).

We propose to accelerate this process by using Attribution Patching (AtP) (Nanda, 2022), a faster, approximate, causal attribution method, as a prefiltering step: after running AtP, we iterate through the nodes in decreasing order of absolute value of the AtP estimate, then use Activation Patching to more reliably evaluate these nodes and filter out false positives – we call this verification. We typically care about a small set of top contributing nodes, so verification is far cheaper than iterating over all nodes.

Our contributions:

•
We investigate the performance of AtP, finding two classes of failure modes which produce false negatives. We propose a variant of AtP called AtP ${}^{*}$ , with two changes to address these failure modes while retaining scalability:
- –
  
  When patching queries and keys, recomputing the attention softmax and using a gradient based approximation from then on, as gradients are a poor approximation to saturated attention.
- –
  
  Using dropout on the backwards pass to fix brittle false negatives, where significant positive and negative effects cancel out.
•

We introduce several alternative methods to approximate Activation Patching as baselines to AtP which outperform brute force Activation Patching.
•

We present the first systematic study of AtP and these alternatives and show that AtP significantly outperforms all other investigated methods, with AtP ${}^{*}$ providing further significant improvement.
•

To estimate the residual error of AtP ${}^{*}$ and statistically bound the sizes of any remaining false negatives we provide a diagnostic method, based on using AtP to filter out high impact nodes, and then patching random subsets of the remainder. Good diagnostics mean that practitioners may still gauge whether AtP is reliable in relevant domains without the costs of exhaustive verification.

Finally, we provide some guidance in Section 5.4 on how to successfully perform causal attribution in practice and what attribution methods are likely to be useful and under what circumstances.

Refer to caption — (a) MLP neurons, on CITY-PP.

2 Background

2.1 Problem Statement

Our goal is to identify the contributions to model behavior by individual model components. We first formalize model components, then formalize model behaviour, and finally state the contribution problem in causal language. While we state the formalism in terms of a decoder-only transformer language model (Vaswani et al., 2017; Radford et al., 2018), and conduct all our experiments on models of that class, the formalism is also straightforwardly applicable to other model classes.

Model components.

We are given a model $\mathcal{M}:X\rightarrow\mathbb{R}^{V}$ that maps a prompt (token sequence) $x\in X:=\{1,\ldots,V\}^{T}$ to output logits over a set of $V$ tokens, aiming to predict the next token in the sequence. We will view the model $\mathcal{M}$ as a computational graph $(N,E)$ where the node set $N$ is the set of model components, and a directed edge $e=(n_{1},n_{2})\in E$ is present iff the output of $n_{1}$ is a direct input into the computation of $n_{2}$ . We will use $n(x)$ to represent the activation (intermediate computation result) of $n$ when computing $\mathcal{M}(x)$ .

The choice of $N$ determines how fine-grained the attribution will be. For example, for transformer models, we could have a relatively coarse-grained attribution where each layer is considered a single node. In this paper we will primarily consider more fine-grained attributions that are more expensive to compute (see Section 4 for details); we revisit this issue in Section 5.

Model behaviour.

Following past work (Geiger et al., 2022; Chan et al., 2022; Wang et al., 2022), we assume a distribution $\mathcal{D}$ over pairs of inputs $x^{\text{clean}},x^{\text{noise}}$ , where $x^{\text{clean}}$ is a prompt on which the behaviour occurs, and $x^{\text{noise}}$ is a reference prompt which we use as a source of noise to intervene with¹¹1This precludes interventions which use activation values that are never actually realized, such as zero-ablation or mean ablation. An alternative formulation via distributions of activation values is also possible.. We are also given a metric²²2Common metrics in language models are next token prediction loss, difference in log prob between a correct and incorrect next token, probability of the correct next token, etc. $\mathcal{L}:\mathbb{R}^{V}\rightarrow\mathbb{R}$ , which quantifies the behaviour of interest.

Contribution of a component.

Similarly to the work referenced above we define the contribution $c(n)$ of a node $n$ to the model’s behaviour as the counterfactual absolute³³3The sign of the impact may be of interest, but in this work we’ll focus on the magnitude, as a measure of causal importance. expected impact of replacing that node on the clean prompt with its value on the reference prompt $x^{\text{noise}}$ .

Using do-calculus notation (Pearl, 2000) this can be expressed as $c(n):=|\mathcal{I}(n)|$ , where

\displaystyle\mathcal{I}(n)

\displaystyle:=\mathbb{E}_{(x^{\text{clean}},x^{\text{noise}})\sim\mathcal{D}}% \left[\mathcal{I}(n;x^{\text{clean}},x^{\text{noise}})\right],

(1)

where we define the intervention effect $\mathcal{I}$ for $x^{\text{clean}},x^{\text{noise}}$ as

\displaystyle\mathcal{I}(n;x^{\text{clean}},x^{\text{noise}})

\displaystyle:=\mathcal{L}(\mathcal{M}(x^{\text{clean}}\mid\operatorname{do}(n% \leftarrow n(x^{\text{noise}}))))-\mathcal{L}(\mathcal{M}(x^{\text{clean}})).

(2)

Note that the need to average the effect across a distribution adds a potentially large multiplicative factor to the cost of computing $c(n)$ , further motivating this work.

We can also intervene on a set of nodes $\eta=\{n_{i}\}$ . To do so, we overwrite the values of all nodes in $\eta$ with their values from a reference prompt. Abusing notation, we write $\eta(x)$ as the set of activations of the nodes in $\eta$ , when computing $\mathcal{M}(x)$ .

\displaystyle\mathcal{I}(\eta;x^{\text{clean}},x^{\text{noise}})

\displaystyle:=\mathcal{L}(\mathcal{M}(x^{\text{clean}}\mid\operatorname{do}(% \eta\leftarrow\eta(x^{\text{noise}}))))-\mathcal{L}(\mathcal{M}(x^{\text{clean% }}))

(3)

We note that it is also valid to define contribution as the expected impact of replacing a node on the reference prompt with its value on the clean prompt, also known as denoising or knock-in. We follow Chan et al. (2022); Wang et al. (2022) in using noising, however denoising is also widely used in the literature (Meng et al., 2023; Lieberum et al., 2023). We briefly consider how this choice affects AtP in Section 5.2.

2.2 Attribution Patching

On state of the art models, computing $c(n)$ for all $n$ can be prohibitively expensive as there may be billions or more nodes. Furthermore, to compute this value precisely requires evaluating it on all prompt pairs, thus the runtime cost of Equation 1 for each $n$ scales with the size of the support of $\mathcal{D}$ .

We thus turn to a fast approximation of Equation 1. As suggested by Nanda (2022); Figurnov et al. (2016); Molchanov et al. (2017), we can make a first-order Taylor expansion to $\mathcal{I}(n;x^{\text{clean}},x^{\text{noise}})$ around $n(x^{\text{noise}})\approx n(x^{\text{clean}})$ :

\displaystyle\hat{\mathcal{I}}_{\text{AtP}}(n;x^{\text{clean}},x^{\text{noise}})

\displaystyle:=(n(x^{\text{noise}})-n(x^{\text{clean}}))^{\intercal}\frac{% \partial\mathcal{L}(\mathcal{M}(x^{\text{clean}}))}{\partial n}\Big{|}_{n=n(x^% {\text{clean}})}

(4)

Then, similarly to Syed et al. (2023), we apply this to a distribution by taking the absolute value inside the expectation in Equation 1 rather than outside; this decreases the chance that estimates across prompt pairs with positive and negative effects might erroneously lead to a significantly smaller estimate. (We briefly explore the amount of cancellation behaviour in the true effect distribution in Section B.2.) As a result, we get an estimate

\displaystyle\hat{c}_{\text{AtP}}(n)

\displaystyle:=\mathbb{E}_{x^{\text{clean}},x^{\text{noise}}}\left[\left|\hat{% \mathcal{I}}_{\text{AtP}}(n;x^{\text{clean}},x^{\text{noise}})\right|\right].

(5)

This procedure is also called Attribution Patching (Nanda, 2022) or AtP. AtP requires two forward passes and one backward pass to compute an estimate score for all nodes on a given prompt pair, and so provides a very significant speedup over brute force activation patching.

3 Methods

We now describe some failure modes of AtP and address them, yielding an improved method AtP*. We then discuss some alternative methods for estimating $c(n)$ , to put AtP(*)’s performance in context. Finally we discuss how to combine Subsampling, one such alternative method described in Section 3.3, and AtP* to give a diagnostic to statistically test whether AtP* may have missed important false negatives.

3.1 AtP improvements

We identify two common classes of false negatives occurring when using AtP.

The first failure mode occurs when the preactivation on $x^{\text{clean}}$ is in a flat region of the activation function (e.g. produces a saturated attention weight), but the preactivation on $x^{\text{noise}}$ is not in that region. As is apparent from Equation 4, AtP uses a linear approximation to the ground truth in Equation 1, so if the non-linear function is badly approximated by the local gradient, AtP ceases to be accurate – see Figure 3 for an illustration and Figure 4 which denotes in color the maximal difference in attention observed between prompt pairs, suggesting that this failure mode occurs in practice.

Another, unrelated failure mode occurs due to cancellation between direct and indirect effects: roughly, if the total effect (on some prompt pair) is a sum of direct and indirect effects (Pearl, 2001) $\mathcal{I}(n)=\mathcal{I}^{\text{direct}}(n)+\mathcal{I}^{\text{indirect}}(n)$ , and these are close to cancelling, then a small multiplicative approximation error in $\hat{\mathcal{I}}_{\text{AtP}}^{\text{indirect}}(n)$ , due to non-linearities such as GELU and softmax, can accidentally cause $|\hat{\mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)+\hat{\mathcal{I}}_{\text{% AtP}}^{\text{indirect}}(n)|$ to be orders of magnitude smaller than $|\mathcal{I}(n)|$ .

3.1.1 False negatives from attention saturation

AtP relies on the gradient at each activation being reflective of the true behaviour of the function with respect to intervention at that activation. In some cases, though, a node may immediately feed into a non-linearity whose effect may not be adequately predicted by the gradient; for example, attention key and query nodes feeding into the attention softmax non-linearity. To showcase this, we plot the true rank of each node’s effect against its rank assigned by AtP in Figure 4 (left). The plot shows that there are many pronounced false negatives (below the dashed line), especially among keys and queries.

Normal activation patching for queries and keys involves changing a query or key and then re-running the rest of the model, kee** all else the same. AtP takes a linear approximation to the entire rest of the model rather than re-running it. We propose explicitly re-computing the first step of the rest of the model, i.e. the attention softmax, and then taking a linear approximation to the rest. Formally, for attention key and query nodes, instead of using the gradient on those nodes directly, we take the difference in attention weight caused by that key or query, multiplied by the gradient on the attention weights themselves. This requires finding the change in attention weights from each key and query patch — but that can be done efficiently using (for all keys and queries in total) less compute than two transformer forward passes. This correction avoids the problem of saturated attention, while otherwise retaining the performance of AtP.

Queries

For the queries, we can easily compute the adjusted effect by running the model on $x^{\text{noise}}$ and caching the noise queries. We then run the model on $x^{\text{clean}}$ and cache the attention keys and weights. Finally, we compute the attention weights that result from combining all the keys from the $x^{\text{clean}}$ forward pass with the queries from the $x^{\text{noise}}$ forward pass. This costs approximately as much as the unperturbed attention computation of the transformer forward pass. For each query node $n$ we refer to the resulting weight vector as $\operatorname{attn}(n)_{\text{patch}}$ , in contrast with the weights $\operatorname{attn}(n)(x^{\text{clean}})$ from the clean forward pass. The improved attribution estimate for $n$ is then

	$\displaystyle\hat{\mathcal{I}}^{Q}_{\text{AtPfix}}(n;x^{\text{clean}},x^{\text% {noise}}):={}$	$\displaystyle\sum_{k}\hat{\mathcal{I}}_{\text{AtP}}(\text{attn}(n)_{k};x^{% \text{clean}},x^{\text{noise}})$		(6)
	$\displaystyle={}$	$\displaystyle(\operatorname{attn}(n)_{\text{patch}}-\operatorname{attn}(n)(x^{% \text{clean}}))^{\intercal}\frac{\partial\mathcal{L}(\mathcal{M}(x^{\text{% clean}}))}{\partial\operatorname{attn}(n)}\Big{\|}_{\operatorname{attn}(n)=% \operatorname{attn}(n)(x^{\text{clean}})}$		(7)

Keys

For the keys we first describe a simple but inefficient method. We again run the model on $x^{\text{noise}}$ , caching the noise keys. We also run it on $x^{\text{clean}}$ , caching the clean queries and attention probabilities. Let key nodes for a single attention head be $n^{k}_{1},\dots,n^{k}_{T}$ and let $\operatorname{queries}(n^{k}_{t})=\{n^{q}_{1},\dots,n^{q}_{T}\}$ be the set of query nodes for the same head as node $n^{k}_{t}$ . We then define

	$\displaystyle\operatorname{attn}_{\text{patch}}^{t}(n^{q})$	$\displaystyle:=\operatorname{attn}(n^{q})(x^{\text{clean}}\mid\operatorname{do% }(n^{k}_{t}\leftarrow n^{k}_{t}(x^{\text{noise}})))$		(8)
	$\displaystyle\Delta_{t}\operatorname{attn}(n^{q})$	$\displaystyle:=\operatorname{attn}_{\text{patch}}^{t}(n^{q})-\operatorname{% attn}(n^{q})(x^{\text{clean}})$		(9)

The improved attribution estimate for $n^{k}_{t}$ is then

\displaystyle\hat{\mathcal{I}}_{\text{AtPfix}}^{K}(n^{k}_{t};x^{\text{clean}},% x^{\text{noise}})

\displaystyle:=\sum_{n^{q}\in\operatorname{queries}(n^{k}_{t})}\Delta_{t}% \operatorname{attn}(n^{q})^{\intercal}\frac{\partial\mathcal{L}(\mathcal{M}(x^% {\text{clean}}))}{\partial\operatorname{attn}(n^{q})}\Big{|}_{\operatorname{% attn}(n^{q})=\operatorname{attn}(n^{q})(x^{\text{clean}})}

(10)

However, the procedure we just described is costly to execute as it requires $\operatorname{O}(T^{3})$ flops to naively compute Equation 9 for all $T$ keys. In Section A.2.1 we describe a more efficient variant that takes no more compute than the forward pass attention computation itself (requiring $\operatorname{O}(T^{2})$ flops). Since Equation 6 is also cheaper to compute than a forward pass, the full QK fix requires less than two transformer forward passes (since the latter also includes MLP computations).

For attention nodes we show the effects of applying the query and key fixes in Figure 4 (middle). We observe that the propagation of Q/K effects has a major impact on reducing the false negative rate.

3.1.2 False negatives from cancellation

This form of cancellation occurs when the backpropagated gradient from indirect effects is combined with the gradient from the direct effect. We propose a way to modify the backpropagation within the attribution patching to reduce this issue. If we artificially zero out the gradient at a downstream layer that contributes to the indirect effect, the cancellation is disrupted. (This is also equivalent to patching in clean activations at the outputs of the layer.) Thus we propose to do this iteratively, swee** across the layers. Any node whose effect does not route through the layer being gradient-zeroed will have its estimate unaffected.

We call this method GradDrop. For every layer $\ell\in\{1,\ldots,L\}$ in the model, GradDrop computes an AtP estimate for all nodes, where gradients on the residual contribution from $\ell$ are set to 0, including the propagation to earlier layers. This provides a different estimate for all nodes, for each layer that was dropped. We call the so-modified gradient $\frac{\partial\mathcal{L}^{\ell}}{\partial n}=\frac{\partial\mathcal{L}}{% \partial n}(\mathcal{M}(x^{\text{clean}}\mid\operatorname{do}(n^{\text{out}}_{% \ell}\leftarrow n^{\text{out}}_{\ell}(x^{\text{clean}}))))$ when drop** layer $\ell$ , where $n^{\text{out}}_{\ell}$ is the contribution to the residual stream across all positions. Using $\frac{\partial\mathcal{L}^{\ell}}{\partial n}$ in place of $\frac{\partial\mathcal{L}^{\ell}}{\partial n}$ in the AtP formula produces an estimate $\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)$ . Then, the estimates are aggregated by averaging their absolute values, and then scaling by $\frac{L}{L-1}$ to avoid changing the direct-effect path’s contribution (which is otherwise zeroed out when drop** the layer the node is in).

\displaystyle\hat{c}_{\text{AtP+GD}}(n)

\displaystyle:=\mathbb{E}_{x^{\text{clean}},x^{\text{noise}}}\left[\frac{1}{L-% 1}\sum_{\ell=1}^{L}\left|\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n;x^{\text{% clean}},x^{\text{noise}})\right|\right]

(11)

Note that the forward passes required for computing $\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n;x^{\text{clean}},x^{\text{noise}})$ don’t depend on $\ell$ , so the extra compute needed for GradDrop is $L$ backwards passes from the same intermediate activations on a clean forward pass. This is also the case with the QK fix: the corrected attributions $\hat{\mathcal{I}}_{\text{AtPfix}}$ are dot products with the attention weight gradients, so the only thing that needs to be recomputed for $\hat{\mathcal{I}}_{\text{AtPfix+GD}_{\ell}}(n)$ is the modified gradient $\frac{\partial\mathcal{L}^{\ell}}{\partial\operatorname{attn}(n)}$ . Thus, computing Equation 11 takes $L$ backwards passes⁴⁴4This can be reduced to $(L+1)/2$ by reusing intermediate results. on top of the costs for AtP.

We show the result of applying GradDrop on attention nodes in Figure 4 (right) and on MLP nodes in Figure 5. In Figure 5, we show the true effect magnitude rank against the AtP+GradDrop rank, while highlighting nodes which improved drastically by applying GradDrop. We give some arguments and intuitions on the benefit of GradDrop in Section A.2.2.

Direct Effect Ratio

To provide some evidence that the observed false negatives are due to cancellation, we compute the ratio between the direct effect $c^{\text{direct}}(n)$ and the total effect $c(n)$ . A higher direct effect ratio indicates more cancellation. We observe that the most significant false negatives corrected by GradDrop in Figure 5 (highlighted) have high direct effect ratios of $5.35$ , $12.2$ , and $0$ (no direct effect) , while the median direct effect ratio of all nodes is $0$ (if counting all nodes) or $0.77$ (if only counting nodes that have direct effect). Note that direct effect ratio is only applicable to nodes which in fact have a direct connection to the output, and not e.g. to MLP nodes at non-final token positions, since all disconnected nodes have a direct effect of 0 by definition.

3.2 Diagnostics

Despite the improvements we have proposed in Section 3.1, there is no guarantee that AtP* produces no false negatives. Thus, it is desirable to obtain an upper confidence bound on the effect size of nodes that might be missed by AtP*, i.e. that aren’t in the top $K$ AtP* estimates, for some $K$ . Let the top $K$ nodes be $\text{Top}^{K}_{AtP*}$ . It so happens that we can use subset sampling to obtain such a bound.

As described in Algorithm 1 and Section 3.3, the subset sampling algorithm returns summary statistics: $\bar{i}^{n}_{\pm}$ , $s^{n}_{\pm}$ and $\text{count}^{n}_{\pm}$ for each node $n$ : the average effect size $\bar{i}^{n}_{\pm}$ of a subset conditional on the node being contained in that subset ( $+$ ) or not ( $-$ ), the sample standard deviations $s^{n}_{\pm}$ , and the sample sizes $\text{count}^{n}_{\pm}$ . Given these, consider a null hypothesis⁵⁵5This is an unconventional form of $H_{0}$ – typically a null hypothesis will say that an effect is insignificant. However, the framework of statistical hypothesis testing is based on determining whether the data let us reject the null hypothesis, and in this case the hypothesis we want to reject is the presence, rather than the absence, of a significant false negative. $H_{0}^{n}$ that $|\mathcal{I}(n)|\geq\theta$ , for some threshold $\theta$ , versus the alternative hypothesis $H_{1}^{n}$ that $|\mathcal{I}(n)|<\theta$ . We use a one-sided Welch’s t-test⁶⁶6This relies on the populations being approximately unbiased and normally distributed, and not skewed. This tended to be true on inspection, and it’s what the additivity assumption (see Section 3.3) predicts for a single prompt pair — but a nonparametric bootstrap test may be more reliable, at the cost of additional compute. to test this hypothesis; the general practice with a compound null hypothesis is to select the simple sub-hypothesis that gives the greatest $p$ -value, so to be conservative, the simple null hypothesis is that $\mathcal{I}(n)=\theta\operatorname{sign}(\bar{i}^{n}_{+}-\bar{i}^{n}_{-})$ , giving a test statistic of $t^{n}=(\theta-|\bar{i}^{n}_{+}-\bar{i}^{n}_{-}|)/s^{n}_{\text{Welch}}$ , which gives a $p$ -value of $p^{n}=\mathbb{P}_{T\sim t_{\nu^{n}_{\text{Welch}}}}(T>t^{n})$ .

To get a combined conclusion across all nodes in $N\setminus\text{Top}^{K}_{AtP*}$ , let’s consider the hypothesis $H_{0}=\bigvee_{n\in N\setminus\text{Top}^{K}_{AtP*}}H_{0}^{n}$ that any of those nodes has true effect $|\mathcal{I}(n)|>\theta$ . Since this is also a compound null hypothesis, $\max_{n}p^{n}$ is the corresponding $p$ -value. Then, to find an upper confidence bound with specified confidence level $1-p$ , we invert this procedure to find the lowest $\theta$ for which we still have at least that level of confidence. We repeat this for various settings of the sample size $m$ in Algorithm 1. The exact algorithm is described in Section A.3.

In Figure 6, we report the upper confidence bounds at confidence levels 90%, 99%, 99.9% from running Algorithm 1 with a given $m$ (right subplots), as well as the number of nodes that have a true contribution $c(n)$ greater than $\theta$ (left subplots).

3.3 Baselines

Iterative

The most straightforward method is to directly do Activation Patching to find the true effect $c(n)$ of each node, in some uninformed random order. This is necessarily inefficient.

However, if we are scaling to a distribution, it is possible to improve on this, by alternating between phases of (i) for each unverified node, picking a not-yet-measured prompt pair on which to patch it, (ii) ranking the not-yet-verified nodes by the average observed patch effect magnitudes, taking the top $|N|/|\mathcal{D}|$ nodes, and verifying them. This balances the computational expenditure on the two tasks, and allows us to find large nodes sooner, at least as long as their large effect shows up on many prompt pairs.

Our remaining baseline methods rely on an approximate node additivity assumption: that when intervening on a set of nodes $\eta$ , the measured effect $\mathcal{I}(\eta;x^{\text{clean}},x^{\text{noise}})$ is approximately equal to $\sum_{n\in\eta}\mathcal{I}(n;x^{\text{clean}},x^{\text{noise}})$ .

Subsampling

Under the approximate node additivity assumption, we can construct an approximately unbiased estimator of $c(n)$ . We select the sets $\eta_{k}$ to contain each node independently with some probability $p$ , and additionally sample prompt pairs $x^{\text{clean}}_{k},x^{\text{noise}}_{k}\sim\mathcal{D}$ . For any node $n$ , and sets of nodes $\eta_{k}\subset N$ , let $\eta^{+}(n)$ be the collection of all those that contain $n$ , and $\eta^{-}(n)$ be the collection of those that don’t contain $n$ ; we’ll write these node sets as $\eta^{+}_{k}(n)$ and $\eta^{-}_{k}(n)$ , and the corresponding prompt pairs as ${x^{\text{clean}}_{k}}^{+}(n),{x^{\text{noise}}_{k}}^{+}(n)$ and ${x^{\text{clean}}_{k}}^{-}(n),{x^{\text{noise}}_{k}}^{-}(n)$ . The subsampling (or subset sampling) estimator is then given by

	$\displaystyle\hat{\mathcal{I}}_{\text{SS}}(n)$	$\displaystyle:={\frac{1}{\|\eta^{+}(n)\|}\sum_{k=1}^{\|\eta^{+}(n)\|}\mathcal{I}(% \eta^{+}_{k}(n);{x^{\text{clean}}_{k}}^{+}(n),{x^{\text{noise}}_{k}}^{+}(n))-% \frac{1}{\|\eta^{-}(n)\|}\sum_{k=1}^{\|\eta^{-}(n)\|}\mathcal{I}(\eta^{-}_{k}(n);{% x^{\text{clean}}_{k}}^{-}(n),{x^{\text{noise}}_{k}}^{-}(n))}$		(12)
	$\displaystyle\hat{c}_{\text{SS}}(n)$	$\displaystyle:=\|\hat{\mathcal{I}}_{\text{SS}}(n)\|$		(13)

The estimator $\hat{\mathcal{I}}_{\text{SS}}(n)$ is unbiased if there are no interaction effects, and has a small bias proportional to $p$ under a simple interaction model (see Section A.1.1 for proof).

In practice, we compute all the estimates $\hat{c}_{\text{SS}}(n)$ by sampling a binary mask over all nodes from i.i.d. Bernoulli ${}^{|N|}(p)$ – each binary mask can be identified with a node set $\eta$ . In Algorithm 1, we describe how to compute summary statistics related to Equation 13 efficiently for all nodes $n\in N$ . The means $\bar{i}^{\pm}$ are enough to compute $\hat{c}_{\text{SS}}(n)$ , while other summary statistics are involved in bounding the magnitude of a false negative (cf. Section 3.2). (Note, $\text{count}^{\pm}_{n}$ is just an alternate notation for $|\eta^{\pm}(n)|$ .)

Algorithm 1 Subsampling

p\in(0,1)

, model

\mathcal{M}

, metric

\mathcal{L}

, prompt pair distribution

\mathcal{D}

, num samples

m

\text{count}^{\pm}

\text{runSum}^{\pm}

\text{runSquaredSum}^{\pm}

\leftarrow 0^{|N|}

\triangleright

Init counts and running sums to 0 vectors

3:for

i\leftarrow 1\textrm{ to }m

x^{\text{clean}},x^{\text{noise}}\sim\mathcal{D}

\text{mask}^{+}\leftarrow\text{Bernoulli}^{|N|}(p)

\triangleright

Sample binary mask for patching

\text{mask}^{-}\leftarrow 1-\text{mask}^{+}

i\leftarrow\mathcal{I}(\{n\in N:\text{mask}^{+}_{n}=1\};x^{\text{clean}},x^{% \text{noise}})

\triangleright

\eta^{+}=\{n\in N:\text{mask}^{+}_{n}=1\}

\text{count}^{\pm}\,\leftarrow\,\text{count}^{\pm}+\text{mask}^{\pm}

\text{runSum}^{\pm}\,\leftarrow\,\text{runSum}^{\pm}+i\cdot\text{mask}^{\pm}

10:

\text{runSquaredSum}^{\pm}\,\leftarrow\,\text{runSquaredSum}^{\pm}+i^{2}\cdot% \text{mask}^{\pm}

11:

\bar{i}^{\pm}\leftarrow\text{runSum}^{\pm}/\text{count}^{\pm}

12:

s^{\pm}\leftarrow\sqrt{(\text{runSquaredSum}^{\pm}-(\bar{i}^{\pm})^{2})/(\text% {count}^{\pm}-1)}

13:return

\text{count}^{\pm}

\bar{i}^{\pm}

s^{\pm}

\triangleright

If diagnostics are not required,

\bar{i}^{\pm}

is sufficient.

Blocks & Hierarchical

Instead of sampling each $\eta$ independently, we can group nodes into fixed “blocks” $\eta$ of some size, and patch each block to find its aggregated contribution $c(\eta)$ ; we can then traverse the nodes, starting with high-contribution blocks and proceeding from there.

There is a tradeoff in terms of the block size: using large blocks increases the compute required to traverse a high-contribution block, but using small blocks increases the compute required to finish traversing all of the blocks. We refer to the fixed block size setting as Blocks. Another way to handle this tradeoff is to add recursion: the blocks can be grouped into higher-level blocks, and so forth. We call this method Hierarchical.

We present results from both methods in our comparison plots, but relegate details to Section A.1.2. Relative to subsampling, these grou**-based methods have the disadvantage that on distributions, their cost scales linearly with size of $\mathcal{D}$ ’s support, in addition to scaling with the number of nodes⁷⁷7AtP* also scales linearly in the same way, but with far fewer forward passes per prompt pair..

4 Experiments

4.1 Setup

Nodes

When attributing model behavior to components, an important choice is the partition of the model’s computational graph into units of analysis or ‘nodes’ $N\ni n$ (cf. Section 2.1). We investigate two settings for the choice of $N$ , AttentionNodes and NeuronNodes. For NeuronNodes, each MLP neuron⁸⁸8We use the neuron post-activation for the node; this makes no difference when causally intervening, but for AtP it’s beneficial, because it makes the $n\mapsto\mathcal{L}(n)$ function more linear. is a separate node. For AttentionNodes, we consider the query, key, and value vector for each head as distinct nodes, as well as the pre-linear per-head attention output⁹⁹9We include the output node because it provides additional information about what function an attention head is serving, particularly in the case where its queries have negligible patch effects relative to its keys and/or values. This may happen as a result of choosing $x^{\text{clean}},\,x^{\text{noise}}$ such that the query does not differ across the prompts.. We also refer to these units as ‘sites’. For each site, we consider each copy of that site at different token positions as a separate node. As a result, we can identify each node $n\in N$ with a pair $(T,S)$ from the product TokenPosition $\times$ Site. Since our two settings for $N$ are using a different level of granularity and are expected to have different per-node effect magnitudes, we present results on them separately.

Models

We investigate transformer language models from the Pythia suite (Biderman et al., 2023) of sizes between 410M and 12B parameters. This allows us to demonstrate that our methods are applicable across scale. Our cost-of-verified-recall plots in Figures 1, 7 and 8 refer to Pythia-12B. Results for other model sizes are presented via the relative-cost (cf. Section 4.2) plots in the main body Figure 9 and disaggregated via cost-of-verified recall in Section B.3.

Effect Metric $\mathcal{L}$

All reported results use the negative log probability¹⁰¹⁰10Another popular metric is the difference in logits between the clean and noise target. As opposed to the negative logprob, the logit difference is linear in the final logits and thus might favor AtP. A downside of logit difference is that it is sensitive to the noise target, which may not be meaningful if there are multiple plausible completions, such as in IOI. as their loss function $\mathcal{L}$ . We compute $\mathcal{L}$ relative to targets from the clean prompt $x^{\text{clean}}$ . We briefly explore other metrics in Section B.4.

4.2 Measuring Effectiveness and Efficiency

Cost of verified recall

As mentioned in the introduction, we’re primarily interested in finding the largest-effect nodes – see Appendix D for the distribution of $c(n)$ across models and distributions. Once we have obtained node estimates via a given method, it is relatively cheap to directly measure true effects of top nodes one at a time; we refer to this as “verification”. Incorporating this into our methodology, we find that false positives are typically not a big issue; they are simply revealed during verification. In contrast, false negatives are not so easy to remedy without verifying all nodes, which is what we were trying to avoid.

We compare methods on the basis of total compute cost (in # of forward passes) to verify the $K$ nodes with biggest true effect magnitude, for varying $K$ . The procedure being measured is to first compute estimates (incurring an estimation cost), and then sweep through nodes in decreasing order of estimated magnitude, measuring their individual effects $c(n)$ (i.e. verifying them), and incurring a verification cost. Then the total cost is the sum of these two costs.

Inverse-rank-weighted geometric mean cost

Sometimes we find it useful to summarize the method performance with a scalar; this is useful for comparing methods at a glance across different settings (e.g. model sizes, as in Figure 2), or for selecting hyperparameters (cf. Section B.5). The cost of verified recall of the top $K$ nodes is of interest for $K$ at varying orders of magnitude. In order to avoid the performance metric being dominated by small or large $K$ , we assign similar total weight to different orders of magnitude: we use a weighted average with weight $1/K$ for the cost of the top $K$ nodes. Similarly, since the costs themselves may have different orders of magnitude, we average them on a log scale – i.e., we take a geometric mean.

This metric is also proportional to the area under the curve in plots like Figure 1. To produce a more understandable result, we always report it relative to (i.e. divided by) the oracle verification cost on the same metric; the diagonal line is the oracle, with relative cost 1. We refer to this as the IRWRGM (inverse-rank-weighted relative geometric mean) cost, or the relative cost.

Note that the preference of the individual practitioner may be different such that this metric is no longer accurately measuring the important rank regime. For example, AtP* pays a notable upfront cost relative to AtP or AtP+QKfix, which sets it at a disadvantage when it doesn’t manage to find additional false negatives; but this may or may not be practically significant. To understand the performance in more detail we advise to refer to the cost of verified recall plots, like Figure 1 (or many more in Section B.3).

4.3 Single Prompt Pairs versus Distributions

We focus many of our experiments on single prompt pairs. This is primarily because it’s easier to set up and get ground truth data. It’s also a simpler setting in which to investigate the question, and one that’s more universally applicable, since a distribution to generalize to is not always available.

Clean single prompt pairs

As a starting point we report results on single prompt pairs which we expect to have relatively clean circuitry¹¹¹¹11Formally, these represent prompt distributions via the delta distribution $p(x^{\text{clean}},x^{\text{noise}})=\delta_{x^{\text{clean}}_{1},x^{\text{% noise}}_{1}}(x^{\text{clean}},x^{\text{noise}})$ where $x^{\text{clean}}_{1},x^{\text{noise}}_{1}$ is the singular prompt pair.. All singular prompt pairs are shown in Table 1. IOI-PP is chosen to resemble an instance from the indirect object identification (IOI) task (Wang et al., 2022), a task predominantly involving attention heads. CITY-PP is chosen to elicit factual recall which previous research suggests involves early MLPs and a small number of late attention heads (Meng et al., 2023; Geva et al., 2023; Nanda et al., 2023). The country/city combinations were chosen such that Pythia-410M achieved low loss on both $x^{\text{clean}}$ and $x^{\text{noise}}$ and such that all places were represented by a single token.

Identifier	Clean Prompt	Noise Source Prompt
CITY-PP	BOSCity:␣Barcelona\n Country:␣Spain	BOSCity:␣Bei**g\n Country:␣China
IOI-PP	BOSWhen␣Michael␣and␣Jessica ␣went␣to␣the␣bar,␣Michael ␣gave␣a␣drink␣to␣Jessica	BOSWhen␣Michael␣and␣Jessica ␣went␣to␣the␣bar,␣Ashley ␣gave␣a␣drink␣to␣Michael
RAND-PP	BOSHer␣biggest␣worry␣was␣the ␣festival␣might␣suffer␣and ␣people␣might␣erroneously␣think	BOSalso␣think␣that␣there ␣should␣be␣the␣same␣rules ␣or␣regulations␣when␣it

Table 1: Clean and noise source prompts for singular prompt pair distributions. Vertical lines denote tokenization boundaries. All prompts are preceded by the BOS (beginning of sequence) token. The last token is not part of the input. The last token of the clean prompt is used as the target in

\mathcal{L}

We show the cost of verified 100% recall for various methods in Figure 1, where we focus on NeuronNodes for CITY-PP and AttentionNodes for IOI-PP. Exhaustive results for smaller Pythia models are shown in Section B.3. Figure 2 shows the aggregated relative costs for all models on CITY-PP and IOI-PP.

Instead of applying the strict criterion of recalling all important nodes, we can also relax this constraint. In Figure 7, we show the cost of verified 90% recall in the two clean prompt pair settings.

Random prompt pair

The previous prompt pairs may in fact be the best-case scenarios: the interventions they create will be fairly localized to a specific circuit, and this may make it easy for AtP to approximate the contributions. It may thus be informative to see how the methods generalize to settings where the interventions are less surgical. To do this, we also report results in Figure 8 (top) and Figure 9 on a random prompt pair chosen from a non-copyright-protected section of The Pile (Gao et al., 2020) which we refer to as RAND-PP. The prompt pair was chosen such that Pythia-410M still achieved low loss on both prompts.

We find that AtP/AtP* is only somewhat less effective here; this provides tentative evidence that the strong performance of AtP/AtP* isn’t reliant on the clean prompt using a particularly crisp circuit, or on the noise prompt being a precise control.

Distributions

Causal attribution is often of most interest when evaluated across a distribution, as laid out in Section 2. Of the methods, AtP, AtP*, and Subsampling scale reasonably to distributions; the former 2 because they’re inexpensive so running them $|\mathcal{D}|$ times is not prohibitive, and Subsampling because it intrinsically averages across the distribution and thus becomes proportionally cheaper relative to the verification via activation patching. In addition, having a distribution enables a more performant Iterative method, as described in Section 3.3.

We present a comparison of these methods on 2 distributional settings. The first is a reduced version of IOI (Wang et al., 2022) on 6 names, resulting in $6\times 5\times 4=120$ prompt pairs, where we evaluate AttentionNodes. The other distribution prompts the model to output an indefinite article ‘ a’ or ‘ an’, where we evaluate NeuronNodes. See Section B.1 for details on constructing these distributions. Results are shown in Figure 8 for Pythia 12B, and in Figure 9 across models. The results show that AtP continues to perform well, especially with the QK fix; in addition, the cancellation failure mode tends to be sensitive to the particular input prompt pair, and as a result, averaging across a distribution diminishes the benefit of GradDrops.

An implication of Subsampling scaling well to this setting is that diagnostics may give reasonable confidence in not missing false negatives with much less overhead than in the single-prompt-pair case; this is illustrated in Figure 6.

5 Discussion

5.1 Limitations

Prompt pair distributions

We only considered a small set of prompt pair distributions, which often were limited to a single prompt pair, since evaluating the ground truth can be quite costly. While we aimed to evaluate on distributions that are reasonably representative, our results may not generalize to other distributions.

Choice of Nodes $N$

In the NeuronNodes setting, we took MLP neurons as our fundamental unit of analysis. However, there is mounting evidence (Bricken et al., 2023) that the decomposition of signals into neuron contributions does not correspond directly to a semantically meaningful decomposition. Instead, achieving such a decomposition seems to require finding the right set of directions in neuron activation space (Bricken et al., 2023; Gurnee et al., 2023) – which we viewed as being out of scope for this paper. In Section 5.2 we further discuss the applicability of AtP to sparse autoencoders, a method of finding these decompositions.

More generally, we only considered relatively fine-grained nodes, because this is a case where very exhaustive verification is prohibitively expensive, justifying the need for an approximate, fast method. Nanda (2022) speculate that AtP may perform worse on coarser components like full layers or entire residual streams, as a larger change may have more of a non-linear effect. There may still be benefit in speeding up such an analysis, particularly if the context length is long – our alternative methods may have something to offer here, though we leave investigation of this to future work.

It is popular in the literature to do Activation Patching with these larger components, with short contexts – this doesn’t pose a performance issue, and so our work would not provide any benefit here.

Caveats of $c(n)$ as importance measure

In this work we took the ground truth of activation patching, as defined in Equation 1, as our evaluation target. As discussed by McGrath et al. (2023), Equation 1 often significantly disagrees with a different evaluation target, the “direct effect”, by putting lower weight on some contributions when later components would shift their behaviour to compensate for the earlier patched component. In the worst case this could be seen as producing additional false negatives not accounted for by our metrics. To some degree this is likely to be mitigated by the GradDrop formula in Eq. 11, which will include a term drop** out the effect of that downstream shift.

However, it is also questionable whether we need to concern ourselves with finding high-direct-effect nodes. For example, direct effect is easy to efficiently compute for all nodes, as explored by nostalgebraist (2020) – so there is no need for fast approximations like AtP if direct effect is the quantity of interest. This ease of computation is no free lunch, though, because direct effect is also more limited as a tool for finding causally important nodes: it would not be able to locate any nodes that contribute only instrumentally to the circuit rather than producing its output. For example, there is no direct effect from nodes at non-final token positions. We discuss the direct effect further in Section 3.1.2 and Section A.2.2.

Another nuance of our ground–truth definition occurs in the distributional setting. Some nodes may have a real and significant effect, but only on a single clean prompt (e.g. they only respond to a particular name in IOI¹²¹²12We did observe this particular behavior in a few instances. or object in A-AN). Since the effect is averaged over the distribution, the ground truth will not assign these nodes large causal importance. Depending on the goal of the practitioner this may or may not be desirable.

Effect size versus rank estimation

When evaluating the performance of various estimators, we focused on evaluating the relative rank of estimates, since our main goal was to identify important components (with effect size only instrumentally useful to this end), and we assumed a further verification step of the nodes with highest estimated effects one at a time, in contexts where knowing effect size is important. Thus, we do not present evidence about how closely the estimated effect magnitudes from AtP or AtP* match the ground truth. Similarly, we did not assess the prevalence of false positives in our analysis, because they can be filtered out via the verification process. Finally, we did not compare to past manual interpretability work to check whether our methods find the same nodes to be causally important as discovered by human researchers, as done in prior work (Conmy et al., 2023; Syed et al., 2023).

Other LLMs

While we think it likely that our results on the Pythia model family (Biderman et al., 2023) will transfer to other LLM families, we cannot rule out qualitatively different behavior without further evidence, especially on SotA–scale models or models that significantly deviate from the standard decoder-only transformer architecture.

5.2 Extensions/Variants

Edge Patching

While we focus on computing the effects of individual nodes, edge activation patching can give more fine-grained information about which paths in the computational graph matter. However, it suffers from an even larger blowup in number of forward passes if done naively. Fortunately, AtP is easy to generalize to estimating the effects of edges between nodes (Nanda, 2022; Syed et al., 2023), while AtP* may provide further improvement. We discuss edge-AtP, and how to efficiently carry over the insights from AtP*, in Section C.2.

Coarser nodes $N$

We focused on fine-grained attribution, rather than full layers or sliding windows (Meng et al., 2023; Geva et al., 2023). In the latter case there’s less computational blowup to resolve, but for long contexts there may still be benefit in considering speedups like ours; on the other hand, they may be less linear, thus favouring other methods over AtP*. We leave investigation of this to future work.

Layer normalization

Nanda (2022) observed that AtP’s approximation to layer normalization may be a worse approximation when it comes to patching larger/coarser nodes: on average the patched and clean activations are likely to have similar norm, but may not have high cosine-similarity. They recommend treating the denominator in layer normalization as fixed, e.g. using a stop-gradient operator in the implementation. In Section C.1 we explore the effect of this, and illustrate the behaviour of this alternative form of AtP. It seems likely that this variant would indeed produce better results particularly when patching residual-stream nodes – but we leave empirical investigation of this to future work.

Denoising

Denoising (Meng et al., 2023; Lieberum et al., 2023) is a different use case for patching, which may produce moderately different results: the difference is that each forward pass is run on $x^{\text{noise}}$ with the activation to patch taken from $x^{\text{clean}}$ — colloquially, this tests whether the patched activation is sufficient to recover model performance on $x^{\text{clean}}$ , rather than necessary. We provide some preliminary evidence to the effect of this choice in Section B.4 but leave a more thorough investigation to future work.

Other forms of ablation

Further, in some settings it may be of interest to do mean-ablation, or even zero-ablation, and our tweaks remain applicable there; the random-prompt-pair result suggests AtP* isn’t overly sensitive to the noise distribution, so we speculate the results are likely to carry over.

5.3 Applications

Automated Circuit Finding

A natural application of the methods we discussed in this work is the automatic identification and localization of sparse subgraphs or ‘circuits’ (Cammarata et al., 2020). A variant of this was already discussed in concurrent work by Syed et al. (2023) who combined edge attribution patching with the ACDC algorithm (Conmy et al., 2023). As we mentioned in the edge patching discussion, AtP* can be generalized to edge attribution patching, which may bring additional benefit for automated circuit discovery.

Another approach is to learn a (probabilistic) mask over nodes, similar to Louizos et al. (2018); Cao et al. (2021), where the probability scales with the currently estimated node contribution $c(n)$ . For that approach, a fast method to estimate all node effects given the current mask probabilities could prove vital.

Sparse Autoencoders

Recently there has been increased interest by the community in using sparse autoencoders (SAEs) to construct disentangled sparse representations with potentially more semantic coherence than transformer-native units such as neurons (Cunningham et al., 2023; Bricken et al., 2023). SAEs usually have a lot more nodes than the corresponding transformer block they are applied to. This could pose a larger problem in terms of the activation patching effects, making the speedup of AtP* more valuable. However, due to the sparseness of the SAE, on a given forward pass the effect of most features will be zero. For example, some successful SAEs by Bricken et al. (2023) have 10-20 active features for 500 neurons for a given token position, which reduces the number of nodes by 20-50x relative to the MLP setting, increasing the scale at which existing iterative methods remain practical. It is still an open research question, however, what degree of sparsity is feasible with tolerable reconstruction error for practically relevant or SOTA–scale models, where the methods discussed in this work may become more important again.

Steering LLMs

AtP* could be used to discover single nodes in the model that can be leveraged for targeted inference time interventions to control the model’s behavior. In contrast to previous work (Li et al., 2023; Turner et al., 2023; Zou et al., 2023) it might provide more localized interventions with less impact on the rest of the model’s computation. One potential exciting direction would be to use AtP* (or other gradient-based approximations) to see which sparse autoencoder features, if activated, would have a significant effect.

5.4 Recommendation

Our results suggest that if a practitioner is trying to do fast causal attribution, there are 2 main factors to consider: (i) the desired granularity of localization, and (ii) the confidence vs compute tradeoff.

Regarding (i), the desired granularity, smaller components (e.g. MLP neurons or attention heads) are more numerous but more linear, likely yielding better results from gradient-based methods like AtP. We are less sure AtP will be a good approximation if patching layers or sliding windows of layers, and in this case practitioners may want to do normal patching. If the number of forward passes required remains prohibitive (e.g. a long context times many layers, when doing per token $\times$ layer patching), our other baselines may be useful. For a single prompt pair we particularly recommend trying Blocks, as it’s easy to make sense of; for a distribution we recommend Subsampling because it scales better to many prompt pairs.

Regarding (ii), the confidence vs compute tradeoff, depending on the application, it may be desirable to run AtP as an activation patching prefilter followed by running the diagnostic to increase confidence. On the other hand, if false negatives aren’t a big concern then it may be preferable to skip the diagnostic – and if false positives aren’t either, then in certain cases practitioners may want to skip activation patching verification entirely. In addition, if the prompt pair distribution does not adequately highlight the specific circuit/behaviour of interest, this may also limit what can be learned from any localization methods.

If AtP is appropriate, our results suggest the best variant to use is probably AtP* for single prompt pairs, AtP+QKFix for AttentionNodes on distributions, and AtP for NeuronNodes (or other sites that aren’t immediately before a nonlinearity) on distributions.

Of course, these recommendations are best-substantiated in settings similar to those we studied: focused prompt pairs / distribution, attention node or neuron sites, nodewise attribution, measuring cross-entropy loss on the clean-prompt next token. If departing from these assumptions we recommend looking before you leap.

6 Related work

Localization and Mediation Analysis

This work is concerned with identifying the effect of all (important) nodes in a causal graph (Pearl, 2000), in the specific case where the graph represents a language model’s computation. A key method for finding important intermediate nodes in a causal graph is intervening on those nodes and observing the effect, which was first discussed under the name of causal mediation analysis by Robins and Greenland (1992); Pearl (2001).

Activation Patching

In recent years there has been increasing success at applying the ideas of causal mediation analysis to identify causally important nodes in deep neural networks, in particular via the method of activation patching, where the output of a model component is intervened on. This technique has been widely used by the community and successfully applied in a range of contexts (Olsson et al., 2022; Vig et al., 2020; Soulos et al., 2020; Meng et al., 2023; Wang et al., 2022; Hase et al., 2023; Lieberum et al., 2023; Conmy et al., 2023; Hanna et al., 2023; Geva et al., 2023; Huang et al., 2023; Tigges et al., 2023; Merullo et al., 2023; McDougall et al., 2023; Goldowsky-Dill et al., 2023; Stolfo et al., 2023; Feng and Steinhardt, 2023; Hendel et al., 2023; Todd et al., 2023; Cunningham et al., 2023; Finlayson et al., 2021; Nanda et al., 2023).

Chan et al. (2022) introduce causal scrubbing, a generalized algorithm to verify a hypothesis about the internal mechanism underlying a model’s behavior, and detail their motivation behind performing noising and resample ablation rather than denoising or using mean or zero ablation – they interpret the hypothesis as implying the computation is invariant to some large set of perturbations, so their starting-point is the clean unperturbed forward pass.¹³¹³13Our motivation for focusing on noising rather than denoising was a closely related one – we were motivated by automated circuit discovery, where gradually noising more and more of the model is the basic methodology for both of the approaches discussed in Section 5.3.

Another line of research concerning formalizing causal abstractions focuses on finding and verifying high-level causal abstractions of low-level variables (Geiger et al., 2020, 2021, 2022, 2023). See Jenner et al. (2022) for more details on how these different frameworks agree and differ. In contrast to those works, we are chiefly concerned with identifying the important low-level variables in the computational graph and are not investigating their semantics or potential grou**s of lower-level into higher-level variables.

In addition to causal mediation analysis, intervening on node activations in the model forward pass has also been studied as a way of steering models towards desirable behavior (Rimsky et al., 2023; Zou et al., 2023; Turner et al., 2023; Jorgensen et al., 2023; Li et al., 2023; Belrose et al., 2023).

Attribution Patching / Gradient-based Masking

While we use the resample–ablation variant of AtP as formulated in Nanda (2022), similar formulations have been used in the past to successfully prune deep neural networks (Figurnov et al., 2016; Molchanov et al., 2017; Michel et al., 2019), or even identify causally important nodes for interpretability (Cao et al., 2021). Concurrent work by Syed et al. (2023) also demonstrates AtP can help with automatically finding causally important circuits in a way that agrees with previous manual circuit identification work. In contrast to Syed et al. (2023), we provide further analysis of AtP’s failure modes, give improvements in the form of AtP ${}^{*}$ , and evaluate both methods as well as several baselines on a suite of larger models against a ground truth that is independent of human researchers’ judgement.

7 Conclusion

In this paper, we have explored the use of attribution patching for node patch effect evaluation. We have compared attribution patching with alternatives and augmentations, characterized its failure modes, and presented reliability diagnostics. We have also discussed the implications of our contributions for other settings in which patching can be of interest, such as circuit discovery, edge localization, coarse-grained localization, and causal abstraction.

Our results show that AtP* can be a more reliable and scalable approach to node patch effect evaluation than alternatives. However, it is important to be aware of the failure modes of attribution patching, such as cancellation and saturation. We explored these in some detail, and provided mitigations, as well as recommendations for diagnostics to ensure that the results are reliable.

We believe that our work makes an important contribution to the field of mechanistic interpretability and will help to advance the development of more reliable and scalable methods for understanding the behavior of deep neural networks.

8 Author Contributions

János Kramár was research lead, and Tom Lieberum was also a core contributor – both were highly involved in most aspects of the project. Rohin Shah and Neel Nanda served as advisors and gave feedback and guidance throughout.

References

Belrose et al. (2023) N. Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, and S. Biderman. Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023.
Biderman et al. (2023) S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR, 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
Bricken et al. (2023) T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
Cammarata et al. (2020) N. Cammarata, S. Carter, G. Goh, C. Olah, M. Petrov, L. Schubert, C. Voss, B. Egan, and S. K. Lim. Thread: Circuits. Distill, 2020. 10.23915/distill.00024. https://distill.pub/2020/circuits.
Cao et al. (2021) N. D. Cao, L. Schmid, D. Hupkes, and I. Titov. Sparse interventions in language models with differentiable masking, 2021.
Chan et al. (2022) L. Chan, A. Garriga-Alonso, N. Goldwosky-Dill, R. Greenblatt, J. Nitishinskaya, A. Radhakrishnan, B. Shlegeris, and N. Thomas. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
Conmy et al. (2023) A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023.
Cunningham et al. (2023) H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models, 2023.
Feng and Steinhardt (2023) J. Feng and J. Steinhardt. How do language models bind entities in context?, 2023.
Figurnov et al. (2016) M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli. Perforatedcnns: Acceleration through elimination of redundant convolutions. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/f0e52b27a7a5d6a1a87373dffa53dbe5-Paper.pdf.
Finlayson et al. (2021) M. Finlayson, A. Mueller, S. Gehrmann, S. Shieber, T. Linzen, and Y. Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1828–1843, Online, Aug. 2021. Association for Computational Linguistics. 10.18653/v1/2021.acl-long.144. URL https://aclanthology.org/2021.acl-long.144.
Gao et al. (2020) L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
Geiger et al. (2020) A. Geiger, K. Richardson, and C. Potts. Neural natural language inference models partially embed theories of lexical entailment and negation, 2020.
Geiger et al. (2021) A. Geiger, H. Lu, T. Icard, and C. Potts. Causal abstractions of neural networks, 2021.
Geiger et al. (2022) A. Geiger, Z. Wu, H. Lu, J. Rozner, E. Kreiss, T. Icard, N. D. Goodman, and C. Potts. Inducing causal structure for interpretable neural networks, 2022.
Geiger et al. (2023) A. Geiger, C. Potts, and T. Icard. Causal abstraction for faithful model interpretation, 2023.
Geva et al. (2023) M. Geva, J. Bastings, K. Filippova, and A. Globerson. Dissecting recall of factual associations in auto-regressive language models, 2023.
Goldowsky-Dill et al. (2023) N. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora. Localizing model behavior with path patching, 2023.
Gurnee et al. (2023) W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse probing, 2023.
Hanna et al. (2023) M. Hanna, O. Liu, and A. Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, 2023.
Hase et al. (2023) P. Hase, M. Bansal, B. Kim, and A. Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023.
Hendel et al. (2023) R. Hendel, M. Geva, and A. Globerson. In-context learning creates task vectors, 2023.
Hoffmann et al. (2022) J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. Rae, and L. Sifre. An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf.
Huang et al. (2023) J. Huang, A. Geiger, K. D’Oosterlinck, Z. Wu, and C. Potts. Rigorously assessing natural language explanations of neurons, 2023.
Jenner et al. (2022) E. Jenner, A. Garriga-Alonso, and E. Zverev. A comparison of causal scrubbing, causal abstractions, and related methods. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and.
Jorgensen et al. (2023) O. Jorgensen, D. Cope, N. Schoots, and M. Shanahan. Improving activation steering in language models with mean-centring, 2023.
Li et al. (2023) K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023.
Lieberum et al. (2023) T. Lieberum, M. Rahtz, J. Kramár, N. Nanda, G. Irving, R. Shah, and V. Mikulik. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023.
Louizos et al. (2018) C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through $l_{0}$ regularization, 2018.
McDougall et al. (2023) C. McDougall, A. Conmy, C. Rushing, T. McGrath, and N. Nanda. Copy suppression: Comprehensively understanding an attention head, 2023.
McGrath et al. (2023) T. McGrath, M. Rahtz, J. Kramár, V. Mikulik, and S. Legg. The hydra effect: Emergent self-repair in language model computations, 2023.
Meng et al. (2023) K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in gpt, 2023.
Merullo et al. (2023) J. Merullo, C. Eickhoff, and E. Pavlick. Circuit component reuse across tasks in transformer language models, 2023.
Michel et al. (2019) P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf.
Molchanov et al. (2017) P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJGCiw5gl.
Nanda (2022) N. Nanda. Attribution patching: Activation patching at industrial scale. 2022. URL https://www.neelnanda.io/mechanistic-interpretability/attribution-patching.
Nanda et al. (2023) N. Nanda, S. Rajamanoharan, J. Kramár, and R. Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
nostalgebraist (2020) nostalgebraist. interpreting gpt: the logit lens. 2020. URL https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
Olsson et al. (2022) C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
Pearl (2000) J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
Pearl (2001) J. Pearl. Direct and indirect effects, 2001.
Radford et al. (2018) A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training, 2018.
Rimsky et al. (2023) N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition, 2023.
Robins and Greenland (1992) J. M. Robins and S. Greenland. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3:143–155, 1992. URL https://api.semanticscholar.org/CorpusID:10757981.
Soulos et al. (2020) P. Soulos, R. T. McCoy, T. Linzen, and P. Smolensky. Discovering the compositional structure of vector representations with role learning networks. In A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, and H. Sajjad, editors, Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 238–254, Online, Nov. 2020. Association for Computational Linguistics. 10.18653/v1/2020.blackboxnlp-1.23. URL https://aclanthology.org/2020.blackboxnlp-1.23.
Stolfo et al. (2023) A. Stolfo, Y. Belinkov, and M. Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis, 2023.
Syed et al. (2023) A. Syed, C. Rager, and A. Conmy. Attribution patching outperforms automated circuit discovery, 2023.
Tigges et al. (2023) C. Tigges, O. J. Hollinsworth, A. Geiger, and N. Nanda. Linear representations of sentiment in large language models, 2023.
Todd et al. (2023) E. Todd, M. L. Li, A. S. Sharma, A. Mueller, B. C. Wallace, and D. Bau. Function vectors in large language models, 2023.
Turner et al. (2023) A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization, 2023.
Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need, 2017.
Veit et al. (2016) A. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf.
Vig et al. (2020) J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.
Wang et al. (2022) K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022.
Welch (1947) B. L. Welch. The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika, 34(1-2):28–35, 01 1947. ISSN 0006-3444. 10.1093/biomet/34.1-2.28. URL https://doi.org/10.1093/biomet/34.1-2.28.
Zou et al. (2023) A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to ai transparency, 2023.

Appendix A Method details

A.1 Baselines

A.1.1 Properties of Subsampling

Here we prove that the subsampling estimator $\hat{\mathcal{I}}_{\text{SS}}(n)$ from Section 3.3 is unbiased in the case of no interaction effects. Furthermore, assuming a simple interaction model, we show the bias of $\hat{\mathcal{I}}_{\text{SS}}(n)$ is $p$ times the total interaction effect of $n$ with other nodes. We assume a pairwise interaction model. That is, given a set of nodes $\eta$ , we have

\displaystyle\mathcal{I}(\eta;x)

\displaystyle=\sum_{n\in\eta}\mathcal{I}(n;x)+\sum_{\begin{subarray}{c}n,n^{% \prime}\in\eta\\ n\neq n\end{subarray}}\sigma_{n,n^{\prime}}(x)

(16)

with fixed constants $\sigma_{n,n^{\prime}}(x)\in\mathbb{R}$ for each prompt pair $x\in\operatorname{support}(\mathcal{D})$ . Let $\sigma_{n,n^{\prime}}=\mathbb{E}_{x\sim\mathcal{D}}\left[\sigma_{n,n^{\prime}}% (x)\right]$ .

Let $p$ be the probability of including each node in a given $\eta$ and let $M$ be the number of node masks sampled from $\operatorname{Bernoulli}^{|N|}(p)$ and prompt pairs $x$ sampled from $\mathcal{D}$ . Then,

$\displaystyle\mathbb{E}\left[\hat{\mathcal{I}}_{\text{SS}}(n)\right]$	$\displaystyle=\mathbb{E}\left[\frac{1}{\|\eta^{+}(n)\|}\sum_{k=1}^{\|\eta^{+}(n)\|% }\mathcal{I}(\eta^{+}_{k}(n);x_{k}^{+})-\frac{1}{\|\eta^{-}(n)\|}\sum_{k=1}^{\|% \eta^{-}(n)\|}\mathcal{I}(\eta^{-}_{k}(n);x_{k}^{-})\right]$	(17a)
	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\frac{1}{\|\eta^{+}(n)\|}\sum_{k=1% }^{\|\eta^{+}(n)\|}\mathcal{I}(\eta^{+}_{k}(n);x_{k}^{+})-\frac{1}{\|\eta^{-}(n)\|% }\sum_{k=1}^{\|\eta^{-}(n)\|}\mathcal{I}(\eta^{-}_{k}(n);x_{k}^{-})\middle\|\|\eta% ^{+}(n)\|\right]\right]$	(17b)
	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\frac{\|\eta^{+}(n)\|}{\|\eta^{+}(n% )\|}\mathbb{E}\left[\mathcal{I}(\eta_{1};x_{1})\middle\|n\in\eta_{1}\right]-% \frac{\|\eta^{-}(n)\|}{\|\eta^{-}(n)\|}\mathbb{E}\left[\mathcal{I}(\eta_{1};x_{1})% \middle\|n\not\in\eta_{1}\right]\middle\|\|\eta^{+}(n)\|\right]\right]$	(17c)
	$\displaystyle=\mathbb{E}\left[\mathcal{I}(\eta_{1};x_{1})\middle\|n\in\eta_{1}% \right]-\mathbb{E}\left[\mathcal{I}(\eta_{1};x_{1})\middle\|n\not\in\eta_{1}\right]$	(17d)
	$\displaystyle=c(n)+\mathbb{E}\left[\sum_{n^{\prime}\neq n}\mathbb{1}[n^{\prime% }\in\eta_{1}]\left(c(n^{\prime})+\sigma_{nn^{\prime}}+\frac{1}{2}\sum_{n^{% \prime\prime}\not\in\{n^{\prime},n\}}\mathbb{1}[n^{\prime}\in\eta_{1}]\sigma_{% n^{\prime}n^{\prime\prime}}\middle\|n\in\eta_{1}\right)\right]$	(17e)
	$\displaystyle\quad-\mathbb{E}\left[\sum_{n^{\prime}\neq n}\mathbb{1}[n^{\prime% }\in\eta_{1}]\left(c(n^{\prime})+\frac{1}{2}\sum_{n^{\prime\prime}\not\in\{n^{% \prime},n\}}\mathbb{1}[n^{\prime}\in\eta_{1}]\sigma_{n^{\prime}n^{\prime\prime% }}\right)\middle\|n\not\in\eta_{1}\right]$	(17f)
	$\displaystyle=c(n)+p\sum_{n^{\prime}\neq n}\sigma_{nn^{\prime}}$	(17g)

In Equation 17g, we observe that if the interaction terms $\sigma_{nn^{\prime}}$ are all zero, the estimator is unbiased. Otherwise, the bias scales both with the sum of interaction effects and with $p$ , as expected.

A.1.2 Pseudocode for Blocks and Hierarchical baselines

In Algorithm 2 we detail the Blocks baseline algorithm. As explained in Section 3.3, it comes with a tradeoff in its “block size” hyperparameter $B$ : a small block size requires a lot of time to evaluate all the blocks, while a large block size means many irrelevant nodes to evaluate in each high-contribution block.

Algorithm 2 Blocks algorithm for causal attribution.

1:block size

B

, compute budget

M

, nodes

N=\{n_{i}\}

, prompts

x^{\text{clean}},\,x^{\text{noise}}

, intervention function

\tilde{\mathcal{I}}:\eta\mapsto\mathcal{I}(\eta;x^{\text{clean}},x^{\text{% noise}})

\mathrm{numBlocks}\leftarrow\lceil|N|/B\rceil

\pi\leftarrow\operatorname{shuffle}\left(\left\{\left\lfloor\mathrm{numBlocks}% \cdot iB/|N|\right\rfloor\mid i\in\{0,\dots,|N|-1\}\right\}\right)

\triangleright

Assign each node to a block.

4:for

i\leftarrow 0\textrm{ to numBlocks}-1

\textrm{blockContribution}[i]\leftarrow|\tilde{\mathcal{I}}(\pi^{-1}(\{i\}))|

\triangleright

\pi^{-1}(\{i\}):=\{n:\,\pi(n)=i\mid n\in N\})

\mathrm{spentBudget}\leftarrow M-\mathrm{numBlocks}

\mathrm{topNodeContribs}\leftarrow\operatorname{CreateEmptyDictionary}()

8:for all

i\in\{0\textrm{ to numBlocks}-1\}

in decreasing order of

\textrm{blockContribution}[i]

9: for all

n\in\pi^{-1}(\{i\})

\triangleright

Eval all nodes in block.

10: if

\mathrm{spentBudget}<M

then

11:

\mathrm{topNodeContribs}[n]\leftarrow\mid\tilde{\mathcal{I}}(\{n\})|

12:

\mathrm{spentBudget}\leftarrow\mathrm{spentBudget}+1

13: else

14: return topNodeContribs

15:return topNodeContribs

The Hierarchical baseline algorithm aims to resolve this tradeoff, by using small blocks, but grouped into superblocks so it’s not necessary to traverse all the small blocks before finding the key nodes. In Algorithm 3 we detail the hierarchical algorithm in its iterative form, corresponding to batch size 1.

One aspect that might be surprising is that on line 22, we ensure a subblock is never added to the priority queue with higher priority than its ancestor superblocks. The reason for doing this is that in practice we use batched inference rather than patching a single block at a time, so depending on the batch size, we do evaluate blocks that aren’t the highest-priority unevaluated blocks, and this might impose a significant delay in when some blocks are evaluated. In order to reduce this dependence on the batch size hyperparameter, line 22 ensures that every block is evaluated at most $L$ batches later than it would be with batch size 1.

Algorithm 3 Hierarchical algorithm for causal attribution, in iterative form. In practice we do additional batching rather than evaluating a single block at a time on line 15.

1:branching factor

B

, num levels

L

, compute budget

M

, nodes

N=\{n_{i}\}

, intervention function

\mathcal{I}

\mathrm{numTopLevelBlocks}\leftarrow\lceil|N|/B^{L}\rceil

\pi\leftarrow\operatorname{shuffle}\left(\left\{\left\lfloor\mathrm{% numTopLevelBlocks}\cdot iB^{L}/|N|\right\rfloor\middle|i\in\{0,\dots,|N|-1\}% \right\}\right)

4:for all

n_{i}\in N

(d_{L-1},d_{L-2},\dots,d_{0})\leftarrow\text{zero-padded final }L

base-

B

digits of

\pi_{i}

\mathrm{address}(n_{i})=(\lfloor\pi_{i}/B^{L}\rfloor,d_{L-1},\dots,d_{0})

Q\leftarrow\operatorname{CreateEmptyPriorityQueue}()

8:for

i\leftarrow 0\textrm{ to numTopLevelBlocks}-1

\operatorname{PriorityQueueInsert}(Q,[i],\infty)

10:

\mathrm{spentBudget}\leftarrow 0

11:

\mathrm{topNodeContribs}\leftarrow\operatorname{CreateEmptyDictionary}()

12:repeat

13:

(\mathrm{addressPrefix},\mathrm{priority})\leftarrow\operatorname{% PriorityQueuePop}(Q)

14:

\mathrm{blockNodes}\leftarrow\left\{n\in N\middle|\operatorname{StartsWith}(% \mathrm{address}(n),\mathrm{addressPrefix})\right\}

15:

\mathrm{blockContribution}\leftarrow|\mathcal{I}\left(\mathrm{blockNodes}% \right)|

16:

\mathrm{spentBudget}\leftarrow\mathrm{spentBudget}+1

17: if

\mathrm{blockNodes}=\{n\}

for some

n\in N

then

18:

\mathrm{topNodeContribs}[n]\leftarrow\mathrm{blockContribution}

19: else

20: for

i\leftarrow 0\textrm{ to }B-1

21: if

\{n\in\mathrm{blockNodes}|\operatorname{StartsWith}(\mathrm{address}(n),% \mathrm{addressPrefix}+[i]\}\not=\emptyset

then

22:

\operatorname{PriorityQueueInsert}(Q,\mathrm{addressPrefix}+[i],\min(\mathrm{% blockContribution},\mathrm{priority}))

23:until

\mathrm{spentBudget}=M

\operatorname{PriorityQueueEmpty}(Q)

24:return topNodeContribs

A.2 AtP improvements

A.2.1 Pseudocode for corrected AtP on attention keys

As described in Section 3.1.1, computing Equation 10 naïvely for all nodes requires $\operatorname{O}(T^{3})$ flops at each attention head and prompt pair. Here we give a more efficient algorithm running in $\operatorname{O}(T^{2})$ . In addition to keys, queries and attention probabilities, we now also cache attention logits (pre-softmax scaled key-query dot products).

We define $\operatorname{attnLogits}_{\text{patch}}^{t}(n^{q})$ and $\Delta_{t}\operatorname{attnLogits}(n^{q})$ analogously to Equations 8 and 9. For brevity we can also define $\operatorname{attnLogits}_{\text{patch}}(n^{q})_{t}:=\operatorname{attnLogits}% ^{t}_{\text{patch}}(n^{q})_{t}$ and $\Delta\operatorname{attnLogits}(n^{q})_{t}:=\Delta_{t}\operatorname{attnLogits% }(n^{q})_{t}$ , since the aim with this algorithm is to avoid having to separately compute effects of $\operatorname{do}(n^{k}_{t}\leftarrow n^{k}_{t}(x^{\text{noise}}))$ on any other component of $\operatorname{attnLogits}$ than the one for key node $n^{k}_{t}$ .

Note that, for a key $n^{k}_{t}$ at position $t$ in the sequence, the proportions of the non- $t$ components of $\operatorname{attn}(n^{q})_{t}$ do not change when $\operatorname{attnLogits}(n^{q})_{t}$ is changed, so $\Delta_{t}\operatorname{attn}(n^{q})$ is actually $\mathrm{onehot}(t)-\operatorname{attn}(n^{q})$ multiplied by some scalar $s_{t}$ ; specifically, to get the right attention weight on $n^{k}_{t}$ , the scalar must be $s_{t}:=\frac{\Delta\operatorname{attn}(n^{q})_{t}}{1-\operatorname{attn}(n^{q}% )_{t}}$ . Additionally, we have $\log\left(\frac{\operatorname{attn}_{\text{patch}}^{t}(n^{q})_{t}}{1-% \operatorname{attn}_{\text{patch}}^{t}(n^{q})_{t}}\right)=\log\left(\frac{% \operatorname{attn}(n^{q})_{t}}{1-\operatorname{attn}(n^{q})_{t}}\right)+% \Delta\operatorname{attnLogits}(n^{q})_{t}$ ; note that the logodds function $p\mapsto\log\left(\frac{p}{1-p}\right)$ is the inverse of the sigmoid function, so $\operatorname{attn}_{\text{patch}}^{t}(n^{q})=\operatorname{\sigma}\left(\log% \left(\frac{\operatorname{attn}_{\text{patch}}^{t}(n^{q})_{t}}{1-\operatorname% {attn}_{\text{patch}}^{t}(n^{q})_{t}}\right)\right)$ . Putting this together, we can compute all $\operatorname{attnLogits}_{\text{patch}}(n^{q})$ by combining all keys from the $x^{\text{noise}}$ forward pass with all queries from the $x^{\text{clean}}$ forward pass, and proceed to compute $\Delta\operatorname{attnLogits}(n^{q})$ , and all $\Delta_{t}\operatorname{attn}(n^{q})_{t}$ , and thus all $\hat{\mathcal{I}}_{\text{{AtPfix}}}^{K}(n_{t};x^{\text{clean}},x^{\text{noise}})$ , using $\operatorname{O}(T^{2})$ flops per attention head.

Algorithm 4 computes the contribution of some query node $n^{q}$ and prompt pair $x^{\text{clean}},x^{\text{noise}}$ to the corrected AtP estimates $\hat{c}_{\text{AtPfix}}^{K}(n^{k}_{t})$ for key nodes $n^{k}_{1},\dots,n^{k}_{T}$ from a single attention head, using $O(T)$ flops, while avoiding numerical overflows. We reuse the notation $\operatorname{attn}(n^{q})$ , $\operatorname{attn}_{\text{patch}}^{t}(n^{q})$ , $\Delta_{t}\operatorname{attn}(n^{q})$ , $\operatorname{attnLogits}(n^{q})$ , $\operatorname{attnLogits}_{\text{patch}}(n^{q})$ , and $s_{t}$ from Section 3.1.1, leaving the prompt pair implicit.

Algorithm 4 AtP correction for attention keys

\mathbf{a}:=\operatorname{attnLogits}(n^{q})

\mathbf{a}^{\text{patch}}:=\operatorname{attnLogits}_{\text{patch}}(n^{q})

\mathbf{g}:=\frac{\partial\mathcal{L}(\mathcal{M}(x^{\text{clean}}))}{\partial% \operatorname{attn}(n^{q})}

t^{*}\leftarrow\operatorname{argmax}_{t}(a_{t})

\ell\leftarrow\mathbf{a}-a_{t^{*}}-\log\left(\sum_{t}e^{a_{t}-a_{t^{*}}}\right)

\triangleright

Clean log attn weights,

\ell=\log(\operatorname{attn}(n^{q}))

\mathbf{d}\leftarrow\ell-\log(1-e^{\ell})

\triangleright

Clean logodds,

d_{t}=\log\left(\frac{\operatorname{attn}(n^{q})_{t}}{1-\operatorname{attn}(n^% {q})_{t}}\right)

d_{t^{*}}\leftarrow a_{t^{*}}-\max_{t\not=t^{*}}a_{t}-\log\left(\sum_{t^{% \prime}\not=t^{*}}e^{a_{t^{\prime}}-\max_{t\not=t^{*}}a_{t}}\right)

\triangleright

Adjust

\mathbf{d}

; more stable for

a_{t^{*}}\gg\max_{t\not=t^{*}}a_{t}

\ell^{\text{patch}}\leftarrow\operatorname{logsigmoid}(\mathbf{d}+\mathbf{a}^{% \text{patch}}-\mathbf{a})

\triangleright

Patched log attn weights,

\ell^{\text{patch}}_{t}=\log(\operatorname{attn}_{\text{patch}}^{t}(n^{q})_{t})

\Delta\ell\leftarrow\ell^{\text{patch}}-\ell

\triangleright

\Delta\ell_{t}=\log\left(\frac{\operatorname{attn}_{\text{patch}}^{t}(n^{q})_{% t}}{\operatorname{attn}(n^{q})_{t}}\right)

b\leftarrow\operatorname{softmax}(\mathbf{a})^{\intercal}\mathbf{g}

\triangleright

b=\operatorname{attn}(n^{q})^{\intercal}\mathbf{g}

9:for

t\leftarrow 1\textrm{ to }T

10:

\triangleright

Compute scaling factor

s_{t}:=\frac{\Delta_{t}\operatorname{attn}(n^{q})_{t}}{1-\operatorname{attn}(n% ^{q})_{t}}

11: if

\ell^{\text{patch}}_{t}>\ell_{t}

then

\triangleright

Avoid overflow when

\ell^{\text{patch}}_{t}\gg\ell_{t}

12:

s_{t}\leftarrow e^{d_{t}+\Delta\ell_{t}+\log(1-e^{-\Delta\ell_{t}})}

\triangleright

s_{t}=\frac{\operatorname{attn}(n^{q})_{t}}{1-\operatorname{attn}(n^{q})_{t}}% \frac{\operatorname{attn}_{\text{patch}}^{t}(n^{q})_{t}}{\operatorname{attn}(n% ^{q})_{t}}\left(1-\frac{\operatorname{attn}(n^{q})_{t}}{\operatorname{attn}_{% \text{patch}}^{t}(n^{q})_{t}}\right)

13: else

\triangleright

Avoid overflow when

\ell^{\text{patch}}_{t}\ll\ell_{t}

14:

s_{t}\leftarrow-e^{d_{t}+\log(1-e^{\Delta\ell_{t}})}

\triangleright

s_{t}=-\frac{\operatorname{attn}(n^{q})_{t}}{1-\operatorname{attn}(n^{q})_{t}}% \left(1-\frac{\operatorname{attn}_{\text{patch}}^{t}(n^{q})_{t}}{\operatorname% {attn}(n^{q})_{t}}\right)

15:

r_{t}\leftarrow s_{t}(g_{t}-b)

\triangleright

r_{t}=s_{t}(\mathrm{onehot}(t)-\operatorname{attn}(n^{q}))^{\intercal}\mathbf{% g}=\Delta_{t}\operatorname{attn}(n^{q})\cdot\frac{\partial\mathcal{L}(\mathcal% {M}(x^{\text{clean}}))}{\partial\operatorname{attn}(n^{q})}

16:return

\mathbf{r}

The corrected AtP estimates $\hat{c}_{\text{AtPfix}}^{K}(n^{k}_{t})$ can then be computed using Equation 10; in other words, by summing the returned $r_{t}$ from Algorithm 4 over queries $n^{q}$ for this attention head, and averaging over $x^{\text{clean}},x^{\text{noise}}\sim\mathcal{D}$ .

A.2.2 Properties of GradDrop

In Section 3.1.2 we introduced GradDrop to address an AtP failure mode arising from cancellation between direct and indirect effects: roughly, if the total effect (on some prompt pair) is $\mathcal{I}(n)=\mathcal{I}^{\text{direct}}(n)+\mathcal{I}^{\text{indirect}}(n)$ , and these are close to cancelling, then a small multiplicative approximation error in $\hat{\mathcal{I}}_{\text{AtP}}^{\text{indirect}}(n)$ , due to nonlinearities, can accidentally cause $|\hat{\mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)+\hat{\mathcal{I}}_{\text{% AtP}}^{\text{indirect}}(n)|$ to be orders of magnitude smaller than $|\mathcal{I}(n)|$ .

To address this failure mode with an improved estimator $\hat{c}_{\text{AtP+GD}}(n)$ , there’s 3 desiderata for GradDrop:

1.

$\hat{c}_{\text{AtP+GD}}(n)$ shouldn’t be much smaller than $\hat{c}_{\text{AtP}}(n)$ , because that would risk creating more false negatives.
2.

$\hat{c}_{\text{AtP+GD}}(n)$ should usually not be much larger than $\hat{c}_{\text{AtP}}(n)$ , because that would create false positives, which also slows down verification and can effectively create false negatives at a given budget.
3.

If $\hat{c}_{\text{AtP}}(n)$ is suffering from the cancellation failure mode, then $\hat{c}_{\text{AtP+GD}}(n)$ should be significantly larger than $\hat{c}_{\text{AtP}}(n)$ .

Let’s recall how GradDrop was defined in Section 3.1.2, using a virtual node $n_{\ell}^{\text{out}}$ to represent the residual-stream contributions of layer $\ell$ :

	$\displaystyle\hat{c}_{\text{AtP+GD}}(n):={}$	$\displaystyle\mathbb{E}_{x^{\text{clean}},x^{\text{noise}}}\left[\frac{1}{L-1}% \sum_{\ell=1}^{L}\left\|\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n;x^{\text{% clean}},x^{\text{noise}})\right\|\right]$
	$\displaystyle={}$	$\displaystyle\mathbb{E}_{x^{\text{clean}},x^{\text{noise}}}\left[\frac{1}{L-1}% \sum_{\ell=1}^{L}\left\|(n(x^{\text{noise}})-n(x^{\text{clean}}))^{\intercal}% \frac{\partial\mathcal{L}^{\ell}}{\partial n}\right\|\right]$
	$\displaystyle={}$	$\displaystyle\mathbb{E}_{x^{\text{clean}},x^{\text{noise}}}\left[\frac{1}{L-1}% \sum_{\ell=1}^{L}\left\|(n(x^{\text{noise}})-n(x^{\text{clean}}))^{\intercal}% \frac{\partial\mathcal{L}}{\partial n}(\mathcal{M}(x^{\text{clean}}\mid% \operatorname{do}(n^{\text{out}}_{\ell}\leftarrow n^{\text{out}}_{\ell}(x^{% \text{clean}}))))\right\|\right]$

To better understand the behaviour of GradDrop, let’s look more carefully at the gradient $\frac{\partial\mathcal{L}}{\partial n}$ . The total gradient $\frac{\partial\mathcal{L}}{\partial n}$ can be expressed as a sum of all path gradients from the node $n$ to the output. Each path is characterized by the set of layers $s$ it goes through (in contrast to routing via the skip connection). We write the gradient along a path $s$ as $\frac{\partial\mathcal{L}_{s}}{\partial n}$ .

Let $\mathcal{S}$ be the set of all subsets of layers after the layer $n$ is in. For example, the direct-effect path is given by $\emptyset\in\mathcal{S}$ . Then the total gradient can be expressed as

\displaystyle\frac{\partial\mathcal{L}}{\partial n}

\displaystyle=\sum_{s\in\mathcal{S}}\frac{\partial\mathcal{L}_{s}}{\partial n}.

(18)

We can analogously define $\hat{\mathcal{I}}_{\text{AtP}}^{s}(n)=(n(x^{\text{noise}})-n(x^{\text{clean}})% )^{\intercal}\frac{\partial\mathcal{L}_{s}}{\partial n}$ , and break down $\hat{\mathcal{I}}_{\text{AtP}}(n)=\sum_{s\in\mathcal{S}}\hat{\mathcal{I}}_{% \text{AtP}}^{s}(n)$ . The effect of doing GradDrop at some layer $\ell$ is then to drop all terms $\hat{\mathcal{I}}_{\text{AtP}}^{s}(n)$ with $\ell\in s$ : in other words,

\displaystyle\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)

\displaystyle=\sum_{\begin{subarray}{c}s\in\mathcal{S}\\ \ell\not\in s\end{subarray}}\hat{\mathcal{I}}_{\text{AtP}}^{s}(n).

(21)

Now we’ll use this understanding to discuss the 3 desiderata.

Firstly, most node effects are approximately independent of most layers (see e.g. Veit et al. (2016)); for any layer $\ell$ that $n$ ’s effect is independent of, we’ll have $\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)=\hat{\mathcal{I}}_{\text{AtP}}(n)$ . Letting $K$ be the set of downstream layers that matter, this guarantees $\frac{1}{L-1}\sum_{\ell=1}^{L}\left|\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n% ;x^{\text{clean}},x^{\text{noise}})\right|\geq\frac{L-|K|-1}{L-1}\left|\hat{% \mathcal{I}}_{\text{AtP}}(n;x^{\text{clean}},x^{\text{noise}})\right|$ , which meets the first desideratum.

Regarding the second desideratum: for each $\ell$ we have $\left|\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)\right|\leq\sum_{s\in\mathcal% {S}}\left|\hat{\mathcal{I}}_{\text{AtP}}^{s}(n)\right|$ , so overall we have $\frac{1}{L-1}\sum_{\ell=1}^{L}\left|\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n% )\right|\leq\frac{L-|K|-1}{L-1}\left|\hat{\mathcal{I}}_{\text{AtP}}(n)\right|+% \frac{|K|}{L-1}\sum_{s\in\mathcal{S}}\left|\hat{\mathcal{I}}_{\text{AtP}}^{s}(% n)\right|$ . For the RHS to be much larger (e.g. $\alpha$ times larger) than $\left|\sum_{s\in\mathcal{S}}\hat{\mathcal{I}}_{\text{AtP}}^{s}(n)\right|=|\hat% {\mathcal{I}}_{\text{AtP}}(n)|$ , there must be quite a lot of cancellation between different paths, enough so that $\sum_{s\in\mathcal{S}}\left|\hat{\mathcal{I}}_{\text{AtP}}^{s}(n)\right|\geq% \frac{(L-1)\alpha}{|K|}\left|\sum_{s\in\mathcal{S}}\hat{\mathcal{I}}_{\text{% AtP}}^{s}(n)\right|$ . This is possible, but seems generally unlikely for e.g. $\alpha>3$ .

Now let’s consider the third desideratum, i.e. suppose $n$ is a cancellation false negative, with $|\hat{\mathcal{I}}_{\text{AtP}}(n)|\ll|\mathcal{I}(n)|\ll|\mathcal{I}^{\text{% direct}}(n)|\approx|\hat{\mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)|$ . Then, $\left|\sum_{s\in\mathcal{S}\setminus\emptyset}\hat{\mathcal{I}}_{\text{AtP}}^{% s}(n)\right|=\left|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_{\text{% AtP}}^{\text{direct}}(n)\right|\gg|\mathcal{I}(n)|$ . The summands in $\sum_{s\in\mathcal{S}\setminus\emptyset}\hat{\mathcal{I}}_{\text{AtP}}^{s}(n)$ are the union of the summands in $\sum_{\begin{subarray}{c}s\in\mathcal{S}\\ \ell\in s\end{subarray}}\hat{\mathcal{I}}_{\text{AtP}}^{s}(n)=\hat{\mathcal{I}% }_{\text{AtP}}(n)-\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)$ across layers $\ell$ .

It’s then possible but intuitively unlikely that $\sum_{\ell}\left|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_{\text{% AtP+GD}_{\ell}}(n)\right|$ would be much smaller than $\left|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_{\text{AtP}}^{\text{% direct}}(n)\right|$ . Suppose the ratio is $\alpha$ , i.e. suppose $\sum_{\ell}\left|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_{\text{% AtP+GD}_{\ell}}(n)\right|=\alpha\left|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{% \mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)\right|$ . For example, if all indirect effects use paths of length 1 then the union is a disjoint union, so $\sum_{\ell}\left|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_{\text{% AtP+GD}_{\ell}}(n)\right|\geq\left|\sum_{\ell}\left(\hat{\mathcal{I}}_{\text{% AtP}}(n)-\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)\right)\right|=\left|\hat{% \mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)\right|$ , so $\alpha\geq 1$ . Now:

$\displaystyle\sum_{\ell\in K}\left\|\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)\right\|$	$\displaystyle\geq\sum_{\ell\in K}\left\|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{% \mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)\right\|-\|K\|\left\|\hat{\mathcal{I}}_{% \text{AtP}}(n)\right\|$	(22)
	$\displaystyle=\alpha\left\|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_% {\text{AtP}}^{\text{direct}}(n)\right\|-\|K\|\left\|\hat{\mathcal{I}}_{\text{AtP}}% (n)\right\|$	(23)
	$\displaystyle\geq\alpha\left\|\hat{\mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)% \right\|-(\|K\|+\alpha)\left\|\hat{\mathcal{I}}_{\text{AtP}}(n)\right\|$	(24)
$\displaystyle\therefore\frac{1}{L-1}\sum_{\ell=1}^{L}\left\|\hat{\mathcal{I}}_{% \text{AtP+GD}_{\ell}}(n)\right\|$	$\displaystyle=\frac{1}{L-1}\sum_{\ell\in K}\left\|\hat{\mathcal{I}}_{\text{AtP+% GD}_{\ell}}(n)\right\|+\frac{L-\|K\|-1}{L-1}\left\|\hat{\mathcal{I}}_{\text{AtP}}(% n)\right\|$	(25)
	$\displaystyle\geq\frac{\alpha}{L-1}\left\|\hat{\mathcal{I}}_{\text{AtP}}^{\text% {direct}}(n)\right\|+\frac{L-2\|K\|-1-\alpha}{L-1}\left\|\hat{\mathcal{I}}_{\text{% AtP}}(n)\right\|$	(26)

And the RHS is an improvement over $\left|\hat{\mathcal{I}}_{\text{AtP}}(n)\right|$ so long as $\alpha\left|\hat{\mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)\right|>(2|K|+% \alpha)\left|\hat{\mathcal{I}}_{\text{AtP}}(n)\right|$ , which is likely given the assumptions.

Ultimately, though, the desiderata are validated by the experiments, which consistently show GradDrops either decreasing or leaving untouched the number of false negatives, and thus improving performance apart from the initial upfront cost of the extra backwards passes.

A.3 Algorithm for computing diagnostics

Given summary statistics $\bar{i}_{\pm}$ , $s_{\pm}$ and $\text{count}_{\pm}$ for every node $n$ , obtained from Algorithm 1, and a threshold $\theta>0$ we can use Welch’s $t$ -test Welch (1947) to test the hypothesis that $|\bar{i}_{+}-\bar{i}_{-}|\geq\theta$ . Concretely we compute the $t$ -statistic via

	$\displaystyle s_{\bar{i}_{\pm}}$	$\displaystyle=\frac{s_{\pm}}{\sqrt{\text{count}_{\pm}}}$		(28)
	$\displaystyle t$	$\displaystyle=\frac{\theta-\|\bar{i}_{+}-\bar{i}_{-}\|}{\sqrt{s_{\bar{i}_{+}}^{2% }+s_{\bar{i}_{-}}^{2}}}.$		(29)

The effective degrees of freedom $\nu$ can be approximated with the Welch–Satterthwaite equation

\displaystyle\nu_{\text{Welch}}=\frac{\left(\frac{s_{+}^{2}}{\text{count}_{+}}% +\frac{s_{-}^{2}}{\text{count}_{-}}\right)^{2}}{\frac{s_{+}^{4}}{\text{count}^% {2}_{+}(\text{count}_{+}-1)}+\frac{s_{-}^{4}}{\text{count}^{2}_{-}(\text{count% }_{-}-1)}}

(30)

We then compute the probability ( $p$ -value) of obtaining a $t$ at least as large as observed, using the cumulative distribution function of Student’s $t\Big{(}x;\nu_{\text{Welch}}\Big{)}$ at the appropriate points. We take the max of the individual $p$ -values of all nodes to obtain an aggregate upper bound. Finally, we use binary search to find the largest threshold $\theta$ that still has an aggregate $p$ -value smaller than a given target $p$ value. We show multiple such diagnostic curves in Section B.3, for different confidence levels ( $1-p_{\text{target}}$ ).

Appendix B Experiments

B.1 Prompt Distributions

B.1.1 IOI

We use the following prompt template:

BOSWhen␣[A]␣and␣[B]␣went␣to␣the␣bar,␣[A/C]␣gave␣a␣drink␣to␣[B/A]

Each clean prompt $x^{\text{clean}}$ uses two names A and B with completion B, while a noise prompt $x^{\text{noise}}$ uses names A, B, and C with completion A. We construct all possible such assignments where names are chosen from the set of {Michael, Jessica, Ashley, Joshua, David, Sarah}, resulting in 120 prompt pairs.

B.1.2 A-AN

We use the following prompt template to induce the prediction of an indefinite article.

BOSI␣want␣one␣pear.␣Can␣you␣pick␣up␣a␣pear␣for␣me?
␣I␣want␣one␣orange.␣Can␣you␣pick␣up␣an␣orange␣for␣me?
␣I␣want␣one␣[OBJECT].␣Can␣you␣pick␣up␣[a/an]

We found that zero shot performance of small models was relatively low, but performance improved drastically when providing a single example of each case. Model performance was sensitive to the ordering of the two examples but was better than random in all cases. The magnitude and sign of the impact of the few-shot ordering was inconsistent.

Clean prompts $x^{\text{clean}}$ contain objects inducing ‘␣a’, one of {boat, coat, drum, horn, map, pipe, screw, stamp, tent, wall}. Noise prompts $x^{\text{noise}}$ contain objects inducing ‘␣an’, one of {apple, ant, axe, award, elephant, egg, orange, oven, onion, umbrella}. This results in a total of 100 prompt pairs.

B.2 Cancellation across a distribution

As mention in Section 2, we average the magnitudes of effects across a distribution, rather than taking the magnitude of the average effect. We do this because cancellation of effects is happening frequently across a distribution, which, together with imprecise estimates, could lead to significant false negatives. A proper ablation study to quantify this effect exactly is beyond the scope of this work. In Figure 10, we show the degree of cancellation across the IOI distribution for various model sizes. For this we define the Cancellation Ratio of node $n$ as

\displaystyle 1-\frac{\left|\sum_{x^{\text{clean}},x^{\text{noise}}}\mathcal{I% }(n;x^{\text{clean}},x^{\text{noise}})\right|}{\sum_{x^{\text{clean}},x^{\text% {noise}}}\left|\mathcal{I}(n;x^{\text{clean}},x^{\text{noise}})\right|}.

B.3 Additional detailed results

We show the diagnostic measurements for Pythia-12B across all investigated distributions in Figure 11(b), and cost of verified 100% recall curves for all models and settings in Figures 12(c) and 13(c).

B.4 Metrics

In this paper we focus on the difference in loss (negative log probability) as the metric $\mathcal{L}$ . We provide some evidence that AtP( ${}^{*}$ ) is not sensitive to the choice of $\mathcal{L}$ . For Pythia-12B, on IOI-PP and IOI, we show the rank scatter plots in Figure 14 for three different metrics.

For IOI, we also show that performance of AtP ${}^{*}$ looks notably worse when effects are evaluated via denoising instead of noising (cf. Section 2.1). As of now we do not have a satisfactory explanation for this observation.

B.5 Hyperparameter selection

The iterative baseline, and the AtP-based methods, have no hyperparameters. In general, we used 5 random seeds for each hyperparameter setting, and selected the setting that produced the lowest IRWRGM cost (see Section 4.2).

For Subsampling, the two hyperparameters are the Bernoulli sampling probability $p$ , and the number of samples to collect before verifying nodes in decreasing order of $\hat{c}_{\text{SS}}$ . $p$ was chosen from {0.01, 0.03}¹⁴¹⁴14We observed early on that larger values of $p$ were consistently underperforming. We leave it to future work to investigate more granular and smaller values for $p$ .. The number of steps was chosen among power-of-2 numbers of batches, where the batch size depended on the setting.

For Blocks, we swept across block sizes 2, 6, 20, 60, 250. For Hierarchical, we used a branching factor of $B=3$ , because of the following heuristic argument. If all but one node had zero effect, then discovering that node would be a matter of iterating through the hierarchy levels. We’d have number of levels $\log_{B}|N|$ , and at each level, $B$ forward passes would be required to find which lower-level block the special node is in – and thus the cost of finding the node would be $B\log_{B}|N|=\frac{B}{\log B}\log|N|$ . $\frac{B}{\log B}$ is minimized at $B=e$ , or at $B=3$ if $B$ must be an integer. The other hyperparameter is the number of levels; we swept this from 2 to 12.

Appendix C AtP variants

C.1 Residual-site AtP and Layer normalization

Let’s consider the behaviour of AtP on sites that contain much or all of the total signal in the residual stream, such as residual-stream sites. Nanda (2022) described a concern about this behaviour: that linear approximation of the layer normalization would do poorly if the patched value is significantly different than the clean one, but with a similar norm. The proposed modification to AtP to account for this was to hold the scaling factors (in the denominators) fixed when computing the backwards pass. Here we’ll present an analysis of how this modification would affect the approximation error of AtP. (Empirical investigation of this issue is beyond the scope of this paper.)

Concretely, let the node under consideration be $n$ , with clean and alternate values $n^{\mathrm{clean}}$ and $n^{\mathrm{noise}}$ ; and for simplicity, let’s assume the model does nothing more than an unparametrized RMSNorm $\mathcal{M}(n):=n/|n|$ . Let’s now consider how well $\mathcal{M}(n^{\mathrm{noise}})$ is approximated, both by its first-order approximation $\hat{\mathcal{M}}_{\text{AtP}}(n^{\mathrm{noise}}):=\mathcal{M}(n^{\mathrm{% clean}})+\mathcal{M}(n^{\mathrm{clean}})^{\perp}(n^{\mathrm{noise}}-n^{\mathrm% {clean}})$ where $\mathcal{M}(n^{\mathrm{clean}})^{\perp}=I-\mathcal{M}(n^{\mathrm{clean}})% \mathcal{M}(n^{\mathrm{clean}})^{\intercal}$ is the projection to the hyperplane orthogonal to $\mathcal{M}(n^{\mathrm{clean}})$ , and by the variant that fixes the denominator: $\hat{\mathcal{M}}_{\text{AtP+frozenLN}}(n^{\mathrm{noise}}):=n^{\mathrm{noise}% }/|n^{\mathrm{clean}}|$ .

To quantify the error in the above, we’ll measure the error $\epsilon$ in terms of Euclidean distance. Let’s also assume, without loss of generality, that $|n^{\mathrm{clean}}|=1$ . Geometrically, then, $\mathcal{M}(n)$ is a projection onto the unit hypersphere, $\mathcal{M}_{\text{AtP}}(n)$ is a projection onto the tangent hyperplane at $n^{\mathrm{clean}}$ , and $\mathcal{M}_{\text{AtP+frozenLN}}$ is the identity function.

Now, let’s define orthogonal coordinates $(x,y)$ on the plane spanned by $n^{\mathrm{clean}},n^{\mathrm{noise}}$ , such that $n^{\mathrm{clean}}$ is mapped to $(1,0)$ and $n^{\mathrm{noise}}$ is mapped to $(x,y)$ , with $y\geq 0$ . Then, $\epsilon_{\text{AtP}}:=\left|\hat{\mathcal{M}}(n^{\mathrm{noise}})-\mathcal{M}% (n^{\mathrm{noise}})\right|=\sqrt{2+y^{2}-2\frac{x+y^{2}}{\sqrt{x^{2}+y^{2}}}}$ , while $\epsilon_{\text{AtP+frozenLN}}:=\left|\hat{\mathcal{M}}_{\mathrm{fix}}(n^{% \mathrm{noise}})-\mathcal{M}(n^{\mathrm{noise}})\right|=\left|\sqrt{x^{2}+y^{2% }}-1\right|$ .

Plotting the error in Figure 15, we can see that, as might be expected, freezing the layer norm denominators helps whenever $n^{\mathrm{noise}}$ indeed has the same norm as $n^{\mathrm{clean}}$ , and (barring weird cases with $x>1$ ) whenever the cosine-similarity is less than $\frac{1}{2}$ ; but largely hurts if $n^{\mathrm{noise}}$ is close to $n^{\mathrm{clean}}$ . This illustrates that, while freezing the denominators will generally be unhelpful when patch distances are small relative to the full residual signal (as with almost all nodes considered in this paper), it will likely be helpful in a different setting of patching residual streams, which could be quite unaligned but have similar norm.

C.2 Edge AtP and AtP*

Here we will investigate edge attribution patching, and how the cost scales if we use GradDrop and/or QK fix. (For this section we’ll focus on a single prompt pair.)

First, let’s review what edge attribution patching is trying to approximate, and how it works.

C.2.1 Edge intervention effects

Given nodes $n_{1},n_{2}$ where $n_{1}$ is upstream of $n_{2}$ , if we were to patch in an alternate value for $n_{1}$ , this could impact $n_{2}$ in a complicated nonlinear way. As discussed in 3.1.2, because LLMs have a residual stream, the “direct effect” can be understood as the one holding all other possible intermediate nodes between $n_{1}$ and $n_{2}$ fixed – and it’s a relatively simple function, composed of transforming the alternate value $n_{1}(x^{\text{noise}})$ to a residual stream contribution $r_{\text{out},{\ell_{1}}}(x^{\text{clean}}|\operatorname{do}(n_{1}\leftarrow n% _{1}(x^{\text{noise}})))$ , then carrying it along the residual stream to an input $r_{\text{in},{\ell_{2}}}=r_{\text{in},{\ell_{2}}}(x^{\text{clean}})+(r_{\text{% out},{\ell_{1}}}-r_{\text{out},{\ell_{1}}}(x^{\text{clean}}))$ , and transforming that into a value $n_{2}^{\text{direct}}$ .

In the above, $\ell_{1}$ and $\ell_{2}$ are the semilayers containing $n_{1}$ and $n_{2}$ , respectively. Let’s define $\mathbf{n}_{(\ell_{1},\ell_{2})}$ to be the set of non-residual nodes between semilayers $\ell_{1}$ and $\ell_{2}$ . Then, we can define the resulting $n_{2}^{\text{direct}}$ as:

n_{2}^{\text{direct}^{\ell_{1}}}(x^{\text{clean}}|\operatorname{do}(n_{1}% \leftarrow n_{1}(x^{\text{noise}}))):=n_{2}(x^{\text{clean}}|\operatorname{do}% (n_{1}\leftarrow n_{1}(x^{\text{noise}})),\operatorname{do}(\mathbf{n}_{(\ell_% {1},\ell_{2})}\leftarrow\mathbf{n}_{(\ell_{1},\ell_{2})}(x^{\text{clean}}))).

The residual-stream input $r_{\text{in},{\ell_{2}}}^{\text{direct}^{\ell_{1}}}(x^{\text{clean}}|% \operatorname{do}(n_{1}\leftarrow n_{1}(x^{\text{noise}})))$ is defined similarly.

Finally, $n_{2}$ itself isn’t enough to compute the metric $\mathcal{L}$ – for that we also need to let the forward pass $\mathcal{M}(x^{\text{clean}})$ run using the modified $n_{2}^{\text{direct}^{\ell_{1}}}(x^{\text{clean}}|\operatorname{do}(n_{1}% \leftarrow n_{1}(x^{\text{noise}})))$ , while removing all other effects of $n_{1}$ (i.e. not patching it).

Writing this out, we have edge intervention effect

	$\displaystyle\mathcal{I}(n_{1}\rightarrow n_{2};x^{\text{clean}},x^{\text{% noise}})$	$\displaystyle:=\mathcal{L}(\mathcal{M}(x^{\text{clean}}\|\operatorname{do}(n_{2% }\leftarrow n_{2}^{\text{direct}^{\ell_{1}}}(x^{\text{clean}}\|\operatorname{do% }(n_{1}\leftarrow n_{1}(x^{\text{noise}}))))))$
		$\displaystyle\mathrel{\phantom{:=}}-\mathcal{L}(\mathcal{M}(x^{\text{clean}})).$		(31)

C.2.2 Nodes and Edges

Let’s briefly consider what edges we’d want to be evaluating this on. In Section 4.1, we were able to conveniently separate attention nodes from MLP neurons, knowing that to handle both kinds of nodes, we’d just need to be able handle each kind of node on its own, and then combine the results. For edge interventions this of course isn’t true, because edges can go from MLP neurons to attention nodes, and vice versa. For the purposes of this section, we’ll assume that the node set $N$ contains the attention nodes, and for MLPs either a node per layer (as in Syed et al. (2023)), or a node per neuron (as in the NeuronNodes setting).

Regarding the edges, the MLP nodes can reasonably be connected with any upstream or downstream node, but this isn’t true for the attention nodes, which have more of a structure amongst themselves: the key, query, and value nodes for an attention head can only affect downstream nodes via the attention output nodes for that head, and vice versa. As a result, on edges between different semilayers, upstream attention nodes must be attention head outputs, and downstream attention nodes must be keys, queries, or values. In addition, there are some within-attention-head edges, connecting each query node to the output node in the same position, and each key and value node to output nodes in causally affectable positions.

C.2.3 Edge AtP

As with node activation patching, the edge intervention effect $\mathcal{I}(n_{1}\rightarrow n_{2};x^{\text{clean}},x^{\text{noise}})$ is costly to evaluate directly for every edge, since a forward pass is required each time. However, as with AtP, we can apply first-order approximations: we define

$\displaystyle\hat{\mathcal{I}}_{\text{AtP}}(n_{1}\rightarrow n_{2};x^{\text{% clean}},x^{\text{noise}})$	$\displaystyle:=\left(\Delta r_{n_{1}}^{\text{AtP}}(x^{\text{clean}},x^{\text{% noise}})\right)^{\intercal}\nabla_{r_{n_{2}}}^{\text{AtP}}\mathcal{L}(\mathcal% {M}(x^{\text{clean}})),$	(32)
$\displaystyle\text{where }\Delta r_{n_{1}}^{\text{AtP}}(x^{\text{clean}},x^{% \text{noise}})$	$\displaystyle:=\operatorname{Jac}_{n_{1}}(r_{\text{out},{\ell_{1}}})(n_{1}(x^{% \text{clean}}))(n_{1}(x^{\text{noise}})-n_{1}(x^{\text{clean}}))$	(33)
$\displaystyle\text{and }\nabla_{r_{n_{2}}}^{\text{AtP}}\mathcal{L}(\mathcal{M}% (x^{\text{clean}}))$	$\displaystyle:=\left(\operatorname{Jac}_{r_{\text{in},{\ell_{2}}}}(n_{2})(r_{% \text{in},{\ell_{2}}}(x^{\text{clean}}))\right)^{\intercal}\nabla_{n_{2}}(% \mathcal{L}(\mathcal{M}(x^{\text{clean}})))(n_{2}(x^{\text{clean}})),$	(34)

and this is a close approximation when $n_{1}(x^{\text{noise}})\approx n_{1}(x^{\text{clean}})$ .

A key benefit of this decomposition is that the first term depends only on $n_{1}$ , and the second term depends only on $n_{2}$ ; and they’re both easy to compute from a forward and backward pass on $x^{\text{clean}}$ and a forward pass on $x^{\text{noise}}$ , just like AtP itself.

Then, to complete the edge-AtP evaluation, what remains computationally is to evaluate all the dot products between nodes in different semilayers, at each token position. This requires $d_{\mathrm{resid}}T(1-\frac{1}{L})|N|^{2}/2$ multiplications in total¹⁵¹⁵15This formula omits edges within a single layer, for simplicity – but those are a small minority., where $L$ is the number of layers, $T$ is the number of tokens, and $|N|$ is the total number of nodes. This cost exceeds the cost of computing all $\Delta r_{n_{1}}^{\text{AtP}}(x^{\text{clean}},x^{\text{noise}})$ and $\nabla_{r_{n_{2}}}^{\text{AtP}}\mathcal{L}(\mathcal{M}(x^{\text{clean}}))$ on Pythia 2.8B even with a single node per MLP layer; if we look at a larger model, or especially if we consider single-neuron nodes even for small models, the gap grows significantly.

Due to this observation, we’ll focus our attention on the quadratic part of the compute cost, pertaining to two nodes rather than just one – i.e. the number of multiplications in computing all $(\Delta r_{n_{1}}^{\text{AtP}}(x^{\text{clean}},x^{\text{noise}}))^{\intercal}% \nabla_{r_{n_{2}}}^{\text{AtP}}\mathcal{L}(\mathcal{M}(x^{\text{clean}}))$ . Notably, we’ll also exclude within-attention-head edges from the “quadratic cost”: these edges, from some key, query, or value node to an attention output node can be handled by minor variations of the nodewise AtP or AtP* methods for the corresponding key, query, or value node.

C.2.4 MLPs

There are a couple of issues that can come up around the MLP nodes. One is that, similarly to the attention saturation issue described in Section 3.1.1, the linear approximation to the MLP may be fairly bad in some cases, creating significant false negatives if $n_{2}$ is an MLP node. Another issue is that if we use single-neuron nodes, then those are very numerous, making the $d_{\mathrm{resid}}$ -dimensional dot product per edge quite costly.

MLP saturation and fix

Just as clean activations that saturate the attention probability may have small gradients that lead to strongly underestimated effects, the same is true of the MLP nonlinearity. A similar fix is applicable: instead of using a linear approximation to the function from $n_{1}$ to $n_{2}$ , we can linearly approximate the function from $n_{1}$ to the preactivation $n_{2,\text{pre}}$ , and then recompute $n_{2}$ using that, before multiplying by the gradient.

This kind of rearrangement, where the gradient-delta-activation dot product is computed in $d_{n_{2}}$ dimensions rather than $d_{\mathrm{resid}}$ , will come up again – we’ll call it the factored form of AtP.

If the nodes are neurons then the factored form requires no change to the number of multiplications; however, if they’re MLP layers then there’s a large increase in cost, by a factor of $d_{\mathrm{neurons}}$ . This increase is mitigated by two factors: one is that this is a small minority of edges, outnumbered by the number of edges ending in attention nodes by $3\times(\text{\# heads per layer})$ ; the other is the potential for parameter sharing.

Neuron edges and parameter sharing

A useful observation is that each edge, across different token¹⁶¹⁶16Also across different batch entries, if we do this on more than one prompt pair. positions, reuses the same parameter matrices in $\operatorname{Jac}_{n_{1}}(r_{\text{out},{\ell_{1}}})(n_{1}(x^{\text{clean}}))$ and $\operatorname{Jac}_{r_{\text{in},{\ell_{2}}}}(n_{2})(r_{\text{in},{\ell_{2}}}(% x^{\text{clean}}))$ . Indeed, setting aside the MLP activation function, the only other nonlinearity in those functions is a layer normalization; if we freeze the scaling factor at its clean value as in Section C.1, the Jacobians are equal to the product of the corresponding parameter matrices, divided by the clean scaling factor.

Thus if we premultiply the parameter matrices then we eliminate the need to do so at each token, which reduces the per-token quadratic cost by $d_{\mathrm{resid}}$ (i.e. to a scalar multiplication) for neuron-neuron edges, or by $d_{\mathrm{resid}}/d_{\mathrm{site}}$ (i.e. to a $d_{\mathrm{site}}$ -dimensional dot product) for edges between neurons and some attention site.

It’s worth noting, though, that these premultiplied parameter matrices (or, indeed, the edge-AtP estimates if we use neuron sites) will in total be many times (specifically, $(L-1)\frac{d_{\mathrm{neurons}}}{4d_{\mathrm{resid}}}$ times) larger than the MLP weights themselves, so storage may need to be considered carefully. It may be worth considering ways to only find the largest estimates, or the estimates over some threshold, rather than full estimates for all edges.

C.2.5 Edge AtP* costs

Let’s now consider how to adapt the AtP* proposals from Section 3.1 to this setting. We’ve already seen that the MLP fix, which is similarly motivated to the QK fix, has negligible cost in the neuron-nodes case, but comes with a $d_{\mathrm{neurons}}/d_{\mathrm{resid}}$ overhead in quadratic cost in the case of using an MLP layer per node, at least on edges into those MLP nodes. We’ll consider the MLP fix to be part of edge-AtP*. Now let’s investigate the two corrections in regular AtP*: GradDrops, and the QK fix.

GradDrops

GradDrops works by replacing the single backward pass in the AtP formula with $L$ backward passes; this in effect means $L$ values for the multiplicand $\nabla_{r_{n_{2}}}^{\text{AtP}}\mathcal{L}(\mathcal{M}(x^{\text{clean}}))$ , so this is a multiplicative factor of $L$ on the quadratic cost (though in fact some of these will be duplicates, and taking this into account lets us drive the multiplicative factor down to $(L+1)/2$ ). Notably this works equally well with “factored AtP”, as used for neuron edges; and in particular, if $n_{2}$ is a neuron, the gradients can easily be combined and shared across $n_{1}$ s, eliminating the $(L+1)/2$ quadratic-cost overhead.

However, the motivation for GradDrops was to account for multiple paths whose effects may cancel; in the edge-interventions setting, these can already be discovered in a different way (by identifying the responsible edges out of $n_{2}$ ), so the benefit of GradDrops is lessened. At the same time, the cost remains substantial. Thus, we’ll omit GradDrops from our recommended procedure edge-AtP*.

QK fix

The QK fix applies to the $\nabla_{n_{2}}(\mathcal{L}(\mathcal{M}(x^{\text{clean}})))(n_{2}(x^{\text{% clean}}))$ term, i.e. to replacing the linear approximation to the softmax with a correct calculation to the change in softmax, for each different input $\Delta r_{n_{1}}^{\text{AtP}}(x^{\text{clean}},x^{\text{noise}})$ . As in Section 3.1.1, there’s the simpler case of accounting for $n_{2}$ s that are query nodes, and the more complicated case of $n_{2}$ s that are key nodes using Algorithm 4 – but these are both cheap to do after computing the $\Delta\operatorname{attnLogits}$ corresponding to $n_{2}$ .

The “factored AtP” way is to matrix-multiply $\Delta r_{n_{1}}^{\text{AtP}}(x^{\text{clean}},x^{\text{noise}})$ with key or query weights and with the clean queries or keys, respectively. This means instead of the $d_{\mathrm{resid}}$ multiplications required for each edge $n_{1}\rightarrow n_{2}$ with AtP, we need $d_{\mathrm{resid}}d_{\mathrm{key}}+Td_{\mathrm{key}}$ multiplications (which, thanks to the causal mask, can be reduced to an average of $d_{\mathrm{key}}(d_{\mathrm{resid}}+(T+1)/2)$ ).

The “unfactored” option is to stay in the $r_{\text{in},{\ell_{2}}}$ space: pre-multiply the clean queries or keys with the respective key or query weight matrices, and then take the dot product of $\Delta r_{n_{1}}^{\text{AtP}}(x^{\text{clean}},x^{\text{noise}})$ with each one. This way, the quadratic part of the compute cost contains $d_{\mathrm{resid}}(T+1)/2$ multiplications; this will be more efficient for short sequence lengths.

This means that for edges into key and query nodes, the overhead of doing AtP+QKfix on the quadratic cost is a multiplicative factor of $\min\left(\frac{T+1}{2},d_{\mathrm{key}}\left(1+\frac{T+1}{2d_{\mathrm{resid}}% }\right)\right)$ .

QK fix + GradDrops

If the QK fix is being combined with GradDrops, then the first multiplication by the $d_{\mathrm{resid}}\times d_{\mathrm{key}}$ matrix can be shared between the different gradients; so the overhead on the quadratic cost of QKfix + GradDrops for edges into queries and keys, using the factored method, is $d_{\mathrm{key}}\left(1+\frac{(T+1)(L+1)}{4d_{\mathrm{resid}}}\right)$ .

C.3 Conclusion

Considering all the above possibilities, it’s not obvious where the best tradeoff is between correctness and compute cost in all situations. In Table 2 we provide formulas measuring the number of multiplications in the quadratic cost for each kind of edge, across the variations we’ve mentioned. In Figure 16 we plug in the 4 sizes of Pythia model used elsewhere in the paper, such as Figure 2, to enable numerical comparison.

AtP variant	O $\rightarrow$ V	O $\rightarrow$ Q,K	O $\rightarrow$ MLP	MLP $\rightarrow$ V	MLP $\rightarrow$ Q,K	MLP $\rightarrow$ MLP
MLP layers	$DH^{2}$	$2DH^{2}$	$DH$	$DH$	$2DH$	$D$
QKfix	$DH^{2}$	$(T+1)DH^{2}$	$DH$	$DH$	$(T+1)DH$	$D$
QKfix+GD	$\frac{L+1}{2}DH^{2}$	$\frac{(L+1)(T+1)}{2}DH^{2}$	$\frac{L+1}{2}DH$	$\frac{L+1}{2}DH$	$\frac{(L+1)(T+1)}{2}DH$	$\frac{L+1}{2}D$
AtP*	$DH^{2}$	$(T+1)DH^{2}$	$VNH$	$DH$	$(T+1)DH$	$ND$
AtP*+GD	$\frac{L+1}{2}DH^{2}$	$\frac{(L+1)(T+1)}{2}DH^{2}$	$VNH$	$\frac{L+1}{2}DH$	$\frac{(L+1)(T+1)}{2}DH$	$ND$
QKfix (long)	$DH^{2}$	$(2D+T+1)KH^{2}$	$DH$	$DH$	$(2D+T+1)KH$	$D$
QKfix+GD	$\frac{L+1}{2}DH^{2}$	$\frac{L+1}{2}(2D+T+1)KH^{2}$	$\frac{L+1}{2}DH$	$\frac{L+1}{2}DH$	$\frac{L+1}{2}(2D+T+1)KH$	$\frac{L+1}{2}D$
ATP*	$DH^{2}$	$(2D+T+1)KH^{2}$	$VNH$	$DH$	$(2D+T+1)KH$	$ND$
AtP*+GD	$\frac{L+1}{2}DH^{2}$	$\frac{L+1}{2}(2D+T+1)KH^{2}$	$VNH$	$\frac{L+1}{2}DH$	$\frac{L+1}{2}(2D+T+1)KH$	$ND$
Neurons	$DH^{2}$	$2DH^{2}$	$VNH$	$VNH$	$2KNH$	$N^{2}$
MLPfix	$DH^{2}$	$2DH^{2}$	$VNH$	$VNH$	$2KNH$	$N^{2}$
AtP*	$DH^{2}$	$(T+1)DH^{2}$	$VNH$	$VNH$	$(T+1)KNH$	$N^{2}$
AtP*+GD	$\frac{L+1}{2}DH^{2}$	$\frac{L+1}{2}(T+1)DH^{2}$	$VNH$	$\frac{L+1}{2}VNH$	$\frac{(L+1)(T+1)}{2}KNH$	$N^{2}$
ATP* (long)	$DH^{2}$	$(2D+T+1)KH^{2}$	$VNH$	$VNH$	$(T+1)KNH$	$N^{2}$
AtP*+GD	$\frac{L+1}{2}DH^{2}$	$\frac{L+1}{2}(2D+T+1)KH^{2}$	$VNH$	$\frac{L+1}{2}VNH$	$\frac{(L+1)(T+1)}{2}KNH$	$N^{2}$

Table 2: Per-token per-layer-pair total quadratic cost of each kind of between-layers edge, across edge-AtP variants. For brevity, we omit the layer-pair

\binom{L}{2}

factor that would otherwise be in every cell, and use

D:=d_{\mathrm{resid}},H:=\text{\# heads per layer},K:=d_{\mathrm{key}},V:=d_{% \mathrm{value}},N:=d_{\mathrm{neurons}}

Appendix D Distribution of true effects

In Figure 17, we show the distribution of $c(n)$ across models and distributions.

$\displaystyle\sum_{\ell\in K}\left\|\hat{\mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)\right\|$	$\displaystyle\geq\sum_{\ell\in K}\left\|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{% \mathcal{I}}_{\text{AtP+GD}_{\ell}}(n)\right\|-\|K\|\left\|\hat{\mathcal{I}}_{% \text{AtP}}(n)\right\|$	(22)
	$\displaystyle=\alpha\left\|\hat{\mathcal{I}}_{\text{AtP}}(n)-\hat{\mathcal{I}}_% {\text{AtP}}^{\text{direct}}(n)\right\|-\|K\|\left\|\hat{\mathcal{I}}_{\text{AtP}}% (n)\right\|$	(23)
	$\displaystyle\geq\alpha\left\|\hat{\mathcal{I}}_{\text{AtP}}^{\text{direct}}(n)% \right\|-(\|K\|+\alpha)\left\|\hat{\mathcal{I}}_{\text{AtP}}(n)\right\|$	(24)
$\displaystyle\therefore\frac{1}{L-1}\sum_{\ell=1}^{L}\left\|\hat{\mathcal{I}}_{% \text{AtP+GD}_{\ell}}(n)\right\|$	$\displaystyle=\frac{1}{L-1}\sum_{\ell\in K}\left\|\hat{\mathcal{I}}_{\text{AtP+% GD}_{\ell}}(n)\right\|+\frac{L-\|K\|-1}{L-1}\left\|\hat{\mathcal{I}}_{\text{AtP}}(% n)\right\|$	(25)
	$\displaystyle\geq\frac{\alpha}{L-1}\left\|\hat{\mathcal{I}}_{\text{AtP}}^{\text% {direct}}(n)\right\|+\frac{L-2\|K\|-1-\alpha}{L-1}\left\|\hat{\mathcal{I}}_{\text{% AtP}}(n)\right\|$	(26)

AtP*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT: An efficient and scalable method for localizing LLM behaviour to components

Abstract

1 Introduction

Our contributions:

2 Background

2.1 Problem Statement

Model components.

Model behaviour.

Contribution of a component.

2.2 Attribution Patching

3 Methods

3.1 AtP improvements

3.1.1 False negatives from attention saturation

Queries

Keys

3.1.2 False negatives from cancellation

Direct Effect Ratio

3.2 Diagnostics

3.3 Baselines

Iterative

Subsampling

Blocks & Hierarchical

4 Experiments

4.1 Setup

Nodes

Models

Effect Metric ℒℒ\mathcal{L}caligraphic_L

4.2 Measuring Effectiveness and Efficiency

Cost of verified recall

Inverse-rank-weighted geometric mean cost

4.3 Single Prompt Pairs versus Distributions

Clean single prompt pairs

Random prompt pair

Distributions

5 Discussion

5.1 Limitations

Prompt pair distributions

Choice of Nodes N𝑁Nitalic_N

Caveats of c⁢(n)𝑐𝑛c(n)italic_c ( italic_n ) as importance measure

Effect size versus rank estimation

Other LLMs

5.2 Extensions/Variants

Edge Patching

Coarser nodes N𝑁Nitalic_N

Layer normalization

Denoising

Other forms of ablation

5.3 Applications

Automated Circuit Finding

Sparse Autoencoders

Steering LLMs

5.4 Recommendation

6 Related work

Localization and Mediation Analysis

Activation Patching

Attribution Patching / Gradient-based Masking

7 Conclusion

8 Author Contributions

References

Appendix A Method details

A.1 Baselines

A.1.1 Properties of Subsampling

A.1.2 Pseudocode for Blocks and Hierarchical baselines

A.2 AtP improvements

A.2.1 Pseudocode for corrected AtP on attention keys

A.2.2 Properties of GradDrop

A.3 Algorithm for computing diagnostics

Appendix B Experiments

B.1 Prompt Distributions

B.1.1 IOI

B.1.2 A-AN

B.2 Cancellation across a distribution

B.3 Additional detailed results

B.4 Metrics

B.5 Hyperparameter selection

Appendix C AtP variants

C.1 Residual-site AtP and Layer normalization

C.2 Edge AtP and AtP*

C.2.1 Edge intervention effects

C.2.2 Nodes and Edges

AtP ${}^{*}$ : An efficient and scalable method for localizing LLM behaviour to components

Effect Metric $\mathcal{L}$

Choice of Nodes $N$

Caveats of $c(n)$ as importance measure

Coarser nodes $N$