A Primer on the Inner Workings of
Transformer-based Language Models

Javier Ferrando1 , Gabriele Sarti2, Arianna Bisazza2, Marta R. Costa-jussà3
1Universitat Politècnica de Catalunya, 2CLCG, University of Groningen, 3FAIR, Meta
Correspondence to: [email protected].
Abstract

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.

Refer to caption
Figure 1: Survey overview. Section 2 introduces the Transformer language model and its components. Section 3 and Section 4 present interpretability techniques used to analyze models’ inner workings. Finally, Section 5 presents known inner workings of Transformer language models.

1 Introduction

The development of powerful Transformers-based language models (LMs; Radford et al., 2019; Brown et al., 2020; Hoffmann et al., 2022; Chowdhery et al., 2023) and their widespread utilization underscores the significance of research devoted to understanding their inner mechanisms. Gaining a deeper understanding of these mechanisms in highly capable AI systems holds important implications in ensuring the safety and fairness of such systems, mitigating their biases and errors in critical settings, and ultimately driving model improvements (Wei et al., 2022; Costa-jussà et al., 2023). As a result, the natural language processing (NLP) community has witnessed a notable increase in research focused on interpretability in language models, leading to new insights into their internal functioning.

Existing surveys present a wide variety of techniques adopted by Explainable AI analyses (Räuker et al., 2023) and their applications in NLP (Madsen et al., 2022; Lyu et al., 2024). While previous NLP interpretability surveys primarily focused on encoder-based models like BERT (Devlin et al., 2019; Rogers et al., 2021), the success of decoder-only Transformers (Radford et al., 2018) prompted further developments in the analysis of these powerful generative models, with concurrent work surveying trends in interpretability research and their relation to AI safety (Bereska & Gavves, 2024). By contrast, this work provides a concise, in-depth technical introduction to relevant techniques used in LM interpretability research, focusing on insights derived from models’ inner workings and drawing connections between different areas of interpretability research. Moreover, throughout this work, we employ a unified notation to introduce model components, interpretability methods, and insights from surveyed works, shedding light on the assumptions and motivations behind specific method designs. We categorize LM interpretability approaches surveyed in this work along two dimensions: i) localizing the inputs or model components responsible for a particular prediction (Section 3); and ii) decoding information stored in learned representations111In this work we use representations and activations interchangeably, and we refer to the fundamental unit of information encoded in model activations as features, representing human-interpretable input properties. to understand its usage across network components (Section 4). Finally, Section 5 provides an exhaustive list of insights into the inner workings of Transformer-based LMs, and Section 6 provides an overview of useful tools to conduct interpretability analyses on these models.

2 The Components of a Transformer Language Model

Auto-regressive language models assign probabilities to sequences of tokens. Using the probability chain rule, we can decompose the probability distribution over a sequence 𝐭=t1,t2,tn𝐭subscript𝑡1subscript𝑡2subscript𝑡𝑛{\mathbf{t}=\langle t_{1},t_{2}\ldots,t_{n}\rangle}bold_t = ⟨ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ into a product of conditional distributions:

P(t1,,tn)=P(t1)i=1nP(ti+1|t1,,ti).𝑃subscript𝑡1subscript𝑡𝑛𝑃subscript𝑡1superscriptsubscriptproduct𝑖1𝑛𝑃conditionalsubscript𝑡𝑖1subscript𝑡1subscript𝑡𝑖P(t_{1},\ldots,t_{n})=P(t_{1})\prod_{i=1}^{n}P(t_{i+1}|t_{1},\ldots,t_{i}).italic_P ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_P ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (1)

Such distributions can be parametrized using a neural network optimized to maximize the likelihood of a corpus used for training (Bengio et al., 2003). In recent years, the Transformer architecture by Vaswani et al. (2017) was widely adopted for this purpose thanks to its expressivity and its scalability (Kaplan et al., 2020). While several variants of the original Transformers were proposed, we focus here on the decoder-only architecture (also known as GPT-like) due to its success and popularity.222Most of the insights presented in this work remain relevant for encoder-only and encoder-decoder models. A decoder-only model f𝑓fitalic_f has L𝐿Litalic_L layers, and operates on a sequence of embeddings 𝐱=𝒙1,𝒙2,𝒙n𝐱subscript𝒙1subscript𝒙2subscript𝒙𝑛{\mathbf{x}=\langle{\bm{x}}_{1},{\bm{x}}_{2}\ldots,{\bm{x}}_{n}\rangle}bold_x = ⟨ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ representing the tokens 𝐭=t1,t2,tn𝐭subscript𝑡1subscript𝑡2subscript𝑡𝑛{\mathbf{t}=\langle t_{1},t_{2}\ldots,t_{n}\rangle}bold_t = ⟨ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩. Each embedding 𝒙d𝒙superscript𝑑{{\bm{x}}\in\mathbb{R}^{d}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a row vector corresponding to a row of the embedding matrix 𝑾E|𝒱|×dsubscript𝑾𝐸superscript𝒱𝑑{\bm{W}}_{E}\in\mathbb{R}^{|\mathcal{V}|\times d}bold_italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V | × italic_d end_POSTSUPERSCRIPT, where 𝒱𝒱\mathcal{V}caligraphic_V is the model vocabulary. Intermediate layer representations, for instance, at position i𝑖iitalic_i and layer l𝑙litalic_l, are referred to as 𝒙ilsubscriptsuperscript𝒙𝑙𝑖{\bm{x}}^{l}_{i}bold_italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.333Note that 𝒙i0=𝒙isubscriptsuperscript𝒙0𝑖subscript𝒙𝑖{\bm{x}}^{0}_{i}={\bm{x}}_{i}bold_italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. By 𝑿n×d𝑿superscript𝑛𝑑{\bm{X}}\in\mathbb{R}^{n\times d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT we represent the sequence 𝐱𝐱\mathbf{x}bold_x as a matrix with embeddings stacked as rows. Likewise, for intermediate representations, 𝑿ilsuperscriptsubscript𝑿absent𝑖𝑙{\bm{X}}_{\leq i}^{l}bold_italic_X start_POSTSUBSCRIPT ≤ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the layer l𝑙litalic_l representation matrix up to position i𝑖iitalic_i. Appendix A provides a summary of the notation used in this work.

Following recent literature regarding interpretability in Transformers, we present the architecture adopting the residual stream perspective (Elhage et al., 2021a). In this view, each input embedding gets updated via vector additions from the attention (Section 2.1.2) and feed-forward blocks (Section 2.1.3), producing residual stream states (or intermediate representations). The final layer residual stream state is then projected into the vocabulary space via the unembedding matrix 𝑾Ud×|𝒱|subscript𝑾𝑈superscript𝑑𝒱{\bm{W}}_{U}\in\mathbb{R}^{d\times|\mathcal{V}|}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × | caligraphic_V | end_POSTSUPERSCRIPT(Section 2.2), and normalized via the softmax function to obtain the probability distribution over the vocabulary from which a new token is sampled.

2.1 The Transformer Layer

In this section, we present the Transformer layer components following their computations’ flow.

2.1.1 Layer Normalization

Layer normalization (LayerNorm) is a common operation used to stabilize the training process of deep neural networks (Ba et al., 2016). Although early Transformer models implemented LayerNorm at the output of each block, modern models consistently normalize preceding each block (Xiong et al., 2020; Takase et al., 2023). Given a representation 𝒛𝒛{\bm{z}}bold_italic_z, the LayerNorm computes (𝒛μ(𝒛)/σ(𝒛))γ+βdirect-product𝒛𝜇𝒛𝜎𝒛𝛾𝛽(\nicefrac{{{\bm{z}}-\mu({\bm{z}})}}{{\sigma({\bm{z}})}})\odot\mathbf{\gamma}+% \mathbf{\beta}( / start_ARG bold_italic_z - italic_μ ( bold_italic_z ) end_ARG start_ARG italic_σ ( bold_italic_z ) end_ARG ) ⊙ italic_γ + italic_β, where μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ calculate the mean and standard deviation, and γd𝛾superscript𝑑\gamma\in\mathbb{R}^{d}italic_γ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and βd𝛽superscript𝑑\beta\in\mathbb{R}^{d}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT refer to learned element-wise transformation and bias respectively. Layer normalization can be interpreted geometrically by visualizing the mean subtraction operation as a projection of input representations onto a hyperplane defined by the normal vector [1,1,,1]d111superscript𝑑{[1,1,\ldots,1]\in\mathbb{R}^{d}}[ 1 , 1 , … , 1 ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and the following scaling to d𝑑\sqrt{d}square-root start_ARG italic_d end_ARG norm as a map** of the resulting representations to a hypersphere (Brody et al., 2023)Kobayashi et al. (2021) notes that LayerNorm can be treated as an affine transformation 𝒛𝑳+β𝒛𝑳𝛽{{\bm{z}}{\bm{L}}+\beta}bold_italic_z bold_italic_L + italic_β, as long as σ(𝒛)𝜎𝒛\sigma({\bm{z}})italic_σ ( bold_italic_z ) is considered as a constant (Appendix B). In this view, the matrix 𝑳𝑳{\bm{L}}bold_italic_L computes the centering and scaling operations. Furthermore, the weights of the affine transformation can be folded into the following linear layer (Appendix C), simplifying the analysis.

We note that current LMs such as Llama 2 (Touvron et al., 2023) adopt an alternative layer normalization procedure, RMSNorm (Zhang & Sennrich, 2019), where the centering operation is removed, and scaling is performed using the root mean square (RMS) statistic.

2.1.2 Attention Block

Attention is a key mechanism that allows Transformers to contextualize token representations at each layer. The attention block is composed of multiple attention heads. At a decoding step i𝑖iitalic_i, each attention head reads from residual streams across previous positions (iabsent𝑖\leq i≤ italic_i), decides which positions to attend to, gathers information from those, and finally writes it to the current residual stream. We adopt the rearrangement proposed by Kobayashi et al. (2021) and Elhage et al. (2021a) to simplify the analysis of residual stream contributions.444The original implementation considers a concatenation of each attention head output before projecting into the weight matrix 𝑾OlHdh×dsubscriptsuperscript𝑾𝑙𝑂superscript𝐻subscript𝑑𝑑{\bm{W}}^{l}_{O}\in\mathbb{R}^{H\cdot d_{h}\times d}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H ⋅ italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT. By splitting 𝑾Olsubscriptsuperscript𝑾𝑙𝑂{\bm{W}}^{l}_{O}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT into per-head weight matrices 𝑾Ol,hdh×dsuperscriptsubscript𝑾𝑂𝑙superscriptsubscript𝑑𝑑{\bm{W}}_{O}^{l,h}\in\mathbb{R}^{d_{h}\times d}bold_italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT, matrices 𝑾Vl,hsubscriptsuperscript𝑾𝑙𝑉{\bm{W}}^{l,h}_{V}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and 𝑾Ol,hsubscriptsuperscript𝑾𝑙𝑂{\bm{W}}^{l,h}_{O}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT can be joined in a single matrix 𝑾OVl,hsubscriptsuperscript𝑾𝑙𝑂𝑉{\bm{W}}^{l,h}_{OV}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT. In particular, every attention head computes

Attnl,h(𝑿il1)superscriptAttn𝑙subscriptsuperscript𝑿𝑙1absent𝑖\displaystyle\text{Attn}^{l,h}({\bm{X}}^{l-1}_{\leq i})Attn start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_i end_POSTSUBSCRIPT ) =jiai,jl,h\eqnmarkbox[bluedrawio]v1𝒙jl1𝑾Vl,h𝑾Ol,habsentsubscript𝑗𝑖subscriptsuperscript𝑎𝑙𝑖𝑗\eqnmarkboxdelimited-[]𝑏𝑙𝑢subscript𝑒𝑑𝑟𝑎𝑤𝑖𝑜𝑣1subscriptsuperscript𝒙𝑙1𝑗subscriptsuperscript𝑾𝑙𝑉subscriptsuperscript𝑾𝑙𝑂\displaystyle=\sum_{j\leq i}a^{l,h}_{i,j}\eqnmarkbox[blue_{d}rawio]{v1}{{\bm{x% }}^{l-1}_{j}{\bm{W}}^{l,h}_{V}}{\bm{W}}^{l,h}_{O}= ∑ start_POSTSUBSCRIPT italic_j ≤ italic_i end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT [ italic_b italic_l italic_u italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_v 1 bold_italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT
=jiai,jl,h𝒙jl1𝑾OVl,h,\annotate[yshift=0em]above,labelbelowv1Valuevectorabsentsubscript𝑗𝑖subscriptsuperscript𝑎𝑙𝑖𝑗subscriptsuperscript𝒙𝑙1𝑗subscriptsuperscript𝑾𝑙𝑂𝑉\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑣1𝑉𝑎𝑙𝑢𝑒𝑣𝑒𝑐𝑡𝑜𝑟\displaystyle=\sum_{j\leq i}a^{l,h}_{i,j}{\bm{x}}^{l-1}_{j}{\bm{W}}^{l,h}_{OV}% ,\annotate[yshift=0em]{above,labelbelow}{v1}{Valuevector}= ∑ start_POSTSUBSCRIPT italic_j ≤ italic_i end_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT , [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_v 1 italic_V italic_a italic_l italic_u italic_e italic_v italic_e italic_c italic_t italic_o italic_r (2)

where the learnable weight matrices 𝑾Vl,hd×dhsubscriptsuperscript𝑾𝑙𝑉superscript𝑑subscript𝑑{\bm{W}}^{l,h}_{V}\in\mathbb{R}^{d\times d_{h}}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝑾Ol,hdh×dsubscriptsuperscript𝑾𝑙𝑂superscriptsubscript𝑑𝑑{\bm{W}}^{l,h}_{O}\in\mathbb{R}^{d_{h}\times d}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT are combined into the OV matrix 𝑾Vl,h𝑾Ol,h=𝑾OVl,hd×dsubscriptsuperscript𝑾𝑙𝑉subscriptsuperscript𝑾𝑙𝑂subscriptsuperscript𝑾𝑙𝑂𝑉superscript𝑑𝑑{{\bm{W}}^{l,h}_{V}{\bm{W}}^{l,h}_{O}={\bm{W}}^{l,h}_{OV}\in\mathbb{R}^{d% \times d}}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT = bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, also referred to as OV (output-value) circuit. The attention weights for every key (iabsent𝑖\leq i≤ italic_i) given the current query (i𝑖iitalic_i) are obtained as:

𝒂il,hsuperscriptsubscript𝒂𝑖𝑙\displaystyle{\bm{a}}_{i}^{l,h}bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT =softmax(\eqnmarkbox[reddrawio]qk1𝒙il1𝑾Ql,h\eqnmarkbox[darkgreen]qk2(𝑿il1𝑾Kl,h)dk)absentsoftmax\eqnmarkboxdelimited-[]𝑟𝑒subscript𝑑𝑑𝑟𝑎𝑤𝑖𝑜𝑞𝑘1subscriptsuperscript𝒙𝑙1𝑖subscriptsuperscript𝑾𝑙𝑄\eqnmarkboxdelimited-[]𝑑𝑎𝑟𝑘𝑔𝑟𝑒𝑒𝑛𝑞𝑘2superscriptsubscriptsuperscript𝑿𝑙1absent𝑖subscriptsuperscript𝑾𝑙𝐾subscript𝑑𝑘\displaystyle=\text{softmax}\left(\frac{\eqnmarkbox[red_{d}rawio]{qk1}{{\bm{x}% }^{l-1}_{i}{\bm{W}}^{l,h}_{Q}}\eqnmarkbox[darkgreen]{qk2}{({\bm{X}}^{l-1}_{% \leq i}{\bm{W}}^{l,h}_{K})^{\intercal}}}{\sqrt{d_{k}}}\right)= softmax ( divide start_ARG [ italic_r italic_e italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_q italic_k 1 bold_italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT [ italic_d italic_a italic_r italic_k italic_g italic_r italic_e italic_e italic_n ] italic_q italic_k 2 ( bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_i end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG )
=softmax(𝒙il1𝑾QKh𝑿il1dk),\annotate[yshift=0em]above,labelbelowqk1Queryvector\annotate[yshift=1em,xshift=4em]below,labelbelowqk2Keyvectorabsentsoftmaxsubscriptsuperscript𝒙𝑙1𝑖subscriptsuperscript𝑾𝑄𝐾superscriptsubscriptsuperscript𝑿𝑙1absent𝑖subscript𝑑𝑘\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑞𝑘1𝑄𝑢𝑒𝑟𝑦𝑣𝑒𝑐𝑡𝑜𝑟\annotatedelimited-[]formulae-sequence𝑦𝑠𝑖𝑓𝑡1𝑒𝑚𝑥𝑠𝑖𝑓𝑡4𝑒𝑚𝑏𝑒𝑙𝑜𝑤𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑞𝑘2𝐾𝑒𝑦𝑣𝑒𝑐𝑡𝑜𝑟\displaystyle=\text{softmax}\left(\frac{{\bm{x}}^{l-1}_{i}{\bm{W}}^{h}_{QK}{{% \bm{X}}^{l-1}_{\leq i}}^{\intercal}}{\sqrt{d_{k}}}\right),\annotate[yshift=0em% ]{above,labelbelow}{qk1}{Queryvector}\annotate[yshift=-1em,xshift=4em]{below,% labelbelow}{qk2}{Keyvector}= softmax ( divide start_ARG bold_italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q italic_K end_POSTSUBSCRIPT bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) , [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_q italic_k 1 italic_Q italic_u italic_e italic_r italic_y italic_v italic_e italic_c italic_t italic_o italic_r [ italic_y italic_s italic_h italic_i italic_f italic_t = - 1 italic_e italic_m , italic_x italic_s italic_h italic_i italic_f italic_t = 4 italic_e italic_m ] italic_b italic_e italic_l italic_o italic_w , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_q italic_k 2 italic_K italic_e italic_y italic_v italic_e italic_c italic_t italic_o italic_r (3)

with 𝑾Ql,hd×dhsubscriptsuperscript𝑾𝑙𝑄superscript𝑑subscript𝑑{\bm{W}}^{l,h}_{Q}\in\mathbb{R}^{d\times d_{h}}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝑾Kl,hd×dhsubscriptsuperscript𝑾𝑙𝐾superscript𝑑subscript𝑑{\bm{W}}^{l,h}_{K}\in\mathbb{R}^{d\times d_{h}}bold_italic_W start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT combining as the QK (query-key) circuit 𝑾Qh𝑾Kh=𝑾QKhd×dsubscriptsuperscript𝑾𝑄superscriptsubscriptsuperscript𝑾𝐾subscriptsuperscript𝑾𝑄𝐾superscript𝑑𝑑{{\bm{W}}^{h}_{Q}{{\bm{W}}^{h}_{K}}^{\intercal}={\bm{W}}^{h}_{QK}\in\mathbb{R}% ^{d\times d}}bold_italic_W start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. The decomposition introduced in Sections 2.1.2 and 2.1.2 enables a view of QK and OV circuits as units responsible for respectively reading from and writing to the residual stream. The attention block output is the sum of individual attention heads, which is subsequently added back into the residual stream:

Attnl(𝑿il1)=h=1HAttnl,h(𝑿il1),superscriptAttn𝑙subscriptsuperscript𝑿𝑙1absent𝑖superscriptsubscript1𝐻superscriptAttn𝑙subscriptsuperscript𝑿𝑙1absent𝑖\text{Attn}^{l}({\bm{X}}^{l-1}_{\leq i})=\sum\limits_{h=1}^{H}\text{Attn}^{l,h% }({\bm{X}}^{l-1}_{\leq i}),Attn start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT Attn start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_i end_POSTSUBSCRIPT ) , (4)
𝒙imid,l=𝒙il1+Attnl(𝑿il1).subscriptsuperscript𝒙mid𝑙𝑖subscriptsuperscript𝒙𝑙1𝑖superscriptAttn𝑙subscriptsuperscript𝑿𝑙1absent𝑖{\bm{x}}^{\text{mid},l}_{i}={\bm{x}}^{l-1}_{i}+\text{Attn}^{l}({\bm{X}}^{l-1}_% {\leq i}).bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + Attn start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_i end_POSTSUBSCRIPT ) . (5)
Refer to caption
Figure 2: Unrolled Transformer LM with expanded views of the Attention and Feedforward network blocks, including model weights (gray) and residual stream states (green). Based on figures from (Ferrando & Voita, 2024; Voita et al., 2023).

2.1.3 Feedforward Network Block

The feedforward network (FFN) in the Transformer block is composed of two learnable weight matrices555We omit bias terms, which are included in the original Transformer architecture, following the practice of recent models such as Llama (Touvron et al., 2023), PaLM (Chowdhery et al., 2023) and OLMo (Groeneveld et al., 2024), which also exclude biases from attention matrices.: 𝑾inld×dffnsubscriptsuperscript𝑾𝑙insuperscript𝑑subscript𝑑ffn{\bm{W}}^{l}_{\text{in}}\in\mathbb{R}^{d\times d_{\text{ffn}}}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT ffn end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝑾outldffn×dsubscriptsuperscript𝑾𝑙outsuperscriptsubscript𝑑ffn𝑑{\bm{W}}^{l}_{\text{out}}\in\mathbb{R}^{d_{\text{ffn}}\times d}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT ffn end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT. 𝑾inlsubscriptsuperscript𝑾𝑙in{\bm{W}}^{l}_{\text{in}}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT in end_POSTSUBSCRIPT reads from the residual stream state 𝒙imid,lsubscriptsuperscript𝒙mid𝑙𝑖{\bm{x}}^{\text{mid},l}_{i}bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and its result is passed through an element-wise non-linear activation function g𝑔gitalic_g, producing the neuron activations. These get transformed by 𝑾outlsubscriptsuperscript𝑾𝑙out{\bm{W}}^{l}_{\text{out}}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT to produce the output FFN(𝒙imid)FFNsubscriptsuperscript𝒙mid𝑖\text{FFN}({\bm{x}}^{\text{mid}}_{i})FFN ( bold_italic_x start_POSTSUPERSCRIPT mid end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which is then added back to the residual stream:

FFNl(𝒙imid,l)=g(𝒙imid,l𝑾inl)𝑾outl.superscriptFFN𝑙subscriptsuperscript𝒙mid𝑙𝑖𝑔subscriptsuperscript𝒙mid𝑙𝑖subscriptsuperscript𝑾𝑙insubscriptsuperscript𝑾𝑙out\text{FFN}^{l}({\bm{x}}^{\text{mid},l}_{i})=g({\bm{x}}^{\text{mid},l}_{i}{\bm{% W}}^{l}_{\text{in}}){\bm{W}}^{l}_{\text{out}}.FFN start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_g ( bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT . (6)
𝒙il=𝒙imid,l+FFNl(𝒙imid,l).subscriptsuperscript𝒙𝑙𝑖subscriptsuperscript𝒙mid𝑙𝑖superscriptFFN𝑙subscriptsuperscript𝒙mid𝑙𝑖{\bm{x}}^{l}_{i}={\bm{x}}^{\text{mid},l}_{i}+\text{FFN}^{l}({\bm{x}}^{\text{% mid},l}_{i}).bold_italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + FFN start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)

The computation described in Equation 6 was equated to key-value memory retrieval (Geva et al., 2021), with keys (𝒘inlsubscriptsuperscript𝒘𝑙in{\bm{w}}^{l}_{\text{in}}bold_italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT in end_POSTSUBSCRIPT) stored in columns of 𝑾inlsubscriptsuperscript𝑾𝑙in{\bm{W}}^{l}_{\text{in}}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT in end_POSTSUBSCRIPT acting as pattern detectors over the input sequence (Figure 2 right) and values 𝒘outlsubscriptsuperscript𝒘𝑙out{\bm{w}}^{l}_{\text{out}}bold_italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT, rows of 𝑾outlsubscriptsuperscript𝑾𝑙out{\bm{W}}^{l}_{\text{out}}bold_italic_W start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT, being upweighted by each neuron activation. We use the term “neuron” to refer to each value after an element-wise non-linearity, and use “unit” or “dimension” for other individual values in any other representation. Provided that the output of the FFN is a linear combination of 𝒘outlsubscriptsuperscript𝒘𝑙out{\bm{w}}^{l}_{\text{out}}bold_italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out end_POSTSUBSCRIPT values, Equation 6 can be rewritten following the key-value perspective:

FFNl(𝒙imid,l)superscriptFFN𝑙subscriptsuperscript𝒙mid𝑙𝑖\displaystyle\text{FFN}^{l}({\bm{x}}^{\text{mid},l}_{i})FFN start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =u=1dffngu(𝒙imid,l𝒘inul)𝒘outulabsentsuperscriptsubscript𝑢1subscript𝑑ffnsubscript𝑔𝑢subscriptsuperscript𝒙mid𝑙𝑖subscriptsuperscript𝒘𝑙subscriptin𝑢subscriptsuperscript𝒘𝑙subscriptout𝑢\displaystyle=\sum_{u=1}^{d_{\text{ffn}}}g_{u}({\bm{x}}^{\text{mid},l}_{i}{\bm% {w}}^{l}_{\text{in}_{u}}){\bm{w}}^{l}_{\text{out}_{u}}= ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT ffn end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT in start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) bold_italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT (8)
=u=1dffnnul𝒘outul,absentsuperscriptsubscript𝑢1subscript𝑑ffnsubscriptsuperscript𝑛𝑙𝑢subscriptsuperscript𝒘𝑙subscriptout𝑢\displaystyle=\sum_{u=1}^{d_{\text{ffn}}}n^{l}_{u}{\bm{w}}^{l}_{\text{out}_{u}},= ∑ start_POSTSUBSCRIPT italic_u = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT ffn end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT out start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (9)

with 𝒏ldffnsuperscript𝒏𝑙superscriptsubscript𝑑ffn{\bm{n}}^{l}\in\mathbb{R}^{d_{\text{ffn}}}bold_italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT ffn end_POSTSUBSCRIPT end_POSTSUPERSCRIPT being the vector of neuron activations, and nulsubscriptsuperscript𝑛𝑙𝑢n^{l}_{u}italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT the u𝑢uitalic_u-th neuron activation value.

The elementwise nonlinearity inside FFNs creates a privileged basis (Elhage et al., 2022b), which encourages features to align with basis directions. For instance, given a linear network f(𝒙)=𝒙𝑾1𝑾2𝑓𝒙𝒙subscript𝑾1subscript𝑾2f({\bm{x}})={\bm{x}}{\bm{W}}_{1}{\bm{W}}_{2}italic_f ( bold_italic_x ) = bold_italic_x bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the representations extracted from its first layer, 𝒙𝑾1𝒙subscript𝑾1{\bm{x}}{\bm{W}}_{1}bold_italic_x bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, are rotationally invariant, since we can rotate them by an orthogonal matrix 𝑶𝑶{\bm{O}}bold_italic_O, giving 𝒙𝑾1𝑶𝒙subscript𝑾1𝑶{\bm{x}}{\bm{W}}_{1}{\bm{O}}bold_italic_x bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_O, and invert the rotation having the output of the network untouched, f(𝒙)=𝒙𝑾1𝑶𝑶1𝑾2𝑓𝒙𝒙subscript𝑾1𝑶superscript𝑶1subscript𝑾2f({\bm{x}})={\bm{x}}{\bm{W}}_{1}{\bm{O}}{\bm{O}}^{-1}{\bm{W}}_{2}italic_f ( bold_italic_x ) = bold_italic_x bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_O bold_italic_O start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (Brown et al., 2023). However, having an elementwise nonlinear function on the output of the first layer breaks the rotational invariance of the representations, making the standard basis dimensions (neurons) more likely to be independently meaningful, and therefore better suitable for interpretability analysis.

2.2 Prediction Head and Transformer Decompositions

The prediction head of a Transformer consists of an unembedding matrix 𝑾Ud×|𝒱|subscript𝑾𝑈superscript𝑑𝒱{\bm{W}}_{U}\in\mathbb{R}^{d\times|\mathcal{V}|}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × | caligraphic_V | end_POSTSUPERSCRIPT, sometimes accompanied by a bias. The last residual stream state gets transformed by this linear map converting the representation into a next-token distribution of logits, which is turned into a probability distribution via the softmax function.

Prediction as a sum of component outputs.

The residual stream view shows that every model component interacts with it through addition (Mickus et al., 2022). Thus, the unnormalized scores (logits) are obtained via a linear projection of the summed component outputs. Due to the properties of linear transformations, we can rearrange the traditional forward pass formulation so that each model component contributes directly to the output logits:

f(𝐱)𝑓𝐱\displaystyle f(\mathbf{x})italic_f ( bold_x ) =𝒙nL𝑾Uabsentsubscriptsuperscript𝒙𝐿𝑛subscript𝑾𝑈\displaystyle={\bm{x}}^{L}_{n}{\bm{W}}_{U}= bold_italic_x start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT
=(l=1Lh=1HAttnl,h(𝑿nl1)+l=1LFFNl(𝒙nmid,l)+𝒙n)𝑾Uabsentsuperscriptsubscript𝑙1𝐿superscriptsubscript1𝐻superscriptAttn𝑙subscriptsuperscript𝑿𝑙1absent𝑛superscriptsubscript𝑙1𝐿superscriptFFN𝑙subscriptsuperscript𝒙mid𝑙𝑛subscript𝒙𝑛subscript𝑾𝑈\displaystyle=\Big{(}\sum_{l=1}^{L}\sum_{h=1}^{H}\text{Attn}^{l,h}({\bm{X}}^{l% -1}_{\leq n})+\sum_{l=1}^{L}\text{FFN}^{l}({\bm{x}}^{\text{mid},l}_{n})+{\bm{x% }}_{n}\Big{)}{\bm{W}}_{U}= ( ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT Attn start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT FFN start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT
=l=1Lh=1H\eqnmarkbox[bluedrawio]atnnupAttnl,h(𝑿nl1)𝑾U+l=1L\eqnmarkbox[greendrawio]ffnupFFNl(𝒙nmid,l)𝑾U+𝒙n𝑾U.\annotate[yshift=0.8em]below,left,labelbelowatnnupAttentionheadlogitsupdate\annotate[yshift=0.9em]below,right,labelbelowffnupFFNlogitsupdateformulae-sequenceabsentsuperscriptsubscript𝑙1𝐿superscriptsubscript1𝐻\eqnmarkboxdelimited-[]𝑏𝑙𝑢subscript𝑒𝑑𝑟𝑎𝑤𝑖𝑜𝑎𝑡𝑛subscript𝑛𝑢𝑝superscriptAttn𝑙subscriptsuperscript𝑿𝑙1absent𝑛subscript𝑾𝑈superscriptsubscript𝑙1𝐿\eqnmarkboxdelimited-[]𝑔𝑟𝑒𝑒subscript𝑛𝑑𝑟𝑎𝑤𝑖𝑜𝑓𝑓subscript𝑛𝑢𝑝superscriptFFN𝑙subscriptsuperscript𝒙mid𝑙𝑛subscript𝑾𝑈subscript𝒙𝑛subscript𝑾𝑈\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0.8𝑒𝑚𝑏𝑒𝑙𝑜𝑤𝑙𝑒𝑓𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑎𝑡𝑛subscript𝑛𝑢𝑝𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑎𝑑𝑙𝑜𝑔𝑖𝑡𝑠𝑢𝑝𝑑𝑎𝑡𝑒\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0.9𝑒𝑚𝑏𝑒𝑙𝑜𝑤𝑟𝑖𝑔𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑓𝑓subscript𝑛𝑢𝑝𝐹𝐹𝑁𝑙𝑜𝑔𝑖𝑡𝑠𝑢𝑝𝑑𝑎𝑡𝑒\displaystyle=\sum_{l=1}^{L}\sum_{h=1}^{H}\eqnmarkbox[blue_{d}rawio]{atnn_{u}p% }{\text{Attn}^{l,h}({\bm{X}}^{l-1}_{\leq n}){\bm{W}}_{U}}+\sum_{l=1}^{L}% \eqnmarkbox[green_{d}rawio]{ffn_{u}p}{\text{FFN}^{l}({\bm{x}}^{\text{mid},l}_{% n}){\bm{W}}_{U}}+{\bm{x}}_{n}{\bm{W}}_{U}.\annotate[yshift=-0.8em]{below,left,% labelbelow}{atnn_{u}p}{Attentionheadlogitsupdate}\annotate[yshift=-0.9em]{% below,right,labelbelow}{ffn_{u}p}{FFNlogitsupdate}= ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT [ italic_b italic_l italic_u italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_a italic_t italic_n italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_p Attn start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT [ italic_g italic_r italic_e italic_e italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_f italic_f italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_p FFN start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT + bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT . [ italic_y italic_s italic_h italic_i italic_f italic_t = - 0.8 italic_e italic_m ] italic_b italic_e italic_l italic_o italic_w , italic_l italic_e italic_f italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_a italic_t italic_n italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_p italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n italic_h italic_e italic_a italic_d italic_l italic_o italic_g italic_i italic_t italic_s italic_u italic_p italic_d italic_a italic_t italic_e [ italic_y italic_s italic_h italic_i italic_f italic_t = - 0.9 italic_e italic_m ] italic_b italic_e italic_l italic_o italic_w , italic_r italic_i italic_g italic_h italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_f italic_f italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT italic_p italic_F italic_F italic_N italic_l italic_o italic_g italic_i italic_t italic_s italic_u italic_p italic_d italic_a italic_t italic_e (10)

This decomposition plays an important role when localizing components responsible for a prediction (Section 3) since it allows us to measure the direct contribution of every component to the logits of the predicted token (Section 3.2.1).

Refer to caption
Figure 3: Forward pass decomposition in a simplified Transformer LM. The direct path (red), full OV circuits (yellow) and virtual attention heads (grey) expressed in Equation 11 are highlighted.
Prediction as an ensemble of shallow networks forward passes.

Residual networks work as ensembles of shallow networks (Veit et al., 2016), where each subnetwork defines a path in the computational graph. Let us consider a two-layer attention-only Transformer, where each attention head is composed just by an OV matrix: f(𝒙)=𝒙1+𝑾OV2(𝒙1)𝑓𝒙superscript𝒙1superscriptsubscript𝑾𝑂𝑉2superscript𝒙1{f({\bm{x}})={\bm{x}}^{1}+{\bm{W}}_{OV}^{2}({\bm{x}}^{1})}italic_f ( bold_italic_x ) = bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ), with 𝒙1=𝒙+𝑾OV1(𝒙)superscript𝒙1𝒙superscriptsubscript𝑾𝑂𝑉1𝒙{{\bm{x}}^{1}={\bm{x}}+{\bm{W}}_{OV}^{1}({\bm{x}})}bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = bold_italic_x + bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_x ). We can decompose the forward pass (Figure 3) as

f(𝒙)=\eqnmarkbox[reddrawio]dp𝒙𝑾U+\eqnmarkbox[orangedrawio]fovc1𝒙𝑾OV1𝑾U+\eqnmarkbox[greydrawio]vcomp𝒙𝑾OV1𝑾OV2𝑾U+\eqnmarkbox[orangedrawio]fovc2𝒙𝑾OV2𝑾U.\annotate[yshift=0em]above,left,labelbelowdpDirectpath\annotatetwo[yshift=0em]above,labelbelowfovc1fovc2FullOVcircuits\annotate[yshift=0.5em]below,labelabovevcompVirtualattentionheads(Vcomposition)formulae-sequence𝑓𝒙\eqnmarkboxdelimited-[]𝑟𝑒subscript𝑑𝑑𝑟𝑎𝑤𝑖𝑜𝑑𝑝𝒙subscript𝑾𝑈\eqnmarkboxdelimited-[]𝑜𝑟𝑎𝑛𝑔subscript𝑒𝑑𝑟𝑎𝑤𝑖𝑜𝑓𝑜𝑣𝑐1𝒙superscriptsubscript𝑾𝑂𝑉1subscript𝑾𝑈\eqnmarkboxdelimited-[]𝑔𝑟𝑒subscript𝑦𝑑𝑟𝑎𝑤𝑖𝑜𝑣𝑐𝑜𝑚𝑝𝒙superscriptsubscript𝑾𝑂𝑉1superscriptsubscript𝑾𝑂𝑉2subscript𝑾𝑈\eqnmarkboxdelimited-[]𝑜𝑟𝑎𝑛𝑔subscript𝑒𝑑𝑟𝑎𝑤𝑖𝑜𝑓𝑜𝑣𝑐2𝒙superscriptsubscript𝑾𝑂𝑉2subscript𝑾𝑈\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑒𝑓𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑑𝑝𝐷𝑖𝑟𝑒𝑐𝑡𝑝𝑎𝑡\annotatetwodelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑓𝑜𝑣𝑐1𝑓𝑜𝑣𝑐2𝐹𝑢𝑙𝑙𝑂𝑉𝑐𝑖𝑟𝑐𝑢𝑖𝑡𝑠\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0.5𝑒𝑚𝑏𝑒𝑙𝑜𝑤𝑙𝑎𝑏𝑒𝑙𝑎𝑏𝑜𝑣𝑒𝑣𝑐𝑜𝑚𝑝𝑉𝑖𝑟𝑡𝑢𝑎𝑙𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑎𝑑𝑠𝑉𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛f({\bm{x}})=\eqnmarkbox[red_{d}rawio]{dp}{{\bm{x}}{\bm{W}}_{U}}+\eqnmarkbox[% orange_{d}rawio]{fovc1}{{\bm{x}}{\bm{W}}_{OV}^{1}{\bm{W}}_{U}}+\eqnmarkbox[% grey_{d}rawio]{vcomp}{{\bm{x}}{\bm{W}}_{OV}^{1}{\bm{W}}_{OV}^{2}{\bm{W}}_{U}}+% \eqnmarkbox[orange_{d}rawio]{fovc2}{{\bm{x}}{\bm{W}}_{OV}^{2}{\bm{W}}_{U}}.% \annotate[yshift=0em]{above,left,labelbelow}{dp}{Directpath}\annotatetwo[% yshift=0em]{above,labelbelow}{fovc1}{fovc2}{FullOVcircuits}\annotate[yshift=-0% .5em]{below,labelabove}{vcomp}{Virtualattentionheads(V-composition)}italic_f ( bold_italic_x ) = [ italic_r italic_e italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_d italic_p bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT + [ italic_o italic_r italic_a italic_n italic_g italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_f italic_o italic_v italic_c 1 bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT + [ italic_g italic_r italic_e italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_v italic_c italic_o italic_m italic_p bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT + [ italic_o italic_r italic_a italic_n italic_g italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_f italic_o italic_v italic_c 2 bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT . [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_e italic_f italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_d italic_p italic_D italic_i italic_r italic_e italic_c italic_t italic_p italic_a italic_t italic_h [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_f italic_o italic_v italic_c 1 italic_f italic_o italic_v italic_c 2 italic_F italic_u italic_l italic_l italic_O italic_V italic_c italic_i italic_r italic_c italic_u italic_i italic_t italic_s [ italic_y italic_s italic_h italic_i italic_f italic_t = - 0.5 italic_e italic_m ] italic_b italic_e italic_l italic_o italic_w , italic_l italic_a italic_b italic_e italic_l italic_a italic_b italic_o italic_v italic_e italic_v italic_c italic_o italic_m italic_p italic_V italic_i italic_r italic_t italic_u italic_a italic_l italic_a italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n italic_h italic_e italic_a italic_d italic_s ( italic_V - italic_c italic_o italic_m italic_p italic_o italic_s italic_i italic_t italic_i italic_o italic_n ) (11)

The first term in Equation 11, linking the input embedding to the unembedding matrix, is referred to as the direct path (first path in Figure 3). The paths traversing a single OV matrix are instead named full OV circuits (second and fourth path in Figure 3). Often, full OV circuits are written as 𝑾E𝑾OV𝑾U|𝒱|×|𝒱|subscript𝑾𝐸subscript𝑾𝑂𝑉subscript𝑾𝑈superscript𝒱𝒱{\bm{W}}_{E}{\bm{W}}_{OV}{\bm{W}}_{U}\in\mathbb{R}^{|\mathcal{V}|\times|% \mathcal{V}|}bold_italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V | × | caligraphic_V | end_POSTSUPERSCRIPT, stacking as rows the logits effect of each input embedding through the circuit. Lastly, the path involving both attention heads is referred to as virtual attention heads doing V-composition, since the sequential writing and reading of the two heads is seen as OV matrices composing together. Elhage et al. (2021a) propose measuring the amount of composition as: 𝑾OV1𝑾OV2F/𝑾OV1F𝑾OV2Fsubscriptdelimited-∥∥superscriptsubscript𝑾𝑂𝑉1superscriptsubscript𝑾𝑂𝑉2𝐹subscriptdelimited-∥∥superscriptsubscript𝑾𝑂𝑉1𝐹subscriptdelimited-∥∥superscriptsubscript𝑾𝑂𝑉2𝐹\nicefrac{{\left\lVert{\bm{W}}_{OV}^{1}{\bm{W}}_{OV}^{2}\right\rVert_{F}}}{{% \left\lVert{\bm{W}}_{OV}^{1}\right\rVert_{F}\left\lVert{\bm{W}}_{OV}^{2}\right% \rVert_{F}}}/ start_ARG ∥ bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG. Q-composition and K-composition, i.e. compositions of 𝑾Qsubscript𝑾𝑄{\bm{W}}_{Q}bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝑾Ksubscript𝑾𝐾{\bm{W}}_{K}bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT with the 𝑾OVsubscript𝑾𝑂𝑉{\bm{W}}_{OV}bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT output of previous layers, can also be found in full Transformer models.

3 Behavior Localization

Understanding the inner workings of language models implies localizing which elements in the forward pass (input elements, representations, and model components) are responsible for a specific prediction.666Commonly referred to as local explanation in the interpretability literature (Lipton, 2018)In this section, we present two different types of methods that allow localizing model behavior: input attribution (Section 3.1) and model component attribution (Section 3.2).

3.1 Input Attribution

Input attribution methods are commonly used to localize model behavior by estimating the contribution of input elements (in the case of LMs, tokens) in defining model predictions. We refer readers to Madsen et al. (2022) for a broader overview of post-hoc input attribution methods with a focus on classification tasks in NLP.

Gradient-based input attribution.

For neural network models like LMs, gradient information is frequently used as a natural metric for attribution purposes (Simonyan et al., 2014; Li et al., 2016; Ding & Koehn, 2021). Gradient-based attribution in this context involves a first-order Taylor expansion of a Transformer at a point 𝐱𝐱\mathbf{x}bold_x, expressed as f(𝐱)𝐱+𝒃𝑓𝐱𝐱𝒃{\nabla f(\mathbf{x})\cdot\mathbf{x}+{\bm{b}}}∇ italic_f ( bold_x ) ⋅ bold_x + bold_italic_b. The resulting gradient fw(𝐱)n×d=(gradfw)(𝐱)subscript𝑓𝑤𝐱superscript𝑛𝑑gradsubscript𝑓𝑤𝐱{\nabla f_{w}(\mathbf{x})\in\mathbb{R}^{n\times d}=(\text{grad}\;f_{w})(% \mathbf{x})}∇ italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT = ( grad italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ( bold_x ) captures intuitively the sensitivity of the model to each element in the input when predicting token w𝑤witalic_w.777Vocabulary logits or probability scores are commonly used as differentiation targets (Bastings et al., 2022). While attribution scores are computed for every dimension of input token embeddings, they are generally aggregated at a token level to obtain a more intuitive overview of the influence of individual tokens. This is commonly done by taking the Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT norm of the gradient vector w.r.t the i𝑖iitalic_i-th input embedding:

Afw(𝐱)tiGrad=𝒙ifw(𝐱)p.subscriptsuperscript𝐴Gradsubscript𝑓𝑤𝐱subscript𝑡𝑖subscriptdelimited-∥∥subscriptsubscript𝒙𝑖subscript𝑓𝑤𝐱𝑝A^{\text{Grad}}_{f_{w}(\mathbf{x})\leftarrow t_{i}}=\left\lVert\nabla_{{\bm{x}% }_{i}}f_{w}(\mathbf{x})\right\rVert_{p}.italic_A start_POSTSUPERSCRIPT Grad end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) ← italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∥ ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT . (12)

By taking the dot product between the gradient vector and the input embedding 𝒙ifw(𝐱)𝒙isubscriptsubscript𝒙𝑖subscript𝑓𝑤𝐱subscript𝒙𝑖{\nabla_{{\bm{x}}_{i}}f_{w}(\mathbf{x})\cdot{\bm{x}}_{i}}∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) ⋅ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, known as gradient ×\times× input method (Denil et al., 2015), this sensitivity can be converted to an importance estimate. However, these approaches are known to exhibit gradient saturation and shattering issues (Shrikumar et al., 2017; Balduzzi et al., 2017). This fact prompted the introduction of methods such as integrated gradients (Sundararajan et al., 2017) and SmoothGrad (Smilkov et al., 2017) to filter noisy gradient information. For example, integrated gradients approximate the integral of gradients along the straight-line path between a baseline input 𝐱~~𝐱\tilde{\mathbf{x}}over~ start_ARG bold_x end_ARG and the input 𝐱𝐱\mathbf{x}bold_x: (𝒙i𝒙~i)01𝒙ifw(𝐱~+α(𝐱𝐱~))𝑑αsubscript𝒙𝑖subscript~𝒙𝑖superscriptsubscript01subscriptsubscript𝒙𝑖subscript𝑓𝑤~𝐱𝛼𝐱~𝐱differential-d𝛼{({\bm{x}}_{i}-\tilde{{\bm{x}}}_{i})\int_{0}^{1}\nabla_{{\bm{x}}_{i}}f_{w}(% \tilde{\mathbf{x}}+\alpha(\mathbf{x}-\tilde{\mathbf{x}}))d\alpha}( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG + italic_α ( bold_x - over~ start_ARG bold_x end_ARG ) ) italic_d italic_α, and subsequent adaptations were proposed to accommodate the discreteness of textual inputs (Sanyal & Ren, 2021; Enguehard, 2023). Finally, approaches based on Layer-wise Relevance Propagation (LRP) (Bach et al., 2015) have been widely applied to study Transformer-based LMs (Voita et al., 2021; Chefer et al., 2021; Ali et al., 2022; Achtibat et al., 2024). These methods use custom rules for gradient propagation to decompose component contributions at every layer, ensuring their sum remains constant throughout the network.

Refer to caption
Figure 4: Three approaches to compute inter-token contributions (ci,jsubscript𝑐𝑖𝑗c_{i,j}italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT) towards context mixing in attention heads. Relying only on attention weights overlooks the magnitude of the vectors they operate on. This limitation can be addressed by accounting for the norm of the value-weighted or output-value-weighted vectors (𝒙jsuperscriptsubscript𝒙𝑗{\bm{x}}_{j}^{\prime}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT). Finally, distance-based analysis estimates the contribution of weighted vectors from their proximity to the attention output.
Perturbation-based input attribution.

Another popular family of approaches estimates input importance by adding noise or ablating input elements and measuring the resulting impact on model predictions (Li et al., 2017). For instance, the input token at position i𝑖iitalic_i can be removed, and the resulting probability difference fw(𝐱)fw(𝐱𝒙i)subscript𝑓𝑤𝐱subscript𝑓𝑤subscript𝐱subscript𝒙𝑖{f_{w}(\mathbf{x})-f_{w}(\mathbf{x}_{-{\bm{x}}_{i}})}italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) - italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) can be used as an estimate for its importance. If the logit or probability given to w𝑤witalic_w does not change, we conclude that the i𝑖iitalic_i-th token has no influence. A multitude of perturbation-based attribution methods exist in the literature, such as those based on interpretable local surrogate models such as LIME (Ribeiro et al., 2016), or those derived from game theory like SHAP (Shapley, 1953; Lundberg & Lee, 2017). Notably, new perturbation-based approaches were proposed to leverage linguistic structures (Amara et al., 2024; Zhao & Shan, 2024) and Transformer components (Deiseroth et al., 2023; Mohebbi et al., 2023) for attribution purposes. These methods relate directly to causal interventions discussed in Section 3.2.2. We refer readers to Covert et al. (2021) for a unified perspective on perturbation-based input attribution.

Context mixing for input attribution.

While raw model internals such as attention weights were generally considered to provide unfaithful explanations of model behavior (Jain & Wallace, 2019; Bastings & Filippova, 2020), recent methods have proposed alternatives to attention weights for measuring intermediate token-wise attributions. Some of these alternatives include the use of the norm of value-weighted vectors (Kobayashi et al., 2020) and output-value-weighted vectors (Kobayashi et al., 2021), or the use of vectors’ distances to estimate contributions (Ferrando et al., 2022b) (Figure 4 provides a visual description). A common strategy among such approaches involves aggregating intermediate per-layer attributions reflecting context mixing patterns (Brunner et al., 2020) using techniques such as attention rollout (Abnar & Zuidema, 2020), resulting in input attribution scores (Ferrando et al., 2022b; Modarressi et al., 2022; Mohebbi et al., 2023).888The attention flow method is seldom used due to its computational inefficiency, despite its theoretical guarantees (Ethayarajh & Jurafsky, 2021). Such context mixing approaches have shown strong faithfulness compared to gradient and perturbation-based methods on classification benchmarks such as ERASER (DeYoung et al., 2020). However, rollout aggregation has recently been criticized due to its simplistic assumptions, and recent research has attempted to fully expand the linear decomposition of the model output presented in Section 2.2 (Modarressi et al., 2023; Yang et al., 2023; Oh & Schuler, 2023) as a sum of linear transformations of the input tokens, linearizing the FFN block (Kobayashi et al., 2024).

Contrastive input attribution.

An important limitation of input attribution methods for interpreting language models is that attributed output tokens belong to a large vocabulary space, often having semantically equivalent tokens competing for probability mass in next-word prediction (Holtzman et al., 2021). In this context, attribution scores are likely to misrepresent several overlap** factors such as grammatical correctness and semantic appropriateness driving the model prediction. Recent work addresses this issue by proposing a contrastive formulation of such methods, producing counterfactual explanations for why the model predicts token w𝑤witalic_w instead of an alternative token o𝑜oitalic_o (Yin & Neubig, 2022). As an example, Yin & Neubig (2022) extend the vanilla gradient method of Equation 12 to provide contrastive explanations (ContGrad):

Afw¬o(𝐱)tiContGrad=𝒙i(fw(𝐱)fo(𝐱))p.subscriptsuperscript𝐴ContGradsubscript𝑓𝑤𝑜𝐱subscript𝑡𝑖subscriptdelimited-∥∥subscriptsubscript𝒙𝑖subscript𝑓𝑤𝐱subscript𝑓𝑜𝐱𝑝A^{\text{ContGrad}}_{f_{w\neg o}(\mathbf{x})\leftarrow t_{i}}=\left\lVert% \nabla_{{\bm{x}}_{i}}\left(f_{w}(\mathbf{x})-f_{o}(\mathbf{x})\right)\right% \rVert_{p}.italic_A start_POSTSUPERSCRIPT ContGrad end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w ¬ italic_o end_POSTSUBSCRIPT ( bold_x ) ← italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∥ ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) - italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( bold_x ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT . (13)
Limitations of input attribution methods.

While input attribution methods are commonly used to debug failure cases and identify biases in models’ predictions (McCoy et al., 2019), popular approaches were shown to be insensitive to variations in the model and data generating process (Adebayo et al., 2018; Sixt et al., 2020), to disagree with each others’ predictions (Atanasova et al., 2020; Crabbé & van der Schaar, 2023; Anonymous, 2024) and to show limited capacity in detecting unseen spurious correlations (Adebayo et al., 2020; 2022). Importantly, popular methods such as SHAP and Integrated Gradients were found provably unreliable at predicting counterfactual model behavior in realistic settings Bilodeau et al. (2024). Apart from theoretical limitations, perturbation-based approaches also suffer from out-of-distribution predictions induced by unrealistic noised or ablated inputs, and from high computational cost of targeted ablations for granular input elements.

Training data attribution.

Another dimension of input attribution involves the identification of influential training examples driving specific model predictions at inference time (Koh & Liang, 2017). These approaches are commonly referred to as training data attribution (TDA) or instance attribution methods and were applied to identify data artifacts (Han et al., 2020; Pezeshkpour et al., 2022) and sources of biases in language models’ predictions (Brunet et al., 2019), with recent approaches proposing to perform TDA via training run simulations (Guu et al., 2023; Liu et al., 2024). While the applicability of established TDA methods was put in question (Akyurek et al., 2022), especially due to their inefficiency, recent work in this area has produced more efficient methods that can be applied to large generative models at scale (Park et al., 2023b; Grosse et al., 2023; Kwon et al., 2024). We refer readers to (Hammoudeh & Lowd, 2022) for further details on TDA methods.

3.2 Model Component Importance

Early studies on the importance of Transformers LMs components highlighted a high degree of sparsity in model capabilities. This means, for example, that removing even a significant fraction of the attention heads in a model may not deteriorate its downstream performances (Michel et al., 2019; Voita et al., 2019b). These results motivated a new line of research studying how various components in an LM contribute to its wide array of capabilities.

3.2.1 Logit Attribution

Refer to caption
Figure 5: Direct Logit Attributions (DLA) on output token w𝑤witalic_w. (a) DLA of an attention head Attnl,hsuperscriptAttn𝑙\text{Attn}^{l,h}Attn start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT, (b) DLA of an intermediate representation 𝒙1l1superscriptsubscript𝒙1𝑙1{\bm{x}}_{1}^{l-1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT via an attention head, (c) DLA of an FFN block, and (d) DLA of a single neuron.

Let us call fc(𝐱)superscript𝑓𝑐𝐱f^{c}(\mathbf{x})italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) the output representation of a model component c𝑐citalic_c (attention head or FFN) at a particular layer for the last token position n𝑛nitalic_n. The decomposition presented in Section 2.2 allows us to measure the direct logit attribution999Note that the softmax function is shift-invariant, and therefore the logit scores have no absolute scale. (DLA, Figure 5) of each model component for the output token w𝒱𝑤𝒱w\in\mathcal{V}italic_w ∈ caligraphic_V:

Afw(𝐱)cDLA=fc(𝐱)𝑾U[:,w],subscriptsuperscript𝐴DLAsubscript𝑓𝑤𝐱𝑐superscript𝑓𝑐𝐱subscript𝑾𝑈:𝑤A^{\text{DLA}}_{f_{w}(\mathbf{x})\leftarrow c}=f^{c}(\mathbf{x}){\bm{W}}_{U[:,% w]},italic_A start_POSTSUPERSCRIPT DLA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) ← italic_c end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) bold_italic_W start_POSTSUBSCRIPT italic_U [ : , italic_w ] end_POSTSUBSCRIPT , (14)

where 𝑾U[:,w]subscript𝑾𝑈:𝑤{\bm{W}}_{U[:,w]}bold_italic_W start_POSTSUBSCRIPT italic_U [ : , italic_w ] end_POSTSUBSCRIPT is the w𝑤witalic_w-th column of 𝑾Usubscript𝑾𝑈{\bm{W}}_{U}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT, i.e. the unembedding vector of token w𝑤witalic_w. In practical terms, the DLA for a component c𝑐citalic_c expresses the contribution of c𝑐citalic_c to the logit of the predicted token, using the linearity of the model’s components described in Section 2.2.

Geva et al. (2022b) exploit the fact that the FFN block update is a linear combination of the rows of 𝑾outsubscript𝑾out{\bm{W}}_{\text{out}}bold_italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT weighted by the neuron activation values (Equation 8). Thus, it is possible to measure the DLA of each neuron as:

Afw(𝐱)nulDLA=nul𝒘outul𝑾U[:,w],subscriptsuperscript𝐴DLAsubscript𝑓𝑤𝐱subscriptsuperscript𝑛𝑙𝑢subscriptsuperscript𝑛𝑙𝑢superscriptsubscript𝒘subscriptout𝑢𝑙subscript𝑾𝑈:𝑤A^{\text{DLA}}_{f_{w}(\mathbf{x})\leftarrow n^{l}_{u}}=n^{l}_{u}{\bm{w}}_{% \text{out}_{u}}^{l}{\bm{W}}_{U[:,w]},italic_A start_POSTSUPERSCRIPT DLA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) ← italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT bold_italic_w start_POSTSUBSCRIPT out start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U [ : , italic_w ] end_POSTSUBSCRIPT , (15)

Similarly, Ferrando et al. (2023) makes use of the decomposition of an attention head as a weighted sum of residual stream transformations (Section 2.1.2) and proposes assessing the DLA of each path involving the attention head:

Afw(𝐱)𝒙jl1DLA=an,jl,h𝒙jl1𝑾OVl,h𝑾U[:,w].subscriptsuperscript𝐴DLAsubscript𝑓𝑤𝐱subscriptsuperscript𝒙𝑙1𝑗subscriptsuperscript𝑎𝑙𝑛𝑗subscriptsuperscript𝒙𝑙1𝑗superscriptsubscript𝑾𝑂𝑉𝑙subscript𝑾𝑈:𝑤A^{\text{DLA}}_{f_{w}(\mathbf{x})\xleftarrow{\hskip 1.42262pth\hskip 1.42262pt% }{\bm{x}}^{l-1}_{j}}=a^{l,h}_{n,j}{\bm{x}}^{l-1}_{j}{\bm{W}}_{OV}^{l,h}{\bm{W}% }_{U[:,w]}.italic_A start_POSTSUPERSCRIPT DLA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) start_ARROW start_OVERACCENT italic_h end_OVERACCENT ← end_ARROW bold_italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U [ : , italic_w ] end_POSTSUBSCRIPT . (16)

The Logit difference (LD) (Wang et al., 2023a) is the difference in logits between two tokens, fw(𝐱)fo(𝐱)subscript𝑓𝑤𝐱subscript𝑓𝑜𝐱{f_{w}(\mathbf{x})-f_{o}(\mathbf{x})}italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_x ) - italic_f start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( bold_x ). DLA can be extended to measure direct logit difference attribution (DLDA):

Afw¬o(𝐱)cDLDA=fc(𝐱)𝑾U[:,w]fc(𝐱)𝑾U[:,o].subscriptsuperscript𝐴DLDAsubscript𝑓𝑤𝑜𝐱𝑐superscript𝑓𝑐𝐱subscript𝑾𝑈:𝑤superscript𝑓𝑐𝐱subscript𝑾𝑈:𝑜A^{\text{DLDA}}_{f_{w\neg o}(\mathbf{x})\leftarrow c}=f^{c}(\mathbf{x}){\bm{W}% }_{U[:,w]}-f^{c}(\mathbf{x}){\bm{W}}_{U[:,o]}.italic_A start_POSTSUPERSCRIPT DLDA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w ¬ italic_o end_POSTSUBSCRIPT ( bold_x ) ← italic_c end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) bold_italic_W start_POSTSUBSCRIPT italic_U [ : , italic_w ] end_POSTSUBSCRIPT - italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) bold_italic_W start_POSTSUBSCRIPT italic_U [ : , italic_o ] end_POSTSUBSCRIPT . (17)

including its neuron and head-specific variants of Equation 15 and Equation 16. Similarly to the contrastive attribution framework described in Section 3.1, a positive DLDA value suggests that c𝑐citalic_c promotes token w𝑤witalic_w more than token o𝑜oitalic_o.

3.2.2 Causal Interventions

We can view the computations of a Transformer-based LM as a causal model (Geiger et al., 2021; McGrath et al., 2023), and use causality tools (Pearl, 2009; Vig et al., 2020) to shed light on the contribution to the prediction of each model component c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C across different positions. The causal model can be seen as a directed acyclic graph (DAG), where nodes are model computations and edges are activations. We can intervene in the model by changing some node’s value fc(𝐱)superscript𝑓𝑐𝐱f^{c}(\mathbf{x})italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) computed by a model component101010Alternatively, we can patch residual stream states fl(𝐱)superscript𝑓𝑙𝐱f^{l}(\mathbf{x})italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ). in the forward pass on target input 𝐱𝐱\mathbf{x}bold_x, to those from another value 𝒉~~𝒉\tilde{{\bm{h}}}over~ start_ARG bold_italic_h end_ARG, which is referred to as activation patching (Figure 6).111111Also referred to in the literature as Causal Mediation Analysis (Vig et al., 2020), Causal Tracing (Meng et al., 2022), and Interchange Interventions (Geiger et al., 2020; 2021). We can express this intervention using the do-operator (Pearl, 2009) as f(𝐱|do(fc(𝐱)=𝒉~))𝑓conditional𝐱dosuperscript𝑓𝑐𝐱~𝒉f(\mathbf{x}|\text{do}(f^{c}(\mathbf{x})=\tilde{{\bm{h}}}))italic_f ( bold_x | do ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) = over~ start_ARG bold_italic_h end_ARG ) ). We then measure how much the prediction changes after patching:

Af(𝐱)cPatch=diff(f(𝐱),f(𝐱|do(fc(𝐱)=𝒉~))).subscriptsuperscript𝐴Patch𝑓𝐱𝑐diff𝑓𝐱𝑓conditional𝐱dosuperscript𝑓𝑐𝐱~𝒉A^{\text{Patch}}_{f(\mathbf{x})\leftarrow c}=\text{diff}(f(\mathbf{x}),f(% \mathbf{x}|\text{do}(f^{c}(\mathbf{x})=\tilde{{\bm{h}}}))).italic_A start_POSTSUPERSCRIPT Patch end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f ( bold_x ) ← italic_c end_POSTSUBSCRIPT = diff ( italic_f ( bold_x ) , italic_f ( bold_x | do ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) = over~ start_ARG bold_italic_h end_ARG ) ) ) . (18)

Popular choices for the diff(,)diff\text{diff}(\cdot,\cdot)diff ( ⋅ , ⋅ ) function include KL divergence and logit/probability difference (Zhang & Nanda, 2024). The patched activation (𝒉~~𝒉\tilde{{\bm{h}}}over~ start_ARG bold_italic_h end_ARG) can be originated from various sources. A common approach is to create a counterfactual dataset with distribution Ppatchsubscript𝑃patchP_{\text{patch}}italic_P start_POSTSUBSCRIPT patch end_POSTSUBSCRIPT, where some input signals regarding the task are inverted. This approach leads to two distinct types of ablation:

  • Resample intervention121212Commonly named ablation in the literature. We use the more neutral intervention here since activations are not actually ablated, but rather replaced., where the patched activation is obtained from a single example of Ppatchsubscript𝑃patchP_{\text{patch}}italic_P start_POSTSUBSCRIPT patch end_POSTSUBSCRIPT, i.e. 𝒉~=fc(𝐱~),𝐱~Ppatchformulae-sequence~𝒉superscript𝑓𝑐~𝐱similar-to~𝐱subscript𝑃patch\tilde{{\bm{h}}}=f^{c}(\tilde{\mathbf{x}}),\tilde{\mathbf{x}}\sim P_{\text{% patch}}over~ start_ARG bold_italic_h end_ARG = italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG bold_x end_ARG ) , over~ start_ARG bold_x end_ARG ∼ italic_P start_POSTSUBSCRIPT patch end_POSTSUBSCRIPT (Heimersheim & Janiak, 2023; Hanna et al., 2023; Conmy et al., 2023).

  • Mean intervention, where the average of activations of multiple Ppatchsubscript𝑃patchP_{\text{patch}}italic_P start_POSTSUBSCRIPT patch end_POSTSUBSCRIPT examples is used for patching, i.e. 𝒉~=𝔼𝐱~Ppatch[fc(𝐱~)]~𝒉subscript𝔼similar-to~𝐱subscript𝑃patchdelimited-[]superscript𝑓𝑐~𝐱\tilde{{\bm{h}}}=\mathbb{E}_{\tilde{\mathbf{x}}\sim P_{\text{patch}}}[f^{c}(% \tilde{\mathbf{x}})]over~ start_ARG bold_italic_h end_ARG = blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG ∼ italic_P start_POSTSUBSCRIPT patch end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG bold_x end_ARG ) ] (Wang et al., 2023a).

Refer to caption
Figure 6: Activation (resample) patching. The FFN output activation from the forward pass with a source input 𝐱~~𝐱\tilde{\mathbf{x}}over~ start_ARG bold_x end_ARG (left) is placed in the forward pass with target input 𝐱𝐱\mathbf{x}bold_x (right), making the prediction flip from “Italy” to “France”.

Alternatively, other sources of patching activations include:

  • Zero intervention, where the activation is substituted by a null vector, i.e. 𝒉~=𝟎~𝒉0\tilde{{\bm{h}}}=\mathbf{0}over~ start_ARG bold_italic_h end_ARG = bold_0 (Olsson et al., 2022; Mohebbi et al., 2023).

  • Noise intervention, where the new activation is obtained by running the model on a perturbed input, e.g. 𝒉~=fc(𝐱+ϵ),ϵ𝒩(0,σ2)formulae-sequence~𝒉superscript𝑓𝑐𝐱italic-ϵsimilar-toitalic-ϵ𝒩0superscript𝜎2\tilde{{\bm{h}}}=f^{c}(\mathbf{x}+\epsilon),\epsilon\sim\mathcal{N}(0,\sigma^{% 2})over~ start_ARG bold_italic_h end_ARG = italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x + italic_ϵ ) , italic_ϵ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (Meng et al., 2022).

An important factor to consider when designing causal interventions experiments is the ecological validity of the setup, since zero and noise ablation could lead the model away from the natural activations distribution and ultimately undermine the validity of components’ analysis (Chan et al., 2022; Zhang & Nanda, 2024).

Following the distinction of Kramár et al. (2024), we note that the activation patching methods presented above adopt a noising setup, since the patching is performed during the forward pass with the clean/target input, i.e. f(𝐱|do(fc(𝐱)=𝒉~))𝑓conditional𝐱dosuperscript𝑓𝑐𝐱~𝒉f(\mathbf{x}|\text{do}(f^{c}(\mathbf{x})=\tilde{{\bm{h}}}))italic_f ( bold_x | do ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) = over~ start_ARG bold_italic_h end_ARG ) ) (Wang et al., 2023a; Hanna et al., 2023). Alternatively, the same interventions can be performed in a denoising setup, where the patch 𝒉~~𝒉\tilde{{\bm{h}}}over~ start_ARG bold_italic_h end_ARG is taken from the clean/target run and applied over the patched run on source/corrupted input, i.e. f(𝐱~|do(fc(𝐱~)=𝒉~))𝑓conditional~𝐱dosuperscript𝑓𝑐~𝐱~𝒉f(\tilde{\mathbf{x}}|\text{do}(f^{c}(\tilde{\mathbf{x}})=\tilde{{\bm{h}}}))italic_f ( over~ start_ARG bold_x end_ARG | do ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG bold_x end_ARG ) = over~ start_ARG bold_italic_h end_ARG ) ) (Meng et al., 2022; Lieberum et al., 2023). We refer readers to Heimersheim & Nanda (2024) for a comprehensive overview of metrics, good practices and pitfalls of activation patching.

Other forms of causal interventions use differentiable binary masking on subsets of units or neurons of intermediate representations (De Cao et al., 2020; Csordás et al., 2021; De Cao et al., 2022), or entire attention heads outputs (Voita et al., 2019b; Michel et al., 2019) which can be cast as a form of zero ablation.

Subspace Activation Patching.

It is hypothesized that models encode features as linear subspaces of the representation space (Section 4.2). Geiger et al. (2023b) proposed distributed interchange interventions (DII), which aim to intervene only on these subspaces131313Subspace causal interventions were also used as causal probes by Guerner et al. (2023).. It provides a tool that allows for a fine-grained intervention, rather than relying on patching full representations. Formally, assuming a model component c𝑐citalic_c takes values in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, we seek to find a linear subspace Ud𝑈superscript𝑑U\subset\mathbb{R}^{d}italic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where by replacing the orthogonal projection of f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) on U𝑈Uitalic_U with that of f(𝐱~)𝑓~𝐱f(\mathbf{\widetilde{x}})italic_f ( over~ start_ARG bold_x end_ARG ) we substitute the feature of interest present in f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) by that in f(𝐱~)𝑓~𝐱f(\mathbf{\widetilde{x}})italic_f ( over~ start_ARG bold_x end_ARG ). Following the do-operation notation for the intervention process, f(𝐱|do(fc(𝐱)=𝒉~))𝑓conditional𝐱dosuperscript𝑓𝑐𝐱~𝒉f(\mathbf{x}|\text{do}(f^{c}(\mathbf{x})=\tilde{{\bm{h}}}))italic_f ( bold_x | do ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) = over~ start_ARG bold_italic_h end_ARG ) ), the patched activation is computed as follows:

𝒉~=fc(𝐱)fc(𝐱)𝑼𝑼projUfc(𝐱)+fc(𝐱~)𝑼𝑼~𝒉subscriptsuperscript𝑓𝑐𝐱superscript𝑓𝑐𝐱superscript𝑼𝑼subscriptprojsuperscript𝑈perpendicular-tosuperscript𝑓𝑐𝐱superscript𝑓𝑐~𝐱superscript𝑼𝑼\tilde{{\bm{h}}}=\underbrace{f^{c}(\mathbf{x})-f^{c}(\mathbf{x}){\bm{U}}^{% \intercal}{\bm{U}}}_{\text{proj}_{U^{\perp}}f^{c}(\mathbf{x})}+f^{c}(\tilde{% \mathbf{x}}){\bm{U}}^{\intercal}{\bm{U}}over~ start_ARG bold_italic_h end_ARG = under⏟ start_ARG italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) - italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) bold_italic_U start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_U end_ARG start_POSTSUBSCRIPT proj start_POSTSUBSCRIPT italic_U start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) end_POSTSUBSCRIPT + italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG bold_x end_ARG ) bold_italic_U start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_U (19)

where 𝑼n×d𝑼superscript𝑛𝑑{\bm{U}}\in\mathbb{R}^{n\times d}bold_italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT is an orthonormal matrix whose rows form a basis for U𝑈Uitalic_U. If the feature is encoded as a direction (Figure 7 left), i.e. in a 1-dimensional subspace, then the patched activation becomes

\eqnmarkbox[bluedrawio]patched𝒉~=fc(𝐱)\eqnmarkbox[greendrawio]projcleanfc(𝐱)𝒖𝒖\eqnmarkbox[greydrawio]projorthoproj𝒖fc(𝐱)+\eqnmarkbox[darkgreendrawio]projcorrufc(𝐱~)𝒖𝒖.\annotate[yshift=0.2em]above,left,labelbelowpatchedPatchedactivation\annotate[yshift=0em]above,left,labelbelowprojcleanTargetprojection\annotate[yshift=0em]above,right,labelbelowprojcorruSourceprojection\annotate[yshift=1em]below,left,labelbelowprojorthoTargetprojectionsubtractionformulae-sequence\eqnmarkboxdelimited-[]𝑏𝑙𝑢subscript𝑒𝑑𝑟𝑎𝑤𝑖𝑜𝑝𝑎𝑡𝑐𝑒𝑑~𝒉subscriptsuperscript𝑓𝑐𝐱\eqnmarkboxdelimited-[]𝑔𝑟𝑒𝑒subscript𝑛𝑑𝑟𝑎𝑤𝑖𝑜𝑝𝑟𝑜subscript𝑗𝑐𝑙𝑒𝑎𝑛superscript𝑓𝑐𝐱superscript𝒖𝒖\eqnmarkboxdelimited-[]𝑔𝑟𝑒subscript𝑦𝑑𝑟𝑎𝑤𝑖𝑜𝑝𝑟𝑜subscript𝑗𝑜𝑟𝑡𝑜subscriptprojsuperscript𝒖perpendicular-tosuperscript𝑓𝑐𝐱\eqnmarkboxdelimited-[]𝑑𝑎𝑟subscript𝑘𝑔𝑟𝑒𝑒subscript𝑛𝑑𝑟𝑎𝑤𝑖𝑜𝑝𝑟𝑜subscript𝑗𝑐𝑜𝑟𝑟𝑢superscript𝑓𝑐~𝐱superscript𝒖𝒖\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0.2𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑒𝑓𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑝𝑎𝑡𝑐𝑒𝑑𝑃𝑎𝑡𝑐𝑒𝑑𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑒𝑓𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑝𝑟𝑜subscript𝑗𝑐𝑙𝑒𝑎𝑛𝑇𝑎𝑟𝑔𝑒𝑡𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑟𝑖𝑔𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑝𝑟𝑜subscript𝑗𝑐𝑜𝑟𝑟𝑢𝑆𝑜𝑢𝑟𝑐𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡1𝑒𝑚𝑏𝑒𝑙𝑜𝑤𝑙𝑒𝑓𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑝𝑟𝑜subscript𝑗𝑜𝑟𝑡𝑜𝑇𝑎𝑟𝑔𝑒𝑡𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛𝑠𝑢𝑏𝑡𝑟𝑎𝑐𝑡𝑖𝑜𝑛\eqnmarkbox[blue_{d}rawio]{patched}{\tilde{{\bm{h}}}}=\underbrace{f^{c}(% \mathbf{x})-\eqnmarkbox[green_{d}rawio]{proj_{c}lean}{f^{c}(\mathbf{x}){\bm{u}% }^{\intercal}{\bm{u}}}}_{\eqnmarkbox[grey_{d}rawio]{proj_{o}rtho}{\text{proj}_% {{\bm{u}}^{\perp}}f^{c}(\mathbf{x})}}+\,\eqnmarkbox[dark_{g}reen_{d}rawio]{% proj_{c}orru}{f^{c}(\tilde{\mathbf{x}}){\bm{u}}^{\intercal}{\bm{u}}}.\annotate% [yshift=-0.2em]{above,left,labelbelow}{patched}{Patchedactivation}\annotate[% yshift=0em]{above,left,labelbelow}{proj_{c}lean}{Targetprojection}\annotate[% yshift=0em]{above,right,labelbelow}{proj_{c}orru}{Sourceprojection}\annotate[% yshift=-1em]{below,left,labelbelow}{proj_{o}rtho}{Targetprojectionsubtraction}[ italic_b italic_l italic_u italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_p italic_a italic_t italic_c italic_h italic_e italic_d over~ start_ARG bold_italic_h end_ARG = under⏟ start_ARG italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) - [ italic_g italic_r italic_e italic_e italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_p italic_r italic_o italic_j start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_l italic_e italic_a italic_n italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) bold_italic_u start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_u end_ARG start_POSTSUBSCRIPT [ italic_g italic_r italic_e italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_p italic_r italic_o italic_j start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT italic_r italic_t italic_h italic_o proj start_POSTSUBSCRIPT bold_italic_u start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) end_POSTSUBSCRIPT + [ italic_d italic_a italic_r italic_k start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_r italic_e italic_e italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_p italic_r italic_o italic_j start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_o italic_r italic_r italic_u italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( over~ start_ARG bold_x end_ARG ) bold_italic_u start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_u . [ italic_y italic_s italic_h italic_i italic_f italic_t = - 0.2 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_e italic_f italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_p italic_a italic_t italic_c italic_h italic_e italic_d italic_P italic_a italic_t italic_c italic_h italic_e italic_d italic_a italic_c italic_t italic_i italic_v italic_a italic_t italic_i italic_o italic_n [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_e italic_f italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_p italic_r italic_o italic_j start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_l italic_e italic_a italic_n italic_T italic_a italic_r italic_g italic_e italic_t italic_p italic_r italic_o italic_j italic_e italic_c italic_t italic_i italic_o italic_n [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_r italic_i italic_g italic_h italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_p italic_r italic_o italic_j start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_o italic_r italic_r italic_u italic_S italic_o italic_u italic_r italic_c italic_e italic_p italic_r italic_o italic_j italic_e italic_c italic_t italic_i italic_o italic_n [ italic_y italic_s italic_h italic_i italic_f italic_t = - 1 italic_e italic_m ] italic_b italic_e italic_l italic_o italic_w , italic_l italic_e italic_f italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_p italic_r italic_o italic_j start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT italic_r italic_t italic_h italic_o italic_T italic_a italic_r italic_g italic_e italic_t italic_p italic_r italic_o italic_j italic_e italic_c italic_t italic_i italic_o italic_n italic_s italic_u italic_b italic_t italic_r italic_a italic_c italic_t italic_i italic_o italic_n (20)
Refer to caption
Figure 7: Left: Distributed Interchange Interventions (subspace activation patching) on a 1-dimensional subspace (direction) 𝒖𝒖{\bm{u}}bold_italic_u. Right: Single-layer Transformer. Path patching replaces the edges of different paths connecting two nodes (sender and receiver) representing model components. For instance, we can measure the direct effect of the attention head on the output f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) or the indirect effect of the attention head on the output f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ) via the FFN.

3.2.3 Circuits Analysis

The Mechanistic Interpretability (MI) subfield focuses on reverse-engineering neural networks into human-understandable algorithms (Olah, 2022). Recent studies in MI aim to uncover the existence of circuits, which are a subset of model components (subgraphs) interacting together to solve a task (Cammarata et al., 2020). Activation patching, logit attribution, and attention pattern analysis are common techniques for circuit discovery (Wang et al., 2023a; Stolfo et al., 2023a; b; Heimersheim & Janiak, 2023; Geva et al., 2023; Hanna et al., 2023).

Edge and path patching.

Activation patching propagates the effect of the intervention throughout the network by recomputing the activations of components after the patched location (Figure 6). The changes in the model output (Equation 18) allow estimating the total effect of the model component on the prediction. However, circuit discovery also requires identifying important interactions between components. For this purpose, edge patching exploits the fact that every model component input is the sum of the output of previous components in its residual stream (Section 2.2), and considers edges directly connecting pairs of model components’ nodes (Figure 7 right). Path patching generalizes the edge patching approach to multiple edges (Wang et al., 2023a; Goldowsky-Dill et al., 2023), allowing for a more fine-grained analysis. For example, using the forward pass decomposition into shallow networks described in Equation 11, we could visualize the single-layer Transformer of Figure 7 (right) as being composed as

f(𝐱)=\eqnmarkbox[reddrawio]adpAttn(𝑿n)𝑾u+FFN(\eqnmarkbox[orangedrawio]aipAttn(𝑿n)+𝒙n)𝑾u+𝒙n𝑾u,𝑓𝐱\eqnmarkboxdelimited-[]𝑟𝑒subscript𝑑𝑑𝑟𝑎𝑤𝑖𝑜𝑎𝑑𝑝Attnsubscript𝑿absent𝑛subscript𝑾𝑢FFN\eqnmarkboxdelimited-[]𝑜𝑟𝑎𝑛𝑔subscript𝑒𝑑𝑟𝑎𝑤𝑖𝑜𝑎𝑖𝑝Attnsubscript𝑿absent𝑛subscript𝒙𝑛subscript𝑾𝑢subscript𝒙𝑛subscript𝑾𝑢f(\mathbf{x})=\eqnmarkbox[red_{d}rawio]{adp}{\text{Attn}(\bm{X}_{\leq n})}{\bm% {W}}_{u}+\text{FFN}(\eqnmarkbox[orange_{d}rawio]{aip}{\text{Attn}({\bm{X}}_{% \leq n})}+{\bm{x}}_{n}){\bm{W}}_{u}+{\bm{x}}_{n}{\bm{W}}_{u},italic_f ( bold_x ) = [ italic_r italic_e italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_a italic_d italic_p Attn ( bold_italic_X start_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + FFN ( [ italic_o italic_r italic_a italic_n italic_g italic_e start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_a italic_i italic_p Attn ( bold_italic_X start_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT ) + bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , (21)
\annotate

[yshift=0em]above, left, label belowadpAttn direct path to logits \annotate[yshift=0em]above, right, label belowaipAttn indirect path to logits via FFN

where each copy of the sender node AttnL(𝑿nL1)superscriptAttn𝐿subscriptsuperscript𝑿𝐿1absent𝑛\text{Attn}^{L}(\bm{X}^{L-1}_{\leq n})Attn start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT ) is relative to a single path. In this example, patching separately each of the sender node copies (Goldowsky-Dill et al., 2023) allows us to estimate direct and indirect effects (Pearl, 2001; Vig et al., 2020) of AttnL(𝑿nL1)superscriptAttn𝐿subscriptsuperscript𝑿𝐿1absent𝑛\text{Attn}^{L}(\bm{X}^{L-1}_{\leq n})Attn start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT ) to the output logits f(𝐱)𝑓𝐱f(\mathbf{x})italic_f ( bold_x ). In general, we can apply path patching to any path in the network and measure composition between heads, FFNs, or the effects of these components on the logits.

Limitations of circuit analysis with causal interventions.

Circuit analysis based on causal intervention methods presents several shortcomings:

  1. 1.

    it demands significant efforts for designing the input templates for the task to evaluate, along with the counterfactual dataset, i.e. defining Ppatchsubscript𝑃patchP_{\text{patch}}italic_P start_POSTSUBSCRIPT patch end_POSTSUBSCRIPT.

  2. 2.

    isolating important subgraphs after obtaining component importance estimates requires human inspection and domain knowledge.

  3. 3.

    it has been shown that interventions can produce second-order effects in the behavior of downstream components (Makelov et al., 2024, see Wu et al., 2024d for discussion), in some settings even eliciting compensatory behavior akin to self-repair (McGrath et al., 2023; Rushing & Nanda, 2024). This phenomenon can make it difficult to draw conclusions about the role of each component.

Overcoming the limitations.

Conmy et al. (2023) propose an Automatic Circuit Discovery (ACDC) algorithm to automate the process of circuit identification (Limitation 2) by iteratively removing edges from the computational graph. However, this process requires a large amount of forward passes (one per patched element), which becomes impractical when studying large models (Lieberum et al., 2023). A valid alternative to patching involves gradient-based methods, which have been extended beyond input attribution to compute the importance of intermediate model components  (Leino et al., 2018; Shrikumar et al., 2018; Dhamdhere et al., 2019). For instance, given the token prediction w𝑤witalic_w, to calculate the attribution of an intermediate layer l𝑙litalic_l, denoted as fl(𝐱)superscript𝑓𝑙𝐱f^{l}(\mathbf{x})italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ), the gradient fw(fl(𝐱))subscript𝑓𝑤superscript𝑓𝑙𝐱\nabla f_{w}(f^{l}(\mathbf{x}))∇ italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ) ) is computed.  Sarti et al. (2023) extend the contrastive gradient attribution formulation of Equation 13 to locate components contributing to the prediction of the correct continuation over the wrong one using a single forward and backward pass. Nanda (2023); Syed et al. (2023) propose Edge Attribution Patching (EAP), consisting of a linear approximation of the pre- and post-patching prediction difference (Equation 18) to estimate the importance of each edge in the computational graph. The key advantage of this method is that it requires two forward passes and one backward pass to obtain attribution scores of every edge in the graph. Hanna et al. (2024) propose combining EAP with Integrated Gradients (EAP-IG) and show improved faithfulness of the extracted circuits, a method also used by Marks et al. (2024) to identify sparse feature circuits. Further work on Attribution Patching by Kramár et al. (2024) finds two settings leading to false negatives in the linear approximation of activation patching, and proposes AtP, a more robust method preserving a good computational efficiency. Recently, Ferrando & Voita (2024) propose finding relevant subnetworks, which they name information flow routes, using a patch-free context mixing approach, requiring only a single forward pass, avoiding the dependence on counterfactual examples and the risk of self-repair interferences during the analysis.

Causal Abstraction.

Another line of research deals with finding interpretable high-level causal abstractions in lower-level neural networks (Geiger et al., 2021; 2022; 2023a). These methods involve a computationally expensive search and assume high-level variables align with groups of units or neurons. To overcome the limitations, Geiger et al. (2023b) propose distributed alignment search (DAS), which performs distributed interchange interventions (DII, Section 3.2.2) on non-basis-aligned subspaces of the low-level representation space found via gradient descent.141414Alternatively, Lepori et al. (2023) proposes employing circuit discovery approaches for this purpose. DAS interventions have been shown to be effective in finding features with causal influence in targeted syntactic evaluation (Arora et al., 2024), and in isolating the causal effect of individual attributes of entities (Huang et al., 2024a). Recently, learned edits on subspaces of intermediate representations during the forward pass have been proposed as an efficient and effective alternative to weight-based Parameter-efficient fine-tuning (PEFT) approaches (Wu et al., 2024b). A DAS variant named Boundless DAS has been used to search for interpretable causal structure in large language models (Wu et al., 2023b). In this context, Causal Proxy Models (CPMs) were proposed as interpretable proxies trained to mimic the predictions of lower-level models and simulate their counterfactual behavior after targeted interventions (Wu et al., 2023a).

4 Information Decoding

Fully understanding a model prediction entails localizing the relevant parts of the model, but also comprehending what information is being extracted and processed by each of these components. For example, if the grammatical gender of nouns is assumed to be relevant for the task of coreference resolution in a given language, information decoding methods could look at whether and how a model performing this task encodes noun gender. A natural way to approach decoding the information in the network is in terms of the features that are represented in it. While there is no universally agreed-upon definition of a feature, it is typically described as a human-interpretable property of the input151515Although we have evidence that models learns human-interpretable features even in instances that exceed human performance (McGrath et al., 2022), Olah (2022) argues that the definition of feature should include properties that are not human-interpretable., which can be also referred to as a concept (Kim et al., 2018).

Refer to caption
Figure 8: A binary probe trained to predict the input sentiment with positive 𝒫𝒫\mathcal{P}caligraphic_P and negative 𝒩𝒩\mathcal{N}caligraphic_N sentences. Binary linear classifier probes work within a 1-dimensional subspace (direction 𝒖𝒖{\bm{u}}bold_italic_u) in the representation space.

4.1 Probing

Probes, introduced concurrently in NLP by Köhn (2015); Gupta et al. (2015) and in computer vision by Alain & Bengio (2016) serve as tools to analyze the internal representations of neural networks. Generally, they take the form of supervised models trained to predict input properties from the representations, aiming to asses how much information about the property is encoded in them. Formally, the probing classifier p:fl(𝐱)z:𝑝maps-tosuperscript𝑓𝑙𝐱𝑧{p:f^{l}(\mathbf{x})\mapsto z}italic_p : italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ) ↦ italic_z maps intermediate representations to some input features (labels) z𝑧zitalic_z, which can be, for instance, a part-of-speech tag (Belinkov et al., 2017), or semantic and syntactic information (Peters et al., 2018). For example, for a binary probe seeking to decode the amount of input sentiment information within an intermediate representation (Figure 8) we build two sets: {fl(𝐱):𝐱𝒫}conditional-setsuperscript𝑓𝑙𝐱𝐱𝒫\{f^{l}(\mathbf{x}):\mathbf{x}\in\mathcal{P}\}{ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ) : bold_x ∈ caligraphic_P } and {fl(𝐱):𝐱𝒩}conditional-setsuperscript𝑓𝑙𝐱𝐱𝒩\{f^{l}(\mathbf{x}):\mathbf{x}\in\mathcal{N}\}{ italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ) : bold_x ∈ caligraphic_N }, with the representations obtained when providing positive and negative sentiment sentences respectively. After training the classifier we evaluate the accuracy results on a held-out set.

Although performance on the probing task is interpreted as evidence for the amount of information encoded in the representations, there exists a tension between the ability of the probe to evaluate the information encoded and the probe learning the task itself (Belinkov, 2022). Several works propose using baselines to contextualize the performance of a probe. Hewitt & Liang (2019) use control tasks by randomizing the probing dataset, while Pimentel et al. (2020) propose measuring the information gain after applying control functions on the internal representations. Voita & Titov (2020) suggest evaluating the quality of the probe together with the “amount of effort” required to achieve the quality. This is done by measuring the minimum description length of the code required to transmit labels z𝑧zitalic_z given representations fl(𝐱)superscript𝑓𝑙𝐱f^{l}(\mathbf{x})italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ). We refer the reader to Belinkov & Glass (2019); Belinkov (2022) for a larger coverage of probing methods.

Probing techniques have been largely applied to analyze Transformers in NLP. Although probes are still being used to study decoder-only models (CH-Wang et al., 2023; Zou et al., 2023; Burns et al., 2023; MacDiarmid et al., 2024), a significant portion of the research in this area has focused on BERT (Devlin et al., 2019) and its variants, leading to several BERTology analyses (Rogers et al., 2021). Probing has provided evidence of the existence of syntactic information within BERT representations (Tenney et al., 2019b; Lin et al., 2019; Liu et al., 2019), from which even full parse trees can be recovered with good precision (Hewitt & Manning, 2019). Additionally, some studies have analyzed where syntactic information is stored across the residual stream suggesting a hierarchical encoding of language information, with part-of-speech, constituents, and dependencies being represented earlier in the network than semantic roles and coreferents, matching traditional handcrafted NLP pipelines (Tenney et al., 2019a)Rogers et al. (2021) summarizes results on BERT in detail. Importantly, highly accurate probes indicate a correlation between input representations and labels, but do not provide evidence that the model is using the encoded information for its predictions (Hupkes et al., 2018; Belinkov & Glass, 2019; Elazar et al., 2021).

4.2 Linear Representation Hypothesis and Sparse Autoencoders

Linear Representation Hypothesis.

The linear representation hypothesis states that features are encoded as linear subspaces of the representation space (see Park et al. (2023a) for a formal discussion). Mikolov et al. (2013) were the first to show that Word2Vec word embeddings capture linear syntactic/semantic word relationships. For example, adding the difference between word representations of “Spain” and “Madrid”, f(``Spain")f(``Madrid")𝑓``Spain"𝑓``Madrid"f(``\text{Spain}")-f(``\text{Madrid}")italic_f ( ` ` Spain " ) - italic_f ( ` ` Madrid " ), to the “France” representation, f(“France”)𝑓“France”f(\text{``France''})italic_f ( “France” ), would result in a vector close to f(“Paris”)𝑓“Paris”f(\text{``Paris''})italic_f ( “Paris” ). This presumes that the vector f(``Spain")f(``Madrid")𝑓``Spain"𝑓``Madrid"f(``\text{Spain}")-f(``\text{Madrid}")italic_f ( ` ` Spain " ) - italic_f ( ` ` Madrid " ) can be considered as the direction of the abstract capital_of feature. Instances of interpretable neurons (Radford et al., 2017; Voita et al., 2023; Bau et al., 2020), i.e. neurons that fire consistently for specific input features (either monosemantic or polysemantic), also exemplify features represented as directions in the neuron space. Recent work suggests the linearity of concepts in representation space is largely driven by the next-word-prediction training objective and inductive biases in gradient descent optimization (Jiang et al., 2024).

Erasing Features with Linear Interventions

Feature directions can be found in LMs using linear classifiers (linear probes, Section 4.1). These models learn a hyperplane that separates representations associated with a particular feature from the rest. The normal vector to that hyperplane, the probe direction 𝒖d𝒖superscript𝑑{\bm{u}}\in\mathbb{R}^{d}bold_italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, can be considered the direction representing the underlying feature (Figure 8). For instance, the sensitivity of model predictions to a feature can be computed as the directional derivative of the model in the direction 𝒖𝒖{\bm{u}}bold_italic_u, f(fl(𝐱))𝒖𝑓superscript𝑓𝑙𝐱𝒖\nabla f(f^{l}(\mathbf{x}))\cdot{\bm{u}}∇ italic_f ( italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ) ) ⋅ bold_italic_u, treating the model as a function of the intermediate activation (Kim et al., 2018). This linear feature representation was exploited by Ravfogel et al. (2020; 2022); Belrose et al. (2023b) to erase concepts, preventing linear classifiers from detecting them in the representation space. Linear concept erasure was shown to mitigate bias (Ravfogel et al., 2020) or induce a large increase in perplexity after removing part-of-speech information (Belrose et al., 2023b). In presence of class labels, linear erasure models can be adapted to ensure the removal of all linear information regarding class identity (Singh et al., 2024c; Belrose, 2023). Finally, Elazar et al. (2021) exploits linear erasure to address the correlational nature of probing classifier, validating the influence of probed properties on model predictions.

Steering Generation with Linear Interventions

As mentioned in Section 4.1, a fundamental problem of probing lies in its correlational, rather than causal, nature. Recent work (Nanda et al., 2023b; Zou et al., 2023) shows the effectiveness of linear interventions on language models using directions identified by a probe. For instance, adding negative multiples of the sentiment direction (𝒖𝒖{\bm{u}}bold_italic_u) to the residual stream, i.e. 𝒙l𝒙lα𝒖superscript𝒙superscript𝑙superscript𝒙𝑙𝛼𝒖{{\bm{x}}^{l^{\prime}}\leftarrow{\bm{x}}^{l}-\alpha{\bm{u}}}bold_italic_x start_POSTSUPERSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ← bold_italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_α bold_italic_u, is sufficient to generate a text matching the opposite sentiment label (Tigges et al., 2023). This simple procedure is named activation addition (Turner et al., 2023). Other unsupervised methods for computing features directions include Principal Component Analysis (Tigges et al., 2023), K-Means (Zou et al., 2023), or difference-in-means (Marks & Tegmark, 2023). For instance, Arditi et al. (2024) use the difference-in-means vector between residual streams on harmful and harmless instructions to find a “refusal direction” in LMs with safety fine-tuning (Bai et al., 2022). Projecting out this direction from every model component output, i.e. fcfcfc𝒖𝒖superscript𝑓superscript𝑐superscript𝑓𝑐superscript𝑓𝑐superscript𝒖𝒖{f^{c^{\prime}}\leftarrow f^{c}-f^{c}{\bm{u}}^{\intercal}{\bm{u}}}italic_f start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ← italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT bold_italic_u start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_u, leads to bypass refusal. Recent studies set distributed alignment search (Section 3.2.3) as the best performing method for causal intervention across mathematical reasoning and linguistic plausibility benchmarks (Tigges et al., 2023; Arora et al., 2024; Huang et al., 2024a), and leveraged it for efficient inference-time interventions aimed at improving task-specific model performance (Wu et al., 2024b). Finally, the MiMic framework (Singh et al., 2024c) was recently proposed to craft optimal steering vectors, exploiting insights from linear erasure methods and class labels from the data distribution. We note that the effectiveness of steering approaches involving linear interventions was recently observed to extend to non-Transformer LMs (Paulo et al., 2024).

Polysemanticity and Superposition.

A representation produced by a model layer is a vector that lies in a d𝑑ditalic_d-dimensional space. Neurons are the special subset of representation units right after an element-wise non-linearity (Section 2.1.3). Although previous work has identified neurons in models corresponding to interpretable features, in most cases they respond to apparently unrelated inputs, i.e. they are polysemantic. Two main reasons can explain polysemanticity. Firstly, features can be represented as linear combinations of the standard basis vectors of the neuron space (Figure 9 left (a)), not corresponding to the basis elements themselves. Therefore, each feature is represented across many individual neurons, which is known as distributed representations (Smolensky, 1986; Olah, 2023). Secondly, given the extensive capabilities and long-tail knowledge demonstrated by large language models, it has been hypothesized that models could encode more features than they have dimensions, a phenomenon called superposition (Figure 9 left (c)) (Arora et al., 2018; Olah et al., 2020b). Elhage et al. (2022b) showed on toy models trained on synthetic datasets that superposition happens when forcing sparsity on features, i.e. making them less frequent on the training data. Recently,  Gurnee et al. (2023) have provided evidence of superposition in the early layers of a Transformer language model, using sparse linear probes.

Refer to caption
Figure 9: Left: Feature directions in a 2-dimensional space. (a) features as directions not aligned with the standard basis, we observe polysemanticity. (b) features aligned with the standard basis, monosemanticity. (c) more features than dimensions (superposition), hence features can’t align with the standard basis and polysemanticity is inevitable. Right: Sparse autoencoder (SAE) trained to reconstruct a model’s internal representations 𝒛𝒛{\bm{z}}bold_italic_z. Interpretable SAE features are found in rows of 𝑾decsubscript𝑾dec{\bm{W}}_{\text{dec}}bold_italic_W start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT. Biases are omitted for the sake of clarity.
Sparse Autoencoders (SAEs).

A possible strategy to disentangle features in superposition involves finding an overcomplete feature basis via dictionary learning (Olshausen & Field, 1997). Autoencoders with sparsity regularization, also known as sparse autoencoders (SAEs), can be used for dictionary learning by optimizing them to reconstruct internal representations 𝒛d𝒛superscript𝑑{\bm{z}}\in\mathbb{R}^{d}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of a neural network exhibiting superposition while simultaneously promoting feature sparsity. That is, we obtain a reconstruction 𝒛=SAE(𝒛)+ϵ𝒛SAE𝒛italic-ϵ{\bm{z}}=\text{SAE}({\bm{z}})+\epsilonbold_italic_z = SAE ( bold_italic_z ) + italic_ϵ, where ϵitalic-ϵ\epsilonitalic_ϵ is the SAE error termSharkey et al. (2022); Bricken et al. (2023); Cunningham et al. (2023) propose training SAEs (see Figure 9 right) of the form

SAE(𝒛)=\eqnmarkbox[greendrawio]featactReLU((𝒛𝒃dec)𝑾enc+𝒃enc))\eqnmarkbox[greydrawio]dict𝑾dec+𝒃dec\annotate[yshift=0em]above,right,labelbelowdictDictionarySAEfeatures\text{SAE}({\bm{z}})=\eqnmarkbox[green_{d}rawio]{feat_{a}ct}{\text{ReLU}\bigl{% (}({\bm{z}}-{\bm{b}}_{\text{dec}}){\bm{W}}_{\text{enc}}+{\bm{b}}_{\text{enc}})% \bigr{)}}\eqnmarkbox[grey_{d}rawio]{dict}{{\bm{W}}_{\text{dec}}}+{\bm{b}}_{% \text{dec}}\annotate[yshift=0em]{above,right,labelbelow}{dict}{DictionarySAEfeatures}SAE ( bold_italic_z ) = [ italic_g italic_r italic_e italic_e italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_f italic_e italic_a italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_c italic_t ReLU ( ( bold_italic_z - bold_italic_b start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT ) ) [ italic_g italic_r italic_e italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_d italic_i italic_c italic_t bold_italic_W start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_r italic_i italic_g italic_h italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_d italic_i italic_c italic_t italic_D italic_i italic_c italic_t italic_i italic_o italic_n italic_a italic_r italic_y italic_S italic_A italic_E italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_s (22)
\annotate

[yshift=0em]above, left, label belowfeat_actSAE feature activations h(𝒛)𝒛h({\bm{z}})italic_h ( bold_italic_z ) on language models’ representations with a loss defined as

(𝒛)=\eqnmarkbox[reddrawio]recloss𝒛SAE(𝒛)22+\eqnmarkbox[greendrawio]spalossαh(𝒛)1.\annotate[yshift=0em]above,left,labelbelowreclossReconstructionlossterm\annotate[yshift=0em]above,right,labelbelowspalossSparsitylosstermformulae-sequence𝒛\eqnmarkboxdelimited-[]𝑟𝑒subscript𝑑𝑑𝑟𝑎𝑤𝑖𝑜𝑟𝑒subscript𝑐𝑙𝑜𝑠𝑠superscriptsubscriptdelimited-∥∥𝒛SAE𝒛22\eqnmarkboxdelimited-[]𝑔𝑟𝑒𝑒subscript𝑛𝑑𝑟𝑎𝑤𝑖𝑜𝑠𝑝subscript𝑎𝑙𝑜𝑠𝑠𝛼subscriptdelimited-∥∥𝒛1\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑒𝑓𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑟𝑒subscript𝑐𝑙𝑜𝑠𝑠𝑅𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑙𝑜𝑠𝑠𝑡𝑒𝑟𝑚\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑟𝑖𝑔𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑠𝑝subscript𝑎𝑙𝑜𝑠𝑠𝑆𝑝𝑎𝑟𝑠𝑖𝑡𝑦𝑙𝑜𝑠𝑠𝑡𝑒𝑟𝑚\mathcal{L}({\bm{z}})=\eqnmarkbox[red_{d}rawio]{rec_{l}oss}{\left\lVert{\bm{z}% }-\text{SAE}({\bm{z}})\right\rVert_{2}^{2}}+\eqnmarkbox[green_{d}rawio]{spa_{l% }oss}{\alpha\left\lVert h({\bm{z}})\right\rVert_{1}}.\annotate[yshift=0em]{% above,left,labelbelow}{rec_{l}oss}{Reconstructionlossterm}\annotate[yshift=0em% ]{above,right,labelbelow}{spa_{l}oss}{Sparsitylossterm}caligraphic_L ( bold_italic_z ) = [ italic_r italic_e italic_d start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_r italic_e italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_o italic_s italic_s ∥ bold_italic_z - SAE ( bold_italic_z ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + [ italic_g italic_r italic_e italic_e italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_s italic_p italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_o italic_s italic_s italic_α ∥ italic_h ( bold_italic_z ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_e italic_f italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_r italic_e italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_o italic_s italic_s italic_R italic_e italic_c italic_o italic_n italic_s italic_t italic_r italic_u italic_c italic_t italic_i italic_o italic_n italic_l italic_o italic_s italic_s italic_t italic_e italic_r italic_m [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_r italic_i italic_g italic_h italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_s italic_p italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_o italic_s italic_s italic_S italic_p italic_a italic_r italic_s italic_i italic_t italic_y italic_l italic_o italic_s italic_s italic_t italic_e italic_r italic_m (23)

By inducing sparsity on the latent representation of SAE feature activations h(𝒛)=ReLU(𝒛𝑾enc+𝒃)m𝒛ReLU𝒛subscript𝑾enc𝒃superscript𝑚{h({\bm{z}})=\text{ReLU}({\bm{z}}{\bm{W}}_{\text{enc}}+{\bm{b}})\in\mathbb{R}^% {m}}italic_h ( bold_italic_z ) = ReLU ( bold_italic_z bold_italic_W start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT + bold_italic_b ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and setting m>d𝑚𝑑m>ditalic_m > italic_d, we can approximate 𝒛𝒛{\bm{z}}bold_italic_z as a sparse linear combination of the rows of the learned 𝑾decm×dsubscript𝑾decsuperscript𝑚𝑑{{\bm{W}}_{\text{dec}}\in\mathbb{R}^{m\times d}}bold_italic_W start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT dictionary, from which we can extract interpretable and monosemantic SAE features.161616We present in Appendix D some SAEs’ implementation details currently debated. Since the output weights of each SAE feature interact linearly with the residual stream, we can measure their direct effect on the logits (Section 3.2.1) and their composition with later layers’ components (Section 2.2(He et al., 2024). An initial assessment of reconstruction errors (ϵitalic-ϵ\epsilonitalic_ϵ) in SAEs trained on LM activations highlighted their systematic nature, driving a shift in next token prediction probabilities much higher than random noise (Gurnee, 2024).  Marks et al. (2024) also found these errors account for 1–15% of 𝒛𝒛{\bm{z}}bold_italic_z variance. While this finding might undermine the faithfulness of component analyses relying on SAE features, Marks et al. (2024) proposes an adaptation of the causal model framework outlined in Section 3.2.2 aiming to incorporate SAE features and errors as nodes of the computational graph. Using edge attribution patching (Section 3.2.3), they recover sparse feature circuits providing more intuitive overviews of features driving model predictions.

SAEs Evaluation.

The goal of SAEs is to learn sparse reconstructions of representations. To assess the quality of a trained SAE in achieving this it is common to compute the Pareto frontier of two metrics on an evaluation set (Bricken et al., 2023). These metrics are:

  • The L0 norm of the feature activations vector h(𝒛)𝒛h({\bm{z}})italic_h ( bold_italic_z ), which measures how many features are “alive” given an input. This metric is averaged across the evaluation set, 𝔼𝐳𝒟h(𝒛)0subscript𝔼similar-to𝐳𝒟subscriptdelimited-∥∥𝒛0\mathbb{E}_{\mathbf{z}\sim\mathcal{D}}\left\lVert h({\bm{z}})\right\rVert_{0}blackboard_E start_POSTSUBSCRIPT bold_z ∼ caligraphic_D end_POSTSUBSCRIPT ∥ italic_h ( bold_italic_z ) ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

  • The loss recovered, which reflects the percentage of the original cross-entropy loss of the LM across a dataset when substituting the original representations with the SAE reconstructions.

A summary statistic proposed by Bricken et al. (2023) is the feature density histogram. Feature density is the proportion of tokens in a dataset where a SAE feature has a non-zero value. By looking at the distribution of feature densities we can distinguish if the SAE learnt features that are too dense (activate too often) or too sparse (activate too rarely). Finally, the degree of interpretability of sparse features can be estimated based on their direct logit attribution and maximally activating examples (see Figure 10 left, we introduce these concepts in Section 4.3). This process can be done manually or automated, using a LLM to produce natural language explanations of SAE features.

Refer to caption
Figure 10: Left: SAE feature visualization on Neuronpedia (Lin & Bloom, 2024). It shows the promoted/suppressed tokens, feature density, logits distribution, and maximally activating examples of a name mover feature found in GPT-2 Small (Kissane et al., 2024b). Right: Gated Sparse Autoencoder with encoder weight sharing. Biases are omitted for the sake of clarity.
Gated SAEs (GSAEs).

The sparsity penalty used in SAE training promotes smaller feature activations, biasing the reconstruction process towards smaller norms. This phenomenon is known as shrinkage (Tibshirani, 1996; Wright & Sharkey, 2024)Rajamanoharan et al. (2024) address this issue by proposing Gated Sparse Autoencoders (GSAEs) and a complementary loss function. GSAE is inspired by Gated Linear Units (Dauphin et al., 2017; Shazeer, 2020), which employ a gated ReLU encoder to decouple feature magnitude estimation from feature detection (Figure 10 right):

GSAE(𝒛)=\eqnmarkbox[darkgreendrawio]gate𝟙[((𝒛𝒃dec)𝑾gate+𝒃gate)>0]\eqnmarkbox[greendrawio]magReLU((𝒛𝒃dec)𝑾mag+𝒃mag)h(𝒛)𝑾dec+𝒃dec,\annotate[yshift=0em]above,left,labelbelowgateGSAEfeaturesgate\annotate[yshift=0em]above,right,labelbelowmagGSAEfeatureactivationsmagnitudeGSAE𝒛subscriptdirect-product\eqnmarkboxdelimited-[]𝑑𝑎𝑟subscript𝑘𝑔𝑟𝑒𝑒subscript𝑛𝑑𝑟𝑎𝑤𝑖𝑜𝑔𝑎𝑡𝑒1delimited-[]𝒛subscript𝒃decsubscript𝑾gatesubscript𝒃gate0\eqnmarkboxdelimited-[]𝑔𝑟𝑒𝑒subscript𝑛𝑑𝑟𝑎𝑤𝑖𝑜𝑚𝑎𝑔ReLU𝒛subscript𝒃decsubscript𝑾magsubscript𝒃mag𝒛subscript𝑾decsubscript𝒃dec\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑙𝑒𝑓𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑔𝑎𝑡𝑒𝐺𝑆𝐴𝐸𝑓𝑒𝑎𝑡𝑢𝑟𝑒superscript𝑠𝑔𝑎𝑡𝑒\annotatedelimited-[]𝑦𝑠𝑖𝑓𝑡0𝑒𝑚𝑎𝑏𝑜𝑣𝑒𝑟𝑖𝑔𝑡𝑙𝑎𝑏𝑒𝑙𝑏𝑒𝑙𝑜𝑤𝑚𝑎𝑔𝐺𝑆𝐴𝐸𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛superscript𝑠𝑚𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒\text{GSAE}({\bm{z}})=\underbrace{\eqnmarkbox[dark_{g}reen_{d}rawio]{gate}{% \mathds{1}[(({\bm{z}}-{\bm{b}}_{\text{dec}}){\bm{W}}_{\text{gate}}+{\bm{b}}_{% \text{gate}})>0]}\odot\eqnmarkbox[green_{d}rawio]{mag}{\text{ReLU}\bigl{(}({% \bm{z}}-{\bm{b}}_{\text{dec}}){\bm{W}}_{\text{mag}}+{\bm{b}}_{\text{mag}}\bigr% {)}}}_{h({\bm{z}})}{\bm{W}}_{\text{dec}}+{\bm{b}}_{\text{dec}},\annotate[% yshift=0em]{above,left,labelbelow}{gate}{GSAEfeatures^{\prime}gate}\annotate[% yshift=0em]{above,right,labelbelow}{mag}{GSAEfeatureactivations^{\prime}magnitude}GSAE ( bold_italic_z ) = under⏟ start_ARG [ italic_d italic_a italic_r italic_k start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_r italic_e italic_e italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_g italic_a italic_t italic_e blackboard_1 [ ( ( bold_italic_z - bold_italic_b start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT gate end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT gate end_POSTSUBSCRIPT ) > 0 ] ⊙ [ italic_g italic_r italic_e italic_e italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_r italic_a italic_w italic_i italic_o ] italic_m italic_a italic_g ReLU ( ( bold_italic_z - bold_italic_b start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT mag end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT mag end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_h ( bold_italic_z ) end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT + bold_italic_b start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT , [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_l italic_e italic_f italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_g italic_a italic_t italic_e italic_G italic_S italic_A italic_E italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_g italic_a italic_t italic_e [ italic_y italic_s italic_h italic_i italic_f italic_t = 0 italic_e italic_m ] italic_a italic_b italic_o italic_v italic_e , italic_r italic_i italic_g italic_h italic_t , italic_l italic_a italic_b italic_e italic_l italic_b italic_e italic_l italic_o italic_w italic_m italic_a italic_g italic_G italic_S italic_A italic_E italic_f italic_e italic_a italic_t italic_u italic_r italic_e italic_a italic_c italic_t italic_i italic_v italic_a italic_t italic_i italic_o italic_n italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m italic_a italic_g italic_n italic_i italic_t italic_u italic_d italic_e (24)

where 𝟙1\mathds{1}blackboard_1 is the step function. The features’ gate and activation magnitudes are computed by sharing weight matrices, 𝑾mag[i,j]=𝑾gate[i,j]e𝒓[j]subscript𝑾mag𝑖𝑗subscript𝑾gate𝑖𝑗superscript𝑒subscript𝒓delimited-[]𝑗{{\bm{W}}_{\text{mag}[i,j]}={\bm{W}}_{\text{gate}[i,j]}e^{{\bm{r}}_{[j]}}}bold_italic_W start_POSTSUBSCRIPT mag [ italic_i , italic_j ] end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT gate [ italic_i , italic_j ] end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_r start_POSTSUBSCRIPT [ italic_j ] end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, being 𝒓m𝒓superscript𝑚{\bm{r}}\in\mathbb{R}^{m}bold_italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT a learned rescaling vector, thus 𝑾gatesubscript𝑾gate{\bm{W}}_{\text{gate}}bold_italic_W start_POSTSUBSCRIPT gate end_POSTSUBSCRIPT can be considered the encoder matrix 𝑾encsubscript𝑾enc{\bm{W}}_{\text{enc}}bold_italic_W start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT (Figure 10). Rajamanoharan et al. (2024) show GSAE is a Pareto improvement over the standard SAE architecture on a range of models, scaling GSAEs up to Gemma 7B (Gemma Team et al., 2024).

4.3 Decoding in Vocabulary Space

The model engages with the vocabulary in two primary ways: firstly, through a set of input tokens facilitated by the embedding matrix 𝑾Esubscript𝑾𝐸{\bm{W}}_{E}bold_italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, and secondly, by interacting with the output space via the unembedding matrix 𝑾Usubscript𝑾𝑈{\bm{W}}_{U}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT. Hence, and due to its interpretable nature, a sensible way to approach decoding the information within models’ representations is via vocabulary tokens.

Decoding intermediate representations.

The logit lens (nostalgebraist, 2020) proposes projecting intermediate residual stream states 𝒙lsuperscript𝒙𝑙{\bm{x}}^{l}bold_italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT by 𝑾Usubscript𝑾𝑈{\bm{W}}_{U}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT. The logit lens can also be interpreted as the prediction the model would do if skip** all later layers, and can be used to analyze how the model refines the prediction throughout the forward pass (Jastrzębski et al., 2018). This technique has proven effective in analyzing encoder representations in encoder-decoder models Langedijk et al. (2023). However, the logit lens can fail to elicit plausible predictions in some particular models Belrose et al. (2023a). This phenomenon have inspired researchers to train translators, which are functions applied to the intermediate representations prior to the unembedding projection. Din et al. (2023) suggest using linear map**s, while Belrose et al. (2023a) propose affine transformations (tuned lens). Translators have also been trained on the outputs of attention heads, resulting in the attention lens (Sakarvadia et al., 2023). More generally, we can also think of 𝑾Usubscript𝑾𝑈{\bm{W}}_{U}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT as the weights learned by a probe whose classes are the subwords in the vocabulary (Section 4.1), and inspect at any point in the network the amount of information encoded about any subword.

Patchscopes.

Patchscopes (Ghandeharioun et al., 2024) is a framework that generalizes patching to decode information from intermediate representations.171717A concurrent similar approach is presented by Chen et al. (2024c). Recall from Section 3.2.2 that patching an activation into a forward pass f(𝐱|do(fc(𝐱)=𝒉~))𝑓conditional𝐱dosuperscript𝑓𝑐𝐱~𝒉{f(\mathbf{x}|\text{do}(f^{c}(\mathbf{x})=\tilde{{\bm{h}}}))}italic_f ( bold_x | do ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) = over~ start_ARG bold_italic_h end_ARG ) ) serves to evaluate the output change with respect to the original clean run f(𝐱)𝑓𝐱{f(\mathbf{x})}italic_f ( bold_x ). Patchscope defines a function acting on the patched representation m(𝒉~)𝑚~𝒉m(\tilde{{\bm{h}}})italic_m ( over~ start_ARG bold_italic_h end_ARG ), a target model fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for the patched run, which can differ from the original f𝑓fitalic_f, a target prompt 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and a target model component csuperscript𝑐c^{*}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that can be at a different position and layer. It then evaluates f(𝐱|do(fc(𝐱)=m(𝒉~)))superscript𝑓conditionalsuperscript𝐱dosuperscript𝑓𝑐superscript𝐱𝑚~𝒉{f^{*}(\mathbf{x}^{*}|\text{do}(f^{c}(\mathbf{x})^{*}=m(\tilde{{\bm{h}}})))}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | do ( italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_m ( over~ start_ARG bold_italic_h end_ARG ) ) ), either by inspecting the output logits, probabilities, or generating from it a natural language explanation. The election of fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, m𝑚mitalic_m, 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and csuperscript𝑐c^{*}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT defines the type of information to extract from 𝒉~~𝒉\tilde{{\bm{h}}}over~ start_ARG bold_italic_h end_ARG independently of the original context, allowing a higher expressivity. For instance, the future lens (Pal et al., 2023), used to decode future tokens from intermediate representations can be considered as a patchscope where f=fsuperscript𝑓𝑓f^{*}=fitalic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_f, c=csuperscript𝑐𝑐c^{*}=citalic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_c and 𝐱superscript𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a learned prompt.

Decoding model weights.

As seen in previous sections, 𝑾OVhsubscriptsuperscript𝑾𝑂𝑉{\bm{W}}^{h}_{OV}bold_italic_W start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT, 𝑾outsubscript𝑾out{\bm{W}}_{\text{out}}bold_italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT, 𝑾insubscript𝑾in{\bm{W}}_{\text{in}}bold_italic_W start_POSTSUBSCRIPT in end_POSTSUBSCRIPT and 𝑾QKsubscript𝑾QK{\bm{W}}_{\text{QK}}bold_italic_W start_POSTSUBSCRIPT QK end_POSTSUBSCRIPT interact linearly with the residual streams. Dar et al. (2023) suggest analyzing matrix weights in vocabulary space by projecting them by 𝑾Usubscript𝑾𝑈{\bm{W}}_{U}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT, and find that some weight matrices interact with tokens with related semantic meanings. Millidge & Black (2022) propose to factorize these matrices via the singular value decomposition (SVD). In the “thin” SVD, a matrix is factorized as 𝑾=𝑼𝚺𝑽𝑾𝑼𝚺superscript𝑽{{\bm{W}}={\bm{U}}\mathbf{\Sigma}{\bm{V}}^{\intercal}}bold_italic_W = bold_italic_U bold_Σ bold_italic_V start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT, with 𝑼d×r𝑼superscript𝑑𝑟{\bm{U}}\in\mathbb{R}^{d\times r}bold_italic_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT, 𝚺r×r𝚺superscript𝑟𝑟\mathbf{\Sigma}\in\mathbb{R}^{r\times r}bold_Σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_r end_POSTSUPERSCRIPT, 𝑽r×dsuperscript𝑽superscript𝑟𝑑{\bm{V}}^{\intercal}\in\mathbb{R}^{r\times d}bold_italic_V start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT, and r=rk(𝑾)𝑟rk𝑾r=\text{rk}({\bm{W}})italic_r = rk ( bold_italic_W ), the rank of 𝑾𝑾{\bm{W}}bold_italic_W. The largest right singular vectors (rows of 𝑽superscript𝑽{\bm{V}}^{\intercal}bold_italic_V start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT)181818Note that we left multiply by 𝑾𝑾{\bm{W}}bold_italic_W. represent the directions along which a linear transformation stretches the most. Then, multiplying 𝒛𝒛{\bm{z}}bold_italic_z by 𝑾𝑾{\bm{W}}bold_italic_W (Figure 11 left) can be expressed as

𝒛𝑾=(𝒛𝑼𝚺)𝑽=i=1r(𝒛𝒖iσi)𝒗i.𝒛𝑾𝒛𝑼𝚺superscript𝑽superscriptsubscript𝑖1𝑟𝒛subscript𝒖𝑖subscript𝜎𝑖superscriptsubscript𝒗𝑖{\bm{z}}{\bm{W}}=({\bm{z}}{\bm{U}}\mathbf{\Sigma}){\bm{V}}^{\intercal}=\sum_{i% =1}^{r}({\bm{z}}{\bm{u}}_{i}\sigma_{i}){\bm{v}}_{i}^{\intercal}.bold_italic_z bold_italic_W = ( bold_italic_z bold_italic_U bold_Σ ) bold_italic_V start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ( bold_italic_z bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT . (25)

where 𝒖id×1subscript𝒖𝑖superscript𝑑1{\bm{u}}_{i}\in\mathbb{R}^{d\times 1}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT can be seen as a key that is compared to the query (𝒛𝒛{\bm{z}}bold_italic_z) via dot product, weighting the right singular vector 𝒗isuperscriptsubscript𝒗𝑖{\bm{v}}_{i}^{\intercal}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT (McDougall, 2023; Molina, 2023), similar to Equation 8. By projecting the top right singular vectors onto the vocabulary space via the unembedding matrix (𝒗i𝑾Usuperscriptsubscript𝒗𝑖subscript𝑾𝑈{\bm{v}}_{i}^{\intercal}{\bm{W}}_{U}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT) we reveal the tokens the matrix primarily interacts with (Figure 11 right). We can instead use the SVD to find a low-rank approximation 𝑾^(k)=i=1k(𝒖iσi)𝒗i^𝑾𝑘superscriptsubscript𝑖1𝑘subscript𝒖𝑖subscript𝜎𝑖superscriptsubscript𝒗𝑖\widehat{{\bm{W}}}(k)=\sum_{i=1}^{k}({\bm{u}}_{i}\sigma_{i}){\bm{v}}_{i}^{\intercal}over^ start_ARG bold_italic_W end_ARG ( italic_k ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT, where rk(𝑾^(k))=k<rrk^𝑾𝑘𝑘𝑟\text{rk}(\widehat{{\bm{W}}}(k))=k<rrk ( over^ start_ARG bold_italic_W end_ARG ( italic_k ) ) = italic_k < italic_r, and study the model predictions by substituting the original matrix by 𝑾^(k)^𝑾𝑘\widehat{{\bm{W}}}(k)over^ start_ARG bold_italic_W end_ARG ( italic_k ) (Sharma et al., 2024b). Katz et al. (2024) propose extending the projection of weight matrices (Dar et al., 2023) to the backward pass. Specifically, the backward lens projects the gradient matrices of the FFNs to study how new information is stored in their weights.

Refer to caption
Figure 11: Left: Multiplication of an internal representation 𝒛𝒛{\bm{z}}bold_italic_z by the SVD decomposition of a matrix 𝑾𝑾{\bm{W}}bold_italic_W. Right: The top right singular vector 𝒗1subscriptsuperscript𝒗1{\bm{v}}^{\intercal}_{1}bold_italic_v start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents the direction along which the transformation stretches the most, revealing the tokens the matrix primary interacts with when projecting onto the vocabulary space. The input representation 𝒛𝒛{\bm{z}}bold_italic_z and the associated left singular vector 𝒖1subscript𝒖1{\bm{u}}_{1}bold_italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT act as a query and a key respectively, being 𝒗1subscriptsuperscript𝒗1{\bm{v}}^{\intercal}_{1}bold_italic_v start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT the value associated with the key.
Logit spectroscopy.

Cancedda (2024) proposes an extension of the logit lens, the logit spectroscopy, which allows a fine-grained decoding of the information of internal representations via the unembedding matrix (𝑾usubscript𝑾𝑢{\bm{W}}_{u}bold_italic_W start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT). Logit spectroscopy considers splitting the right singular matrix of 𝑾usubscript𝑾𝑢{\bm{W}}_{u}bold_italic_W start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT into N𝑁Nitalic_N bands: {𝑽u,1,,𝑽u,N}superscriptsubscript𝑽𝑢1superscriptsubscript𝑽𝑢𝑁\{{\bm{V}}_{u,1}^{\intercal},\ldots,{\bm{V}}_{u,N}^{\intercal}\}{ bold_italic_V start_POSTSUBSCRIPT italic_u , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT , … , bold_italic_V start_POSTSUBSCRIPT italic_u , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT }, where 𝑽u,1superscriptsubscript𝑽𝑢1{\bm{V}}_{u,1}^{\intercal}bold_italic_V start_POSTSUBSCRIPT italic_u , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT and 𝑽u,Nsuperscriptsubscript𝑽𝑢𝑁{\bm{V}}_{u,N}^{\intercal}bold_italic_V start_POSTSUBSCRIPT italic_u , italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT each contain a set of singular vectors, the former associated with the largest singular values and the latter with the lowest. If we consider the concatenation of matrices associated with different bands, e.g. from the j𝑗jitalic_j-th to the k𝑘kitalic_k-th band, we form a matrix 𝑽u,j:ksuperscriptsubscript𝑽:𝑢𝑗𝑘{\bm{V}}_{u,j:k}^{\intercal}bold_italic_V start_POSTSUBSCRIPT italic_u , italic_j : italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT whose rows span a linear subspace of the vocabulary space. We can use the operator Φu,j:k=𝑽u,j:k𝑽u,j:ksubscriptΦ:𝑢𝑗𝑘subscript𝑽:𝑢𝑗𝑘superscriptsubscript𝑽:𝑢𝑗𝑘\Phi_{u,j:k}={\bm{V}}_{u,j:k}{\bm{V}}_{u,j:k}^{\intercal}roman_Φ start_POSTSUBSCRIPT italic_u , italic_j : italic_k end_POSTSUBSCRIPT = bold_italic_V start_POSTSUBSCRIPT italic_u , italic_j : italic_k end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT italic_u , italic_j : italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT to evaluate the orthogonal projection 𝒛Φu,j:k𝒛subscriptΦ:𝑢𝑗𝑘{\bm{z}}\Phi_{u,j:k}bold_italic_z roman_Φ start_POSTSUBSCRIPT italic_u , italic_j : italic_k end_POSTSUBSCRIPT of representations 𝒛𝒛{\bm{z}}bold_italic_z onto different subspaces. Alternatively, we can suppress the projection from the representation, i.e. 𝒛𝒛𝒛Φu,j:ksuperscript𝒛𝒛𝒛subscriptΦ:𝑢𝑗𝑘{{\bm{z}}^{\prime}\leftarrow{\bm{z}}-{\bm{z}}\Phi_{u,j:k}}bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_italic_z - bold_italic_z roman_Φ start_POSTSUBSCRIPT italic_u , italic_j : italic_k end_POSTSUBSCRIPT, leaving its orthogonal component with respect to the subspace. Similarly, bands of singular vectors of the embedding matrix can be considered in the analysis.

Maximally-activating inputs.

The features encoded in model neurons or representation units have been largely studied by considering the inputs that maximally activate them (Zhou et al., 2015; Zeiler & Fergus, 2014). In image models this can be done either by generating synthesized inputs (Nguyen et al., 2016), e.g. via gradient descent (Simonyan et al., 2014), or by selecting examples from an existing dataset. The latter approach has been used in language models to explain the features that units (Dalvi et al., 2019) and neurons (Nanda, 2022b) respond to. However, Bolukbasi et al. (2021) warn that just relying on maximum activating dataset examples can result in “interpretability illusions”, as different activation ranges may lead to varying interpretations. Maximally-activating inputs can produce out-of-distribution behaviors, and were recently employed to craft jailbreak attacks aimed at eliciting unacceptable model predictions (Chowdhury et al., 2024), for example by crafting maximally-inappropriate inputs for red-teaming purposes (Wichers et al., 2024).

Natural language explanations from LMs.

Modern LMs can be prompted to provide plausible-sounding justifications for their own or other LMs’ predictions. This can be seen as an edge case of information decoding in which the predictor itself is used as a zero-shot explainer. A notable example is the work by Bills et al. (2023) where GPT-4 is prompted to describe shared features in sets of examples producing high activations for specific neurons across GPT-2 XL. Subsequent work by Huang et al. (2023) shows that neurons identified by Bills et al. (2023) do not have a causal influence over the concepts highlighted in the generated explanation, underscoring a lack of faithfulness in such approach. Additional investigations in the consistency between input attribution and self-explanations in language models highlighted the tendency of LMs to produce explanations that are very plausible according to human intuition, but unfaithful to model inner workings (Atanasova et al., 2023; Parcalabescu & Frank, 2023; Turpin et al., 2023; Lanham et al., 2023; Madsen et al., 2024; Agarwal et al., 2024).

5 Discovered Inner Behaviors

The techniques presented in Sections 3 and 4 have equipped us with essential tools to understand the behavior of language models. In the following sections, we provide an overview of the internal mechanisms that have been discovered within Transformer LMs.

5.1 Attention Block

As seen in Section 2.1.2, each attention head consists of a QK (query-key) circuit and an OV (output-value) circuit. The QK circuit computes the attention weights, determining the positions that need to be attended, while the OV circuit moves (and transforms) the information from the attended position into the current residual stream. A substantial body of research has been dedicated to analyzing attention weights patterns formed by QK circuits (Clark et al., 2019; Kovaleva et al., 2019; Voita et al., 2019b), fueling a debate on whether these weights serve as explanations (Bibal et al., 2022). However, our understanding of the specific features encoded in the subspaces employed by circuit operations is still limited. Here, we categorize known behavior of attention heads in two groups: those having intelligible attention patterns, and those with meaningful QK and OV circuits.

5.1.1 Attention heads with interpretable attention weights patterns

Positional heads.

Clark et al. (2019) showed some BERT heads attend mostly to specific positions relative to the token processed. Specifically, attention heads that attend to the token itself, to the previous token, or to the next position. A similar pattern is also observed in encoders of neural machine translation models (Voita et al., 2019b; Raganato & Tiedemann, 2018). Previous token heads are an essential part of induction heads, and have been shown necessary for circuits in GPT2-Small (Wang et al., 2023a). Their main role has been associated with copying previous token information to the following residual stream, such as concatenating two-tokens names (Nanda et al., 2023c).  Ferrando & Voita (2024) show previous token heads are important across several textual domains.

Subword joiner heads.

First discovered in machine translation encoders Correia et al. (2019), subword joiner heads have been observed as well in large language models (Ferrando & Voita, 2024). These heads attend exclusively to previous tokens that are subwords belonging to the same word as the currently processed token.

Syntactic heads.

Some attention heads attend to tokens having syntactic roles with respect to the processed token significantly more than a random baseline (Clark et al., 2019; Htut et al., 2019). Particularly, certain heads specialize in given dependency relation types such as obj, nsubj, advmod, and amodChen et al. (2024a) show these heads appear suddenly during the training process of masked language models playing a crucial role in the subsequent development of linguistic abilities.

Duplicate token heads.

Duplicate token heads attend to previous occurrences of the same token in the context of the current token. Wang et al. (2023a) hypothesize that, in the IOI task (Section 5.4), these heads copy the position of the previous occurrence to the current position.

5.1.2 Attention heads with interpretable QK and OV circuits

Copying heads.

Several attention heads in Transformer LMs have OV matrices that exhibit copying behavior. Elhage et al. (2021a) propose using the number of positive real eigenvalues of the full OV circuit matrix 𝑾E𝑾OV𝑾Usubscript𝑾𝐸subscript𝑾𝑂𝑉subscript𝑾𝑈{\bm{W}}_{E}{\bm{W}}_{OV}{\bm{W}}_{U}bold_italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT as a summary statistic for detecting copying heads. Positive eigenvalues mean that there exists a linear combination of tokens contributing to an increase in the linear combination of logits of the same tokens.

Refer to caption
Figure 12: Left: Induction mechanism. An early previous token head writes information of A into B’s residual stream via 𝑾OVsubscript𝑾𝑂𝑉{\bm{W}}_{OV}bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT. This information (shown in red) gets read by the 𝑾QKsubscript𝑾𝑄𝐾{\bm{W}}_{QK}bold_italic_W start_POSTSUBSCRIPT italic_Q italic_K end_POSTSUBSCRIPT matrix of a downstream induction head (K-composition), which serves it to attend to B and copy its information, increasing the likelihood of B for the next token prediction. Right: Copy suppression mechanism. The copy suppression head detects that a token in the context (A) is being confidently predicted at the current residual stream, for instance, thanks to a previous copying head (shown in red). The copy suppression head attends to it and suppresses its prediction, improving model calibration. Other components are hidden for the sake of clarity.
Induction heads.

An induction mechanism (Figure 12 left) that allows language models to complete patterns was discovered first by Elhage et al. (2021a) and further studied by Olsson et al. (2022).191919We follow the mechanistic formulation by Elhage et al. (2021a). See (Variengien, 2023) for a discussion. This mechanism involves two heads in different layers composing together. Specifically, a previous token head (PTH) and an induction head. The induction mechanism learns to increase the likelihood of token B given the sequence A BA, irrespective of what A and B are. To do so, a PTH in an early layer copies information from the first instance of token A to the residual stream of B, specifically by writing in the subspace the QK circuit of the induction head reads from (K-composition). This makes the induction head at the last position to attend to token B, and subsequently, its copying OV circuit increases the logit score of BOlsson et al. (2022) demonstrate that the OV and QK circuits of the induction head can perform fuzzy versions of copying and prefix matching, giving rise to generating patterns of the kind AsuperscriptA\texttt{A}^{*}A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT BsuperscriptB\texttt{B}^{*}B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTA BabsentB\rightarrow\texttt{B}→ B, where A and AsuperscriptA\texttt{A}^{*}A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and B and BsuperscriptB\texttt{B}^{*}B start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are semantically related (e.g. the same words in different languages). Overall, induction heads have been shown to appear broadly in Transformer LMs (Nanda, 2022a), with those operating at an n-gram level being identified as important drivers of in-context learning (Akyürek et al., 2024). Recent work showed that these heads display both complementary and redundant behaviors, likely shaped by competitive dynamics during optimization (see Section 5.4(Singh et al., 2024a). Relatedly, redundancy was also observed in the connections between early-layer PTHs and subsequent induction heads. Finally, the emergence rate of induction heads is impacted by the diversity of in-context tokens, with higher diversity in attended and copied tokens delaying the formation of the two respective sub-mechanisms (Singh et al., 2024a).

Copy suppression heads.

Copy suppression heads, discovered in GPT2-Small (McDougall et al., 2023) reduce the logit score of the token they attend to, only if it appears in the context and the current residual stream is confidently predicting it (Figure 12 right). This mechanism was shown to improve overall model calibration by avoiding naive copying in many contexts (e.g. copying “love” in “All’s fair in love and ___”). The OV circuit of a copy suppression head can copy-suppress almost all of the tokens in the model’s vocabulary when attended to. This behavior is confirmed by analyzing the “effective QK circuit” of GPT2-Small. The key input is the FFN1superscriptFFN1\text{FFN}^{1}FFN start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT output of every token, and the query input the unembedding of any token, 𝑾U𝑾QKFFN1(𝑾E)subscript𝑾𝑈subscript𝑾𝑄𝐾superscriptFFN1subscript𝑾𝐸{\bm{W}}_{U}{\bm{W}}_{QK}\text{FFN}^{1}({\bm{W}}_{E})bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_Q italic_K end_POSTSUBSCRIPT FFN start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ), and shows the diagonal elements rank higher. Copy suppression is also linked to the self-repair mechanism since ablating an essential component deactivates the suppression behavior, compensating for the ablation.

Successor heads.

Given an input token belonging to an element in an ordinal sequence (e.g. “one”, “Monday”, or “January”), the ‘effective OV circuit’: FFN1(𝑾E)𝑾OV𝑾UsuperscriptFFN1subscript𝑾𝐸subscript𝑾𝑂𝑉subscript𝑾𝑈\text{FFN}^{1}({\bm{W}}_{E}){\bm{W}}_{OV}{\bm{W}}_{U}FFN start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT italic_O italic_V end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT of the successor heads increases the logits of tokens corresponding to the next elements in the sequence (e.g. “two”, “Tuesday”, “February”). Specifically, Gould et al. (2024) show the output of the first FFN block represents a common ‘numerical structure’ on which the successor head acts. Gould et al. (2024) find these heads in Pythia (Biderman et al., 2023), GPT2 (Radford et al., 2019) and Llama 2 (Touvron et al., 2023) models.

5.1.3 Other noteworthy attention properties

Domain specialization.

The attention heads previously described serve specific functions in aiding the model to predict the next token. However, the degree of specialization of components across different domains and tasks remains unclear. Ferrando & Voita (2024); Chughtai et al. (2024); Lv et al. (2024) identify some specialized heads that contribute only within specific input domains, such as non-English contexts, coding sequences, or specific topics. An analysis of the top singular vectors of their OV matrices (Section 4.3) reveal these heads mainly promote tokens related to the semantics of the input they participate in.

Refer to caption
Figure 13: Attention sink mechanism. An attention head at a token C attends to BOS token. Its OV circuit squeezes the BOS residual stream representation resulting in a negligible update, leaving the residual stream of C unchanged. Cancedda (2024) suggests that early FFNs in Llama 2 write into a “dark subspace” (in red) in the BOS residual stream that allows later heads to exploit this behavior. Gurnee et al. (2024) find specific neurons in previous layer FFNs of GPT-2 that control the extent to which the query attends to BOS (in orange).
Attention sinks.

Early investigations into BERT (Kovaleva et al., 2019) revealed most attention heads exhibit “vertical” attention patterns, mainly focusing on special (CLS, SEP) and punctuation tokens. Clark et al. (2019) hypothesized a head may attend to special tokens when its specialized function is not applicable (no-op hypothesis). Kobayashi et al. (2020) showed the norm of the value vectors (Section 3.1) associated with special tokens, periods, and commas tend to be small, canceling out the effect of large attention weights, thereby supporting the no-op hypothesis. Furthermore, it was shown that attention to the end-of-sequence token in MT models is used to ignore the contribution of the source sentence (Ferrando & Costa-jussà, 2021), useful when predicting some function words such as the particle “off” in “She turned off the lights.”. In auto-regressive LMs, these patterns are observed mainly in the beginning of sentence (BOS) token (Figure 13), although other tokens play the same role (Ferrando & Voita, 2024). According to Xiao et al. (2023), allowing attention mass on the BOS token is necessary for streaming generation, and performance degrades when the BOS is omitted. Using the logit spectroscopy (Section 4.3),  Cancedda (2024) finds that early FFNs in Llama 2 write relevant information (for the attention sink mechanism to occur in later layers) into the residual stream of BOS. These FFNs write into the linear subspace spanned by the right singular vectors with the lowest associated singular values of the unembedding matrix. Cancedda (2024) refers to this as a dark subspace due to its low interference with next token prediction, and finds a significant correlation between the average attention received by a token and the existence of these dark signals in its residual stream. These dark signals reveal as massive activation values acting as fixed biases (Sun et al., 2024), a crucial prerequisite for the attention sink mechanism to take place (Puccetti et al., 2022; Bondarenko et al., 2023). On the other hand, specific neurons in the FFN of the layer before the attention head have been found to control the amount to which the tokens attend to BOS Gurnee et al. (2024) (Figure 13).

Features in attention heads.

Sparse autoencoders have been trained on the outputs of the attention layers to better understand the features computed by each head. Results presented in Kissane et al. (2024a) show that, on a two-layer Transformer, a large number of features (76%) are non-dead, with the majority of them being interpretable (82%). Three specific features are studied in detail. The “board by induction” feature promotes the token board, and is present on the output of an induction head (Section 5.1.2), being part of the induction features family. The “in questions starting with Which” feature is instead part of the local context features, promoting the prediction of ? when Which appears in the context. Lastly, “in texts related to pets” is an example of an high-level context feature that activates for almost the entire context, with its related head attending to pet-related context tokens. Notably,  Kissane et al. (2024a) detect the presence of non-induction features in the output of induction-heads, providing evidence of attention head polysemanticity, initially observed by Heimersheim & Janiak (2023). Further investigations (Kissane et al., 2024b) reveal the same three feature families appear on GPT2-Small, as well as successor features, name mover features, suppression features and duplicate token features associated with heads matching their respective behaviors. Krzyzanowski et al. (2024) conduct a finer-grained analysis of features in GPT-2 Small attention heads, focusing on the top 10 features in each head and concluding that most heads do multiple tasks, with only around 10% of those being monosemantic. Their findings point out that early layers (0-3) mainly focus on shallow syntactic features, with the following layers encoding increasingly more complex syntactic features. Middle layers (5-6) contain the least interpretable features, while later layers (7-10) encode complex abstract features like time and distance relationships and high-level context concepts. The heads in the last attention block show mostly grammatical adjustments and bigram completions.

5.2 Feedforward Network Block

The dimensions in the FFN activation space (neurons), following the non-linearity, are more likely to be independently meaningful (Section 2.1.3), and have therefore been the object of study of recent interpretability works.

Neuron’s input behavior.

The behavior of neurons in language models has been extensively studied, with examinations focusing on either their input or output behavior. In the context of input behavior analysis, Voita et al. (2023) show neurons firing exclusively on specific position ranges. Other discoveries include skills neurons, whose activations are correlated with the task of the input prompt (Wang et al., 2022), concept-specific neurons (Suau et al., 2020; 2022; Gurnee et al., 2023) whose response can be used to predict the presence of a concept in the provided context, such as whether it is Python code, French (Gurnee et al., 2023), or German (Quirke et al., 2023) language. Neurons responding to other linguistic and grammatical features have also been found (Bau et al., 2019; Durrani et al., 2023).

Neuron’s output behavior.

Regarding the output behavior of neurons,  Dai et al. (2022) use the Integrated Gradients method (Section 3.1) to attribute next-word facts predictions to FFNs neurons, finding knowledge neurons. The key-value memory perspective of FFNs (Section 2.1.3) offers a way to understand neuron’s weights. Specifically, using the direct logit attribution method (Section 3.2.1) we can measure the neuron’s effect on the logits. Geva et al. (2022b) show that some neurons promote the prediction of tokens associated with particular semantic and syntactic concepts. Ferrando et al. (2023) illustrate that a small set of neurons in later layers is responsible for making linguistically acceptable predictions, such as predicting the correct number of the verb, in agreement with the subject. Gurnee & Tegmark (2024) find neurons that interact with directions in the residual stream that are similar to the space and time feature directions extracted from probes. Tang et al. (2024) show language-specific neurons are key for multilingual generation, demonstrating one can steer the model output’s language by causally intervening on them. Finally, neurons suppressing improbable continuations, e.g. the repetition of the last token in the sequence, have recently been identified (Voita et al., 2023; Gurnee et al., 2024).

Polysemantic neurons.

Recent work highlighted the presence of polysemantic neurons within language models. Notably, most early layer neurons specialize in sets of n-grams, functioning as n-gram detectors (Voita et al., 2023), with the majority of neurons firing on a large number of n-grams. Gurnee et al. (2023) suggest superposition appears in these early layers, and via sparse probing they find sparse combinations of neurons whose added activation values disentangle the detection of specific n-grams, such as the compound word “social security” from other bigrams containing only one of the two terms. Even though polysemanticity and superposition arise in early layers, several dead neurons were observed in OPT models202020OPT models use ReLU activation functions, allowing for zero activation values (Zhang et al., 2022). (Voita et al., 2023). Furthermore, Elhage et al. (2022a) hypothesize models internally perform “de-/re-tokenization”, where neurons in early layers respond to multi-token words or compound words (Elhage et al., 2022a), map** tokens to a more semantically meaningful representation (detokenization). In contrast, in the latest layers, neurons aggregate contextual representations back into single tokens (re-tokenization) to produce the next-token prediction.

Universality of neurons.

Whether different models learn similar features remains an open question (Olah et al., 2020b). For instance, various computer vision models were found to learn Gabor filters in early layers (Olah et al., 2020a). In a recent study, Gurnee et al. (2024) investigated whether neurons respond to features similarly across different models. Their analysis used the pairwise correlation of neuron activations across GPT2 models trained from different random initializations as a proxy measure, revealing a subset of 1-5% of neurons activating on the same inputs. As expected, within the cluster of universal neurons there is a higher degree of monosemanticity. This group includes alphabet neurons, which activate in response to tokens representing individual letters and on tokens that start with the letter, supporting the re-tokenization hypothesis. Additionally, there are previous token neurons that fire based on the preceding token, as well as unigram, position, semantic, and syntax neurons. In terms of output behavior, universal neurons include attention (de-)activation neurons, responsible for controlling the amount of attention given to the BOS token by a subsequent attention head, and thus setting it as a no-op (Section 5.1.3). Lastly, Gurnee et al. (2024) hypothesize that some neurons act as entropy neurons, modulating the model’s uncertainty over the next token prediction.

High-level structure of the role of neurons.

It has been suggested that the overall arrangement of neurons in language models mirrors that of neuroscience (Elhage et al., 2022a). Early layer neurons exhibit similarities to sensory neurons, responding to shallow patterns of the input, mostly focusing on n-grams. Moving into the middle layers, activation tends to occur around more high-level concepts (Bricken et al., 2023; Gurnee et al., 2023). An example of this is the neuron identified in Elhage et al. (2022a), which represents numbers only when they refer to the amount of people. Finally, later layers’ neurons bear a resemblance to motor neurons in the sense that they produce changes in the distribution of the next-token prediction, either by promoting or suppressing sets of tokens.

Features in Feedforward Networks.

SAEs are able to identify significantly more interpretable features than the model’s neurons themselves (Bricken et al., 2023), as noted both by human and automated analyses in one-layer transformers. The features detected by SAEs trained to reconstruct FFN activations (Bricken et al., 2023) appear to split into increasingly more fine-grained distinctions of the feature as more dimensions (dictionary entries) are added, demonstrating that 512 neurons can encode tens of thousands of features. Examples of features found by Bricken et al. (2023) include those firing in the presence of Arabic or Hebrew scripts and promoting tokens in those scripts, and features responding to DNA sequences or base64 strings.

5.3 Residual Stream

Refer to caption
Figure 14: Example of the local context “President” feature in a Sparse Autoencoder (SAE) trained to reconstruct the second layer residual stream of GPT2-Small (Bloom, 2024).

We can think of the residual stream as the main communication channel in a Transformer. The “direct path” (Section 2.2) connecting the input embedding with the unembedding matrix, 𝒙𝑾U𝒙subscript𝑾𝑈{\bm{x}}{\bm{W}}_{U}bold_italic_x bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT does not move information between positions, and mainly models bigram statistics (Elhage et al., 2021a), while the latest biases in the network, localized in the prediction head, are shown to shift predictions according to word frequency, promoting high-frequency tokens (Kobayashi et al., 2023). However, alternative paths involve the interaction between components, which write into linear subspaces (Elhage et al., 2021a) that can be read by downstream components, or directly by the prediction head, potentially doing more complex computations. Heimersheim & Turner (2023) observed that the norm of the residual stream grows exponentially along the layers over the forward pass of multiple Transformer LMs (Millidge & Winsor, 2023; Merrill et al., 2021). A similar growth rate appears in the norm of the output matrices writing into the residual stream, 𝑾Osubscript𝑾𝑂{\bm{W}}_{O}bold_italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT and 𝑾outsubscript𝑾out{\bm{W}}_{\text{out}}bold_italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT, unlike input matrices (𝑾Qsubscript𝑾𝑄{\bm{W}}_{Q}bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, 𝑾Ksubscript𝑾𝐾{\bm{W}}_{K}bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, 𝑾Vsubscript𝑾𝑉{\bm{W}}_{V}bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT and 𝑾insubscript𝑾in{\bm{W}}_{\text{in}}bold_italic_W start_POSTSUBSCRIPT in end_POSTSUBSCRIPT), which maintain constant norms along the layers. It is hypothesized that some components perform memory management to remove information stored in the residual stream. For instance, there are attention heads with OV matrices with negative eigenvalues attending to the current position, and FFN neurons whose input and output weights have large negative cosine similarity (Elhage et al., 2021a), meaning that they write a vector (FFN value) on the opposite direction to the direction they read from (FFN key). Notably, Gurnee et al. (2024) find that these neurons activate very frequently. Dao et al. (2023) evaluate a small Transformer LM and provide convincing evidence of multiple attention heads removing the information written by a first layer head.

Outlier dimensions (Kovaleva et al., 2021; Luo et al., 2021) have been identified within the residual stream. These rogue dimensions exhibit large magnitudes relative to others and are associated with the generation of anisotropic representations (Ethayarajh, 2019; Timkey & van Schijndel, 2021). Anisotropy means that the residual stream states of random pairs of tokens tend to point towards the same direction, i.e. the expected cosine similarity is close to one. Furthermore, ablating outlier dimensions has been shown to significantly decrease downstream performance (Kovaleva et al., 2021), suggesting they encode task-specific knowledge (Rudman et al., 2023). The magnitudes of these outliers have been shown to increase with model size (Dettmers et al., 2022), posing challenges for the quantization of large language models. The presence of rogue dimensions has been hypothesized to stem from optimizer choices (Elhage et al., 2023), with higher levels of regularization reducing their magnitudes (Ahmadian et al., 2023). Puccetti et al. (2022) identified a high correlation between the magnitude of the outlier dimensions found in token representations and their training frequency. They concluded that these dimensions contribute to enabling the model to focus on special tokens, which is known to be associated with “no-op” attention updates (Bondarenko et al., 2023) (see attention sinks in Section 5.1.3). In Vision Transformers, high-norm residual stream states have been identified as aggregators of global image information, appearing in patches with highly redundant information, such as those composing the image background (Darcet et al., 2024).

The specific features encoded within the residual stream at various layers remain uncertain, yet sparse autoencoders offer a promising avenue for improving our understanding. Recently, SAEs have been trained to reconstruct residual stream states in small language models such as GPT2-Small (Cunningham et al., 2023; Bloom & Lin, 2024; Bloom, 2024) showing highly interpretable features (Figure 14). Since residual stream states gather information about the sum of previous components’ outputs, inspecting SAE’s features can illuminate the process by which they are added or transformed during the forward pass. Given the type of features intermediate FFNs and attention heads interact with, we also expect the residual stream at middle layers to encode highly abstract features. Tigges et al. (2023) provide some preliminary evidence by showing that causally intervening on the residual stream in middle layers is more effective in flip** the sentiment of the output token, suggesting that the latent representation of sentiment is most prominent in the middle layers. Bloom & Lin (2024) study the features learned by a SAE in layer 8 of the 12-layer model GPT2-Small. Based on their output behavior via the logit lens (Section 4.3) the authors first find local context features promoting small sets of tokens. Secondly, they highlight the presence of partition features, which promote and suppress two distinct sets of tokens. For instance, a partition feature might promote tokens starting with capital letters and suppress those starting with lowercase letters. Finally, akin to suppression neurons (Voita et al., 2023; Gurnee et al., 2024), they note the presence of suppression features aimed at reducing the likelihood of specific sets of tokens. In line with these findings, recent studies have shown that language models create vectors representing functions or tasks given in-context examples (Hendel et al., 2023; Todd et al., 2024), which are found in intermediate layers. In the next section, we provide a deeper overview of the interaction between different components and the resulting behavior that emerges.

5.4 Emergent Multi-component Behaviors

In previous sections we presented some of the different mechanisms that attention heads and FFNs implement, as well as an overview of the properties of the residual stream. However, in order to explain the remarkable performance of Transformers, we also need to account for the interactions between the different components (Wen et al., 2023; Cammarata et al., 2020).

Evidence of multi-component behavior.

The induction mechanism presented in Section 5.1.2 is a clear example of two components (attention heads) composing together to complete a pattern. Recent evidence suggests that multiple attention heads work together to create “function” or “task” vectors describing the task when given in-context examples (Hendel et al., 2023; Todd et al., 2024). Intervening in the residual stream with those vectors can produce outputs in accordance with the encoded task on novel zero-shot prompts. Variengien & Winsor (2023) study in-context retrieval tasks involving answering a request where the answer can be found in the context. The authors identify a high-level mechanism that is universal across subtasks and models. Specifically, middle layers process the request, followed by a retrieval step of the entity from the context done by attention heads at later layers.

Refer to caption
Figure 15: Simplified version of the IOI circuit in GPT2 Small discovered by Wang et al. (2023a).

Additionally, Neo et al. (2024); Yu & Ananiadou (2024) reveal that individual neurons within downstream FFNs activate according to the output of previous attention heads, interacting in specific contexts. However, the most compelling evidence of particular behaviors emerging from the interaction between multiple components is found in the circuit analysis literature (Wang et al. (2023a); Stolfo et al. (2023b); Heimersheim & Janiak (2023); Geva et al. (2023); Hanna et al. (2023), among others). As an illustration, we present the circuit found in GPT2 Small for the Indirect Object Identification (IOI) task (Wang et al., 2023a), depicted in Figure 15. In the IOI task the model is given inputs of the type “When Mary and John went to the store, John gave a drink to ___”. The initial clause introduces two names (Mary and John), followed by a secondary clause where the two people exchange an item. The correct prediction is the name not appearing in the second clause, referred to as the Indirect Object (Mary). The circuit found in GPT2 Small mainly includes:

  • Duplicity signaling: duplicate token heads at position S2, and an induction mechanism involving previous token heads at S1+1 signal the duplicity of S (John). This information is read by S-Inhibition heads at the last position, which write in the residual stream a token signal, indicating that S is repeated, and a position signal of the S1 token.

  • Name copying: name mover heads in later layers copy information from names they attend to in the context to the last residual stream. However, the signals of the previous layers S-Inhibition heads modify the query of name mover heads so that the duplicated name (in S1 and S2) is less attended, favouring the copying of the Indirect Object (IO) and therefore, pushing its prediction.

Besides, Wang et al. (2023a) discovered Negative mover heads, which are instances of copy suppression heads (Section 5.1.2) downweighing the probability of the IO. While the IOI is an attention-centric circuit, examples of circuits involving both FFNs and attention heads are also present. For instance, Hanna et al. (2023) reverse-engineered the GPT2-Small circuit for the greater-than task, which involves sentences like The war lasted from the year 1814 to the year 18__, where the model must predict a year greater than 1814. The authors demonstrate that downstream FFNs compute a valid year by reading from previous attention heads, which attend to the event’s initial date.

Generality of circuits.

Prakash et al. (2024) show that the functionality of the circuit components remains consistent after fine-tuning and benefits of fine-tuning are largely derived from an improved ability of circuit components to encode important task-relevant information rather than an overall functional rearrangement. Fine-tuned activations are also found to be compatible with the base model despite no explicit tuning constraints, suggesting the process produces minimal changes in the overall representation space. The findings of Prakash et al. (2024) are additionally supported by Jain et al. (2024) in controlled settings. While a common critique of mechanistic interpretability work is the limited scope of identified circuits, Merullo et al. (2024) show that low-level findings about specific heads and higher-level findings about general algorithms implemented by Transformer models can generalize across tasks, suggesting that large language models could be explained as functions of few task-general sparse components. The results of Merullo et al. (2024) also suggest that circuits are not exclusive, i.e. the same model components might be part of several circuits. Other studied dimensions of discovered circuits include their faithfulness (Hanna et al., 2024) and their completeness (Wang et al., 2023a).

Grokking as Emergence of Task-specific Circuits.

Transformer models were observed to converge to different algorithmic solutions for tasks at hand (Zhong et al., 2023).  Nanda et al. (2023a) provide convincing evidence on the relation between circuit emergence and grokking, i.e. the sudden emergence of near-perfect generalization capabilities for simple symbol manipulation tasks at late stages of model training (Power et al., 2022)Merrill et al. (2023) suggest the grokking phase transition can be seen as the emergence of a sparse circuit with generalization capabilities, replacing a dense subnetwork with low generalization capacity. According to Varma et al. (2023), this happens because dense memorizing circuits are inefficient for compressing large datasets. In contrast, generalizing circuits have a larger fixed cost but better per-example efficiency, hence being preferred in large-scale training. Huang et al. (2024b) connect the learning dynamic converging to grokking to the double descent phenomenon (Loog et al., 2020). According to this view, the emergence of specialized attention heads might be seen as a mild grokking-related phenomenon (Olsson et al., 2022; Bietti et al., 2023).

5.4.1 Factuality and hallucinations in model predictions

Intrinsic views on hallucinatory behavior.

The generation of factually incorrect or nonsensical outputs is considered a significant limitation in the practical usage of language models (Ji et al., 2023; Minaee et al., 2024). While some techniques for detecting hallucinated content rely on quantifying the uncertainty of model predictions (Varshney et al., 2023), most alternative approaches engage with model internal representations. Approaches for detecting hallucinations directly from the representations include training probes and analyzing the properties of the representations leading to hallucinations.  CH-Wang et al. (2023) and Azaria & Mitchell (2023) find probing classifiers predictive of the model’s output truthfulness, achieving the highest accuracy using middle and last layers representations. Zou et al. (2023) and Li et al. (2023a) find “truthfulness” directions with causal influence on the model outputs, i.e. intervening in the internal representations with the found directions enhance the output truthfulness.  Li et al. (2023a) locate these causal directions in the specific attention head activation. Chen et al. (2024b) use the eigenvalues of responses’ representations covariance matrix to measure the semantic consistency in embedding space across layers, while Chen et al. (2024d) observe that logit lens (Section 4.3) scores of the predicted attribute (answer) in higher-layers representations of context tokens are informative of the answer correctness.

A related area of research with overlap** goals is that of hallucination detection in machine translation (MT). An MT model is considered to hallucinate if its output contains partially or fully detached content from the source sentence (Guerreiro et al., 2023b). Prediction probabilities of the generated sequence and attention distributions have been used to detect potential errors (Fomicheva et al., 2020) and model hallucinations (Guerreiro et al., 2023a; b). Recently, methods measuring the amount of contribution from the source sentence tokens (Ferrando et al., 2022a) were found to perform on par with external methods based on semantic similarity across several categories of model hallucinations (Dale et al., 2023a; b). Detection methods show complementary performance across hallucination categories, and simple aggregation strategies for internals-based detectors outperform methods relying on external semantic similarity or quality estimation modules (Himmi et al., 2024).

The underlying mechanisms involved in the prediction of hallucinated content for LLMs remain largely unexplained. Most of the research in this area focuses on studying the ability of language models to recall facts, which we discuss in the next section.

Recall of factual associations.
Refer to caption
Figure 16: Simplified version of the factual recall circuit.

Recent research has delved into the internal mechanisms through which language models recall factual information, which is directly related to the hallucination problem in LLMs. A common methodology involves studying tuples (s,r,a)𝑠𝑟𝑎(s,r,a)( italic_s , italic_r , italic_a ), where s𝑠sitalic_s is a subject, r𝑟ritalic_r a relation, and a𝑎aitalic_a an attribute. The model is prompted to predict the attribute given the subject and relation. For instance, given the prompt: “LeBron James plays the sport of”, the model is expected to predict basketball. Meng et al. (2022) and Geva et al. (2023) make use of causal interventions (Section 3.2.2) to localize a mechanism responsible for recalling factual knowledge within the language model. Early-middle FFNs located in the last subject token add information about the subject into its residual stream. On the other hand, information from the relation passes into the last token residual stream via early attention heads. Finally, later layers attention heads extract the right attribute from the last subject residual stream. Yuksekgonul et al. (2024) find that, in similar settings, attention to relevant tokens in the prompt correlates with LLM’s factual correctness. Importantly, the division of responsibilities between lower and upper layers was also observed in attention-less models based on the Mamba architecture (Gu & Dao, 2023; Sharma et al., 2024a). While this might be motivated by implicit context-mixing akin to Transformers’ causal self-attention (Ali et al., 2024), it suggests the organization of these mechanisms might be driven by the language modeling optimization process rather than architectural constraints.

Subsequent research has moved from localizing model behavior to studying the computations performed to solve this task. Hernandez et al. (2024) show that attributes of entities can be linearly decoded from the enriched subject residual stream, while Chughtai et al. (2024) investigate how attention heads’ OV circuits effectively decode the attributes, proposing an additive mechanism. More precisely, using the direct logit attribution by each token via the attention head (Equation 16) they identify subject heads responsible for extracting attributes from the subject independently from the relation (not attending to it), as well as relation heads that promote attributes without being causally dependent on the subject. Additionally, a group of mixed heads generally favor the correct attribute and depend on both the subject and relation. The combination of the different heads’ outputs, each proposing different sets of attributes, together with the action of some downstream FFNs resolve the correct prediction (Figure 16). Nanda et al. (2023c) provide a detailed explanation of the subject enrichment phase by studying names of athletes as subjects. They suggest that the first layers’ attention heads concatenate the athlete’s name on the final name token residual stream through addition, and subsequent FFNs map the obtained athlete’s name representation into a linear representation of the athlete’s sport that can be easily linearly extracted by the downstream attribute extraction heads.

Merullo et al. (2023) report that for solving relational tasks, such as predicting a country’s capital given in-context examples, middle layers prepare the argument, e.g. Poland, of a get_capital()get_capital\texttt{get\_capital}()get_capital ( ) function that is applied downstream via an FFN update, giving place to get_capital(Poland)=Warsawget_capitalPolandWarsaw{\texttt{get\_capital}(\text{Poland})=\text{Warsaw}}get_capital ( Poland ) = Warsaw. Further research replicates Merullo et al. (2023)’s analysis on zero-shot settings (Lv et al., 2024) and finds specific attention heads “passing” the argument from the context (Poland), but also promoting the capital cities (Warsaw). Downstream FFNs “activate” relevant attention heads in the previous layer and add a vector guiding the residual stream toward the correct capital direction.

Recent works aim to shed light on how the model engages in factual recall vs. grounding. Following the aforementioned (subject, relation, attribute) structure of facts, an answer is considered to be grounded if the attribute is consistent with the information in the context of the prompt. Given prompts of the type “The capital of Poland is London. Q: What is the capital of Poland? A:___”,  Yu et al. (2023a) find in-context heads and memory heads by using the difference logit attribution (Section 3.2.1, Equation 17) of attention heads. These heads favor, respectively, the in-context answer London and the memorized answer Warsaw, showing a “competition” between mechanisms (Ortu et al., 2024). Furthermore, upweighting the output of each head type reveals a bias towards one of the two answers. Similar to the in-context heads, Variengien & Winsor (2023) show that a set of downstream attention heads retrieve the correct answer (an attribute) from the context via copying, preceded by a processing of the request (a question) in middle layers. Wu et al. (2024a) study these type of heads, which they coin retrieval heads in arbitrarily long-contexts, and show they are crucial for solving the Needle-in-a-Haystack tests (Kamradt, 2023)Monea et al. (2024) complement the findings of  Yu et al. (2023a) and Meng et al. (2022) and show that FFNs in the last token of the subject have higher contributions on ungrounded (memorized) answers as opposed to grounded answers, while suggesting that grounding could be a more distributed process lacking a specific localization. Haviv et al. (2023) show that the recall of “memorized” idioms largely depends on the updates of the FFNs in early layers, providing further evidence of their role as a storage of memorized information. This is further observed in the study of memorized paragraphs, with lower layers exhibiting larger gradient flow (Stoehr et al., 2024). On the other hand, Sharma et al. (2024b) show that substituting the original FFN matrices by lower-rank approximations (Section 4.3) leads to improvements in model performance, especially in later layers of the model. They show that, in the factual recall task, the components with smaller singular values encode the correct semantic type of the answer but the wrong answer, thus their removal benefits the accuracy. To conclude, we draw a connection with a decoding strategy (DoLa) proposed to improve the factuality of language models (Chuang et al., 2024). DoLa contrastively compares the logit-lens next-token distributions between an early layer and a later layer (Li et al., 2023b), promoting tokens that undergo a larger probability change, suggesting that the factual knowledge injection is done in a distributed manner across the network.

Factuality issues and model editing.

Factual information encoded in LMs might be incorrect from the start, or become obsolete over time. Moreover, inconsistencies have been observed when recalling factual knowledge in multilingual and cross-lingual settings (Fierro & Søgaard, 2022; Qi et al., 2023), or when factual associations are elicited using less common formulations (Berglund et al., 2023). This sparked the interest in develo** model editing approaches able to perform targeted updates on model factual associations with minimal impact on other capabilities. While early approaches proposed edits based on external modules trained for knowledge editing (De Cao et al., 2021; Mitchell et al., 2022a; b), recent methods employ causal interventions (Section 3.2.2) to localize knowledge neurons (Dai et al., 2022) and FFNs in one or more layers (Meng et al., 2022; 2023), informed by factual recall mechanisms described in the previous paragraph. However, model editing approaches still present several challenges, summarized in (Yao et al., 2023; Li et al., 2024), including the risks of catastrophic forgetting (Gupta et al., 2024a; b) and downstream performance loss (Gu et al., 2024). Importantly, Hase et al. (2023) show that effective localization does not always result in improved editing results, and that distributed edits across different model sections can result in similar editing accuracy. Steerable-by-design architectures such as the Backpack Transformer (Hewitt et al., 2023) were recently proposed as possible alternatives to localization-driven methods, exploiting the linearity of component contributions (Section 4.2) as an inductive bias to enhance controllability. We refer readers to Wang et al. (2023b) for further insights on LM editing.

6 LM Interpretability Tools

Several open-source software libraries were introduced to facilitate interpretability studies on Transformer-based LMs. In this section, we briefly summarize the most notable ones and highlight their main points of strength.

Input attribution tools.

Captum (Kokhlikyan et al., 2020) is a library in the Pytorch ecosystem providing access to several gradient and perturbation-based input attribution methods for any Pytorch-based model. It notably supports training data attribution methods (Section 3.1), and recently added several utilities for simplifying attribution analyses of generative LMs (Miglani et al., 2023). Several Captum-based tools provide convenient APIs for input attribution of Transformers-based models: Transformers Interpret (Pierse, 2021), ferret (Attanasio et al., 2023) and Ecco (Alammar, 2021) are mainly centered around language classification tasks, while Inseq (Sarti et al., 2023) is focused specifically on generative LMs and supports advanced approaches for contrastive context attribution (Sarti et al., 2024) as well as context mixing evaluation (Section 3.1). SHAP (Lundberg & Lee, 2017) is a popular toolkit mainly centered on perturbation-based input attribution methods and model-agnostic explanations for various data modalities. The Saliency (PAIR Team, 2023) library provides framework-agnostic implementations for mainly gradient-based input attribution methods. LIT (Tenney et al., 2020) is a framework-agnostic tool providing a convenient set of utilities and an intuitive interface for interpretability studies spanning input attribution, concept-based explanations and counterfactual behavior evaluation. It notably includes a visual tool for debugging complex LLM prompts (Tenney et al., 2024).

Component importance analysis tools.

Tools supporting work on circuit discovery and causal interventions play a fundamental role in mechanistic studies, balancing the complexity and model-specific nature of intervention-based methods with a broad support for various pre-trained LM architectures. TransformerLens (Nanda & Bloom, 2022) is a Pytorch-based toolkit to conduct mechanistic interpretability analyses of generative language models inspired by the closed-source Garçon library (Elhage et al., 2021b). The library reimplements popular Transformer LM architectures, preserving compatibility with the popular transformers library (Wolf et al., 2020) while also providing utilities such as hook points around model activations and attention head decomposition to facilitate custom interventions. NNsight (Fiotto-Kaufman, 2024) provides a Pytorch-compatible interface for interpretability analyses. Its usage is not restricted to Transformer models, but it provides utilities to streamline the usage of transformers checkpoints. Its main peculiarity is the ability to compile an intervention graph that can be processed through delayed execution, enabling the extraction of arbitrary internal information from large LMs hosted on remote servers. Pyvene (Wu et al., 2024c) is a Pytorch-based library supporting complex intervention schemes, such as trainable (Geiger et al., 2023a) and mid-training interventions (Geiger et al., 2022), alongside various model categories beyond Transformers. Notably, it supports the serialization of intervention schemes to simplify analyses and promotes reusability. Several tools are currently used for the development of SAEs (Section 4.2), providing overlap** sets of features. For example, SAELens (Bloom & Channin, 2024) supports advanced visualization of SAE features, while dictionary-learning (Marks & Mueller, 2023) is an actively developed tool built on top of NNsight, supporting various experimental features to address SAEs’ weaknesses. Finally, sparse-autoencoder (Cooney, 2023) provides a standard TransformerLens-compatible SAE implementation.

Tools for visualizing model internals.

Several tools such as BERTViz (Vig, 2019), exBERT (Hoover et al., 2020) and InterpreT (Lal et al., 2021) were developed to visualize attention weights and activations in Transformers-based LMs. LM-Debugger (Geva et al., 2022a) is a toolkit to inspect intermediate representation updates through the lens of logit attribution (Section 3.2.1), while VISIT (Katz & Belinkov, 2023), Ecco (Alammar, 2021) and Tuned Lens212121https://github.com/AlignmentResearch/tuned-lens (Belrose et al., 2023a) simplify the application of naive and learned vocabulary projections to inspect the evolution of predictions across model layers. CircuitsVis (Cooney, 2022) provides reusable Python bindings for front-end components that can be used to visualize Transformers internals and predictions, and was adopted by various interpretability tools. Penzai (Daniel Johnson, 2024) is a JAX library supporting rich visualizations of pytree data structures, including LM weights and activations. LM-TT (Tufanov et al., 2024) allows inspecting the information flow in a forward pass, faciliting the examination of the contributions of individual attention heads and feed-forward neurons. TDB (Mossing et al., 2024) is a visual interface to interpret neuron activations in LMs supporting automated interpretability techniques and SAEs. Neuronpedia222222https://neuronpedia.org (Lin & Bloom, 2024) provides an open repository for visualizing activation of SAE features trained on LM residual stream states (Section 4.2). Notably, it includes a gamified experience to facilitate the annotation of human-interpretable concepts in SAE feature space. Lastly, sae-vis (McDougall & Bloom, 2024) is a SAELens-compatible library to produce feature-centric and prompt-centric interactive visualizations of SAE features.

Other notable interpretability-related tools.

The “Restricted Access Sequence Processing Language” (RASPWeiss et al., 2021) is a sequence processing language providing a human-readable model for transformer computations. Tracr (Lindner et al., 2023) is a compiler converting RASP programs into decoder-only Transformer weights, automating the creation of small Transformer models implementing specific desired behaviors. RASP and Tracr were adopted for promoting interpretable behaviors via constrained optimization (Friedman et al., 2023) and validating the effectiveness of circuit discovery techniques (Conmy et al., 2023). Pyreft232323https://github.com/stanfordnlp/pyreft (Wu et al., 2024b) is a toolkit based on Pyvene for fine-tuning and sharing trainable interventions (Section 3.2.3, Causal abstraction) aimed at optimizing LM performance on selected tasks, in a similar but more targeted and efficient way than parameter-efficient fine-tuning methods (PEFT, Han et al., 2024). Going beyond the textual modality, ViT Prisma (Joseph, 2023) is a toolkit to conduct mechanistic interpretability analyses on vision and multimodal models. Finally, MAIA (Shaham et al., 2024) is a multimodal language model augmented with tool use to automate common interpretability workflows such as neuron explanations, example synthesis and counterfactual editing.

7 Conclusion and Future Directions

In this paper, we have offered an overview of the existing interpretability methods useful for understanding Transformer-based language models, and have presented the insights they have led to. Although the focus of this work is on practical methods and findings, we acknowledge theoretical studies related to the interpretability of Transformers, such as investigations explaining in-context learning (Akyürek et al., 2023; Von Oswald et al., 2023; Xie et al., 2022), explorations of Transformers through the lens of data compression and representation learning (Yu et al., 2023b; Voita et al., 2019a), the study of Transformers’ learning dynamics (Tian et al., 2024; 2023; Tarzanagh et al., 2024), or the analyses on their generalization properties on algorithmic tasks (Nogueira et al., 2021; Anil et al., 2022; Zhou et al., 2024).

Looking forward, we believe that the ultimate test for insights collected in years of interpretability work remains their applicability in debugging and improving the safety and reliability of future models, providing developers and users with better tools to interact with them and understand the factors influencing their predictions (Longo et al., 2024). To ensure such requirements are met, future developments in interpretability research will be faced with the challenging task of moving from functionally-grounded evaluations (i.e. no human evaluation, only toy settings) to actionable insights and benefits for real-world tasks (Doshi-Velez & Kim, 2017). From an analytical standpoint, this involves moving from methods and analyses operating in model component space to human-interpretable space, i.e from model components to features and natural language explanations, as suggested by Singh et al. (2024b), while still faithfully reflecting model behaviors (Siegel et al., 2024). Directions we deem promising in this area involve the usage of LMs as  verbalizers (Feldhus et al., 2023; Bills et al., 2023; Wang et al., 2024; Chen et al., 2024c) for scaling input and component attribution analyses, especially when paired with verification mechanisms to ensure counterfactual consistency (Avitan et al., 2024), and circuit discovery methods leveraging interpretable features to enable interventions motivated by human-understandable concepts (Marks et al., 2024). More accessible insights might also unlock gains in model performance and efficiency, translating interpretability-driven insights into downstream task improvements (Wu et al., 2024b). Importantly, interdisciplinary research grounded in the technical developments we summarize in this survey will play a key role in broadening the scope of interpretability analyses to account for the perceptual and interactive dimensions of model explanations from a human perspective (Liao et al., 2020; Dhanorkar et al., 2021; Vasconcelos et al., 2023). Ultimately, we believe that ensuring open and convenient access to the internals of advanced LMs will remain a fundamental prerequisite for future progress in this area (Bau et al., 2023; Casper et al., 2024; Hudson et al., 2024).

Acknowledgements

Javier Ferrando is supported by the Spanish Ministerio de Ciencia e Innovación through the project PID2019-107579RB-I00 / AEI / 10.13039/501100011033. Gabriele Sarti and Arianna Bisazza acknowledge the support of the Dutch Research Council (NWO) as part of the project InDeep (NWA.1292.19.399).

References

  • Abnar & Zuidema (2020) S. Abnar and W. Zuidema. Quantifying attention flow in transformers. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. URL https://aclanthology.org/2020.acl-main.385.
  • Achtibat et al. (2024) R. Achtibat, S. M. V. Hatefi, M. Dreyer, A. Jain, T. Wiegand, S. Lapuschkin, and W. Samek. AttnLRP: Attention-aware layer-wise relevance propagation for transformers, 2024.
  • Adebayo et al. (2018) J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pp.  9525–9536, Red Hook, NY, USA, 2018. Curran Associates Inc.
  • Adebayo et al. (2020) J. Adebayo, M. Muelly, I. Liccardi, and B. Kim. Debugging tests for model explanations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  • Adebayo et al. (2022) J. Adebayo, M. Muelly, H. Abelson, and B. Kim. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=xNOVfCCvDpM.
  • Agarwal et al. (2024) C. Agarwal, S. H. Tanneru, and H. Lakkaraju. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models. ArXiv, abs/2402.04614, 2024. URL https://api.semanticscholar.org/CorpusID:267523276.
  • Ahmadian et al. (2023) A. Ahmadian, S. Dash, H. Chen, B. Venkitesh, Z. S. Gou, P. Blunsom, A. Üstün, and S. Hooker. Intriguing properties of quantization at scale. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=IYe8j7Gy8f.
  • Akyurek et al. (2022) E. Akyurek, T. Bolukbasi, F. Liu, B. Xiong, I. Tenney, J. Andreas, and K. Guu. Towards tracing knowledge in language models back to the training data. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2429–2446, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.180. URL https://aclanthology.org/2022.findings-emnlp.180.
  • Akyürek et al. (2023) E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and D. Zhou. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0g0X4H8yN4I.
  • Akyürek et al. (2024) E. Akyürek, B. Wang, Y. Kim, and J. Andreas. In-context language learning: Architectures and algorithms, 2024.
  • Alain & Bengio (2016) G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes. Arxiv, 2016. URL https://arxiv.longhoe.net/abs/1610.01644.
  • Alammar (2021) J. Alammar. Ecco: An open source library for the explainability of transformer language models. In H. Ji, J. C. Park, and R. Xia (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp.  249–257, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-demo.30. URL https://aclanthology.org/2021.acl-demo.30.
  • Ali et al. (2022) A. Ali, T. Schnake, O. Eberle, G. Montavon, K.-R. Müller, and L. Wolf. XAI for transformers: Better explanations through conservative propagation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  435–451. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ali22a.html.
  • Ali et al. (2024) A. Ali, I. Zimerman, and L. Wolf. The hidden attention of mamba models, 2024.
  • Amara et al. (2024) K. Amara, R. Sevastjanova, and M. El-Assady. Syntaxshap: Syntax-aware explainability method for text generation. ArXiv, abs/2402.09259, 2024. URL https://api.semanticscholar.org/CorpusID:267657673.
  • Anil et al. (2022) C. Anil, Y. Wu, A. J. Andreassen, A. Lewkowycz, V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari, E. Dyer, and B. Neyshabur. Exploring length generalization in large language models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=zSkYVeX7bC4.
  • Anonymous (2024) Anonymous. The disagreement problem in explainable machine learning: A practitioner’s perspective. Submitted to Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=jESY2WTZCe. Under review.
  • Arditi et al. (2024) A. Arditi, O. Balcells, A. Syed, W. Gurnee, and N. Nanda. Refusal in llms is mediated by a single direction. Alignment Forum, 2024. URL https://alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction.
  • Arora et al. (2024) A. Arora, D. Jurafsky, and C. Potts. Causalgym: Benchmarking causal interpretability methods on linguistic tasks, 2024. URL https://arxiv.longhoe.net/abs/2402.12560.
  • Arora et al. (2018) S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. doi: 10.1162/tacl_a_00034. URL https://aclanthology.org/Q18-1034.
  • Atanasova et al. (2020) P. Atanasova, J. G. Simonsen, C. Lioma, and I. Augenstein. A diagnostic study of explainability techniques for text classification. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3256–3274, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.263. URL https://aclanthology.org/2020.emnlp-main.263.
  • Atanasova et al. (2023) P. Atanasova, O.-M. Camburu, C. Lioma, T. Lukasiewicz, J. G. Simonsen, and I. Augenstein. Faithfulness tests for natural language explanations. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  283–294, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.25. URL https://aclanthology.org/2023.acl-short.25.
  • Attanasio et al. (2023) G. Attanasio, E. Pastor, C. Di Bonaventura, and D. Nozza. ferret: a framework for benchmarking explainers on transformers. In D. Croce and L. Soldaini (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp.  256–266, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-demo.29. URL https://aclanthology.org/2023.eacl-demo.29.
  • Avitan et al. (2024) M. Avitan, R. Cotterell, Y. Goldberg, and S. Ravfogel. What changed? converting representational interventions to natural language. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2402.11355.
  • Azaria & Mitchell (2023) A. Azaria and T. Mitchell. The internal state of an LLM knows when it’s lying. In H. Bouamor, J. Pino, and K. Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  967–976, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https://aclanthology.org/2023.findings-emnlp.68.
  • Ba et al. (2016) J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. Arxiv, 2016. URL https://arxiv.longhoe.net/abs/1607.06450.
  • Bach et al. (2015) S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 07 2015. doi: 10.1371/journal.pone.0130140. URL https://doi.org/10.1371/journal.pone.0130140.
  • Bai et al. (2022) Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, 2022.
  • Balduzzi et al. (2017) D. Balduzzi, M. Frean, L. Leary, J. P. Lewis, K. W.-D. Ma, and B. McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? In D. Precup and Y. W. Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  342–350. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/balduzzi17b.html.
  • Bastings & Filippova (2020) J. Bastings and K. Filippova. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, and H. Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  149–155, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.14. URL https://aclanthology.org/2020.blackboxnlp-1.14.
  • Bastings et al. (2022) J. Bastings, S. Ebert, P. Zablotskaia, A. Sandholm, and K. Filippova. “will you find these shortcuts?” a protocol for evaluating the faithfulness of input salience methods for text classification. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  976–991, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.64. URL https://aclanthology.org/2022.emnlp-main.64.
  • Bau et al. (2019) A. Bau, Y. Belinkov, H. Sajjad, N. Durrani, F. Dalvi, and J. Glass. Identifying and controlling important neurons in neural machine translation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1z-PsR5KX.
  • Bau et al. (2020) D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, and A. Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020. doi: 10.1073/pnas.1907375117. URL https://www.pnas.org/doi/abs/10.1073/pnas.1907375117.
  • Bau et al. (2023) D. Bau, B. C. Wallace, A. Guha, J. Bell, and C. Brodley. National deep inference facility for very large language models (ndif). United States National Science Foundation, 2023.
  • Belinkov (2022) Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7.
  • Belinkov & Glass (2019) Y. Belinkov and J. Glass. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019. doi: 10.1162/tacl_a_00254. URL https://aclanthology.org/Q19-1004.
  • Belinkov et al. (2017) Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, and J. Glass. What do neural machine translation models learn about morphology? In R. Barzilay and M.-Y. Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  861–872, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1080. URL https://aclanthology.org/P17-1080.
  • Belrose (2023) N. Belrose. Least-squares concept erasure with oracle concept labels. EleutherAI Blog, 2023. URL https://blog.eleuther.ai/oracle-leace/.
  • Belrose et al. (2023a) N. Belrose, Z. Furman, L. Smith, D. Halawi, I. Ostrovsky, L. McKinney, S. Biderman, and J. Steinhardt. Eliciting latent predictions from transformers with the tuned lens. Arxiv, 2023a. URL https://arxiv.longhoe.net/abs/2303.08112.
  • Belrose et al. (2023b) N. Belrose, D. Schneider-Joseph, S. Ravfogel, R. Cotterell, E. Raff, and S. Biderman. LEACE: Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=awIpKpwTwF.
  • Bengio et al. (2003) Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3(null):1137–1155, mar 2003. ISSN 1532-4435.
  • Bereska & Gavves (2024) L. Bereska and E. Gavves. Mechanistic interpretability for ai safety – a review. ArXiv, 2024. URL https://arxiv.longhoe.net/abs/2404.14082.
  • Berglund et al. (2023) L. Berglund, M. Tong, M. Kaufmann, M. Balesni, A. C. Stickland, T. Korbak, and O. Evans. The reversal curse: Llms trained on "a is b" fail to learn "b is a". ArXiv, abs/2309.12288, 2023. URL https://api.semanticscholar.org/CorpusID:262083829.
  • Bibal et al. (2022) A. Bibal, R. Cardon, D. Alfter, R. Wilkens, X. Wang, T. François, and P. Watrin. Is attention explanation? an introduction to the debate. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3889–3900, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.269. URL https://aclanthology.org/2022.acl-long.269.
  • Biderman et al. (2023) S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. Van Der Wal. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Bietti et al. (2023) A. Bietti, V. Cabannes, D. Bouchacourt, H. Jegou, and L. Bottou. Birth of a transformer: A memory viewpoint. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  1560–1588. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/0561738a239a995c8cd2ef0e50cfa4fd-Paper-Conference.pdf.
  • Bills et al. (2023) S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
  • Bilodeau et al. (2024) B. Bilodeau, N. Jaques, P. W. Koh, and B. Kim. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):e2304406120, 2024. doi: 10.1073/pnas.2304406120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2304406120.
  • Bloom (2024) J. Bloom. Open source sparse autoencoders for all residual stream layers of GPT2 small. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream.
  • Bloom & Channin (2024) J. Bloom and D. Channin. Saelens. GitHub repository, 2024. URL https://github.com/jbloomAus/SAELens.
  • Bloom & Lin (2024) J. Bloom and J. Lin. Understanding SAE features with the logit lens. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/qykrYY6rXXM7EEs8Q/understanding-sae-features-with-the-logit-lens.
  • Bolukbasi et al. (2021) T. Bolukbasi, A. Pearce, A. Yuan, A. Coenen, E. Reif, F. Viégas, and M. Wattenberg. An interpretability illusion for bert, 2021.
  • Bondarenko et al. (2023) Y. Bondarenko, M. Nagel, and T. Blankevoort. Quantizable transformers: Removing outliers by hel** attention heads do nothing. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=sbusw6LD41.
  • Bricken et al. (2023) T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  • Brody et al. (2023) S. Brody, U. Alon, and E. Yahav. On the expressivity role of LayerNorm in transformers’ attention. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  14211–14221, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.895. URL https://aclanthology.org/2023.findings-acl.895.
  • Brown et al. (2023) D. Brown, N. Vyas, and Y. Bansal. On privileged and convergent bases in neural network representations, 2023.
  • Brown et al. (2020) T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Brunet et al. (2019) M.-E. Brunet, C. Alkalay-Houlihan, A. Anderson, and R. Zemel. Understanding the origins of bias in word embeddings. In K. Chaudhuri and R. Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  803–811. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/brunet19a.html.
  • Brunner et al. (2020) G. Brunner, Y. Liu, D. Pascual, O. Richter, M. Ciaramita, and R. Wattenhofer. On identifiability in transformers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJg1f6EFDB.
  • Burns et al. (2023) C. Burns, H. Ye, D. Klein, and J. Steinhardt. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs.
  • Cammarata et al. (2020) N. Cammarata, S. Carter, G. Goh, C. Olah, M. Petrov, L. Schubert, C. Voss, B. Egan, and S. K. Lim. Thread: Circuits. Distill, 2020. doi: 10.23915/distill.00024. URL https://distill.pub/2020/circuits.
  • Cancedda (2024) N. Cancedda. Spectral filters, dark signals, and attention sinks, 2024. URL https://arxiv.longhoe.net/abs/2402.09221.
  • Casper et al. (2024) S. Casper, C. Ezell, C. Siegmann, N. Kolt, T. L. Curtis, B. Bucknall, A. A. Haupt, K. Wei, J. Scheurer, M. Hobbhahn, L. Sharkey, S. Krishna, M. von Hagen, S. Alberti, A. Chan, Q. Sun, M. Gerovitch, D. Bau, M. Tegmark, D. Krueger, and D. Hadfield-Menell. Black-box access is insufficient for rigorous ai audits. ArXiv, abs/2401.14446, 2024. URL https://api.semanticscholar.org/CorpusID:267301601.
  • CH-Wang et al. (2023) S. CH-Wang, B. V. Durme, J. Eisner, and C. Kedzie. Do androids know they’re only dreaming of electric sheep?, 2023. URL https://arxiv.longhoe.net/abs/2312.17249v1.
  • Chan et al. (2022) L. Chan, A. Garriga-Alonso, N. Goldwosky-Dill, R. Greenblatt, J. Nitishinskaya, A. Radhakrishnan, B. Shlegeris, and N. Thomas. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  • Chefer et al. (2021) H. Chefer, S. Gur, and L. Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  782–791, June 2021.
  • Chen et al. (2024a) A. Chen, R. Shwartz-Ziv, K. Cho, M. L. Leavitt, and N. Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=MO5PiKHELW.
  • Chen et al. (2024b) C. Chen, K. Liu, Z. Chen, Y. Gu, Y. Wu, M. Tao, Z. Fu, and J. Ye. INSIDE: LLMs’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=Zj12nzlQbz.
  • Chen et al. (2024c) H. Chen, C. Vondrick, and C. Mao. Selfie: Self-interpretation of large language model embeddings, 2024c. URL https://arxiv.longhoe.net/abs/2403.10949.
  • Chen et al. (2024d) S. Chen, M. Xiong, J. Liu, Z. Wu, T. Xiao, S. Gao, and J. He. In-context sharpness as alerts: An inner representation perspective for hallucination mitigation, 2024d.
  • Chowdhery et al. (2023) A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  • Chowdhury et al. (2024) A. G. Chowdhury, M. M. Islam, V. Kumar, F. H. Shezan, V. Kumar, V. Jain, and A. Chadha. Breaking down the defenses: A comparative survey of attacks on large language models, 2024.
  • Chuang et al. (2024) Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and P. He. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Th6NyL07na.
  • Chughtai et al. (2024) B. Chughtai, A. Cooney, and N. Nanda. Summing up the facts: Additive mechanisms behind factual recall in llms, 2024. URL https://www.arxiv.longhoe.net/abs/2402.07321.
  • Clark et al. (2019) K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does BERT look at? an analysis of BERT’s attention. In T. Linzen, G. Chrupała, Y. Belinkov, and D. Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  276–286, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://aclanthology.org/W19-4828.
  • Conerly et al. (2024) T. Conerly, A. Templeton, T. Bricken, J. Marcus, and T. Henighan. Circuits updates - april 2024. update on how we train saes. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html.
  • Conmy et al. (2023) A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  16318–16352. Curran Associates, Inc., 2023. URL https://papers.nips.cc/paper_files/paper/2023/hash/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html.
  • Cooney (2022) A. Cooney. CircuitVis, December 2022. URL https://github.com/alan-cooney/CircuitsVis.
  • Cooney (2023) A. Cooney. Sparse autoencoder. GitHub repository, 2023. URL https://github.com/ai-safety-foundation/sparse_autoencoder.
  • Correia et al. (2019) G. M. Correia, V. Niculae, and A. F. T. Martins. Adaptively sparse transformers. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2174–2184, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1223. URL https://aclanthology.org/D19-1223.
  • Costa-jussà et al. (2023) M. Costa-jussà, E. Smith, C. Ropers, D. Licht, J. Maillard, J. Ferrando, and C. Escolano. Toxicity in multilingual machine translation at scale. In H. Bouamor, J. Pino, and K. Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  9570–9586, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.642. URL https://aclanthology.org/2023.findings-emnlp.642.
  • Covert et al. (2021) I. Covert, S. Lundberg, and S.-I. Lee. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021. URL http://jmlr.org/papers/v22/20-1316.html.
  • Crabbé & van der Schaar (2023) J. Crabbé and M. van der Schaar. Evaluating the robustness of interpretability methods through explanation invariance and equivariance. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=5UwnKSgY6u.
  • Csordás et al. (2021) R. Csordás, S. van Steenkiste, and J. Schmidhuber. Are neural nets modular? inspecting functional modularity through differentiable weight masks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=7uVcpu-gMD.
  • Cunningham et al. (2023) H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2309.08600.
  • Dai et al. (2022) D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei. Knowledge neurons in pretrained transformers. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8493–8502, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022.acl-long.581.
  • Dale et al. (2023a) D. Dale, E. Voita, L. Barrault, and M. R. Costa-jussà. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  36–50, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.3. URL https://aclanthology.org/2023.acl-long.3.
  • Dale et al. (2023b) D. Dale, E. Voita, J. Lam, P. Hansanti, C. Ropers, E. Kalbassi, C. Gao, L. Barrault, and M. Costa-jussà. HalOmi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  638–653, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.42. URL https://aclanthology.org/2023.emnlp-main.42.
  • Dalvi et al. (2019) F. Dalvi, N. Durrani, H. Sajjad, Y. Belinkov, A. Bau, and J. Glass. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press, 2019. ISBN 978-1-57735-809-1. doi: 10.1609/aaai.v33i01.33016309. URL https://doi.org/10.1609/aaai.v33i01.33016309.
  • Daniel Johnson (2024) P. A. Daniel Johnson. Penzai. GitHub repository, 2024. URL https://github.com/google-deepmind/penzai.
  • Dao et al. (2023) J. Dao, Y.-T. Lau, C. Rager, and J. Janiak. An adversarial example for direct logit attribution: Memory management in gelu-4l, 2023.
  • Dar et al. (2023) G. Dar, M. Geva, A. Gupta, and J. Berant. Analyzing transformers in embedding space. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  16124–16170, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.893. URL https://aclanthology.org/2023.acl-long.893.
  • Darcet et al. (2024) T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2dnO3LLiJ1.
  • Dauphin et al. (2017) Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  933–941. JMLR.org, 2017.
  • De Cao et al. (2020) N. De Cao, M. S. Schlichtkrull, W. Aziz, and I. Titov. How do decisions emerge across layers in neural models? interpretation with differentiable masking. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3243–3255, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.262. URL https://aclanthology.org/2020.emnlp-main.262.
  • De Cao et al. (2021) N. De Cao, W. Aziz, and I. Titov. Editing factual knowledge in language models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6491–6506, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.522. URL https://aclanthology.org/2021.emnlp-main.522.
  • De Cao et al. (2022) N. De Cao, L. Schmid, D. Hupkes, and I. Titov. Sparse interventions in language models with differentiable masking. In J. Bastings, Y. Belinkov, Y. Elazar, D. Hupkes, N. Saphra, and S. Wiegreffe (eds.), Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  16–27, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.blackboxnlp-1.2. URL https://aclanthology.org/2022.blackboxnlp-1.2.
  • Deiseroth et al. (2023) B. Deiseroth, M. Deb, S. Weinbach, M. Brack, P. Schramowski, and K. Kersting. Atman: Understanding transformer predictions through memory efficient attention manipulation. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  63437–63460. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/c83bc020a020cdeb966ed10804619664-Paper-Conference.pdf.
  • Denil et al. (2015) M. Denil, A. Demiraj, and N. de Freitas. Extraction of salient sentences from labelled documents. Arxiv, 2015. URL https://arxiv.longhoe.net/abs/1412.6815.
  • Dettmers et al. (2022) T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  30318–30332. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c3ba4962c05c49636d4c6206a97e9c8a-Paper-Conference.pdf.
  • Devlin et al. (2019) J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • DeYoung et al. (2020) J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, and B. C. Wallace. ERASER: A benchmark to evaluate rationalized NLP models. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4443–4458, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.408. URL https://aclanthology.org/2020.acl-main.408.
  • Dhamdhere et al. (2019) K. Dhamdhere, M. Sundararajan, and Q. Yan. How important is a neuron. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SylKoo0cKm.
  • Dhanorkar et al. (2021) S. Dhanorkar, C. T. Wolf, K. Qian, A. Xu, L. Popa, and Y. Li. Who needs to know what, when?: Broadening the explainable ai (xai) design space by looking at explanations across the ai lifecycle. In Proceedings of the 2021 ACM Designing Interactive Systems Conference, DIS ’21, pp.  1591–1602, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384766. doi: 10.1145/3461778.3462131. URL https://doi.org/10.1145/3461778.3462131.
  • Din et al. (2023) A. Y. Din, T. Karidi, L. Choshen, and M. Geva. Jump to conclusions: Short-cutting transformers with linear transformations. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2303.09435.
  • Ding & Koehn (2021) S. Ding and P. Koehn. Evaluating saliency methods for neural language models. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5034–5052, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.399. URL https://aclanthology.org/2021.naacl-main.399.
  • Doshi-Velez & Kim (2017) F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning, 2017. URL https://arxiv.longhoe.net/abs/1702.08608.
  • Durrani et al. (2023) N. Durrani, F. Dalvi, and H. Sajjad. Discovering salient neurons in deep nlp models. Journal of Machine Learning Research, 24(362):1–40, 2023. URL http://jmlr.org/papers/v24/23-0074.html.
  • Elazar et al. (2021) Y. Elazar, S. Ravfogel, A. Jacovi, and Y. Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 03 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00359. URL https://doi.org/10.1162/tacl_a_00359.
  • Elhage et al. (2021a) N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021a. URL https://transformer-circuits.pub/2021/framework/index.html.
  • Elhage et al. (2021b) N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. Garcon. Transformer Circuits Thread, 2021b. URL https://transformer-circuits.pub/2021/garcon/index.html.
  • Elhage et al. (2022a) N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. ElShowk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandlish, D. Amodei, and C. Olah. Softmax linear units. Transformer Circuits Thread, 2022a. URL https://transformer-circuits.pub/2022/solu/index.html.
  • Elhage et al. (2022b) N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah. Toy models of superposition. Transformer Circuits Thread, 2022b. URL https://transformer-circuits.pub/2022/toy_model/index.html.
  • Elhage et al. (2023) N. Elhage, R. Lasenby, and C. Olah. Privileged bases in the transformer residual stream. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/privileged-basis/index.html.
  • Enguehard (2023) J. Enguehard. Sequential integrated gradients: a simple but effective method for explaining language models. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  7555–7565, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.477. URL https://aclanthology.org/2023.findings-acl.477.
  • Ethayarajh (2019) K. Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  55–65, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1006. URL https://aclanthology.org/D19-1006.
  • Ethayarajh & Jurafsky (2021) K. Ethayarajh and D. Jurafsky. Attention flows are shapley value explanations. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  49–54, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.8. URL https://aclanthology.org/2021.acl-short.8.
  • Feldhus et al. (2023) N. Feldhus, L. Hennig, M. D. Nasert, C. Ebert, R. Schwarzenberg, and S. Möller. Saliency map verbalization: Comparing feature importance representations from model-free and instruction-based methods. In B. Dalvi Mishra, G. Durrett, P. Jansen, D. Neves Ribeiro, and J. Wei (eds.), Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pp.  30–46, Toronto, Canada, June 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.nlrse-1.4. URL https://aclanthology.org/2023.nlrse-1.4.
  • Ferrando & Costa-jussà (2021) J. Ferrando and M. R. Costa-jussà. Attention weights in transformer NMT fail aligning words between sequences but largely explain model predictions. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  434–443, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.39. URL https://aclanthology.org/2021.findings-emnlp.39.
  • Ferrando & Voita (2024) J. Ferrando and E. Voita. Information flow routes: Automatically interpreting language models at scale. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2403.00824.
  • Ferrando et al. (2022a) J. Ferrando, G. I. Gállego, B. Alastruey, C. Escolano, and M. R. Costa-jussà. Towards opening the black box of neural machine translation: Source and target interpretations of the transformer. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  8756–8769, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.599. URL https://aclanthology.org/2022.emnlp-main.599.
  • Ferrando et al. (2022b) J. Ferrando, G. I. Gállego, and M. R. Costa-jussà. Measuring the mixing of contextual information in the transformer. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  8698–8714, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.595. URL https://aclanthology.org/2022.emnlp-main.595.
  • Ferrando et al. (2023) J. Ferrando, G. I. Gállego, I. Tsiamas, and M. R. Costa-jussà. Explaining how transformers use context to build predictions. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5486–5513, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.301. URL https://aclanthology.org/2023.acl-long.301.
  • Fierro & Søgaard (2022) C. Fierro and A. Søgaard. Factual consistency of multilingual pretrained language models. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  3046–3052, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.240. URL https://aclanthology.org/2022.findings-acl.240.
  • Fiotto-Kaufman (2024) J. Fiotto-Kaufman. nnsight: The package for interpreting and manipulating the internals of deep learned models. , 2024. URL https://github.com/JadenFiotto-Kaufman/nnsight.
  • Fomicheva et al. (2020) M. Fomicheva, S. Sun, L. Yankovskaya, F. Blain, F. Guzmán, M. Fishel, N. Aletras, V. Chaudhary, and L. Specia. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555, 2020. doi: 10.1162/tacl_a_00330. URL https://aclanthology.org/2020.tacl-1.35.
  • Friedman et al. (2023) D. Friedman, A. Wettig, and D. Chen. Learning transformer programs. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023. URL https://openreview.net/forum?id=Pe9WxkN8Ff.
  • Geiger et al. (2020) A. Geiger, K. Richardson, and C. Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, and H. Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  163–173, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.16. URL https://aclanthology.org/2020.blackboxnlp-1.16.
  • Geiger et al. (2021) A. Geiger, H. Lu, T. Icard, and C. Potts. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  9574–9586. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/4f5c422f4d49a5a807eda27434231040-Paper.pdf.
  • Geiger et al. (2022) A. Geiger, Z. Wu, H. Lu, J. Rozner, E. Kreiss, T. Icard, N. D. Goodman, and C. Potts. Inducing causal structure for interpretable neural networks, 2022. URL https://arxiv.longhoe.net/abs/2112.00826.
  • Geiger et al. (2023a) A. Geiger, C. Potts, and T. Icard. Causal abstraction for faithful model interpretation, 2023a. URL https://arxiv.longhoe.net/abs/2301.04709.
  • Geiger et al. (2023b) A. Geiger, Z. Wu, C. Potts, T. Icard, and N. D. Goodman. Finding alignments between interpretable causal variables and distributed neural representations, 2023b. URL https://arxiv.longhoe.net/abs/2303.02536.
  • Gemma Team et al. (2024) G. Gemma Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, P. Tafti, L. Hussenot, P. G. Sessa, A. Chowdhery, A. Roberts, A. Barua, A. Botev, A. Castro-Ros, A. Slone, A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson, B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya, E. Ni, E. Noland, G. Yan, G. Tucker, G.-C. Muraru, G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Grishchenko, J. Austin, J. Keeling, J. Labanowski, J.-B. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret, J. Chiu, J. Mao-Jones, K. Lee, K. Yu, K. Millican, L. L. Sjoesund, L. Lee, L. Dixon, M. Reid, M. Mikuła, M. Wirth, M. Sharman, N. Chinaev, N. Thain, O. Bachem, O. Chang, O. Wahltinez, P. Bailey, P. Michel, P. Yotov, R. Chaabouni, R. Comanescu, R. Jana, R. Anil, R. McIlroy, R. Liu, R. Mullins, S. L. Smith, S. Borgeaud, S. Girgin, S. Douglas, S. Pandya, S. Shakeri, S. De, T. Klimenko, T. Hennigan, V. Feinberg, W. Stokowiec, Y. hui Chen, Z. Ahmed, Z. Gong, T. Warkentin, L. Peran, M. Giang, C. Farabet, O. Vinyals, J. Dean, K. Kavukcuoglu, D. Hassabis, Z. Ghahramani, D. Eck, J. Barral, F. Pereira, E. Collins, A. Joulin, N. Fiedel, E. Senter, A. Andreev, and K. Kenealy. Gemma: Open models based on gemini research and technology. ArXiv, 2024. URL https://arxiv.longhoe.net/abs/2403.08295.
  • Geva et al. (2021) M. Geva, R. Schuster, J. Berant, and O. Levy. Transformer feed-forward layers are key-value memories. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5484–5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.446. URL https://aclanthology.org/2021.emnlp-main.446.
  • Geva et al. (2022a) M. Geva, A. Caciularu, G. Dar, P. Roit, S. Sadde, M. Shlain, B. Tamir, and Y. Goldberg. LM-debugger: An interactive tool for inspection and intervention in transformer-based language models. In W. Che and E. Shutova (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  12–21, Abu Dhabi, UAE, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-demos.2. URL https://aclanthology.org/2022.emnlp-demos.2.
  • Geva et al. (2022b) M. Geva, A. Caciularu, K. Wang, and Y. Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  30–45, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.3. URL https://aclanthology.org/2022.emnlp-main.3.
  • Geva et al. (2023) M. Geva, J. Bastings, K. Filippova, and A. Globerson. Dissecting recall of factual associations in auto-regressive language models. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  12216–12235, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.751. URL https://aclanthology.org/2023.emnlp-main.751.
  • Ghandeharioun et al. (2024) A. Ghandeharioun, A. Caciularu, A. Pearce, L. Dixon, and M. Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2401.06102v2.
  • Goldowsky-Dill et al. (2023) N. Goldowsky-Dill, C. MacLeod, L. Sato, and A. Arora. Localizing model behavior with path patching. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2304.05969.
  • Gould et al. (2024) R. Gould, E. Ong, G. Ogden, and A. Conmy. Successor heads: Recurring, interpretable attention heads in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=kvcbV8KQsi.
  • Groeneveld et al. (2024) D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, E. Strubell, N. Subramani, M. Wortsman, P. Dasigi, N. Lambert, K. Richardson, L. Zettlemoyer, J. Dodge, K. Lo, L. Soldaini, N. A. Smith, and H. Hajishirzi. Olmo: Accelerating the science of language models, 2024.
  • Grosse et al. (2023) R. B. Grosse, J. Bae, C. Anil, N. Elhage, A. Tamkin, A. Tajdini, B. Steiner, D. Li, E. Durmus, E. Perez, E. Hubinger, K. Lukovsiut.e, K. Nguyen, N. Joseph, S. McCandlish, J. Kaplan, and S. Bowman. Studying large language model generalization with influence functions. ArXiv, abs/2308.03296, 2023. URL https://api.semanticscholar.org/CorpusID:260682872.
  • Gu & Dao (2023) A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023.
  • Gu et al. (2024) J.-C. Gu, H. Xu, J.-Y. Ma, P. Lu, Z.-H. Ling, K. wei Chang, and N. Peng. Model editing can hurt general abilities of large language models. ArXiv, abs/2401.04700, 2024. URL https://api.semanticscholar.org/CorpusID:266899568.
  • Guerner et al. (2023) C. Guerner, A. Svete, T. Liu, A. Warstadt, and R. Cotterell. A geometric notion of causal probing, 2023.
  • Guerreiro et al. (2023a) N. M. Guerreiro, P. Colombo, P. Piantanida, and A. Martins. Optimal transport for unsupervised hallucination detection in neural machine translation. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13766–13784, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.770. URL https://aclanthology.org/2023.acl-long.770.
  • Guerreiro et al. (2023b) N. M. Guerreiro, E. Voita, and A. Martins. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In A. Vlachos and I. Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  1059–1075, Dubrovnik, Croatia, May 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.75. URL https://aclanthology.org/2023.eacl-main.75.
  • Gupta et al. (2015) A. Gupta, G. Boleda, M. Baroni, and S. Padó. Distributional vectors encode referential attributes. In L. Màrquez, C. Callison-Burch, and J. Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  12–21, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1002. URL https://aclanthology.org/D15-1002.
  • Gupta et al. (2024a) A. Gupta, A. Rao, and G. K. Anumanchipalli. Model editing at scale leads to gradual and catastrophic forgetting. ArXiv, abs/2401.07453, 2024a. URL https://api.semanticscholar.org/CorpusID:266999650.
  • Gupta et al. (2024b) A. Gupta, D. Sajnani, and G. Anumanchipalli. A unified framework for model editing, 2024b.
  • Gurnee (2024) W. Gurnee. Sae reconstruction errors are (empirically) pathological. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/rZPiuFxESMxCDHe4B/sae-reconstruction-errors-are-empirically-pathological.
  • Gurnee & Tegmark (2024) W. Gurnee and M. Tegmark. Language models represent space and time. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jE8xbmvFin.
  • Gurnee et al. (2023) W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=JYs1R9IMJr.
  • Gurnee et al. (2024) W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. Bertsimas. Universal neurons in gpt2 language models, 2024.
  • Guu et al. (2023) K. Guu, A. Webson, E. Pavlick, L. Dixon, I. Tenney, and T. Bolukbasi. Simfluence: Modeling the influence of individual training examples by simulating training runs. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2303.08114.
  • Hammoudeh & Lowd (2022) Z. Hammoudeh and D. Lowd. Training data influence analysis and estimation: A survey, 2022.
  • Han et al. (2020) X. Han, B. C. Wallace, and Y. Tsvetkov. Explaining black box predictions and unveiling data artifacts through influence functions. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5553–5563, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.492. URL https://aclanthology.org/2020.acl-main.492.
  • Han et al. (2024) Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey, 2024.
  • Hanna et al. (2023) M. Hanna, O. Liu, and A. Variengien. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  76033–76060. Curran Associates, Inc., 2023. URL https://papers.nips.cc/paper_files/paper/2023/hash/efbba7719cc5172d175240f24be11280-Abstract-Conference.html.
  • Hanna et al. (2024) M. Hanna, S. Pezzelle, and Y. Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms, 2024. URL https://arxiv.longhoe.net/abs/2403.17806.
  • Hase et al. (2023) P. Hase, M. Bansal, B. Kim, and A. Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  17643–17668. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/3927bbdcf0e8d1fa8aa23c26f358a281-Paper-Conference.pdf.
  • Haviv et al. (2023) A. Haviv, I. Cohen, J. Gidron, R. Schuster, Y. Goldberg, and M. Geva. Understanding transformer memorization recall through idioms. In A. Vlachos and I. Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  248–264, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.19. URL https://aclanthology.org/2023.eacl-main.19.
  • He et al. (2024) Z. He, X. Ge, Q. Tang, T. Sun, Q. Cheng, and X. Qiu. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello-gpt. ArXiv, abs/2402.12201, 2024. URL https://api.semanticscholar.org/CorpusID:267751496.
  • Heimersheim & Janiak (2023) S. Heimersheim and J. Janiak. A circuit for python docstrings in a 4-layer attention-only transformer. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only.
  • Heimersheim & Nanda (2024) S. Heimersheim and N. Nanda. How to use and interpret activation patching. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2404.15255.
  • Heimersheim & Turner (2023) S. Heimersheim and A. Turner. Residual stream norms grow exponentially over the forward pass. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward.
  • Hendel et al. (2023) R. Hendel, M. Geva, and A. Globerson. In-context learning creates task vectors. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2310.15916.
  • Hernandez et al. (2024) E. Hernandez, A. S. Sharma, T. Haklay, K. Meng, M. Wattenberg, J. Andreas, Y. Belinkov, and D. Bau. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=w7LU2s14kE.
  • Hewitt & Liang (2019) J. Hewitt and P. Liang. Designing and interpreting probes with control tasks. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275.
  • Hewitt & Manning (2019) J. Hewitt and C. D. Manning. A structural probe for finding syntax in word representations. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. URL https://aclanthology.org/N19-1419.
  • Hewitt et al. (2023) J. Hewitt, J. Thickstun, C. Manning, and P. Liang. Backpack language models. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9103–9125, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.506. URL https://aclanthology.org/2023.acl-long.506.
  • Himmi et al. (2024) A. Himmi, G. Staerman, M. Picot, P. Colombo, and N. M. Guerreiro. Enhanced hallucination detection in neural machine translation through simple detector aggregation, 2024.
  • Hoffmann et al. (2022) J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. Rae, and L. Sifre. An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  30016–30030. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/c1e2faff6f588870935f114ebe04a3e5-Abstract-Conference.html.
  • Holtzman et al. (2021) A. Holtzman, P. West, V. Shwartz, Y. Choi, and L. Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7038–7051, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.564. URL https://aclanthology.org/2021.emnlp-main.564.
  • Hoover et al. (2020) B. Hoover, H. Strobelt, and S. Gehrmann. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. In A. Celikyilmaz and T.-H. Wen (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  187–196, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.22. URL https://aclanthology.org/2020.acl-demos.22.
  • Htut et al. (2019) P. M. Htut, J. Phang, S. Bordia, and S. R. Bowman. Do attention heads in bert track syntactic dependencies? Arxiv, 2019. URL https://arxiv.longhoe.net/abs/1911.12246.
  • Huang et al. (2023) J. Huang, A. Geiger, K. D’Oosterlinck, Z. Wu, and C. Potts. Rigorously assessing natural language explanations of neurons. In Y. Belinkov, S. Hao, J. Jumelet, N. Kim, A. McCarthy, and H. Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.  317–331, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.24. URL https://aclanthology.org/2023.blackboxnlp-1.24.
  • Huang et al. (2024a) J. Huang, Z. Wu, C. Potts, M. Geva, and A. Geiger. Ravel: Evaluating interpretability methods on disentangling language model representations, 2024a. URL https://arxiv.longhoe.net/abs/2402.17700.
  • Huang et al. (2024b) Y. Huang, S. Hu, X. Han, Z. Liu, and M. Sun. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition, 2024b.
  • Hudson et al. (2024) N. Hudson, J. G. Pauloski, M. Baughman, A. Kamatar, M. Sakarvadia, L. Ward, R. Chard, A. Bauer, M. Levental, W. Wang, W. Engler, O. P. Skelly, B. Blaiszik, R. Stevens, K. Chard, and I. Foster. Trillion parameter ai serving infrastructure for scientific discovery: A survey and vision. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2402.03480.
  • Hupkes et al. (2018) D. Hupkes, S. Veldhoen, and W. Zuidema. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research, 61(1):907–926, 2018. ISSN 1076-9757.
  • Jain et al. (2024) S. Jain, R. Kirk, E. S. Lubana, R. P. Dick, H. Tanaka, T. Rocktäschel, E. Grefenstette, and D. Krueger. What happens when you fine-tuning your model? mechanistic analysis of procedurally generated tasks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=A0HKeKl4Nl.
  • Jain & Wallace (2019) S. Jain and B. C. Wallace. Attention is not Explanation. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1357. URL https://aclanthology.org/N19-1357.
  • Jastrzębski et al. (2018) S. Jastrzębski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio. Residual connections encourage iterative inference, 2018.
  • Jermyn & Templeton (2024) A. Jermyn and A. Templeton. Circuits updates - jnauary 2024. ghost grads: An improvement on resampling. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html.
  • Ji et al. (2023) Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), mar 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730.
  • Jiang et al. (2024) Y. Jiang, G. Rajendran, P. Ravikumar, B. Aragam, and V. Veitch. On the origins of linear representations in large language models, 2024.
  • Joseph (2023) S. Joseph. Vit prisma: A mechanistic interpretability library for vision transformers. GitHub repository, 2023. URL https://github.com/soniajoseph/vit-prisma.
  • Kamradt (2023) G. Kamradt. Needle in a haystack - pressure testing llms. Github Repository, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack.
  • Kaplan et al. (2020) J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.longhoe.net/abs/2001.08361.
  • Katz & Belinkov (2023) S. Katz and Y. Belinkov. VISIT: Visualizing and interpreting the semantic information flow of transformers. In H. Bouamor, J. Pino, and K. Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  14094–14113, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.939. URL https://aclanthology.org/2023.findings-emnlp.939.
  • Katz et al. (2024) S. Katz, Y. Belinkov, M. Geva, and L. Wolf. Backward lens: Projecting language model gradients into the vocabulary space, 2024.
  • Kim et al. (2018) B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. sayres. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In J. Dy and A. Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2668–2677. PMLR, 2018. URL https://proceedings.mlr.press/v80/kim18d.html.
  • Kissane et al. (2024a) C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Sparse autoencoders work on attention layer outputs. AI Alignment Forum, 2024a. URL https://www.alignmentforum.org/posts/DtdzGwFh9dCfsekZZ.
  • Kissane et al. (2024b) C. Kissane, R. Krzyzanowski, A. Conmy, and N. Nanda. Attention saes scale to gpt-2 small. AI Alignment Forum, 2024b. URL https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr/attention-saes-scale-to-gpt-2-small.
  • Kobayashi et al. (2020) G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Attention is not only a weight: Analyzing transformers with vector norms. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7057–7075, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.574. URL https://aclanthology.org/2020.emnlp-main.574.
  • Kobayashi et al. (2021) G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4547–4568, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.373. URL https://aclanthology.org/2021.emnlp-main.373.
  • Kobayashi et al. (2023) G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Transformer language models handle word frequency in prediction head. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  4523–4535, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.276. URL https://aclanthology.org/2023.findings-acl.276.
  • Kobayashi et al. (2024) G. Kobayashi, T. Kuribayashi, S. Yokoi, and K. Inui. Analyzing feed-forward blocks in transformers through the lens of attention map. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mYWsyTuiRp.
  • Koh & Liang (2017) P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In D. Precup and Y. W. Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1885–1894. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/koh17a.html.
  • Köhn (2015) A. Köhn. What’s in an embedding? analyzing word embeddings through multilingual evaluation. In L. Màrquez, C. Callison-Burch, and J. Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  2067–2073, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1246. URL https://aclanthology.org/D15-1246.
  • Kokhlikyan et al. (2020) N. Kokhlikyan, V. Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson. Captum: A unified and generic model interpretability library for pytorch. Arxiv, 2020. URL https://arxiv.longhoe.net/abs/2009.07896.
  • Kovaleva et al. (2019) O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky. Revealing the dark secrets of BERT. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4365–4374, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1445. URL https://aclanthology.org/D19-1445.
  • Kovaleva et al. (2021) O. Kovaleva, S. Kulshreshtha, A. Rogers, and A. Rumshisky. BERT busters: Outlier dimensions that disrupt transformers. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3392–3405, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.300. URL https://aclanthology.org/2021.findings-acl.300.
  • Kramár et al. (2024) J. Kramár, T. Lieberum, R. Shah, and N. Nanda. Atp*: An efficient and scalable method for localizing llm behaviour to components, 2024. URL https://arxiv.longhoe.net/abs/2403.00745.
  • Krzyzanowski et al. (2024) R. Krzyzanowski, C. Kissane, A. Conmy, and N. Nanda. We inspected every head in GPT-2 small using saes so you don’t have to. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don.
  • Kwon et al. (2024) Y. Kwon, E. Wu, K. Wu, and J. Zou. Datainf: Efficiently estimating data influence in loRA-tuned LLMs and diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=9m02ib92Wz.
  • Lal et al. (2021) V. Lal, A. Ma, E. Aflalo, P. Howard, A. Simoes, D. Korat, O. Pereg, G. Singer, and M. Wasserblat. InterpreT: An interactive visualization tool for interpreting transformers. In D. Gkatzia and D. Seddah (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp.  135–142, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-demos.17. URL https://aclanthology.org/2021.eacl-demos.17.
  • Langedijk et al. (2023) A. Langedijk, H. Mohebbi, G. Sarti, W. Zuidema, and J. Jumelet. Decoderlens: Layerwise interpretation of encoder-decoder transformers. ArXiv, abs/2310.03686, 2023. URL https://api.semanticscholar.org/CorpusID:263671583.
  • Lanham et al. (2023) T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. E. Denison, D. Hernandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, K. Lukovsiut.e, K. Nguyen, N. Cheng, N. Joseph, N. Schiefer, O. Rausch, R. Larson, S. McCandlish, S. Kundu, S. Kadavath, S. Yang, T. Henighan, T. D. Maxwell, T. Telleen-Lawton, T. Hume, Z. Hatfield-Dodds, J. Kaplan, J. Brauner, S. Bowman, and E. Perez. Measuring faithfulness in chain-of-thought reasoning. ArXiv, abs/2307.13702, 2023. URL https://api.semanticscholar.org/CorpusID:259953372.
  • Leino et al. (2018) K. Leino, S. Sen, A. Datta, M. Fredrikson, and L. Li. Influence-directed explanations for deep convolutional networks. In 2018 IEEE International Test Conference (ITC), pp.  1–8, 2018. doi: 10.1109/TEST.2018.8624792.
  • Lepori et al. (2023) M. A. Lepori, T. Serre, and E. Pavlick. Uncovering intermediate variables in transformers using circuit probing, 2023.
  • Li et al. (2016) J. Li, X. Chen, E. Hovy, and D. Jurafsky. Visualizing and understanding neural models in NLP. In K. Knight, A. Nenkova, and O. Rambow (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  681–691, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL https://aclanthology.org/N16-1082.
  • Li et al. (2017) J. Li, W. Monroe, and D. Jurafsky. Understanding neural networks through representation erasure, 2017.
  • Li et al. (2023a) K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=aLLuYpn83y.
  • Li et al. (2023b) X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, L. Zettlemoyer, and M. Lewis. Contrastive decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12286–12312, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https://aclanthology.org/2023.acl-long.687.
  • Li et al. (2024) Z. Li, N. Zhang, Y. Yao, M. Wang, X. Chen, and H. Chen. Unveiling the pitfalls of knowledge editing for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fNktD3ib16.
  • Liao et al. (2020) Q. V. Liao, D. Gruen, and S. Miller. Questioning the ai: Informing design practices for explainable ai user experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, pp.  1–15, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450367080. doi: 10.1145/3313831.3376590. URL https://doi.org/10.1145/3313831.3376590.
  • Lieberum et al. (2023) T. Lieberum, M. Rahtz, J. Kramár, N. Nanda, G. Irving, R. Shah, and V. Mikulik. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2307.09458.
  • Lin & Bloom (2024) J. Lin and J. Bloom. Announcing neuronpedia: Platform for accelerating research into sparse autoencoders. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/BaEQoxHhWPrkinmxd/announcing-neuronpedia-platform-for-accelerating-research.
  • Lin et al. (2019) Y. Lin, Y. C. Tan, and R. Frank. Open sesame: Getting inside BERT’s linguistic knowledge. In T. Linzen, G. Chrupała, Y. Belinkov, and D. Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  241–253, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4825. URL https://aclanthology.org/W19-4825.
  • Lindner et al. (2023) D. Lindner, J. Kramar, S. Farquhar, M. Rahtz, T. McGrath, and V. Mikulik. Tracr: Compiled transformers as a laboratory for interpretability. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  37876–37899. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/771155abaae744e08576f1f3b4b7ac0d-Paper-Conference.pdf.
  • Lipton (2018) Z. C. Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31–57, jun 2018. ISSN 1542-7730. doi: 10.1145/3236386.3241340. URL https://doi.org/10.1145/3236386.3241340.
  • Liu et al. (2019) N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith. Linguistic knowledge and transferability of contextual representations. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  1073–1094, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL https://aclanthology.org/N19-1112.
  • Liu et al. (2024) Q. Liu, Y. Chai, S. Wang, Y. Sun, K. Wang, and H. Wu. On training data influence of gpt models. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2404.07840.
  • Longo et al. (2024) L. Longo, M. Brcic, F. Cabitza, J. Choi, R. Confalonieri, J. D. Ser, R. Guidotti, Y. Hayashi, F. Herrera, A. Holzinger, R. Jiang, H. Khosravi, F. Lecue, G. Malgieri, A. Páez, W. Samek, J. Schneider, T. Speith, and S. Stumpf. Explainable artificial intelligence (xai) 2.0: A manifesto of open challenges and interdisciplinary research directions. Information Fusion, 106:102301, 2024. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2024.102301. URL https://www.sciencedirect.com/science/article/pii/S1566253524000794.
  • Loog et al. (2020) M. Loog, T. Viering, A. Mey, J. H. Krijthe, and D. M. J. Tax. A brief prehistory of double descent. Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020. doi: 10.1073/pnas.2001875117. URL https://www.pnas.org/doi/abs/10.1073/pnas.2001875117.
  • Lundberg & Lee (2017) S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf.
  • Luo et al. (2021) Z. Luo, A. Kulmizev, and X. Mao. Positional artefacts propagate through masked language model embeddings. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5312–5327, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.413. URL https://aclanthology.org/2021.acl-long.413.
  • Lv et al. (2024) A. Lv, K. Zhang, Y. Chen, Y. Wang, L. Liu, J.-R. Wen, J. Xie, and R. Yan. Interpreting key mechanisms of factual recall in transformer-based language models. Computing Research Repository, arXiv:2403.19521, 2024. URL https://arxiv.longhoe.net/abs/2403.19521.
  • Lyu et al. (2024) Q. Lyu, M. Apidianaki, and C. Callison-Burch. Towards Faithful Model Explanation in NLP: A Survey. Computational Linguistics, pp.  1–70, 01 2024. ISSN 0891-2017. doi: 10.1162/coli_a_00511. URL https://doi.org/10.1162/coli_a_00511.
  • MacDiarmid et al. (2024) M. MacDiarmid, T. Maxwell, N. Schiefer, J. Mu, J. Kaplan, D. Duvenaud, S. Bowman, A. Tamkin, E. Perez, M. Sharma, C. Denison, and E. Hubinger. Simple probes can catch sleeper agents. Anthropic, 2024. URL https://www.anthropic.com/news/probes-catch-sleeper-agents.
  • Madsen et al. (2022) A. Madsen, S. Reddy, and S. Chandar. Post-hoc interpretability for neural nlp: A survey. ACM Computing Surveys, 55(8), 2022. ISSN 0360-0300. doi: 10.1145/3546577. URL https://doi.org/10.1145/3546577.
  • Madsen et al. (2024) A. Madsen, S. Chandar, and S. Reddy. Are self-explanations from large language models faithful? ArXiv, abs/2401.07927, 2024. URL https://api.semanticscholar.org/CorpusID:266999774.
  • Makelov et al. (2024) A. Makelov, G. Lange, A. Geiger, and N. Nanda. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ebt7JgMHv1.
  • Marks & Mueller (2023) S. Marks and A. Mueller. Dictionary learning. GitHub repository, 2023. URL https://github.com/saprmarks/dictionary_learning.
  • Marks & Tegmark (2023) S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023. URL https://arxiv.longhoe.net/abs/2310.06824.
  • Marks et al. (2024) S. Marks, C. Rager, E. J. Michaud, Y. Belinkov, D. Bau, and A. Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. Computing Research Repository, arXiv:2403.19647, 2024. URL https://arxiv.longhoe.net/abs/2403.19647.
  • McCoy et al. (2019) T. McCoy, E. Pavlick, and T. Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In A. Korhonen, D. Traum, and L. Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3428–3448, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1334. URL https://aclanthology.org/P19-1334.
  • McDougall (2023) C. McDougall. Six (and a half) intuitions for SVD. Callum McDougall Blog, 2023. URL https://www.perfectlynormal.co.uk/blog-kl-divergence.
  • McDougall & Bloom (2024) C. McDougall and J. Bloom. Sae-vis: Announcement post. LessWrong, 2024. URL https://www.lesswrong.com/posts/nAhy6ZquNY7AD3RkD/sae-vis-announcement-post-1.
  • McDougall et al. (2023) C. McDougall, A. Conmy, C. Rushing, T. McGrath, and N. Nanda. Copy suppression: Comprehensively understanding an attention head. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2310.04625.
  • McGrath et al. (2022) T. McGrath, A. Kapishnikov, N. Tomašev, A. Pearce, M. Wattenberg, D. Hassabis, B. Kim, U. Paquet, and V. Kramnik. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022. doi: 10.1073/pnas.2206625119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2206625119.
  • McGrath et al. (2023) T. McGrath, M. Rahtz, J. Kramar, V. Mikulik, and S. Legg. The hydra effect: Emergent self-repair in language model computations. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2307.15771.
  • Meng et al. (2022) K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in GPT. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  17359–17372. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html.
  • Meng et al. (2023) K. Meng, A. S. Sharma, A. J. Andonian, Y. Belinkov, and D. Bau. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MkbcAHIYgyS.
  • Merrill et al. (2021) W. Merrill, V. Ramanujan, Y. Goldberg, R. Schwartz, and N. A. Smith. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1766–1781, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.133. URL https://aclanthology.org/2021.emnlp-main.133.
  • Merrill et al. (2023) W. Merrill, N. Tsilivis, and A. Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks. ArXiv, abs/2303.11873, 2023. URL https://api.semanticscholar.org/CorpusID:257636667.
  • Merullo et al. (2023) J. Merullo, C. Eickhoff, and E. Pavlick. A mechanism for solving relational tasks in transformer language models, 2023. URL https://arxiv.longhoe.net/abs/2305.16130.
  • Merullo et al. (2024) J. Merullo, C. Eickhoff, and E. Pavlick. Circuit component reuse across tasks in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fpoAYV6Wsk.
  • Michel et al. (2019) P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf.
  • Mickus et al. (2022) T. Mickus, D. Paperno, and M. Constant. How to dissect a Muppet: The structure of transformer embedding spaces. Transactions of the Association for Computational Linguistics, 10:981–996, 2022. doi: 10.1162/tacl_a_00501. URL https://aclanthology.org/2022.tacl-1.57.
  • Miglani et al. (2023) V. Miglani, A. Yang, A. Markosyan, D. Garcia-Olano, and N. Kokhlikyan. Using captum to explain generative language models. In L. Tan, D. Milajevs, G. Chauhan, J. Gwinnup, and E. Rippeth (eds.), Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pp.  165–173, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.nlposs-1.19. URL https://aclanthology.org/2023.nlposs-1.19.
  • Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
  • Millidge & Black (2022) B. Millidge and S. Black. The singular value decompositions of transformer weight matrices are highly interpretable. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight.
  • Millidge & Winsor (2023) B. Millidge and E. Winsor. Basic facts about language model internals. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1.
  • Minaee et al. (2024) S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao. Large language models: A survey, 2024. URL https://arxiv.longhoe.net/abs/2402.06196.
  • Mitchell et al. (2022a) E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning. Fast model editing at scale. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=0DcZxeWfOPt.
  • Mitchell et al. (2022b) E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, and C. Finn. Memory-based model editing at scale. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  15817–15831. PMLR, 17–23 Jul 2022b. URL https://proceedings.mlr.press/v162/mitchell22a.html.
  • Modarressi et al. (2022) A. Modarressi, M. Fayyaz, Y. Yaghoobzadeh, and M. T. Pilehvar. GlobEnc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. In M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  258–271, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.19. URL https://aclanthology.org/2022.naacl-main.19.
  • Modarressi et al. (2023) A. Modarressi, M. Fayyaz, E. Aghazadeh, Y. Yaghoobzadeh, and M. T. Pilehvar. DecompX: Explaining transformers decisions by propagating token decomposition. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2649–2664, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.149. URL https://aclanthology.org/2023.acl-long.149.
  • Mohebbi et al. (2023) H. Mohebbi, W. Zuidema, G. Chrupała, and A. Alishahi. Quantifying context mixing in transformers. In A. Vlachos and I. Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  3378–3400, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.245. URL https://aclanthology.org/2023.eacl-main.245.
  • Molina (2023) R. Molina. Traveling words: A geometric interpretation of transformers. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2309.07315.
  • Monea et al. (2024) G. Monea, M. Peyrard, M. Josifoski, V. Chaudhary, J. Eisner, E. Kıcıman, H. Palangi, B. Patra, and R. West. A glitch in the matrix? locating and detecting language model grounding with fakepedia, 2024. URL https://arxiv.longhoe.net/abs/2312.02073.
  • Mossing et al. (2024) D. Mossing, S. Bills, H. Tillman, T. Dupré la Tour, N. Cammarata, L. Gao, J. Achiam, C. Yeh, J. Leike, J. Wu, and W. Saunders. Transformer debugger. https://github.com/openai/transformer-debugger, 2024.
  • Nanda (2022a) N. Nanda. Induction mosaic. Neel Nanda Blog, 2022a. URL https://neelnanda.io/mosaic.
  • Nanda (2022b) N. Nanda. Neuroscope: A website for mechanistic interpretability of language models. Website, 2022b. URL https://neuroscope.io/.
  • Nanda (2023) N. Nanda. Attribution patching: Activation patching at industrial scale. https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023.
  • Nanda & Bloom (2022) N. Nanda and J. Bloom. Transformerlens. Github Repository, 2022. URL https://github.com/neelnanda-io/TransformerLens.
  • Nanda et al. (2023a) N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=9XFSbDPmdW.
  • Nanda et al. (2023b) N. Nanda, A. Lee, and M. Wattenberg. Emergent linear representations in world models of self-supervised sequence models. In Y. Belinkov, S. Hao, J. Jumelet, N. Kim, A. McCarthy, and H. Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.  16–30, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.2. URL https://aclanthology.org/2023.blackboxnlp-1.2.
  • Nanda et al. (2023c) N. Nanda, S. Rajamanoharan, J. Kramár, and R. Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level. AI Alignment Forum, 2023c. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
  • Neo et al. (2024) C. Neo, S. B. Cohen, and F. Barez. Interpreting context look-ups in transformers: Investigating attention-mlp interactions, 2024. URL https://arxiv.longhoe.net/abs/2402.15055.
  • Nguyen et al. (2016) A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  3395–3403, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  • Nogueira et al. (2021) R. Nogueira, Z. Jiang, and J. Lin. Investigating the limitations of transformers with simple arithmetic tasks, 2021.
  • nostalgebraist (2020) nostalgebraist. Interpreting GPT: the logit lens. AI Alignment Forum, 2020. URL https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
  • Oh & Schuler (2023) B.-D. Oh and W. Schuler. Token-wise decomposition of autoregressive language model hidden states for analyzing model predictions. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  10105–10117, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.562. URL https://aclanthology.org/2023.acl-long.562.
  • Olah (2022) C. Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/mech-interp-essay.
  • Olah (2023) C. Olah. Distributed representations: Composition & superposition. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/superposition-composition/index.html.
  • Olah et al. (2020a) C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. An overview of early vision in inceptionv1. Distill, 2020a. doi: 10.23915/distill.00024.002. https://distill.pub/2020/circuits/early-vision.
  • Olah et al. (2020b) C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits. Distill, 2020b. doi: 10.23915/distill.00024.001. URL https://distill.pub/2020/circuits/zoom-in.
  • Olshausen & Field (1997) B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311–3325, 1997. ISSN 0042-6989. doi: https://doi.org/10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697.
  • Olsson et al. (2022) C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. In-context learning and induction heads. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  • Ortu et al. (2024) F. Ortu, Z. **, D. Doimo, M. Sachan, A. Cazzaniga, and B. Schölkopf. Competition of mechanisms: Tracing how language models handle facts and counterfactuals. Computing Research Repository, arXiv:2402.11655, 2024. URL https://arxiv.longhoe.net/abs/2402.11655.
  • PAIR Team (2023) G. PAIR Team. Saliency: Framework-agnostic implementation for state-of-the-art saliency methods, 2023. URL https://github.com/PAIR-code/saliency.
  • Pal et al. (2023) K. Pal, J. Sun, A. Yuan, B. Wallace, and D. Bau. Future lens: Anticipating subsequent tokens from a single hidden state. In J. Jiang, D. Reitter, and S. Deng (eds.), Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pp.  548–560, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.conll-1.37. URL https://aclanthology.org/2023.conll-1.37.
  • Parcalabescu & Frank (2023) L. Parcalabescu and A. Frank. On measuring faithfulness or self-consistency of natural language explanations, 2023.
  • Park et al. (2023a) K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models. Arxiv, 2023a. URL https://arxiv.longhoe.net/abs/2311.03658.
  • Park et al. (2023b) S. M. Park, K. Georgiev, A. Ilyas, G. Leclerc, and A. Mądry. Trak: attributing model behavior at scale. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023b.
  • Paulo et al. (2024) G. Paulo, T. Marshall, and N. Belrose. Does transformer interpretability transfer to rnns?, 2024.
  • Pearl (2001) J. Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, pp.  411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.
  • Pearl (2009) J. Pearl. Causality. Cambridge University Press, 2 edition, 2009. doi: 10.1017/CBO9780511803161.
  • Peters et al. (2018) M. E. Peters, M. Neumann, L. Zettlemoyer, and W.-t. Yih. Dissecting contextual word embeddings: Architecture and representation. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1499–1509, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1179. URL https://aclanthology.org/D18-1179.
  • Pezeshkpour et al. (2022) P. Pezeshkpour, S. Jain, S. Singh, and B. Wallace. Combining feature and instance attribution to detect artifacts. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  1934–1946, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.153. URL https://aclanthology.org/2022.findings-acl.153.
  • Pierse (2021) C. Pierse. Transformers Interpret, February 2021. URL https://github.com/cdpierse/transformers-interpret.
  • Pimentel et al. (2020) T. Pimentel, J. Valvoda, R. H. Maudslay, R. Zmigrod, A. Williams, and R. Cotterell. Information-theoretic probing for linguistic structure. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4609–4622, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.420. URL https://aclanthology.org/2020.acl-main.420.
  • Power et al. (2022) A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022.
  • Prakash et al. (2024) N. Prakash, T. R. Shaham, T. Haklay, Y. Belinkov, and D. Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8sKcAWOf2D.
  • Puccetti et al. (2022) G. Puccetti, A. Rogers, A. Drozd, and F. Dell’Orletta. Outlier dimensions that disrupt transformers are driven by frequency. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1286–1304, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.93. URL https://aclanthology.org/2022.findings-emnlp.93.
  • Qi et al. (2023) J. Qi, R. Fernández, and A. Bisazza. Cross-lingual consistency of factual knowledge in multilingual language models. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10650–10666, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.658. URL https://aclanthology.org/2023.emnlp-main.658.
  • Quirke et al. (2023) L. Quirke, L. Heindrich, W. Gurnee, and N. Nanda. Training dynamics of contextual n-grams in language models, 2023. URL https://arxiv.longhoe.net/abs/2311.00863.
  • Radford et al. (2017) A. Radford, R. Jozefowicz, and I. Sutskever. Learning to generate reviews and discovering sentiment. Arxiv, 2017. URL https://arxiv.longhoe.net/abs/1704.01444.
  • Radford et al. (2018) A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. OpenAI Blog, 2018. URL https://openai.com/research/language-unsupervised.
  • Radford et al. (2019) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  • Raganato & Tiedemann (2018) A. Raganato and J. Tiedemann. An analysis of encoder representations in transformer-based machine translation. In T. Linzen, G. Chrupała, and A. Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  287–297, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5431. URL https://aclanthology.org/W18-5431.
  • Rajamanoharan (2024) S. Rajamanoharan. Progress update 1 from the gdm mech interp team. improving ghost grads. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update.
  • Rajamanoharan et al. (2024) S. Rajamanoharan, A. Conmy, L. Smith, T. Lieberum, V. Varma, J. Kramár, R. Shah, and N. Nanda. Improving dictionary learning with gated sparse autoencoders. ArXiv, 2024.
  • Ravfogel et al. (2020) S. Ravfogel, Y. Elazar, H. Gonen, M. Twiton, and Y. Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7237–7256, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.acl-main.647.
  • Ravfogel et al. (2022) S. Ravfogel, M. Twiton, Y. Goldberg, and R. D. Cotterell. Linear adversarial concept erasure. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  18400–18421. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ravfogel22a.html.
  • Ribeiro et al. (2016) M. T. Ribeiro, S. Singh, and C. Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp.  1135–1144, 2016.
  • Rogers et al. (2021) A. Rogers, O. Kovaleva, and A. Rumshisky. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866, 01 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00349. URL https://doi.org/10.1162/tacl_a_00349.
  • Rudman et al. (2023) W. Rudman, C. Chen, and C. Eickhoff. Outlier dimensions encode task specific knowledge. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14596–14605, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.901. URL https://aclanthology.org/2023.emnlp-main.901.
  • Rushing & Nanda (2024) C. Rushing and N. Nanda. Explorations of self-repair in language models, 2024. URL https://arxiv.longhoe.net/abs/2402.15390.
  • Räuker et al. (2023) T. Räuker, A. Ho, S. Casper, and D. Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2207.13243.
  • Sakarvadia et al. (2023) M. Sakarvadia, A. Khan, A. Ajith, D. Grzenda, N. Hudson, A. Bauer, K. Chard, and I. Foster. Attention lens: A tool for mechanistically interpreting the attention head information retrieval mechanism, 2023.
  • Sanyal & Ren (2021) S. Sanyal and X. Ren. Discretized integrated gradients for explaining language models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10285–10299, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.805. URL https://aclanthology.org/2021.emnlp-main.805.
  • Sarti et al. (2023) G. Sarti, N. Feldhus, L. Sickert, O. van der Wal, M. Nissim, and A. Bisazza. Inseq: An interpretability toolkit for sequence generation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp.  421–435, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.40. URL https://aclanthology.org/2023.acl-demo.40.
  • Sarti et al. (2024) G. Sarti, G. Chrupała, M. Nissim, and A. Bisazza. Quantifying the plausibility of context reliance in neural machine translation. In The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, May 2024. OpenReview. URL https://openreview.net/forum?id=XTHfNGI3zT.
  • Shaham et al. (2024) T. R. Shaham, S. Schwettmann, F. Wang, A. Rajaram, E. Hernandez, J. Andreas, and A. Torralba. A multimodal automated interpretability agent. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2404.14394.
  • Shapley (1953) L. S. Shapley. A value for n-person games. In H. W. Kuhn and A. W. Tucker (eds.), Contributions to the Theory of Games II, pp.  307–317. Princeton University Press, Princeton, 1953.
  • Sharkey et al. (2022) L. Sharkey, D. Braun, and B. Millidge. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
  • Sharma et al. (2024a) A. S. Sharma, D. Atkinson, and D. Bau. Locating and editing factual associations in mamba, 2024a.
  • Sharma et al. (2024b) P. Sharma, J. T. Ash, and D. Misra. The truth is in there: Improving reasoning with layer-selective rank reduction. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=ozX92bu8VA.
  • Shazeer (2020) N. Shazeer. Glu variants improve transformer. ArXiv, 2020.
  • Shrikumar et al. (2017) A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  3145–3153. JMLR.org, 2017.
  • Shrikumar et al. (2018) A. Shrikumar, J. Su, and A. Kundaje. Computationally efficient measures of internal neuron importance. ArXiv, abs/1807.09946, 2018. URL https://api.semanticscholar.org/CorpusID:50787065.
  • Siegel et al. (2024) N. Y. Siegel, O.-M. Camburu, N. Heess, and M. Perez-Ortiz. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models, 2024.
  • Simonyan et al. (2014) K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In The Second International Conference on Learning Representations, 2014. URL http://arxiv.longhoe.net/abs/1312.6034.
  • Singh et al. (2024a) A. K. Singh, T. Moskovitz, F. Hill, S. C. Y. Chan, and A. M. Saxe. What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation, 2024a.
  • Singh et al. (2024b) C. Singh, J. P. Inala, M. Galley, R. Caruana, and J. Gao. Rethinking interpretability in the era of large language models. ArXiv, abs/2402.01761, 2024b. URL https://api.semanticscholar.org/CorpusID:267412530.
  • Singh et al. (2024c) S. Singh, S. Ravfogel, J. Herzig, R. Aharoni, R. Cotterell, and P. Kumaraguru. Mimic: Minimally modified counterfactuals in the representation space. Arxiv, 2024c. URL https://arxiv.longhoe.net/abs/2402.09631.
  • Sixt et al. (2020) L. Sixt, M. Granz, and T. Landgraf. When explanations lie: Why many modified BP attributions fail. In H. D. III and A. Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  9046–9057. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sixt20a.html.
  • Smilkov et al. (2017) D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. Smoothgrad: removing noise by adding noise, 2017.
  • Smolensky (1986) P. Smolensky. Neural and conceptual interpretation of PDP models, pp.  390–431. MIT Press, Cambridge, MA, USA, 1986. ISBN 0262631105.
  • Stoehr et al. (2024) N. Stoehr, M. Gordon, C. Zhang, and O. Lewis. Localizing paragraph memorization in language models, 2024. URL https://arxiv.longhoe.net/abs/2403.19851.
  • Stolfo et al. (2023a) A. Stolfo, Y. Belinkov, and M. Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  7035–7052, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.435. URL https://aclanthology.org/2023.emnlp-main.435.
  • Stolfo et al. (2023b) A. Stolfo, Y. Belinkov, and M. Sachan. Understanding arithmetic reasoning in language models using causal mediation analysis. Arxiv, 2023b. URL https://arxiv.longhoe.net/abs/2305.15054.
  • Suau et al. (2020) X. Suau, L. Zappella, and N. Apostoloff. Finding experts in transformer models, 2020.
  • Suau et al. (2022) X. Suau, L. Zappella, and N. Apostoloff. Self-conditioning pre-trained language models. International Conference on Machine Learning, 2022. URL https://proceedings.mlr.press/v162/cuadros22a/cuadros22a.pdf.
  • Sun et al. (2024) M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models, 2024. URL https://arxiv.longhoe.net/abs/2402.17762.
  • Sundararajan et al. (2017) M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp.  3319–3328. JMLR.org, 2017.
  • Syed et al. (2023) A. Syed, C. Rager, and A. Conmy. Attribution patching outperforms automated circuit discovery. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2310.10348.
  • Takase et al. (2023) S. Takase, S. Kiyono, S. Kobayashi, and J. Suzuki. B2T connection: Serving stability and performance in deep transformers. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  3078–3095, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.192. URL https://aclanthology.org/2023.findings-acl.192.
  • Tang et al. (2024) T. Tang, W. Luo, H. Huang, D. Zhang, X. Wang, X. Zhao, F. Wei, and J.-R. Wen. Language-specific neurons: The key to multilingual capabilities in large language models, 2024. URL https://arxiv.longhoe.net/abs/2402.16438.
  • Tarzanagh et al. (2024) D. A. Tarzanagh, Y. Li, C. Thrampoulidis, and S. Oymak. Transformers as support vector machines, 2024.
  • Templeton et al. (2024) A. Templeton, T. Conerly, J. Marcus, T. Henighan, A. Golubeva, and T. Bricken. Circuits updates - february 2024. update on dictionary learning improvements. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/feb-update/index.html.
  • Tenney et al. (2019a) I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In A. Korhonen, D. Traum, and L. Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4593–4601, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclanthology.org/P19-1452.
  • Tenney et al. (2019b) I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=SJzSgnRcKX.
  • Tenney et al. (2020) I. Tenney, J. Wexler, J. Bastings, T. Bolukbasi, A. Coenen, S. Gehrmann, E. Jiang, M. Pushkarna, C. Radebaugh, E. Reif, and A. Yuan. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Q. Liu and D. Schlangen (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  107–118, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.15. URL https://aclanthology.org/2020.emnlp-demos.15.
  • Tenney et al. (2024) I. Tenney, R. Mullins, B. Du, S. Pandya, M. Kahng, and L. Dixon. Interactive prompt debugging with sequence salience. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2404.07498.
  • Tian et al. (2023) Y. Tian, Y. Wang, B. Chen, and S. Du. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer, 2023.
  • Tian et al. (2024) Y. Tian, Y. Wang, Z. Zhang, B. Chen, and S. S. Du. JoMA: Demystifying multilayer transformers via joint dynamics of MLP and attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LbJqRGNYCf.
  • Tibshirani (1996) R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. doi: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1996.tb02080.x.
  • Tigges et al. (2023) C. Tigges, O. J. Hollinsworth, A. Geiger, and N. Nanda. Linear representations of sentiment in large language models. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2310.15154.
  • Timkey & van Schijndel (2021) W. Timkey and M. van Schijndel. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4527–4546, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.372. URL https://aclanthology.org/2021.emnlp-main.372.
  • Todd et al. (2024) E. Todd, M. Li, A. S. Sharma, A. Mueller, B. C. Wallace, and D. Bau. LLMs represent contextual tasks as compact function vectors. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AwyxtyMwaG.
  • Touvron et al. (2023) H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2307.09288.
  • Tufanov et al. (2024) I. Tufanov, K. Hambardzumyan, J. Ferrando, and E. Voita. Lm transparency tool: Interactive tool for analyzing transformer language models. Arxiv, 2024. URL https://arxiv.longhoe.net/abs/2404.07004.
  • Turner et al. (2023) A. M. Turner, L. Thiergart, D. Udell, G. Leech, U. Mini, and M. MacDiarmid. Activation addition: Steering language models without optimization, 2023.
  • Turpin et al. (2023) M. Turpin, J. Michael, E. Perez, and S. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388, 2023. URL https://api.semanticscholar.org/CorpusID:258556812.
  • Variengien (2023) A. Variengien. Some common confusion about induction heads. LessWrong, 2023. URL https://www.lesswrong.com/posts/nJqftacoQGKurJ6fv/some-common-confusion-about-induction-heads.
  • Variengien & Winsor (2023) A. Variengien and E. Winsor. Look before you leap: A universal emergent decomposition of retrieval tasks in language models, 2023. URL https://arxiv.longhoe.net/abs/2312.10091.
  • Varma et al. (2023) V. Varma, R. Shah, Z. Kenton, J. Kram’ar, and R. Kumar. Explaining grokking through circuit efficiency. ArXiv, abs/2309.02390, 2023. URL https://api.semanticscholar.org/CorpusID:261557247.
  • Varshney et al. (2023) N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023. URL https://arxiv.longhoe.net/abs/2307.03987.
  • Vasconcelos et al. (2023) H. Vasconcelos, M. Jörke, M. Grunde-McLaughlin, T. Gerstenberg, M. S. Bernstein, and R. Krishna. Explanations can reduce overreliance on ai systems during decision-making. Proc. ACM Hum.-Comput. Interact., 7(CSCW1), apr 2023. doi: 10.1145/3579605. URL https://doi.org/10.1145/3579605.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  • Veit et al. (2016) A. Veit, M. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  550–558, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  • Vig (2019) J. Vig. A multiscale visualization of attention in the transformer model. In M. R. Costa-jussà and E. Alfonseca (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  37–42, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-3007. URL https://aclanthology.org/P19-3007.
  • Vig et al. (2020) J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer, and S. Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html.
  • Voita & Titov (2020) E. Voita and I. Titov. Information-theoretic probing with minimum description length. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL https://aclanthology.org/2020.emnlp-main.14.
  • Voita et al. (2019a) E. Voita, R. Sennrich, and I. Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4396–4406, Hong Kong, China, November 2019a. Association for Computational Linguistics. doi: 10.18653/v1/D19-1448. URL https://aclanthology.org/D19-1448.
  • Voita et al. (2019b) E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In A. Korhonen, D. Traum, and L. Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5797–5808, Florence, Italy, July 2019b. Association for Computational Linguistics. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580.
  • Voita et al. (2021) E. Voita, R. Sennrich, and I. Titov. Analyzing the source and target contributions to predictions in neural machine translation. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1126–1140, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.91. URL https://aclanthology.org/2021.acl-long.91.
  • Voita et al. (2023) E. Voita, J. Ferrando, and C. Nalmpantis. Neurons in large language models: Dead, n-gram, positional. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2309.04827.
  • Von Oswald et al. (2023) J. Von Oswald, E. Niklasson, E. Randazzo, J. Sacramento, A. Mordvintsev, A. Zhmoginov, and M. Vladymyrov. Transformers learn in-context by gradient descent. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  35151–35174. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/von-oswald23a.html.
  • Wang et al. (2023a) K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=NpsVSN6o4ul.
  • Wang et al. (2024) Q. Wang, T. Anikina, N. Feldhus, J. van Genabith, L. Hennig, and S. Möller. Llmcheckup: Conversational examination of large language models via interpretability tools, 2024.
  • Wang et al. (2023b) S. Wang, Y. Zhu, H. Liu, Z. Zheng, C. Chen, and J. Li. Knowledge editing for large language models: A survey. ArXiv, abs/2310.16218, 2023b. URL https://api.semanticscholar.org/CorpusID:264487359.
  • Wang et al. (2022) X. Wang, K. Wen, Z. Zhang, L. Hou, Z. Liu, and J. Li. Finding skill neurons in pre-trained transformer-based language models. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11132–11152, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.765. URL https://aclanthology.org/2022.emnlp-main.765.
  • Wei et al. (2022) D. Wei, R. Nair, A. Dhurandhar, K. R. Varshney, E. Daly, and M. Singh. On the safety of interpretable machine learning: A maximum deviation approach. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  9866–9880. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/402e12102d6ec3ea3df40ce1b23d423a-Paper-Conference.pdf.
  • Weiss et al. (2021) G. Weiss, Y. Goldberg, and E. Yahav. Thinking like transformers. In M. Meila and T. Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  11080–11090. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/weiss21a.html.
  • Wen et al. (2023) K. Wen, Y. Li, B. Liu, and A. Risteski. Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  38723–38766. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/79ba1b827d3fc58e129d1cbfc8ff69f2-Paper-Conference.pdf.
  • Wichers et al. (2024) N. Wichers, C. Denison, and A. Beirami. Gradient-based language model red teaming, 2024.
  • Wolf et al. (2020) T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush. Transformers: State-of-the-art natural language processing. In Q. Liu and D. Schlangen (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  • Wright & Sharkey (2024) B. Wright and L. Sharkey. Addressing feature suppression in saes. AI ALIGNMENT FORUM, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes.
  • Wu et al. (2024a) W. Wu, Y. Wang, G. Xiao, H. Peng, and Y. Fu. Retrieval head mechanistically explains long-context factuality. Arxiv, 2024a. URL https://arxiv.longhoe.net/abs/2404.15574.
  • Wu et al. (2023a) Z. Wu, K. D’Oosterlinck, A. Geiger, A. Zur, and C. Potts. Causal proxy models for concept-based model explanations. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023a.
  • Wu et al. (2023b) Z. Wu, A. Geiger, T. Icard, C. Potts, and N. Goodman. Interpretability at scale: Identifying causal mechanisms in alpaca. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  78205–78226. Curran Associates, Inc., 2023b. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/f6a8b109d4d4fd64c75e94aaf85d9697-Paper-Conference.pdf.
  • Wu et al. (2024b) Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts. Reft: Representation finetuning for language models, 2024b.
  • Wu et al. (2024c) Z. Wu, A. Geiger, A. Arora, J. Huang, Z. Wang, N. D. Goodman, C. D. Manning, and C. Potts. pyvene: A library for understanding and improving pytorch models via interventions, 2024c.
  • Wu et al. (2024d) Z. Wu, A. Geiger, J. Huang, A. Arora, T. Icard, C. Potts, and N. D. Goodman. A reply to makelov et al. (2023)’s "interpretability illusion" arguments, 2024d.
  • Xiao et al. (2023) G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2309.17453.
  • Xie et al. (2022) S. M. Xie, A. Raghunathan, P. Liang, and T. Ma. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.
  • Xiong et al. (2020) R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T.-Y. Liu. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020.
  • Yang et al. (2023) S. Yang, S. Huang, W. Zou, J. Zhang, X. Dai, and J. Chen. Local interpretation of transformer based on linear decomposition. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  10270–10287, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.572. URL https://aclanthology.org/2023.acl-long.572.
  • Yao et al. (2023) Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, H. Chen, and N. Zhang. Editing large language models: Problems, methods, and opportunities. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10222–10240, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.632. URL https://aclanthology.org/2023.emnlp-main.632.
  • Yin & Neubig (2022) K. Yin and G. Neubig. Interpreting language models with contrastive explanations. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  184–198, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.14. URL https://aclanthology.org/2022.emnlp-main.14.
  • Yu et al. (2023a) Q. Yu, J. Merullo, and E. Pavlick. Characterizing mechanisms for factual recall in language models. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9924–9959, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.615. URL https://aclanthology.org/2023.emnlp-main.615.
  • Yu et al. (2023b) Y. Yu, S. Buchanan, D. Pai, T. Chu, Z. Wu, S. Tong, B. D. Haeffele, and Y. Ma. White-box transformers via sparse rate reduction. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=THfl8hdVxH.
  • Yu & Ananiadou (2024) Z. Yu and S. Ananiadou. Locating factual knowledge in large language models: Exploring the residual stream and analyzing subvalues in vocabulary space, 2024. URL https://arxiv.longhoe.net/abs/2312.12141.
  • Yuksekgonul et al. (2024) M. Yuksekgonul, V. Chandrasekaran, E. Jones, S. Gunasekar, R. Naik, H. Palangi, E. Kamar, and B. Nushi. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gfFVATffPd.
  • Zeiler & Fergus (2014) M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (eds.), Computer Vision – ECCV 2014, pp.  818–833, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10590-1.
  • Zhang & Sennrich (2019) B. Zhang and R. Sennrich. Root mean square layer normalization. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc.
  • Zhang & Nanda (2024) F. Zhang and N. Nanda. Towards best practices of activation patching in language models: Metrics and methods. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Hf17y6u9BC.
  • Zhang et al. (2022) S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.
  • Zhao & Shan (2024) Z. Zhao and B. Shan. Reagent: A model-agnostic feature attribution method for generative language models, 2024.
  • Zhong et al. (2023) Z. Zhong, Z. Liu, M. Tegmark, and J. Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  27223–27250. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/56cbfbf49937a0873d451343ddc8c57d-Paper-Conference.pdf.
  • Zhou et al. (2015) B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In International Conference on Learning Representations (ICLR), 2015.
  • Zhou et al. (2024) H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. M. Susskind, S. Bengio, and P. Nakkiran. What algorithms can transformers learn? a study in length generalization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AssIuHnmHX.
  • Zou et al. (2023) A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to ai transparency. Arxiv, 2023. URL https://arxiv.longhoe.net/abs/2310.01405.

Appendix A Mathematical Notation

Notation Definition
n𝑛nitalic_n Sequence length
𝒱𝒱\mathcal{V}caligraphic_V Vocabulary
𝐭=t1,t2,tn𝐭subscript𝑡1subscript𝑡2subscript𝑡𝑛\mathbf{t}=\langle t_{1},t_{2}\ldots,t_{n}\ranglebold_t = ⟨ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ Input sequence of tokens
𝐱=𝒙1,𝒙2,𝒙n𝐱subscript𝒙1subscript𝒙2subscript𝒙𝑛\mathbf{x}=\langle{\bm{x}}_{1},{\bm{x}}_{2}\ldots,{\bm{x}}_{n}\ranglebold_x = ⟨ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⟩ Input sequence of token embeddings
d𝑑ditalic_d Model dimension
dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT Attention head dimension
dFFNsubscript𝑑FFNd_{\text{FFN}}italic_d start_POSTSUBSCRIPT FFN end_POSTSUBSCRIPT FFN dimension
H𝐻Hitalic_H Number of heads
L𝐿Litalic_L Number of layers
𝒙ildsubscriptsuperscript𝒙𝑙𝑖superscript𝑑{\bm{x}}^{l}_{i}\in\mathbb{R}^{d}bold_italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT Residual stream state at position i𝑖iitalic_i, layer l𝑙litalic_l
𝒙imid,ldsubscriptsuperscript𝒙mid𝑙𝑖superscript𝑑{\bm{x}}^{\text{mid},l}_{i}\in\mathbb{R}^{d}bold_italic_x start_POSTSUPERSCRIPT mid , italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT Residual stream state at position i𝑖iitalic_i, layer l𝑙litalic_l, after the attention block
fc(𝐱)dsuperscript𝑓𝑐𝐱superscript𝑑f^{c}(\mathbf{x})\in\mathbb{R}^{d}italic_f start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ( bold_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT Component c𝑐citalic_c output representation at the last position
fl(𝐱)=𝒙nldsuperscript𝑓𝑙𝐱subscriptsuperscript𝒙𝑙𝑛superscript𝑑f^{l}(\mathbf{x})={\bm{x}}^{l}_{n}\in\mathbb{R}^{d}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( bold_x ) = bold_italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT Residual stream state at the last position, layer l𝑙litalic_l
𝑨l,hn×nsuperscript𝑨𝑙superscript𝑛𝑛{\bm{A}}^{l,h}\in\mathbb{R}^{n\times n}bold_italic_A start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT Attention matrix at layer l𝑙litalic_l, head hhitalic_h
𝑾Ql,h,𝑾Kl,h,𝑾Vl,hd×dhsuperscriptsubscript𝑾𝑄𝑙superscriptsubscript𝑾𝐾𝑙superscriptsubscript𝑾𝑉𝑙superscript𝑑subscript𝑑{\bm{W}}_{Q}^{l,h},{\bm{W}}_{K}^{l,h},{\bm{W}}_{V}^{l,h}\in\mathbb{R}^{d\times d% _{h}}bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT Queries, keys and values weight matrices at layer l𝑙litalic_l, head hhitalic_h
𝑾Ol,hdh×dsuperscriptsubscript𝑾𝑂𝑙superscriptsubscript𝑑𝑑{\bm{W}}_{O}^{l,h}\in\mathbb{R}^{d_{h}\times d}bold_italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_h end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT Output weight matrix at layer l𝑙litalic_l, head hhitalic_h
𝑾inld×dFFNsuperscriptsubscript𝑾in𝑙superscript𝑑subscript𝑑FFN{\bm{W}}_{\text{in}}^{l}\in\mathbb{R}^{d\times d_{\text{FFN}}}bold_italic_W start_POSTSUBSCRIPT in end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT FFN end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝑾outldFFN×dsuperscriptsubscript𝑾out𝑙superscriptsubscript𝑑FFN𝑑{\bm{W}}_{\text{out}}^{l}\in\mathbb{R}^{d_{\text{FFN}}\times d}bold_italic_W start_POSTSUBSCRIPT out end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT FFN end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT FFN input and output weight matrices at layer l𝑙litalic_l
𝑾Ed×|𝒱|subscript𝑾𝐸superscript𝑑𝒱{\bm{W}}_{E}\in\mathbb{R}^{d\times|\mathcal{V}|}bold_italic_W start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × | caligraphic_V | end_POSTSUPERSCRIPT and 𝑾U|𝒱|×dsubscript𝑾𝑈superscript𝒱𝑑{\bm{W}}_{U}\in\mathbb{R}^{|\mathcal{V}|\times d}bold_italic_W start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_V | × italic_d end_POSTSUPERSCRIPT Embedding and unembedding matrices
Table 1: Notation and definitions of the main variables used in this work.

Appendix B Linearization of the LayerNorm

The LayerNorm operates over an input 𝒛𝒛{\bm{z}}bold_italic_z as: LN(𝒛)=𝒛μ(𝒛)σ(𝒛)γ+βLN𝒛direct-product𝒛𝜇𝒛𝜎𝒛𝛾𝛽\text{LN}({\bm{z}})=\frac{{\bm{z}}-\mu({\bm{z}})}{\sigma({\bm{z}})}\odot% \mathbf{\gamma}+\mathbf{\beta}LN ( bold_italic_z ) = divide start_ARG bold_italic_z - italic_μ ( bold_italic_z ) end_ARG start_ARG italic_σ ( bold_italic_z ) end_ARG ⊙ italic_γ + italic_β, where μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ compute the mean and standard deviation of 𝒛𝒛{\bm{z}}bold_italic_z, and γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β refer to the element-wise transformation and bias respectively. Holding σ(𝒛)𝜎𝒛\sigma({\bm{z}})italic_σ ( bold_italic_z ) as a constant, the LayerNorm can be decomposed into 𝒛𝐋+β𝒛𝐋𝛽{\bm{z}}\mathbf{L}+\betabold_italic_z bold_L + italic_β, where 𝐋𝐋\mathbf{L}bold_L is a linear transformation:

𝐋:=1σ(𝒛)[γ1000γ2000γn][n1n1n1n1nn1n1n1n1nn1n].assign𝐋1𝜎𝒛delimited-[]subscript𝛾1000subscript𝛾2000subscript𝛾𝑛delimited-[]𝑛1𝑛1𝑛1𝑛1𝑛𝑛1𝑛1𝑛1𝑛1𝑛𝑛1𝑛\leavevmode\resizebox{190.79109pt}{}{$\displaystyle{\mathbf{L}:=\frac{1}{% \sigma({\bm{z}})}\left[\begin{array}[]{cccc}\gamma_{1}&0&\cdots&0\\ 0&\gamma_{2}&\cdots&0\\ \cdots&\cdots&\cdots&\cdots\\ 0&0&\cdots&\gamma_{n}\end{array}\right]\left[\begin{array}[]{cccc}\frac{n-1}{n% }&-\frac{1}{n}&\cdots&-\frac{1}{n}\\ -\frac{1}{n}&\frac{n-1}{n}&\cdots&-\frac{1}{n}\\ \cdots&\cdots&\cdots&\cdots\\ -\frac{1}{n}&-\frac{1}{n}&\cdots&\frac{n-1}{n}\end{array}\right]}$}.bold_L := divide start_ARG 1 end_ARG start_ARG italic_σ ( bold_italic_z ) end_ARG [ start_ARRAY start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] [ start_ARRAY start_ROW start_CELL divide start_ARG italic_n - 1 end_ARG start_ARG italic_n end_ARG end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_CELL start_CELL divide start_ARG italic_n - 1 end_ARG start_ARG italic_n end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_CELL end_ROW start_ROW start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL start_CELL ⋯ end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_CELL start_CELL ⋯ end_CELL start_CELL divide start_ARG italic_n - 1 end_ARG start_ARG italic_n end_ARG end_CELL end_ROW end_ARRAY ] . (26)

Appendix C Folding the LayerNorm

Any Transformer block reads from the residual stream by normalizing before applying a linear layer (with weights 𝑾𝑾{\bm{W}}bold_italic_W and bias 𝒃𝒃{\bm{b}}bold_italic_b) to the resulting vector:

LN(𝒛)𝑾+𝒃LN𝒛𝑾𝒃\text{LN}({\bm{z}}){\bm{W}}+{\bm{b}}LN ( bold_italic_z ) bold_italic_W + bold_italic_b (27)

Following the decomposition in Equation 26 we can fold the weights of the LayerNorm into those of the subsequent linear layer as follows:

LN(𝒛)𝑾+𝒃LN𝒛𝑾𝒃\displaystyle\text{LN}({\bm{z}}){\bm{W}}+{\bm{b}}LN ( bold_italic_z ) bold_italic_W + bold_italic_b =(𝒛𝑳+β)𝑾+𝒃absent𝒛𝑳𝛽𝑾𝒃\displaystyle=({\bm{z}}{\bm{L}}+\beta){\bm{W}}+{\bm{b}}= ( bold_italic_z bold_italic_L + italic_β ) bold_italic_W + bold_italic_b
=𝒛𝑳𝑾+β𝑾+𝒃absent𝒛𝑳𝑾𝛽𝑾𝒃\displaystyle={\bm{z}}{\bm{L}}{\bm{W}}+\beta{\bm{W}}+{\bm{b}}= bold_italic_z bold_italic_L bold_italic_W + italic_β bold_italic_W + bold_italic_b
=𝒛𝑾+𝒃,absent𝒛superscript𝑾superscript𝒃\displaystyle={\bm{z}}{\bm{W}}^{*}+{\bm{b}}^{*},= bold_italic_z bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + bold_italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (28)

where the new weights and bias are 𝑾=𝑳𝑾superscript𝑾𝑳𝑾{\bm{W}}^{*}={\bm{L}}{\bm{W}}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_italic_L bold_italic_W and 𝒃=β𝑾+𝒃superscript𝒃𝛽𝑾𝒃{\bm{b}}^{*}=\beta{\bm{W}}+{\bm{b}}bold_italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_β bold_italic_W + bold_italic_b respectively.

Appendix D Implementation details of SAEs

  • During training, a feature receives a zero gradient signal if it does not activate. When this occurs frequently, it can lead to a dead feature. Bricken et al. (2023) propose resampling these features by reinitializing their encoder and decoder weights periodically during training. An alternative approach to resampling is ghost gradients Jermyn & Templeton (2024), which adds an auxiliary loss term that supplies a gradient signal to promote the reactivation of dead features. However, recent results have found this approach suboptimal (Rajamanoharan, 2024; Conerly et al., 2024).

  • Setting the β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT parameter of Adam to 0 has been found to reduce the number of “dead” features in larger autoencoders (Templeton et al., 2024; Rajamanoharan et al., 2024). Yet, Conerly et al. (2024) rely on β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9.

  • Although intially the norm of the decoder’s rows242424Note that we consider the decoder weight matrix Wdecm×dsubscript𝑊decsuperscript𝑚𝑑{{\bm{W}}_{\text{dec}}\in\mathbb{R}^{m\times d}}bold_italic_W start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT. was recommended to be equal to one (Bricken et al., 2023), recent released SAEs also consider an unconstrained norm setting (Conerly et al., 2024).