Monitoring Latent World States in Language Models with Propositional Probes

Jiahai Feng  Stuart Russell  Jacob Steinhardt
UC Berkeley
Corresponding email: [email protected]
Abstract

Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with propositional probes, which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context “Greg is a nurse. Laura is a physicist.”, we decode the propositions 𝖶𝗈𝗋𝗄𝗌𝖠𝗌(𝖦𝗋𝖾𝗀,𝗇𝗎𝗋𝗌𝖾)𝖶𝗈𝗋𝗄𝗌𝖠𝗌𝖦𝗋𝖾𝗀𝗇𝗎𝗋𝗌𝖾\mathsf{WorksAs(Greg,nurse)}sansserif_WorksAs ( sansserif_Greg , sansserif_nurse ) and 𝖶𝗈𝗋𝗄𝗌𝖠𝗌(𝖫𝖺𝗎𝗋𝖺,𝗉𝗁𝗒𝗌𝗂𝖼𝗂𝗌𝗍)𝖶𝗈𝗋𝗄𝗌𝖠𝗌𝖫𝖺𝗎𝗋𝖺𝗉𝗁𝗒𝗌𝗂𝖼𝗂𝗌𝗍\mathsf{WorksAs(Laura,physicist)}sansserif_WorksAs ( sansserif_Laura , sansserif_physicist ) from the model’s activations. Key to this is identifying a binding subspace in which bound tokens have high similarity (Greg \leftrightarrow nurse) but unbound ones do not (Greg ↮↮\not\leftrightarrow↮ physicist). We validate propositional probes in a closed-world setting with finitely many predicates and properties. Despite being trained on simple templated contexts, propositional probes generalize to contexts rewritten as short stories and translated to Spanish. Moreover, we find that in three settings where language models respond unfaithfully to the input context—prompt injections, backdoor attacks, and gender bias— the decoded propositions remain faithful. This suggests that language models often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.

1 Introduction

Language models (LMs) appear to robustly understand their input contexts and flexibly respond to questions about them. Yet, they are susceptible to tendencies that lead to responses unfaithful to the input contexts: they are misled by irrelevant examples  [1, 2], subject to unintended tendencies  [3] and biases  [4, 5], and vulnerable to adversarial attacks on their prompts  [6] and training data  [7, 8].

Being able to interpret internal states of language models could help diagnose and correct language models when they become unfaithful [9, 10]. In particular, language models might internally represent the input context in a latent world model [11, 12]; when the LM is influenced by unfaithful tendencies, the latent world model could remain truthful even if the LM outputs falsehoods (Fig. 1). Thus, the state of the latent world model could provide more faithful information than the model outputs.

We extract the latent world state by introducing propositional probes, which probe the internal activations of the input context for logical propositions representing a world state. For example, the input context “The CEO and the nurse live in Thailand and France respectively.” would be represented by the propositions {𝖫𝗂𝗏𝖾𝗌𝖨𝗇(𝖢𝖤𝖮,𝖥𝗋𝖺𝗇𝖼𝖾),𝖫𝗂𝗏𝖾𝗌𝖨𝗇(𝖭𝗎𝗋𝗌𝖾,𝖥𝗋𝖺𝗇𝖼𝖾)}𝖫𝗂𝗏𝖾𝗌𝖨𝗇𝖢𝖤𝖮𝖥𝗋𝖺𝗇𝖼𝖾𝖫𝗂𝗏𝖾𝗌𝖨𝗇𝖭𝗎𝗋𝗌𝖾𝖥𝗋𝖺𝗇𝖼𝖾\{\mathsf{LivesIn(CEO,France)},\mathsf{LivesIn(Nurse,France)}\}{ sansserif_LivesIn ( sansserif_CEO , sansserif_France ) , sansserif_LivesIn ( sansserif_Nurse , sansserif_France ) } (Fig. 1).

The key difficulty in predicting propositions is binding entities in predicates—the space of propositions is exponentially large, so we need to exploit compositionality. Our propositional probes probe individual tokens for lexical information, and bind them together to form propositions (Fig. 2). To determine binding we identify with a novel Hessian-based algorithm the “binding subspace” [13], a subspace of activation space in which bound tokens have similar activations. We find that the identified binding subspace causally mediates [14, 15, 16] binding. In Sec. 4 we describe and evaluate the Hessian algorithm, and in Sec. 5 we discuss how propositional probes are constructed.

We validate propositional probes in a closed-world setting about people and their attributes with a finite number of predicates and properties. We generate synthetic training data by populating templates with random propositions. To evaluate their generalizability, we rewrite the templated data into short stories using GPT-3.5-turbo, and translate the short stories into Spanish. We find that propositional probes generalize to these complex settings (Sec. 5.3).

We then apply propositional probes to three situations where language models behave unfaithfully, and show that they reflect the input context more faithfully than the outputs of the language models (Sec. 6). Specifically, we study cases where the language model is influenced by prompt injections, backdoors, and gender bias. Our results indicate that propositional probes, if scaled sufficiently, could be applied to monitoring language models at inference time.

Refer to caption
Figure 1: Left: LMs construct representations of their inputs. They condition on these internal activations to answer queries (Prompting), but we extract symbolic propositions from them (Propositional Probes). Right: LMs may exhibit tendencies towards unfaithfulness, which we hypothesize will influence final responses to prompts but not the early-stage activations detected by our probes.
Refer to caption
Figure 2: Left: Name probes (blue) and country probes (green) classify activations into either a name/country or a null value. Right: Activations have a content component and a binding component, such that bound activations have similar binding components. We use this to compose across tokens.

2 Related Work

Probing   Probes have been shown to be effective at decoding lexical information from language model activations in many domains  [17], such as color  [18], gender  [19], and space  [20]. These insights inform how we construct constituent domain probes. In addition, propositional probes have also been studied in autoregressive models trained on Othello games  [12] and in language models fine-tuned on closed-world Alchemy and TextWorld environments [11]. Lastly, probes have been used to study representations of factuality [21, 22] with respect to static knowledge learned from training; in contrast we study faithfulness with respect to dynamic knowledge described in context.

Monitoring Language Models   Researchers have argued that probes that extract latent knowledge in machine learning systems can provide monitoring tools necessary for usability and safety  [9, 10]. To benchmark monitoring methods, researchers have proposed an anomaly detection framework [23, 24]: probes should be trained only on simple trusted data and tested on hard, untrusted data. Our evaluations are consistent with this principle.

Representations of binding   Researchers have long studied binding in connectionist models and human minds [25, 26, 27, 28]. In recent statistical language models, researchers have shown that representations of semantic roles and coreferences emerge from pretraining  [29, 30, 31]. We build on the “binding vectors” discovered in language model activations  [13, 32], which we describe later.

3 Preliminaries

Closed-world datasets   We consider a closed world about people and their attributes. Each person has a name, country of origin, occupation, and a food they like. We describe an instance of the world with a set of propositions, such as 𝖫𝗂𝗏𝖾𝗌𝖨𝗇(𝖢𝖺𝗋𝗈𝗅,𝖨𝗍𝖺𝗅𝗒)𝖫𝗂𝗏𝖾𝗌𝖨𝗇𝖢𝖺𝗋𝗈𝗅𝖨𝗍𝖺𝗅𝗒\mathsf{LivesIn(Carol,Italy)}sansserif_LivesIn ( sansserif_Carol , sansserif_Italy ). Each proposition consists of one of three predicates (“LivesIn”, “WorksAs”, “LikesToEat”), each of which binds a name to a country, occupation, or food respectively.

In greater detail, we construct four domains, denoted as 𝒟0,,𝒟3subscript𝒟0subscript𝒟3\mathcal{D}_{0},\dots,\mathcal{D}_{3}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, which correspond to names, countries, occupations, and foods respectively. Each domain 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a set of |𝒟k|subscript𝒟𝑘|\mathcal{D}_{k}|| caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | values, e.g. 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a set of names. Then, a proposition consists of a predicate (either “LivesIn”, “WorksAs”, or “LikesToEat”), a name E𝒟0𝐸subscript𝒟0E\in\mathcal{D}_{0}italic_E ∈ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and an attribute A𝐴Aitalic_A from 𝒟1,𝒟2,subscript𝒟1subscript𝒟2\mathcal{D}_{1},\mathcal{D}_{2},caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , or 𝒟3subscript𝒟3\mathcal{D}_{3}caligraphic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT respectively.

In our setting, the predicates correspond one-to-one to the attribute domains, which implies that a proposition can be determined by a value from the names domain and a value from one of the three attribute domains—the predicate can be inferred from which domain the attribute came from. Our method utilizes this observation. In addition, for brevity we sometimes drop the predicate, e.g. (𝖢𝖺𝗋𝗈𝗅,𝖨𝗍𝖺𝗅𝗒)𝖢𝖺𝗋𝗈𝗅𝖨𝗍𝖺𝗅𝗒\mathsf{(Carol,Italy)}( sansserif_Carol , sansserif_Italy ) instead of 𝖫𝗂𝗏𝖾𝗌𝖨𝗇(𝖢𝖺𝗋𝗈𝗅,𝖨𝗍𝖺𝗅𝗒)𝖫𝗂𝗏𝖾𝗌𝖨𝗇𝖢𝖺𝗋𝗈𝗅𝖨𝗍𝖺𝗅𝗒\mathsf{LivesIn(Carol,Italy)}sansserif_LivesIn ( sansserif_Carol , sansserif_Italy ).

To create the input context, we generate random propositions about 2 people. For each person, we generate a random name, country, occupation and food, and instantiate the three propositions about them. We make sure that the two people do not share any attribute. These propositions are formatted in a template to produce the synth dataset, rewritten using GPT-3.5-turbo into a short story (para), and then translated into Spanish (trans). See Fig. 3 for an example and Appendix A for details.

Refer to caption
Figure 3: Dataset creation: Sets of propositions about 2 people are generated randomly. Each set is formatted by a template (synth), rewritten into a story (para), and translated into Spanish (trans).

Domain Probes   For every domain, we train a domain probe that linearly classifies activations at individual token positions into either a value in the domain, or a null value, indicating that none of the values is represented. For example, Fig. 2 (left) shows the outputs of the name probe and the country probe. In greater detail, the language model converts a context with S𝑆Sitalic_S tokens, T0,TS1subscript𝑇0subscript𝑇𝑆1T_{0},\dots T_{S-1}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_S - 1 end_POSTSUBSCRIPT into internal activations Z0,,ZS1subscript𝑍0subscript𝑍𝑆1Z_{0},\dots,Z_{S-1}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_S - 1 end_POSTSUBSCRIPT. For every domain 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we train a probe Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that takes as input an activation Zssubscript𝑍𝑠Z_{s}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT at any token position s𝑠sitalic_s, and outputs a value in 𝒟k{}subscript𝒟𝑘bottom\mathcal{D}_{k}\cup\{\bot\}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∪ { ⊥ }. In Sec. 5.1 we describe how we use simple linear probes to implement domain probes.

Binding subspace   We use the binding subspace to compose values identified by domain probes to form propositions. Prior work  [13] suggested that activations can be linearly decomposed into two components: a content vector and a binding vector (Fig. 2, right); moreover, two activations are bound if and only if their binding vectors are similar. Additionally, they suggested that binding vectors may form a binding subspace, which we identify in this work using a Hessian-based algorithm (Sec. 4.1). We use the binding subspace to form a binding similarity metric between token activations that informs how domain probe results are bound together in propositions.

Models   Throughout the paper we use the Tulu-2 13 billion parameter model [33], an instruction-tuned version of the Llama 2 base model [34]. Our evaluations on Llama-2-13b-chat did not differ significantly. The models have nlayers=40subscript𝑛layers40n_{\text{layers}}=40italic_n start_POSTSUBSCRIPT layers end_POSTSUBSCRIPT = 40 layers and dmodel=5120subscript𝑑model5120d_{\text{model}}=5120italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT = 5120 embedding dimensions.

4 Identifying the binding subspace

In this section, we present a Hessian-based algorithm for identifying the binding subspace (Sec. 4.1), which is loosely inspired by the use of the Jacobian in an unrelated task [35]. To evaluate the binding subspace, we perform quantitative causal interventions as well as qualitative analysis (Sec. 4.2).

4.1 Hessian-based algorithm

In this section, we first motivate the Hessian-based algorithm with a thought experiment. We then describe the algorithm, which produces a matrix H𝐻Hitalic_H that characterizes the binding subspace. Finally, we describe how the binding subspace is extracted from H𝐻Hitalic_H, and how we use H𝐻Hitalic_H to construct a binding similarity metric d(Zx,Zy)𝑑subscript𝑍𝑥subscript𝑍𝑦d(Z_{x},Z_{y})italic_d ( italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) that measures how bound any two activations Zx,Zysubscript𝑍𝑥subscript𝑍𝑦Z_{x},Z_{y}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are.

Let us imagine there really is a binding subspace where two activations Zxsubscript𝑍𝑥Z_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Zysubscript𝑍𝑦Z_{y}italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are bound only if ZxHZysuperscriptsubscript𝑍𝑥top𝐻subscript𝑍𝑦Z_{x}^{\top}HZ_{y}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_H italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is large for some low rank H𝐻Hitalic_H. Then, suppose we take two initially unbound activations Zxsubscript𝑍𝑥Z_{x}italic_Z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Zysubscript𝑍𝑦Z_{y}italic_Z start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and perturb the first in direction x𝑥xitalic_x and the latter in direction y𝑦yitalic_y. The binding strength should increase only if directions x𝑥xitalic_x and y𝑦yitalic_y align under H𝐻Hitalic_H. In fact, if there is a measure of binding strength F(x,y)𝐹𝑥𝑦F(x,y)italic_F ( italic_x , italic_y ) that is bilinear in x,y𝑥𝑦x,yitalic_x , italic_y, then H𝐻Hitalic_H is precisely the second-derivative xyF(x,y)subscript𝑥subscript𝑦𝐹𝑥𝑦\nabla_{x}\nabla_{y}F(x,y)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_F ( italic_x , italic_y ).

To instantiate this, we need a way of creating unbound activations, and also of measuring binding strength. To create unbound activations, we construct activations that are initially bound and subtract away the binding information so that the binding vectors become indistinguishable. Specifically, we consider a context with two propositions {(E0,A0),(E1,A1)}subscript𝐸0subscript𝐴0subscript𝐸1subscript𝐴1\{(E_{0},A_{0}),(E_{1},A_{1})\}{ ( italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) }, and let ZEisubscript𝑍subscript𝐸𝑖Z_{E_{i}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ZAisubscript𝑍subscript𝐴𝑖Z_{A_{i}}italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for i=0,1𝑖01i=0,1italic_i = 0 , 1 be the activations for Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT respectively. Prior work [13] showed we can decompose the activation ZE0subscript𝑍subscript𝐸0Z_{E_{0}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT into a content part fE0subscript𝑓subscript𝐸0f_{E_{0}}italic_f start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and a binding part bE0subscript𝑏subscript𝐸0b_{E_{0}}italic_b start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and likewise for E1,A0,subscript𝐸1subscript𝐴0E_{1},A_{0},italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , and A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We follow the averaging procedure in prior work (explained later) to estimate differences between the binding vectors, ΔEbE1bE0,ΔAbA1bA0formulae-sequencesubscriptΔ𝐸subscript𝑏subscript𝐸1subscript𝑏subscript𝐸0subscriptΔ𝐴subscript𝑏subscript𝐴1subscript𝑏subscript𝐴0\smash{\Delta_{E}\triangleq b_{E_{1}}-b_{E_{0}},\Delta_{A}\triangleq b_{A_{1}}% -b_{A_{0}}}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ≜ italic_b start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ≜ italic_b start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then, to erase binding information, we add 0.5ΔE0.5subscriptΔ𝐸0.5\Delta_{E}0.5 roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and 0.5ΔA0.5subscriptΔ𝐴0.5\Delta_{A}0.5 roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT to ZE0subscript𝑍subscript𝐸0Z_{E_{0}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ZA0subscript𝑍subscript𝐴0Z_{A_{0}}italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and subtract the same quantities from ZE1subscript𝑍subscript𝐸1Z_{E_{1}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ZA1subscript𝑍subscript𝐴1Z_{A_{1}}italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This moves the binding vectors to their midpoint and so should cause them to be indistinguishable from each other.

To measure binding strength, we append to the context a query that asks for the attribute bound to either E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. For example, if the attributes are the countries of residence of the entities, we would ask “Which country does E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT/E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT live in?”. We measure the probability assigned to the correct answer (A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT respectively), and take the average over the two queries as the measure of binding strength F(x,y)𝐹𝑥𝑦F(x,y)italic_F ( italic_x , italic_y ), where x𝑥xitalic_x and y𝑦yitalic_y are the perturbations added to the unbound entity and attribute activations. Therefore, when all activations are unbound (x=y=0𝑥𝑦0x=y=0italic_x = italic_y = 0), the model takes a random guess between the two attributes, and so F(0,0)=0.5𝐹000.5F(0,0)=0.5italic_F ( 0 , 0 ) = 0.5. If x𝑥xitalic_x and y𝑦yitalic_y are aligned under H𝐻Hitalic_H, we expect F(x,y)>0.5𝐹𝑥𝑦0.5F(x,y)>0.5italic_F ( italic_x , italic_y ) > 0.5.

Thus, in sum, our overall method first estimates the differences in binding vectors ΔE,ΔAsubscriptΔ𝐸subscriptΔ𝐴\Delta_{E},\Delta_{A}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, uses this to erase binding information in a two-proposition context, and then looks at which directions would add the binding information back in. Specifically, we measure the binding strength F(x,y)𝐹𝑥𝑦F(x,y)italic_F ( italic_x , italic_y ) by computing the average probability of returning the correct attribute after erasing the binding information and perturbing the activations by x,y𝑥𝑦x,yitalic_x , italic_y, i.e. after the interventions

ZE0ZE0+0.5ΔE+x,ZE1ZE10.5ΔEx,formulae-sequencesubscript𝑍subscript𝐸0subscript𝑍subscript𝐸00.5subscriptΔ𝐸𝑥subscript𝑍subscript𝐸1subscript𝑍subscript𝐸10.5subscriptΔ𝐸𝑥Z_{E_{0}}\leftarrow Z_{E_{0}}+0.5\Delta_{E}+x,\quad Z_{E_{1}}\leftarrow Z_{E_{% 1}}-0.5\Delta_{E}-x,italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 0.5 roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT + italic_x , italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - 0.5 roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT - italic_x ,
ZA0ZA0+0.5ΔA+y,ZA1ZA10.5ΔAy.formulae-sequencesubscript𝑍subscript𝐴0subscript𝑍subscript𝐴00.5subscriptΔ𝐴𝑦subscript𝑍subscript𝐴1subscript𝑍subscript𝐴10.5subscriptΔ𝐴𝑦Z_{A_{0}}\leftarrow Z_{A_{0}}+0.5\Delta_{A}+y,\quad Z_{A_{1}}\leftarrow Z_{A_{% 1}}-0.5\Delta_{A}-y.italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 0.5 roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_y , italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - 0.5 roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_y .

The matrix H𝐻Hitalic_H is obtained as the second-derivative xyF(x,y)subscript𝑥subscript𝑦𝐹𝑥𝑦\nabla_{x}\nabla_{y}F(x,y)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_F ( italic_x , italic_y ). For tractability, we parameterize x𝑥xitalic_x and y𝑦yitalic_y so that they are shared across layers, i.e. x,ydmodel𝑥𝑦superscriptsubscript𝑑model\smash{x,y\in\mathbb{R}^{d_{\text{model}}}}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT instead of dmodel×nlayerssuperscriptsubscript𝑑modelsubscript𝑛layers\smash{\mathbb{R}^{d_{\text{model}}\times n_{\text{layers}}}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT layers end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

To obtain the low-rank binding subspace from H𝐻Hitalic_H, we take its singular value decomposition H=USV𝐻𝑈𝑆superscript𝑉topH=USV^{\top}italic_H = italic_U italic_S italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where S𝑆Sitalic_S is a diagonal matrix with decreasing entries. We expect the binding subspace to be the top k𝑘kitalic_k-dimensional subspaces of U𝑈Uitalic_U and V𝑉Vitalic_V, U(k),V(k)dmodel×ksubscript𝑈𝑘subscript𝑉𝑘superscriptsubscript𝑑model𝑘U_{(k)},V_{(k)}\in\mathbb{R}^{d_{\text{model}}\times k}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT × italic_k end_POSTSUPERSCRIPT, for a relatively small value of k𝑘kitalic_k.

Binding similarity metric   To turn the binding subspace U(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT into a similarity metric between two activations Zssubscript𝑍𝑠Z_{s}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we take the activations Zs(l),Zt(l)superscriptsubscript𝑍𝑠𝑙superscriptsubscript𝑍𝑡𝑙Z_{s}^{(l)},Z_{t}^{(l)}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT at a certain layer l𝑙litalic_l, project them into U(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT, and compute their inner product under the metric induced by S𝑆Sitalic_S:

d(Zs,Zt)Zs(l)U(k)S(k)2U(k)Zt(l).𝑑subscript𝑍𝑠subscript𝑍𝑡subscriptsuperscript𝑍limit-from𝑙top𝑠subscript𝑈𝑘superscriptsubscript𝑆𝑘2superscriptsubscript𝑈𝑘topsubscriptsuperscript𝑍𝑙𝑡d(Z_{s},Z_{t})\triangleq Z^{(l)\top}_{s}U_{(k)}S_{(k)}^{2}U_{(k)}^{\top}Z^{(l)% }_{t}.italic_d ( italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≜ italic_Z start_POSTSUPERSCRIPT ( italic_l ) ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_Z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (1)

We choose l=15𝑙15l=15italic_l = 15 for our models. We discuss this choice and other practical details in Appendix C.

Estimating ΔEsubscriptΔ𝐸\Delta_{E}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and ΔAsubscriptΔ𝐴\Delta_{A}roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT   Following prior work [13], to estimate ΔE=bE1bE0subscriptΔ𝐸subscript𝑏subscript𝐸1subscript𝑏subscript𝐸0\Delta_{E}=b_{E_{1}}-b_{E_{0}}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT we take the average difference ZE1ZE0subscript𝑍subscript𝐸1subscript𝑍subscript𝐸0Z_{E_{1}}-Z_{E_{0}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT across 200 contexts with different values of E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with the hope that the content dependent vectors fE0,fE1subscript𝑓subscript𝐸0subscript𝑓subscript𝐸1f_{E_{0}},f_{E_{1}}italic_f start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT cancel out (likewise for ΔAsubscriptΔ𝐴\Delta_{A}roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT).

4.2 Evaluations of the Hessian-based algorithm

In this section, we first show quantitatively that the Hessian-based algorithm provides a subspace that causally mediates binding, and that this subspace generalizes to contexts with three entities even though the Hessian was taken for two-entity contexts. We then qualitatively evaluate our binding subspace by plotting the binding similarity (Eq. 1) for a few input contexts.

Interchange interventions   To evaluate the claim that the k𝑘kitalic_k-dimensional subspace sU(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT causally mediates binding for entities, we perform an interchange intervention [36] on this subspace at every layer. The idea is that if U(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT indeed carries the binding information but not any of the content information, swap** the activations in this subspace between two entities E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ought to switch the bound pairs. Specifically, we perform the interventions across all layers l=0,,39𝑙039l=0,\dots,39italic_l = 0 , … , 39:

ZE0(l)superscriptsubscript𝑍subscript𝐸0𝑙\displaystyle Z_{E_{0}}^{(l)}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT (IP)ZE0(l)+PZE1absent𝐼𝑃superscriptsubscript𝑍subscript𝐸0𝑙𝑃subscript𝑍subscript𝐸1\displaystyle\leftarrow(I-P)Z_{E_{0}}^{(l)}+PZ_{E_{1}}← ( italic_I - italic_P ) italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + italic_P italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (2)
ZE1(l)superscriptsubscript𝑍subscript𝐸1𝑙\displaystyle Z_{E_{1}}^{(l)}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT (IP)ZE1(l)+PZE0(l),absent𝐼𝑃superscriptsubscript𝑍subscript𝐸1𝑙𝑃superscriptsubscript𝑍subscript𝐸0𝑙\displaystyle\leftarrow(I-P)Z_{E_{1}}^{(l)}+PZ_{E_{0}}^{(l)},← ( italic_I - italic_P ) italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT + italic_P italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , (3)

with P=U(k)U(k)𝑃subscript𝑈𝑘superscriptsubscript𝑈𝑘topP=U_{(k)}U_{(k)}^{\top}italic_P = italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. If U(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT correctly captures binding information, then we expect the binding information to have swapped for E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

In greater detail, we consider synthetic contexts with three entities and attributes, describing the propositions {(E0,A0),(E1,A1),(E2,A2)}subscript𝐸0subscript𝐴0subscript𝐸1subscript𝐴1subscript𝐸2subscript𝐴2\{(E_{0},A_{0}),(E_{1},A_{1}),(E_{2},A_{2})\}{ ( italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) }. For any pair Ei,Ejsubscript𝐸𝑖subscript𝐸𝑗E_{i},E_{j}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, ij𝑖𝑗i\neq jitalic_i ≠ italic_j, we apply interchange interventions (Eq. (2), (3)) to swap binding information between ZEisubscript𝑍subscript𝐸𝑖Z_{E_{i}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ZEjsubscript𝑍subscript𝐸𝑗Z_{E_{j}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT. If the intervention succeeded, we expect Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to now bind to Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and Ejsubscript𝐸𝑗E_{j}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to bind to Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We denote this intervention that swaps the binding between Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ejsubscript𝐸𝑗E_{j}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as i𝑖iitalic_i-j𝑗jitalic_j.

To measure the success of the intervention i𝑖iitalic_i-j𝑗jitalic_j, we append a question to the context that asks which attribute Eisubscript𝐸𝑖E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is bound to, and check if the probability assigned to the expected attribute Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the highest. We do the same for Ejsubscript𝐸𝑗E_{j}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, as well as the last entity that we do not intervene on. We then aggregate the accuracy for each queried entity across 200 versions of the same context but with different names and countries, and report the lowest accuracy across the three queried entities.

We apply the interchange intervention with different dimensions k𝑘kitalic_k for entities using U(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT, and for attributes using subspace V(k)subscript𝑉𝑘V_{(k)}italic_V start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT with analogous interventions (Eq. (2), (3)) on ZAisubscript𝑍subscript𝐴𝑖Z_{A_{i}}italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ZAjsubscript𝑍subscript𝐴𝑗Z_{A_{j}}italic_Z start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT instead (Fig. 4). As a random baseline, we take the SVD of a random matrix instead of the Hessian. As an additional baseline, we implement implement Distributed Alignment Search (DAS) [37]. Specifically, we use gradient descent to find a fixed dimensional subspace that enables interchange interventions between the two entities in two-entity contexts, for various choices of subspace dimension.

As a skyline, we evaluate the two-dimensional subspace spanned by the differences in binding vectors of the three entities. We obtain these difference vectors similarly as ΔEsubscriptΔ𝐸\Delta_{E}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT and ΔAsubscriptΔ𝐴\Delta_{A}roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT used in the computation of the Hessian: we take samples of ZE0,ZE1,subscript𝑍subscript𝐸0subscript𝑍subscript𝐸1Z_{E_{0}},Z_{E_{1}},italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , and ZE2subscript𝑍subscript𝐸2Z_{E_{2}}italic_Z start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT across 200 contexts with different values of entities, average across the samples, and take differences between the three resultant mean activations. We consider this a skyline because this subspace is obtained from three entities’ binding vectors, whereas the Hessian and DAS are computed using contexts with only two entities.

The results show that the top 50 dimensions of the Hessian (out of 5120512051205120) are enough to capture binding information. Moreover, despite being obtained from contexts with two entities, the subspace correctly identifies the direction of the third binding vector, in that it enables the swaps “0-2” and “1-2”. In contrast, random subspaces of all dimensions fail to switch binding information without also switching content. Further, while DAS finds subspaces that successfully swaps the first two binding vectors, they do not enable swaps between the second and third binding vectors (1-2). Interestingly, the top 50 dimensions of the Hessian outperforms the skyline for swap** between E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (and A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT vs A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). This could be due to the fact that the skyline subspace is obtained from differences in binding vectors, which could contain systematic errors that do not contribute to binding, or perhaps due to variability in binding vectors in individual inputs. We discuss details in Appendix D.

Refer to caption
Figure 4: The accuracy of swap** binding information in name (attribute) activations by projecting into U(k)subscript𝑈𝑘U_{(k)}italic_U start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT (V(k)subscript𝑉𝑘V_{(k)}italic_V start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT) against k𝑘kitalic_k in a context with 3 names and 3 attributes. We test the subspaces from the Hessian (blue), a random baseline (orange), and a skyline subspace obtained by estimating the subspace spanned by the first 3 binding vectors. We perform all 3 pairwise switches: 0-1 represents swap** the binding information of E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), and so on.

Qualitative evaluations   We visualize the binding metric in various contexts by plotting pairwise binding similarities (Eq. 1) between token activations. Specifically, on a set of activations Z0,,ZS1subscript𝑍0subscript𝑍𝑆1Z_{0},\dots,Z_{S-1}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_S - 1 end_POSTSUBSCRIPT, we compute the matrix Mst=d(Zs,Zt)subscript𝑀𝑠𝑡𝑑subscript𝑍𝑠subscript𝑍𝑡M_{st}=d(Z_{s},Z_{t})italic_M start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT = italic_d ( italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We show the similarity matrix for three contexts in Fig. 5. We first evaluate an input context in which the entities are bound serially: the first sentence binds E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and the second binds E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. To ensure that the binding subspace is picking up on binding, and not something spurious such as sentence boundaries, we evaluate an input context in which entities are bound in parallel: there is now only one sentence that binds E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to A0subscript𝐴0A_{0}italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT respectively. In both the serial context (Fig. 5 left) and parallel context (Fig. 5 middle), the activations are clustered based on which entity they refer to. In addition, we plot the similarity matrix for a context with three entities (Fig. 5 right). Interestingly, the binding metric does not clearly discriminate between the second and third entities even though the interchange interventions showed that the binding subspace captures the difference in binding between them. This suggests that the 50-dimensional binding subspace obtained from the Hessian may either contain spurious non-binding directions or only incompletely captures the binding subspace. Thus, our current methods may be too noisy for contexts with more than two entities. In Appendix E, we use similar plots to show that coreferred entities share the same binding vectors.

Refer to caption
Figure 5: Similarity between token activations under the binding similarity metric d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) for two-entity serial (left) and parallel (middle) contexts. Right: Three-entity serial context.

5 Propositional Probes

In this section we construct and evaluate propositional probes. To construct propositional probes, we first train 4 domain probes (Sec. 5.1), then compose them together using the binding similarity metric (1) obtained from the binding subspace (Sec. 5.2). We then evaluate our probes on the three datasets synth, para, and trans. Despite being only trained on synth, propositional probes perform comparably with a prompting skyline on all three datasets (Sec. 5.3).

5.1 Domain probes

For every domain 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we train a probe Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that classifies the activation at each position Zssubscript𝑍𝑠Z_{s}italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT into either a value Pk(Zs)𝒟ksubscript𝑃𝑘subscript𝑍𝑠subscript𝒟𝑘P_{k}(Z_{s})\in\mathcal{D}_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the domain or the null class bottom\bot. We use the activation at a particular layer l𝑙litalic_l as the input of the probe. We describe later how l𝑙litalic_l is chosen. The probe then has the type Pk:dmodel𝒟k{}:subscript𝑃𝑘superscriptsubscript𝑑modelsubscript𝒟𝑘bottomP_{k}:\mathbb{R}^{d_{\text{model}}}\rightarrow\mathcal{D}_{k}\cup\{\bot\}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∪ { ⊥ }.

We parameterize the probe with |𝒟k|subscript𝒟𝑘|\mathcal{D}_{k}|| caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | vectors, uk(0),uk(|𝒟k|1)dmodelsuperscriptsubscript𝑢𝑘0superscriptsubscript𝑢𝑘subscript𝒟𝑘1superscriptsubscript𝑑modelu_{k}^{(0)},\dots u_{k}^{(|\mathcal{D}_{k}|-1)}\in\mathbb{R}^{d_{\text{model}}}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( | caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | - 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as well as a threshold hksubscript𝑘h_{k}\in\mathbb{R}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R. Each vector can be thought of as a direction in activation space corresponding to a value in the domain. The classification of the probe is simply the value whose vector has the highest dot product with the activation. When all vectors have a dot product smaller than the threshold, the probe classifies bottom\bot instead. Formally,

Pk(Z)={argmaxiuk(i)Z,ifmaxiuk(i)>hk,otherwisesubscript𝑃𝑘𝑍casessubscriptargmax𝑖superscriptsubscript𝑢𝑘𝑖𝑍ifsubscript𝑖superscriptsubscript𝑢𝑘𝑖subscript𝑘bottomotherwiseP_{k}(Z)=\begin{cases}\operatorname*{arg\,max}_{i}u_{k}^{(i)}\cdot Z,&\text{if% }\max_{i}u_{k}^{(i)}>h_{k}\\ \bot,&\text{otherwise}\end{cases}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_Z ) = { start_ROW start_CELL start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_Z , end_CELL start_CELL if roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT > italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⊥ , end_CELL start_CELL otherwise end_CELL end_ROW

To learn the vectors, we generate a dataset of activations and their corresponding values. Then, we set each vector to the mean of the activations with that input. Then, we subtract each vector with the average vector 1|𝒟k|iuk(i)1subscript𝒟𝑘subscript𝑖superscriptsubscript𝑢𝑘𝑖\frac{1}{|\mathcal{D}_{k}|}\sum_{i}u_{k}^{(i)}divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. This can be seen as a multi-class generalization of the so-called difference-in-means probes  [24].

To generate this dataset, we use synthetic templates. However, this only provides context-level supervision: we know that the activations in the context collectively represent a certain value in the domain, but we do not know which activations represent the value and which represent bottom\bot. Thus, we have to assign the activations at each token position with a ground-truth label.

To do so, we use a Grad-CAM-style attribution technique  [38] similar to that used by Olah et al. [39]. Broadly speaking, we backpropagate through the model to estimate how much the activation at each layer/token position contributes towards the model’s knowledge of the content. This identifies both the token position that carries value information and the layers that are the most informative.

The attribution results indicate that the middle layers at last token position are the most informative. We thus choose layer l=20𝑙20l=20italic_l = 20 (out of 40 layers). In Appendix G we discuss the attribution results further. In Sec. 5.3 we evaluate the accuracy of domain probes on the synth dataset on which they are trained, as well as the para and trans datasets.

5.2 Propositional probes

With the binding similarity (Eq. 1) obtained from the binding subspace, we can compose domain probes with a simple lookup algorithm. We first identify all the names mentioned in the context with the name domain probe P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then, for every other domain probe Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we identify the values it picks up in the context, and for each of these values we look for the best matching name to compose together. The pseudocode is described in Alg. 1.

Algorithm 1 Lookup algorithm to propose predicates
1:Domain probes {Pk}ksubscriptsubscript𝑃𝑘𝑘\{P_{k}\}_{k}{ italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and binding form H𝐻Hitalic_H.
2:procedure ProposePredicates({Zs}ssubscriptsubscript𝑍𝑠𝑠\{Z_{s}\}_{s}{ italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT)
3:     N𝑁Nitalic_N \leftarrow Subset of {Zs}ssubscriptsubscript𝑍𝑠𝑠\{Z_{s}\}_{s}{ italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for which P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is not bottom\bot \triangleright Detect names
4:     for all Pk,k>0subscript𝑃𝑘𝑘0P_{k},k>0italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k > 0 do
5:         V𝑉Vitalic_V \leftarrow Subset of {Zs}ssubscriptsubscript𝑍𝑠𝑠\{Z_{s}\}_{s}{ italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for which Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is not bottom\bot
6:         for all Zvsubscript𝑍𝑣Z_{v}italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT in V𝑉Vitalic_V do
7:              n𝑛nitalic_n argmaxiNd(Zi,Zv)absentsubscriptargmax𝑖𝑁𝑑subscript𝑍𝑖subscript𝑍𝑣\leftarrow\operatorname*{arg\,max}_{i\in N}d(Z_{i},Z_{v})← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_d ( italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) \triangleright Find best matching name
8:              Propose new proposition (n,v)𝑛𝑣(n,v)( italic_n , italic_v )
9:         end for
10:     end for
11:     return All proposed propositions
12:end procedure

5.3 Evaluations of domain and propositional probes

Refer to caption

Refer to caption

Refer to caption
Figure 6: Left: Accuracy of domain probes on synth. Middle: Exact match accuracy and Jaccard index of probing and prompting. “DAS-1”, “DAS-50”, and “random” are ablations of propositional probes where we use the 1-dim and 50-dim DAS subspace or a random 50-dim subspace as the binding subspace. Right: Exact match accuracy and Jaccard index of probing and prompting on prompt-injected (p) and backdoored (ft) settings. All: Error bars indicate standard error.

In this section we evaluate propositional probes on the three datasets, synth, para, and trans. The constituents of propositional probes, domain probes and the binding subspace, are obtained from templated data, and we thus view para and trans as evaluating out-of-domain generalization. We first evaluate each of the four domain probes in isolation, and then evaluate the propositional probe.

Domain probes   We evaluate the exact-match accuracy of each of the four domain probes at predicting the correct set of values that are in the context for the three datasets (Fig. 6 left). Each context contains N=2𝑁2N=2italic_N = 2 values, and domain probes have to get both right. Despite being trained only on a synthetic dataset, we find that domain probes generalize to the paraphrased dataset para, and even the Spanish dataset trans.

Propositional probes   We evaluate the Jaccard index and exact-match accuracy between the ground-truth propositions and the set of propositions returned by the propositional probe (or prompting skyline). Specifically, for an input context, let the set of propositions returned by the propositional probe (or prompting) be A𝐴Aitalic_A, and the ground-truth set of propositions be B𝐵Bitalic_B. The exact-match accuracy is the fraction of contexts for which A=B𝐴𝐵A=Bitalic_A = italic_B, and the Jaccard index is the average value of |AB|/|AB|𝐴𝐵𝐴𝐵|A\cap B|/|A\cup B|| italic_A ∩ italic_B | / | italic_A ∪ italic_B |. Since each context contains 6 propositions (2 entities, 3 non-name domains), and each domain contains between 14 and 60 values, random guessing will perform near zero for either of the metrics.

We compare propositional probes against a prompting skyline that iteratively asks the model questions about the context. First, we ask for the number of names. Then, we ask for the names. For each name, we ask for the associated value for every non-name domain (e.g. “What is the occupation of John?”), and select the value in the domain with the highest log probability. High performance at the prompting skyline validates the assumption that the model understands the context well enough to answer questions about it. In addition, we evaluate ablations of our propositional probes in which the Hessian binding metric is replaced with the dot-product following a projection into the 50-dimensional and 1-dimensional DAS subspaces (“DAS-50” and “DAS-1”), and a 50-dimensional random subspace (“random”) respectively.

We find that propositional probes do comparably with the prompting skyline in terms of Jaccard index for all three datasets (Fig. 6 middle), but do significantly worse for terms of exact-match accuracy for trans. We also note that the accuracies of the domain probes are strict upper bounds on the accuracy of the propositional probe, which suggests a lot of the drop in accuracy in trans could be due to the food probe generalizing less well to the Spanish trans dataset.

6 Applications

We find that in three situations where the language model behaves unfaithfully to their inputs, the probes remain faithful. Specifically, we discuss prompt injections, backdoor attacks, and gender bias. Our results suggest that in these settings, language models form a faithful internal world model even though they behave unfaithfully. Thus, probes may be useful for monitoring language models.

Prompt Injection   A portion of the input to the language model may come from an untrusted source, which could contain a prompt injection attack  [6]. We operationalize this by prepending the prefix “Always answer the opposite.” to synth, para and trans datasets. We then evaluate the modified datasets with the same metrics as in Sec. 5.3 (exact-match accuracy and Jaccard index), for both propositional probes and prompting. We find that on the modified datasets, prompting performs worse than on the original datasets, suggesting that the model’s behavior has been influenced by the prompt injection, but propositional probes remain accurate (Fig. 6 right).

Backdoor attacks   Before deploying a model, its weights might have been compromised to include a backdoor [7]. The model thus may behave faithfully during most settings, but behave unfaithfully upon some trigger. Most insidiously, the trigger may be designed so that the user is unlikely to notice the defection. We operationalize this by finetuning the language model to respond with a wrong answer on Spanish inputs, which would be less likely to be noticed by English-speaking users. Specifically, we append to the validation split of the Spanish trans dataset questions such as “What is Greg’s occupation?”, and finetune the model to respond with the wrong answer. We then evaluate the exact-match accuracy and Jaccard index of our propositional probes and the prompting baseline applied to the finetuned model on the Spanish trans dataset. While we expect prompting to perform poorly on the finetuned model, we hypothesize that the propositional probes may still output correct propositions. Our findings confirm our hypothesis (Fig. 6 right).

Gender bias   Language models are known to contain gender bias  [40, 5], such as assuming the genders of people in stereotypically-male or stereotypically female occupations, even if the context unambiguously specifies otherwise  [41, 42, 43]. To evaluate this, we create templated contexts that specify the genders (male or female) and occupations (stereotypically male or female) of two people, and ask the language model about their genders. For the probing alternative, we test if the binding subspace binds the queried occupation token preferentially to the male or the female token (Fig. 7 left). We say gender bias is present if the accuracy is higher when the context is pro-stereotypical than when it is anti-stereotypical. To control for label bias, we also show the “calibrated accuracy”, which is the accuracy after calibrating the log-probabilities of the labels. See Appendix H for details.

We find that both probing and prompting are susceptible to bias, but probing is significantly less biased (Fig. 7 right). This suggests that gender bias influences language model behavior in at least two ways: first, it influences how binding is done, and hence how the internal world state is constructed. Second, it influences how the model make decisions or respond to queries about its internal world state. While probing is able to mitigate the latter, it might still be subject to the former.

Refer to caption
Refer to caption
Figure 7: Left: Anti-stereotypical example. We either prompt the model for the gender of the occupations, or probe the model with the binding similarity d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ). Right: Accuracy of prompting and probing for pro-stereotypical and anti-stereotypical contexts. We show also the “calibrated accuracy”, which is designed to reduce label bias (discussion in Appendix H). Error bars indicate standard error.

7 Conclusion

This work presents evidence relating to two hypotheses about language models: first, that language models internally construct symbolic models of input contexts; and second, when language models are influenced by unfaithful tendencies, these internal models may remain faithful to the input context even if the outputs of the language models are not.

For the first hypothesis, we develop probes that decode symbolic propositions in a small, closed world from the internal activations of a language model. Our work is primarily enabled by the discovery of the binding mechanism in language models—we believe that this approach could be scaled to larger worlds with more complex semantics if more aspects of how language models represent meaning are discovered, such as representations of role-filler binding [44] and state changes [45].

For the second hypothesis, we showed that our propositional probes are faithful to the input contexts even in settings when the language model outputs tend to be unfaithful. This suggests that propositional probes, when scaled to sufficient complexity to be useful, can serve as monitors on language models at inference time for mitigating adversarial attacks by malicious agents, as well as unintended tendencies and biases learned by the model. The latter could be more insidious—we often only discover surprising tendencies in models after we deploy them [3, 46, 47].

Limitations   Even though the propositional probes were faithful for the three settings we investigated, we have no guarantees that they will remain so for other settings. Thus, while these probes can be used as evidence for detecting unfaithful behavior in language models, we caution against using probes to certify faithfulness in language models. We believe a deeper mechanistic understanding of how these latent states are formed and used for decision making can make progress towards the latter.

Acknowledgments

We thank Yossi Gandelsman and Shawn Im for their helpful feedback on the manuscript. JF acknowledges support from the OpenAI Superalignment Fellowship. JS was supported by the National Science Foundation under Grants No. 2031899 and 1804794. In addition, we thank Open Philanthropy for its support of both JS and the Center for Human-Compatible AI.

References

  • Anil et al. [2024] Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking, 2024.
  • Halawi et al. [2023] Danny Halawi, Jean-Stanislas Denain, and Jacob Steinhardt. Overthinking the truth: Understanding how language models process false demonstrations. arXiv preprint arXiv:2307.09476, 2023.
  • Sharma et al. [2023] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
  • Blodgett et al. [2020] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (technology) is power: A critical survey of" bias" in nlp. arXiv preprint arXiv:2005.14050, 2020.
  • Liang et al. [2022] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  • Perez and Ribeiro [2022] Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  • Wallace et al. [2020] Eric Wallace, Tony Z Zhao, Shi Feng, and Sameer Singh. Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
  • Hubinger et al. [2024] Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
  • Paul Christiano and Xu [2021] Ajeya Cotra Paul Christiano and Mark Xu. Eliciting latent knowledge. Technical report, Alignment Research Center, 2021.
  • Viégas and Wattenberg [2023] Fernanda Viégas and Martin Wattenberg. The system model and the user model: Exploring ai dashboard design. arXiv preprint arXiv:2305.02469, 2023.
  • Li et al. [2021] Belinda Z Li, Maxwell Nye, and Jacob Andreas. Implicit representations of meaning in neural language models. arXiv preprint arXiv:2106.00737, 2021.
  • Li et al. [2022] Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
  • Feng and Steinhardt [2023] Jiahai Feng and Jacob Steinhardt. How do language models bind entities in context? arXiv preprint arXiv:2310.17191, 2023.
  • Geiger et al. [2021a] Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021a.
  • Vig et al. [2020] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33:12388–12401, 2020.
  • Pearl [2022] Judea Pearl. Direct and indirect effects. In Probabilistic and causal inference: the works of Judea Pearl, pages 373–392. 2022.
  • Mikolov et al. [2013] Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 746–751, 2013.
  • Abdou et al. [2021] Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color. arXiv preprint arXiv:2109.06129, 2021.
  • Bolukbasi et al. [2016] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
  • Gurnee and Tegmark [2023] Wes Gurnee and Max Tegmark. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
  • Burns et al. [2022] Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  • Marks and Tegmark [2023] Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  • Roger et al. [2023] Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, and Nate Thomas. Benchmarks for detecting measurement tampering, 2023.
  • Mallen and Belrose [2023] Alex Mallen and Nora Belrose. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023.
  • von der Malsburg [1981] Christoph von der Malsburg. The correlation theory of brain function. Internal Report 81-2, Department of Neurobiology, Max-Planck-Institute for Biophysical Chemistry, 1981.
  • Feldman [1982] Jerome A Feldman. Dynamic connections in neural networks. Biological cybernetics, 46(1):27–39, 1982.
  • Feldman [2013] Jerome Feldman. The neural binding problem (s). Cognitive neurodynamics, 7:1–11, 2013.
  • Treisman [1996] Anne Treisman. The binding problem. Current opinion in neurobiology, 6(2):171–178, 1996.
  • Tenney et al. [2019] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316, 2019.
  • Belinkov et al. [2020] Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. On the linguistic representational power of neural machine translation models. Computational Linguistics, 46(1):1–52, 2020.
  • Peters et al. [2018] Matthew E Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949, 2018.
  • Prakash et al. [2024] Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, and David Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. arXiv preprint arXiv:2402.14811, 2024.
  • Ivison et al. [2023] Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Hernandez et al. [2023] Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124, 2023.
  • Geiger et al. [2021b] Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021b.
  • Geiger et al. [2024] Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah Goodman. Finding alignments between interpretable causal variables and distributed neural representations. In Causal Learning and Reasoning, pages 160–187. PMLR, 2024.
  • Selvaraju et al. [2017] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • Olah et al. [2018] Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill, 3(3):e10, 2018.
  • Orgad and Belinkov [2022] Hadas Orgad and Yonatan Belinkov. Choose your lenses: Flaws in gender bias evaluation. arXiv preprint arXiv:2210.11471, 2022.
  • Zhao et al. [2018] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
  • Rudinger et al. [2018] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
  • Parrish et al. [2021] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.
  • Smolensky [1990] Paul Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(1-2):159–216, 1990.
  • Kim and Schuster [2023] Najoung Kim and Sebastian Schuster. Entity tracking in language models. arXiv preprint arXiv:2305.02363, 2023.
  • Pan et al. [2022] Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Map** and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022.
  • Roose [2023] Kevin Roose. A conversation with bing’s chatbot left me deeply unsettled. 2023.
  • Wolf et al. [2019] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • Wu et al. [2024] Zhengxuan Wu, Atticus Geiger, Aryaman Arora, **g Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, and Christopher Potts. pyvene: A library for understanding and improving PyTorch models via interventions. 2024. URL arxiv.longhoe.net/abs/2403.07809.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Robinson et al. [2022] Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353, 2022.

Appendix A Datasets

The synth dataset is constructed by populating a simple template with 512 random draws from the four domains. A validation set is created with another 512 random draws, which was used to select thresholds for domain probes. The template is the following:

The name domain consists of 60 common English first names, which are all one-token wide in the llama 2 tokenizer. They are: The country domain consists of 16 countries, which are all one-token wide. They are: The food domain consists of 41 common foods, which are all two-tokens wide. They are: The occupation domain consists of 14 occupations, which are the subset of occupations used in the Winobias dataset [41] (MIT license) that are one-token wide. They are: The para dataset is constructed by instructing GPT-3.5-turbo to rewrite the synth dataset into a story. The instructions used are: The trans dataset is constructed by instructing GPT-3.5-turbo to translate the para dataset into Spanish. The instructions used are:

Appendix B Experimental details

Compute   All of our experiments are conducted on an internal GPU cluster. All experiments require at most 4 A100 80GB GPUs. Computing the Hessian takes about 5 hours. The other experiments take less than an hour to run.

Models   We use the huggingface implementation [48] as well as the TransformerLens library to run the Tulu-2-13B [33] and Llama-2-13B-chat models [34].

Appendix C Details for the Hessian algorithm

Here we provide some details of the Hessian-based algorithm.

Concretely, to construct F(x,y)𝐹𝑥𝑦F(x,y)italic_F ( italic_x , italic_y ), we use a template that looks like this:

The names and countries are random samples from the name and country domains. To reduce noise, we use 20 contexts, each constructed the same way. F(x,y)𝐹𝑥𝑦F(x,y)italic_F ( italic_x , italic_y ) itself is the average accuracy over these 20 contexts. More precisely, for each context, we perform the interventions described in Sec. 4.1, and measure the probability of returning the correct country, averaged over the two names we can query. We then average this across the 20 contexts. Further, we parameterize x𝑥xitalic_x and y𝑦yitalic_y by multiplying them with a layer-dependent scale. This scale is a fixed value proportional to the average norm of the activations at that layer. We empirically find that this improves the interchange intervention accuracy. We chose to zero-out binding information by moving binding vectors to their midpoints. We could also have chosen to change E0,A0subscript𝐸0subscript𝐴0E_{0},A_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to match E1,A1subscript𝐸1subscript𝐴1E_{1},A_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, or vice versa. Empirically, mid-point works best of the three. We chose layer 15 to use when computing the binding similarity (Eq. 1). This choice is motivated by convergent evidence from circuit-level analyses, the Grad-CAM-based attribution, and qualitative analyses of the resultant binding similarity matrices.

Appendix D Details for Hessian evaluation

D.1 Distributed Alignment Search Baseline

In this section, we provide details of our implementation of Distributed Alignment Search. We base our implementation and hyperparameters on the pyvene [49] library. We use the Adam optimizer [50], with learning rate 0.001, with a linear schedule over 5 epochs (with the first 0.1 steps as warmup), over a dataset of 128 samples, with batch size 8. We optimize over a subspace parametrized to be orthogonal using Householder reflections as implemented in pytorch [51]. This subspace is shared across all layers. The loss we use is the log probability of returning the desired attribute after performing the interchange intervention.

D.2 Dataset details

Both the Hessian and DAS are trained on templated datasets that draw from the names and countries domains. We partition each domain into a train and a test split, and construct train/test datasets by randomly populating the template described in Appendix C.

Appendix E Qualitative Hessian analysis

In this section we show more qualitative plots of the binding similarity metrics for various contexts. First, we evaluate on a context with coreferencing (Fig. 8 left). Specifically, the context introduces two entities, and then refer to them either with “the former” or “the latter”. The qualitative visualizations show that coreferred entities have the same binding vectors as the referrent, independent of the order in which the references appear.

Refer to caption
Figure 8: Similarity between token activations under the binding similarity metric d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) for coreferences. Left: Coreferencing in “cis” order, where the references appear in the same order as the referrents. Righth: coreferencing in the “trans” order, where the opposite is true.
Refer to caption
Figure 9: Similarity between token activations under the binding similarity metric d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) for a three-entity context.

Next, we show the similarity matrix for a context with three entities (Fig. 9). Interestingly, the similarity metric does not strongly distinguish between the second and third entity, despite the interchange interventions working. One reason for this could be that the 50-dimensional subspace we identified may contain directions that contain spurious, non-binding information. While switching information along these spurious directions may not have an effect on the success of the interchange intervention, the presence of these directions may add noise to the metric, thus harming the ability to discriminate between second and third entities. This indicates room for future work to obtain a more minimal estimate of the binding subspace.

Appendix F Grad-CAM attribution

In this section we describe the Grad-CAM style attribution we use to attribute information about domain values to context activations at specific layers and token positions. It is a general attribution technique that is originally invented for attributing information to pixels in the input space [38], but has adapted for attributing information to internal activations [39].

The goal of the Grad-CAM attribution technique is to attribute the change in behavior to particular layers or token positions evaluated on contrast pairs. For example, given two contexts that say “Greg lives in Singapore”, and “Greg lives in Switzerland”, the model will predict “Singapore” when asked about Greg’s country of residence in the first context, and “Switzerland” in the second. The model constructs internal activations of the context, which we can parameterize as Zs,ldmodelsubscript𝑍𝑠𝑙superscriptsubscript𝑑modelZ_{s,l}\in\mathbb{R}^{d_{\text{model}}}italic_Z start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where s𝑠sitalic_s indicates the token position and l𝑙litalic_l indicates the layer. We want to assign each s𝑠sitalic_s and l𝑙litalic_l an attribution score As,lsubscript𝐴𝑠𝑙A_{s,l}italic_A start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT that describes how much the change in that activation vector contributed to the change in model behavior.

We do so by first quantifying the change in behavior. In this case, it is as simple as the difference in log probabilities of predicting “Singapore” vs “Switzerland”. Let the metric be M𝑀Mitalic_M when evaluated on the first context and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT when evaluated on the second.

Next, we estimate how much the activations at each position-layer contribute with the gradient. Ideally, we want to capture the extent to which changing Zs,lsubscript𝑍𝑠𝑙Z_{s,l}italic_Z start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT to Zs,lsuperscriptsubscript𝑍𝑠𝑙Z_{s,l}^{\prime}italic_Z start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT helps in changing the metric from M𝑀Mitalic_M to Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. We do so by taking a linear approximation (Zs,lZs,l)Zs,lMsuperscriptsuperscriptsubscript𝑍𝑠𝑙subscript𝑍𝑠𝑙topsubscriptsubscript𝑍𝑠𝑙𝑀(Z_{s,l}^{\prime}-Z_{s,l})^{\top}\nabla_{Z_{s,l}}M( italic_Z start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_Z start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_M. See the original Grad-CAM paper for more motivation. Doing so at every position s𝑠sitalic_s and every layer l𝑙litalic_l gives us an attribution score As,lsubscript𝐴𝑠𝑙A_{s,l}italic_A start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT for every token/layer.

Fig. 10 (left) shows an example. On “Name Grad-CAM” plot, we use the contrast pairs “Matthew lives in Switzerland. Alexander lives in Netherlands.” and “Alexander lives in Switzerland. Matthew lives in Netherlands.” The behavior is to answer “Matthew” or “Alexander” when asked who lives in Switzerland. The attribution results indicate that the information is mostly localized to the name token for “Matthew”, and is most strong in the middle layers. The attribution results for countries show similar results.

Appendix G Domain probes

We use Grad-CAM-style attribution to estimate which layers and which tokens carry the domain value information. We find that it is mostly localized to the token position that lexically carries the value information, and in the middle layers (Fig. 10 left). For values in the food domain which has two tokens, we find that information is carried in the second token position, which is consistent with prior results [35]. We thus choose l=20𝑙20l=20italic_l = 20 as the layer to probe from.

To validate that the choice of layer is correct, we compute the Area Under Precision-Recall Curve (AUC-PRC) for every layer (Fig. 10 right). This supports our choice of l=20𝑙20l=20italic_l = 20.

Finally, to select the threshold hhitalic_h, we use accuracy on validatation subsets of paraphrase and translate.

Refer to caption
Refer to caption
Figure 10: Left: Grad-CAM style attributions for name and country domains. Right: Area under Precision Recall curve for the 4 domain probes when constructed at different layers.

Appendix H Gender bias evaluations

This section contains details about the gender bias evaluation.

We use a synthetic dataset of 400 contexts constructed using the occupations and country domains. We use the following template

To ensure that the probing and prompting methods are detecting binding, and not relying on short cuts such as sentence order, for half of the contexts we swap the order of the last two sentences. We compare probing and prompting at predicting correctly the gender of an occcupation mentioned in the context. For prompting, we prompt the model with “The gender of the [occupation] is”, and take the gender with the higher log probability to be the model’s answer. For probing, we take the gender token with higher binding similarity to the occupation token to be the probe’s answer. To evaluate these two methods, we showed both accuracy and calibrated accuracy. Accuracy is the fraction of the time that asking for the gender of an occupation in the context returns the correct answer. Calibrated accuracy requires more explanation. Language models sometimes exhibit innocuous but systematic preferences when evaluated in a forced-choice setting. For example, it might encounter the “male” token a lot more frequently than the “female” token, and so it might output “male” over “female” regardless of what the context or even the bias in the occupation indicates. A common practice is to calibrate the log probabilities [52]. We do so by subtracting the mean log probabilities in paired responses. Specifically, let v0,v12subscript𝑣0subscript𝑣1superscript2v_{0},v_{1}\in\mathbb{R}^{2}italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT be the log probabilities over the male and female tokens when queried with occupation 0 and occupation 1. The calibrated log probabilities for occupation 0 is v0(v0+v1)/2subscript𝑣0subscript𝑣0subscript𝑣12v_{0}-(v_{0}+v_{1})/2italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / 2, and that for occupation 1 is v1(v0+v1)/2subscript𝑣1subscript𝑣0subscript𝑣12v_{1}-(v_{0}+v_{1})/2italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ( italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / 2. After obtaining the calibrated log probabilities, we apply the same decision rule as before, i.e. we choose the gender with the higher calibrated log probability as the answer. The same procedure can be applied to the binding similarities to calibrate probing. However, we find that calibration does not significantly change the accuracy of either method.