Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

Sheridan Feucht    David Atkinson    Byron C. Wallace    David Bau
Northeastern University
{feucht.s, atkinson.da, b.wallace, d.bau}@northeastern.edu
Abstract

LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b’s tokenizer splits the word “northeastern” into the tokens [_n, ort, he, astern], none of which correspond to semantically meaningful units like “north” or “east.” Similarly, the overall meanings of named entities like “Neil Young” and multi-word expressions like “break a leg” cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced “erasure” effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to “read out” the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.111Code and data available at footprints.baulab.info

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs


Sheridan Feucht    David Atkinson    Byron C. Wallace    David Bau Northeastern University {feucht.s, atkinson.da, b.wallace, d.bau}@northeastern.edu


1 Introduction

Despite their widespread use, the specific mechanisms by which LLMs are able to “understand” and generate coherent text are not well understood. One mystery is the process by which groups of subword tokens are converted into meaningful representations, a process described by Elhage et al., 2022 and Gurnee et al., 2023 as detokenization.

Refer to caption
Figure 1: We observe “erasure” of token-level information in later layers of LLMs for multi-token words and entities (top). We hypothesize that this is a result of a process that converts token embeddings into useful lexical representations, and introduce a new method for enumerating these lexical items (bottom).

Current language models process text as a series of tokens drawn from a set token vocabulary: One token can correspond to a single word (_fish), or to a piece of a larger word (mon in “salmon”). The vocabulary of tokens available to a model is typically determined before training with byte-pair encoding (Sennrich et al., 2016), which is based on a specific dataset and can lead to unintuitive results. For example, Llama-2-7b’s (Touvron et al., 2023) tokenizer breaks the word “northeastern” into the tokens [_n, ort, he, astern], none of which correspond to semantically meaningful units like “north” or “east.” Capitalization also creates unexpected issues: for example, the word “Hawaii” is split into two tokens if the first letter is capitalized [_Hawai, i], but four if the first letter is lowercase [_ha, w, ai, i]. In spite of these challenges, large models are apparently able to “understand” such idiosyncratic tokenizations of multi-token words with few observable effects on downstream performance (Gutiérrez et al., 2023), unless these weaknesses are directly targeted (Wang et al., 2024; Batsuren et al., 2024). How is this possible?

We hypothesize that during pretraining, LLMs develop an implicit vocabulary that maps from groups of arbitrary tokens to semantically meaningful units. These lexical units may be multi-token words (“northeastern”), named entities (“Neil Young”), or idiomatic multi-word expressions (“break a leg”) and can be understood as “item[s] that function as single unit[s] of meaning” in a model’s vocabulary (Simpson, 2011). Lexical items are also non-compositional: Just as the meaning of “break a leg” cannot be predicted from the individual meanings of “break” and “leg,” the meaning of “patrolling” cannot be predicted from its constituent tokens pat and rolling. This arbitrariness necessitates some kind of storage system, implicit or otherwise (Murphy, 2010).

How exactly do LLMs deal with these cases mechanistically? In this paper, we begin to answer this question by investigating token-level information stored in LLM representations.

  • We find that last token positions of multi-token words and named entities “erase” token-level information in early layers for both Llama-2-7b (Touvron et al., 2023) and Llama-3-8b (Meta, 2024).

  • We develop a heuristic for scoring the “lexicality” of a given sequence of tokens, and use it to “read out” a list of an LLM’s lexical items given a large dataset of natural text.

We interpret this erasure effect as a “footprint” of a mechanism in early layers that orchestrates the formation of meaningful lexical items.

2 Background

Refer to caption
Figure 2: Test accuracy on CounterFact subject last tokens versus other tokens in the dataset for probes trained on Llama-2-7b hidden states (n=5063𝑛5063n=5063italic_n = 5063). i𝑖iitalic_i represents the position being predicted (e.g., i=1𝑖1i=-1italic_i = - 1 is previous token prediction; i=1𝑖1i=1italic_i = 1 is next-token prediction). We observe an “erasure” effect in last subject tokens that is not present for other types of tokens: these last subject tokens consistently “forget” about preceding tokens and themselves. Appendix A shows Llama-3-8b results and in-distribution performance on Pile tokens.

Previous work has shown that knowledge about a multi-token entity is often stored in the last token of that entity. For example, Meng et al. (2022) found that factual information about a subject like “The Space Needle” would be concentrated in the representation for le. Geva et al. (2023) find evidence for a subject enrichment stage during factual recall, where information about an entity is collected at its last token in early layers, which is also seen in other work on factual recall using the same dataset (Katz et al., 2024), and corroborated by research on athlete \rightarrow sport lookups (Nanda et al., 2023). This phenomenon may be due to the autoregressive nature of decoder transformer models: models cannot enrich “Space” with information about Seattle until after “Needle” is seen, as “Space” could refer to a number of unrelated concepts (“Space Jam,” “Space Station”).222This is not a hard-and-fast rule; it depends on entity frequency and context cues. For example, if a model sees _The, _E, and iff, it may already know that these tokens refer to “The Eiffel Tower” without needing to see el and Tower.

Other work in interpretability has also started to uncover evidence of models encoding lexical items. Elhage et al. (2022) observe neurons in early layers that fire on the last tokens of multi-token words, names of famous people, generic nouns, compound words, and LaTeX commands. They also find late-layer neurons that seem to be relevant to retokenization, i.e., conversion from internal representations back into tokens. For example, a retokenization neuron might fire on _st and promote rag in order to facilitate the output of the word “straggler.” Gurnee et al. (2023) also find examples of polysemantic neurons in Pythia models (Mallen and Belrose, 2023) that activate for a number of multi-token constructions like “apple developer,” “Bloom.ington,” and “research.gate.”

3 Linear Probing of Hidden States

3.1 Method

If last token positions are so important (Section 2), then what do these representations encode? Perhaps the last hidden state directly stores information about other subject tokens (e.g., _Wars might contain some encoding for _Star in its hidden state). To test this hypothesis, we investigate hidden states for both Llama-2-7b and Llama-3-8b, as they have significantly different token vocabulary sizes (32k and 128k tokens, respectively). We train linear probes pi()superscriptsubscript𝑝𝑖p_{i}^{(\ell)}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT to take a hidden state ht()superscriptsubscript𝑡h_{t}^{(\ell)}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT at layer \ellroman_ℓ and token position t𝑡titalic_t and predict the value of a nearby token t+i𝑡𝑖t+iitalic_t + italic_i. (e.g., a probe trained to predict the previous token for layer 5 hidden states would be denoted by p1(5)superscriptsubscript𝑝15p_{-1}^{(5)}italic_p start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 5 ) end_POSTSUPERSCRIPT).

We train probes for all layer indexes 0<320320\leq\ell<320 ≤ roman_ℓ < 32 and offsets i{3,2,1,0,1}𝑖32101i\in\{-3,-2,-1,0,1\}italic_i ∈ { - 3 , - 2 , - 1 , 0 , 1 }. We also train probes in the same manner on the embedding layer (=11\ell=-1roman_ℓ = - 1) and on the final outputs of the network before the language modelling head (=3232\ell=32roman_ℓ = 32). We trained probes on a random sample of 428k tokens from the Pile (Gao et al., 2020) using AdamW for 16 epochs with a batch size of 4 and a learning rate of 0.1. Hyperparameters were selected based on validation performance on a separate Pile sample (279k tokens) after a random sweep. Each probe takes 6-8 hours to train on an RTX-A6000.

3.2 CounterFact Subjects

After training probes in Section 3.1, we test them on the CounterFact dataset (Meng et al., 2022), which consists of prompts about subjects that require factual knowledge to complete correctly (e.g. “Mount Passel is in Antarctica”). We filter the dataset to include only prompts that the model answers correctly, yielding 5,063 examples for Llama-2-7b and 5,495 examples for Llama-3-8b. To augment this dataset, we also sampled and filtered down [album/movie/series \rightarrow creator] pairs from Wikidata Vrandečić and Krötzsch (2014) and embedded them in prompts in the same manner, yielding a total of 12,135 correctly-answered prompts for Llama-2-7b and 13,995 for Llama-3-8b.

Figure 2 shows probe test results on CounterFact last subject tokens (right) versus every other type of token in the dataset (left). We see a striking “erasure” effect for last tokens of CounterFact subjects, where these hidden states consistently “forget about” preceding and current tokens. Subject tokens that are not in the last position (e.g., _Star) do not exhibit this pattern (Appendix A, Figure 13). This striking drop in token accuracy is reminiscent of the subject enrichment stage described by Geva et al. (2023), suggesting that the tokens _Star and _Wars may be overwritten in the process of representing the concept of Star Wars.

We also observe the same phenomenon when testing on named entities identified by spaCy in Wikipedia articles (Appendix A, Figure 12), suggesting that this effect is not an artifact of the short templates found in the CounterFact dataset. Additionally, we consider whether this result is due to imbalances in probe training data, but this seems not to be the case either (Appendix B).

3.3 Multi-Token Words

Intuitively, the process of converting a multi-token sequence like [_n, ort, he, astern] into a meaningful representation of the word “northeastern” resembles the process of converting [_E, iff, el, Tower] into “Eiffel Tower.” We hypothesize that models treat multi-token words in the same way that they treat multi-token entities, and test our probes from Section 3.1 on multi-token words. After sampling 500 articles (similar-to\sim256k tokens) from the 20220301.en split of the Wikipedia dump (Foundation, 2022), we split by white-space to naively identify word boundaries. As predicted, we see the same “erasing” pattern for multi-token words that we do for multi-token entities (Appendix A, Figure 11). This suggests that they may be processed in a similar manner in early layers.

4 Building a Vocabulary

After examination of probe behavior for multi-token words and entities, we hypothesize that this “erasure” effect is a result of the implicit formation of lexical representations in early layers. To characterize this phenomenon, we propose an erasure score ψ𝜓\psiitalic_ψ to identify token sequences that follow the pattern observed in Section 3. We then introduce an approach to “reading out” a list of implicit vocabulary entries for a given model using this score.

4.1 An Erasure Score

We first define an erasure score ψp,qsubscript𝜓𝑝𝑞\psi_{p,q}italic_ψ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT, which is a heuristic designed around the intuition that the last token representation of a lexical item should exhibit a strong “erasing” effect for probe predictions from layer 1 to layer L𝐿Litalic_L.333For both Llama-2-7b and Llama-3-8b we set L=9𝐿9L=9italic_L = 9. The score quantifies “erasing” behavior at token positions within a given sequence at indices p𝑝pitalic_p through q𝑞qitalic_q, and penalizes any erasure of tokens outside of these boundaries. Equation 1 defines the score ψp,qsubscript𝜓𝑝𝑞\psi_{p,q}italic_ψ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT for a sequence sp,qsubscript𝑠𝑝𝑞s_{p,q}italic_s start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT of length n=qp+1𝑛𝑞𝑝1n=q-p+1italic_n = italic_q - italic_p + 1 as:

11+2n(δ(q,0)+t=pqi=21𝟙within(t,i)δ(t,i))112𝑛𝛿𝑞0superscriptsubscript𝑡𝑝𝑞superscriptsubscript𝑖21subscript1within𝑡𝑖𝛿𝑡𝑖\frac{1}{1+2n}\left(\delta(q,0)+\sum_{t=p}^{q}\sum_{i=-2}^{-1}\mathbbm{1}_{% \text{within}}(t,i)\cdot\delta(t,i)\right)divide start_ARG 1 end_ARG start_ARG 1 + 2 italic_n end_ARG ( italic_δ ( italic_q , 0 ) + ∑ start_POSTSUBSCRIPT italic_t = italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT within end_POSTSUBSCRIPT ( italic_t , italic_i ) ⋅ italic_δ ( italic_t , italic_i ) ) (1)

where δ(t,i)𝛿𝑡𝑖\delta(t,i)italic_δ ( italic_t , italic_i ) denotes the change in probability of the predicted token t+i𝑡𝑖t+iitalic_t + italic_i from layer 1 to layer L𝐿Litalic_L, based on probes pi()superscriptsubscript𝑝𝑖p_{i}^{(\ell)}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT from Section 3.1.

δ(t,i)=Ppi(1)(t+i|ht(1))Ppi(L)(t+i|ht(L))𝛿𝑡𝑖subscript𝑃subscriptsuperscript𝑝1𝑖𝑡conditional𝑖superscriptsubscript𝑡1subscript𝑃subscriptsuperscript𝑝𝐿𝑖𝑡conditional𝑖superscriptsubscript𝑡𝐿\delta(t,i)=P_{p^{(1)}_{i}}(t+i|h_{t}^{(1)})-P_{p^{(L)}_{i}}(t+i|h_{t}^{(L)})italic_δ ( italic_t , italic_i ) = italic_P start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t + italic_i | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) - italic_P start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t + italic_i | italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT ) (2)

If t+i𝑡𝑖t+iitalic_t + italic_i lies outside the boundaries of s𝑠sitalic_s, we decrease ψp,qsubscript𝜓𝑝𝑞\psi_{p,q}italic_ψ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT. Otherwise, a large drop between layers δ(t,i)𝛿𝑡𝑖\delta(t,i)italic_δ ( italic_t , italic_i ) increases the value of ψp,qsubscript𝜓𝑝𝑞\psi_{p,q}italic_ψ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT.

𝟙within(t,i)={1 if t+i<p1 elsesubscript1within𝑡𝑖cases1 if 𝑡𝑖𝑝otherwise1 elseotherwise\mathbbm{1}_{\text{within}}(t,i)=\begin{cases}-1\text{ if }t+i<p\\ 1\text{ else}\end{cases}blackboard_1 start_POSTSUBSCRIPT within end_POSTSUBSCRIPT ( italic_t , italic_i ) = { start_ROW start_CELL - 1 if italic_t + italic_i < italic_p end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 else end_CELL start_CELL end_CELL end_ROW (3)

We provide further explanation of the intuition behind this approach in Appendix C.

4.2 Segmenting Documents

We develop an algorithm built around our erasure score ψ𝜓\psiitalic_ψ that breaks any given document d𝒟𝑑𝒟d\in\mathcal{D}italic_d ∈ caligraphic_D into high-scoring, non-overlap** segments covering all of d𝑑ditalic_d (Algorithm 1, Appendix D). Figure 1 shows the top-scoring sequences sp,qsubscript𝑠𝑝𝑞s_{p,q}italic_s start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT calculated in this manner from a Wikipedia excerpt about Thelonious Monk, where unigram scores are excluded for clarity. Not all multi-token words are scored highly via our approach, but the highest-scoring sequences are plausible lexical items that are non-compositional in nature (“dram.atic”, “sil.ences”, “tw.ists”). We share examples of complete segmentations in Appendix D.

4.3 Model Vocabularies

MTW MTE
llama data prec. recall prec. recall
2-7b wiki 0.306 0.016 0.143 0.016
pile 0.296 0.017 0.080 0.018
3-8b wiki 0.044 0.001 0.010 0.000
pile 0.023 0.001 0.012 0.001
Table 1: Precision and recall for aggregated results of Algorithm 1 run on Llama-2-7b and Llama-3-8b, using either Wikipedia or Pile documents (|𝒟|=500𝒟500|\mathcal{D}|=500| caligraphic_D | = 500). MTW refers to all multi-token words in the dataset when split by whitespace; MTE refers to all spaCy named entities.

Finally, we propose a method to “read out” the implicit vocabulary of a model \mathcal{M}caligraphic_M given a dataset 𝒟𝒟\mathcal{D}caligraphic_D. For each document d𝒟𝑑𝒟d\in\mathcal{D}italic_d ∈ caligraphic_D, we segment d𝑑ditalic_d using Algorithm 1. We then average scores ψ𝜓\psiitalic_ψ for every multi-token sequence that appears more than once across all documents. As this process is very data-dependent, we compare the top 50 results for Pile and Wikipedia text in Appendix E.

With this approach, we are able to recover similar-to\sim1800 sequences for Llama-2-7b and similar-to\sim900 for Llama-3-8b using the same five hundred Wikipedia articles. Although recall is quite low (Table 1), we find that 44.9% of sequences recovered for Llama-2-7b on Wikipedia text are either multi-token words or multi-token entities (29.68% for Pile text). For Llama-3-8b, only 5% and 3% of sequences are MTWs or MTEs. However, looking at examples of Llama-3-8b sequences in Appendix E, we can observe other interesting cases, like multi-token expressions (“gold medalists,” “by per capita income,” “thank you for your understanding”) and LaTeX commands (as similarly observed by Elhage et al. (2022)). Because Llama-3-8b’s token vocabulary is four times larger than Llama-2-7b’s, its implicit vocabulary also seems to consist of more multi-word expressions and chunks of code rather than multi-token words (Appendix E, Table 6).

5 Conclusion

In this work, we present preliminary evidence for the existence of an implicit vocabulary that allows models to convert from byte-pair encoded tokens to useful lexical items. We posit that the “erasure” effect we observe for Llama-2-7b and Llama-3-8B is a result of model processes that deal with multi-token expressions, and use this insight to propose a new method for “reading out” an LLM’s implicit vocabulary. This is a first step towards understanding the formation of lexical representations in LLMs, and may serve as a useful tool for elucidation of words that a given model “knows.”

Limitations

Evaluation of implicit vocabulary-building methods (Section 4) is challenging due to the lack of a known ground-truth. Our approach is motivated by the desire to understand the inner workings of the model being studied, but we have no authoritative reference that distinguishes between situations where a given sequence gets a high ψ𝜓\psiitalic_ψ value because it is truly treated as a lexical unit by the model, or where it may be due to an error in our methodology. To quantify our results, we have compared the extracted vocbulary to sequences that we assume to be likely lexical items: multi-token words and spaCy named entities. However, this likely does not cover all cases for which “token grou**” occurs in LLMs.

Another limitation of this work is that we have restricted our analysis to known entities. There is also the question of what happens for intermediate cases such as plausible-sounding fictional towns or names of people who are not famous. If ψ𝜓\psiitalic_ψ correlates with sequence presence in training data, these results could be useful for understanding how familiar an LLM is with a given word or entity.

Finally, our measurements have been run only on the Llama family of models and do not yet extend to non-Llama models of comparable size, or Llama models of larger sizes.

Ethics Statement

In this work, we restrict our analysis to English words, due to our biases as native speakers of English. We hope that this work can also provide valuable insights for other languages, especially low-resource languages, where understanding “what words an LLM knows” may be especially useful.

Acknowledgments

We thank David Smith, Bilal Chughtai, Chantal Shaib, Atticus Geiger, and Adrian Chang for helpful discussion and feedback throughout the course of this project. This work was supported in part by Open Philanthropy, and by the National Science Foundation (NSF) grant IIS-1901117.

Experiments were implemented using the nnsight library; many were run on the Center for AI Safety Compute Cluster. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

References

  • Batsuren et al. (2024) Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, and Gábor Bella. 2024. Evaluating subword tokenization: Alien subword composition and oov generalization challenge. Preprint, arXiv:2404.13292.
  • Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. 2022. Softmax linear units. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/solu/index.html.
  • Foundation (2022) Wikimedia Foundation. 2022. Wikimedia downloads.
  • Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  • Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. ArXiv, abs/2304.14767.
  • Gurnee et al. (2023) Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. Finding neurons in a haystack: Case studies with sparse probing. Preprint, arXiv:2305.01610.
  • Gutiérrez et al. (2023) Bernal Jiménez Gutiérrez, Huan Sun, and Yu Su. 2023. Biomedical language models are robust to sub-optimal tokenization. Preprint, arXiv:2306.17649.
  • Katz et al. (2024) Shahar Katz, Yonatan Belinkov, Mor Geva, and Lior Wolf. 2024. Backward lens: Projecting language model gradients into the vocabulary space. Preprint, arXiv:2402.12865.
  • Mallen and Belrose (2023) Alex Mallen and Nora Belrose. 2023. Eliciting latent knowledge from quirky language models. Preprint, arXiv:2312.01037.
  • Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. In Neural Information Processing Systems.
  • Meta (2024) Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date.
  • Murphy (2010) M. Lynne Murphy. 2010. Lexical Meaning. Cambridge Textbooks in Linguistics. Cambridge University Press.
  • Nanda et al. (2023) Neel Nanda, Senthooran Rajamanoharan, János Krámar, and Rohin Shah. 2023. Fact finding: Attempting to reverse-engineer factual recall on the neuron level.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  • Simpson (2011) James Simpson. 2011. The Routledge handbook of applied linguistics. Taylor & Francis.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  • Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85.
  • Wang et al. (2024) Dixuan Wang, Yanda Li, Junyuan Jiang, Zepeng Ding, Guochao Jiang, Jiaqing Liang, and Deqing Yang. 2024. Tokenization matters! degrading large language models through challenging their tokenization. Preprint, arXiv:2405.17067.

Appendix A Linear Probing on Hidden States

A.1 Llama-3-8b Results

CounterFact Accuracy

We share results analogous to Figure 2 but for Llama-3-8b, which shows a similar “erasure” pattern (Figure 8). Probes are tested only on prompts that Llama-3-8b answers correctly.

Multi-Token Word Accuracy

Figure 9 shows results for probes tested on the last token positions of multi-token words from Wikipedia (where “words” are determined by whitespace separation).

Multi-Token Entity Accuracy

Figure 10 shows results for probes tested on the last token positions of multi-token entities identified by spaCy, using the same dataset as A.1. We use spaCy’s named entity recognition pipeline to identify named entities. Because digits 0-9 are added to Llama-2-7b’s vocabulary, we filter out all classes relating to numbers (PERCENT, DATE, CARDINAL, TIME, ORDINAL, MONEY, QUANTITY), with the thought that these sequences may be treated differently at the detokenization stage.

A.2 Llama-2-7b Results

Multi-Token Word Accuracy

Figure 11 shows results from Llama-2-7b probes tested on multi-token words from Wikipedia as described in Section 3.3.

Multi-Token Entity Accuracy

Figure 12 shows results for Llama-2-7b probes tested on multi-token entities from Wikipedia, using the same dataset from Section 3.3 and also filtering out number-based entity classes as in Section A.1.

A.3 Llama-2-7b Extras

Refer to caption
Figure 3: Overall test accuracy on unseen Pile tokens (n=273𝑛273n=273italic_n = 273k) for probes trained on Llama-2-7b hidden states. Next token prediction becomes more accurate throughout model layers as current and previous token accuracy decreases.

Pile Accuracy

While Figure 2 shows test accuracy of linear probes on model hidden states, Figure 3 shows in-distribution test accuracy on Pile tokens. We can observe a smoother trajectory of gradual “forgetting” of previous and current token-level information throughout layers.

Comparison of Token Positions

Figure 13 shows the breakdown of probe performance on different types of subject tokens: first subject tokens, middle subject tokens, and last subject tokens. We see that the observed drop in previous and current token representation observed in last subject tokens still exists, but is not as drastic for first and middle subject tokens.

Comparison of Subject Lengths

We also show previous token representation broken down by CounterFact subject length for last token representations in Figure 14. Unigram subjects represent previous token information at a rate even higher than non-subject tokens. For bigrams and trigrams, we see a pattern similar to Figure 2.

Appendix B Accounting for Possible Training Imbalance

One explanation for the observed drop in accuracy for CounterFact entities across layers is that our probes have simply not been exposed to as many entity tokens during training. We do not believe this is the case for Llama-2-7b for two reasons: (1) If this effect was due to probes being less sensitive to tokens found in multi-token entities, we would also see a significant drop for first and middle tokens, which does not occur (Figure 13). (2) We measure the frequency of all test n-grams in the original Pile data used to train our probes, and find that both subject and non-subject n-grams are found in the probe training dataset at similar rates, with the median number of occurrences in the test set for both types of sequences being zero. After removing the few non-subject sequences that do appear often in the probe training set, we still see the same “erasure” effect.

Appendix C Intuition for ψ𝜓\psiitalic_ψ

C.1 Explanation of Equation 1

Our first assumption when designing ψ𝜓\psiitalic_ψ is that if an LLM is processing a sequence of tokens corresponding to a lexical item (e.g., [Cal, g, ary]), the last token ary in that sequence should “forget itself” between layers 1 and L=9𝐿9L=9italic_L = 9, according to the pattern we observe in Section 3. We measure this drop using probability scores from probe outputs for a single example. In plain English, the first term δ(q,0)𝛿𝑞0\delta(q,0)italic_δ ( italic_q , 0 ) in Equation 1 represents how much P(t=ary)𝑃𝑡aryP(t=\texttt{ary})italic_P ( italic_t = ary ) drops between layer 1 and layer L𝐿Litalic_L when probing the hidden representation for ary.

In the double summation term, we take into account probe predictions for tokens one (i=1𝑖1i=-1italic_i = - 1) or two (i=2𝑖2i=-2italic_i = - 2) positions before the current token. If there is a drop in probability between layer 1 and L𝐿Litalic_L for these positions, we take this as evidence of token clum**.

However, we also observe (from Figure 13 and manual inspection) that probe predictions for token positions that lie outside the boundaries of a presumed lexical item are maintained across layers, or even promoted. This is clear from the i=1𝑖1i=-1italic_i = - 1 case in the leftmost plot for Figure 13, and was also a consistent pattern when we examined probe behavior over a number of example documents. Given this fact, we include the indicator function 𝟙within(t,i)subscript1within𝑡𝑖\mathbbm{1}_{\text{within}}(t,i)blackboard_1 start_POSTSUBSCRIPT within end_POSTSUBSCRIPT ( italic_t , italic_i ) to decrease ψ𝜓\psiitalic_ψ in cases where erasure is happening outside the bounds of the given sequence.

C.2 Choice of L𝐿Litalic_L

We choose L=9𝐿9L=9italic_L = 9 based on probe “erasure” behavior for Llama-2-7b and Llama-3-8b, particularly Figure 2. We also present a short ablation experiment for L{5,9,13,17,21}𝐿59131721L\in\{5,9,13,17,21\}italic_L ∈ { 5 , 9 , 13 , 17 , 21 } with results in Table 2, which shows that other values of L𝐿Litalic_L after “the drop” are roughly equivalent to L=9𝐿9L=9italic_L = 9.

MTW MTE
L𝐿Litalic_L prec. recall prec. recall
5 0.307 0.002 0.143 0.002
9 0.306 0.016 0.143 0.016
13 0.328 0.003 0.169 0.003
17 0.330 0.003 0.180 0.003
21 0.319 0.003 0.172 0.003
Table 2: Precision and recall for different values of L𝐿Litalic_L for Algorithm 1 applied to Llama-2-7b on Wikipedia text. Recall seems to be best for L=9𝐿9L=9italic_L = 9, with precision improving by a few points in mid-late layers.

Appendix D Document Segmentation

Algorithm 1 Document Segmentation
1:document d𝒟𝑑𝒟d\in\mathcal{D}italic_d ∈ caligraphic_D of length l𝑙litalic_l
2:for n=1𝑛1n=1italic_n = 1 to l𝑙litalic_l do \triangleright all ngram lengths
3:     for p=0𝑝0p=0italic_p = 0 to ln𝑙𝑛l-nitalic_l - italic_n do
4:         for q=p+n1𝑞𝑝𝑛1q=p+n-1italic_q = italic_p + italic_n - 1 to l1𝑙1l-1italic_l - 1 do
5:              assign score ψp,qsubscript𝜓𝑝𝑞\psi_{p,q}italic_ψ start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT to sequence sp,qsubscript𝑠𝑝𝑞s_{p,q}italic_s start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT
6:         end for
7:     end for
8:end for
9:sort s𝑠sitalic_s in descending order of ψ𝜓\psiitalic_ψ
10:segms𝑠𝑒𝑔𝑚𝑠segms\leftarrow\emptysetitalic_s italic_e italic_g italic_m italic_s ← ∅
11:for sp,qsubscript𝑠𝑝𝑞s_{p,q}italic_s start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT in sorted s𝑠sitalic_s do
12:     if sx,ysegms,(x>qy<p)for-allsubscript𝑠𝑥𝑦𝑠𝑒𝑔𝑚𝑠𝑥𝑞𝑦𝑝\forall s_{x,y}\in segms,(x>q\lor y<p)∀ italic_s start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ∈ italic_s italic_e italic_g italic_m italic_s , ( italic_x > italic_q ∨ italic_y < italic_p ) then
13:         segmssegms{sp,q}𝑠𝑒𝑔𝑚𝑠𝑠𝑒𝑔𝑚𝑠subscript𝑠𝑝𝑞segms\leftarrow segms\cup\{s_{p,q}\}italic_s italic_e italic_g italic_m italic_s ← italic_s italic_e italic_g italic_m italic_s ∪ { italic_s start_POSTSUBSCRIPT italic_p , italic_q end_POSTSUBSCRIPT }
14:     end if
15:end for
16:return segms𝑠𝑒𝑔𝑚𝑠segmsitalic_s italic_e italic_g italic_m italic_s \triangleright non-overlap** segments

We show full document segmentations using Algorithm 1 for short excerpts from the same Wikipedia article in Figure 4 and Figure 5. Figure 6 and Figure 7 show more segmentations for a Pile document.

Refer to caption
Figure 4: Full segmentation of a document from Wikipedia via Algorithm 1 on Llama-2-7b. Borders indicate segmentation, with bolded letters indicating multi-token segments. Darker blue cells have higher scores, yellow cells have negative scores. The highest-scoring sequence in this document is “Australian Institute” (ψ=0.579𝜓0.579\psi=0.579italic_ψ = 0.579).
Refer to caption
Figure 5: Full segmentation of a document from Wikipedia via Algorithm 1 on Llama-3-8b. Borders indicate segmentation, with bolded letters indicating multi-token segments. Darker blue cells have higher scores, yellow cells have negative scores. The highest-scoring sequence in this document is “. After the Games she commented "” (ψ=0.443𝜓0.443\psi=0.443italic_ψ = 0.443).
Refer to caption
Figure 6: Full segmentation of a document from the Pile via Algorithm 1 on Llama-2-7b. Borders indicate segmentation, with bolded letters indicating multi-token segments. Darker blue cells have higher scores, yellow cells have negative scores. The highest-scoring sequence in this document is “submodel” (ψ=0.559𝜓0.559\psi=0.559italic_ψ = 0.559).
Refer to caption
Figure 7: Full segmentation of a document from the Pile via Algorithm 1 on Llama-3-8b. Borders indicate segmentation, with bolded letters indicating multi-token segments. Darker blue cells have higher scores, yellow cells have negative scores. The highest-scoring sequence in this document is “re really brave:” (ψ=0.634𝜓0.634\psi=0.634italic_ψ = 0.634).

Appendix E Model Vocabularies

Tables 3 through 6 show the top 50 highest-scoring multi-token sequences for Llama-2-7b and Llama-3-8b across either five hundred Wikipedia articles or five hundred Pile samples. Entries were filtered to show only sequences that appear more than once.

Refer to caption
Figure 8: Test accuracy on CounterFact subject last tokens versus other tokens in the dataset for probes trained on Llama-3-8b (n=5495𝑛5495n=5495italic_n = 5495). i𝑖iitalic_i represents the position being predicted (e.g., i=1𝑖1i=-1italic_i = - 1 is previous token prediction; i=1𝑖1i=1italic_i = 1 is next-token prediction). We observe an “erasure” effect similar to Figure 2.
Refer to caption
Figure 9: Test accuracy of probes on last tokens of Wikipedia multi-token words for probes trained on Llama-3-8b (n=91935𝑛91935n=91935italic_n = 91935; right). Test accuracy on all other tokens shown on the left. Similarly to Figure 2, we see an erasing effect that is not present for other types of tokens.
Refer to caption
Figure 10: Test accuracy of probes on last tokens of Wikipedia multi-token entities for probes trained on Llama-3-8b (n=36723𝑛36723n=36723italic_n = 36723; right). Test accuracy on all other tokens shown on the left. Entities are identified via spaCy named entity recognition, excluding entity types that include digits.
Refer to caption
Figure 11: Test accuracy of probes on last tokens of Wikipedia multi-token words for Llama-2-7b (n=80606𝑛80606n=80606italic_n = 80606, right). Test accuracy on all other tokens shown on the left. Similarly to Figure 2, we see an erasing effect that is not present for other types of tokens.
Refer to caption
Figure 12: Test accuracy of probes on last tokens of Wikipedia multi-token entities for Llama-2-7b (n=36723𝑛36723n=36723italic_n = 36723; right). Test accuracy on all other tokens shown on the left. Entities are identified via spaCy named entity recognition, excluding entity types that include digits.
Refer to caption
Figure 13: Breakdown for Section 3 probes tested on CounterFact first subject tokens, middle subject tokens, and last subject tokens. We observe an “erasing” effect only for last subject tokens. Because BOS tokens are recoverable by i=1𝑖1i=-1italic_i = - 1 probes at high rates, and since 55% of prompts tested on had subjects at the beginning, we filter examples for which BOS tokens are labels from the leftmost plot.
Refer to caption
Figure 14: Probe test results for CounterFact subject last tokens broken down for unigrams, bigrams, and trigrams. Unigram subjects store previous token information at rates near 100%, even excluding BOS tokens.
Token Sequence n𝑛nitalic_n ct ψ𝜓\psiitalic_ψ
Gottsche 3 2 0.685220
berth 3 2 0.680793
carries 3 2 0.647844
Eurocop 3 2 0.644104
franchises 3 2 0.642707
0 Women 3 2 0.639162
rape 3 2 0.632567
Rebell 3 3 0.614295
intermittently 4 2 0.613479
enn State 4 3 0.607535
North Dakota 4 10 0.600616
Sride 3 2 0.600013
fiction 2 2 0.599339
Sox 3 3 0.599043
Bazz 3 2 0.598242
erect 3 2 0.597915
borough 3 3 0.596054
encompasses 5 2 0.592084
northernmost 3 2 0.591607
Madras 3 2 0.590394
hull 3 2 0.586968
iron 2 2 0.586959
Galaxy 3 2 0.585879
began operations 3 2 0.584680
Redding 3 2 0.584244
gloss 3 2 0.576740
cello 3 2 0.573732
Gators 3 5 0.573675
senator 3 2 0.572947
restructuring 4 2 0.570552
supervised 3 3 0.570421
Mediterranean 4 2 0.567790
Madera 3 2 0.567563
sequel 3 2 0.563626
scarp 3 3 0.561548
Sout 3 2 0.560640
South Division 3 2 0.558720
rectangular 3 2 0.557339
Danny 3 2 0.556836
Examiner 4 2 0.555797
Kuwait 4 4 0.554636
Bogue 3 6 0.552219
Lancaster 3 3 0.552166
Leuven 4 3 0.548806
the Park 3 2 0.548687
first Baron 3 2 0.547447
fights 3 2 0.547171
Carpio 3 2 0.547116
Czech Republic 3 2 0.546651
Survive 4 2 0.546255
Table 3: Llama-2-7b Wikipedia results (1808 sequences total). n𝑛nitalic_n is the number of tokens in the sequence, and ‘ct’ represents occurrences of this segment. ψ𝜓\psiitalic_ψ is averaged over all occurrences.
Token Sequence n𝑛nitalic_n ct ψ𝜓\psiitalic_ψ
1992 births 7 2 0.573139
19th-century 7 3 0.568861
dehydrogen 5 2 0.553029
Swahili 4 4 0.539052
Chuck Liddell 6 2 0.537169
its population was 5 5 0.534977
by per capita income 6 3 0.518991
are brownish 4 2 0.515703
ate women’s football 7 4 0.509384
Almeida 4 5 0.507277
of New South Wales 5 3 0.503120
2015 deaths 8 2 0.503074
Pittsburgh 3 3 0.503070
21st-century 7 4 0.499362
(NSW 4 9 0.497107
age of the United Kingdom 6 3 0.487303
Presidential 3 2 0.485317
Landmark 3 2 0.484965
Alistair 4 2 0.484930
Tauri 3 8 0.482449
2 km 4 2 0.479984
20th-century 7 3 0.475703
East Bay 3 2 0.475156
game goes in extra time, if the scored 10 2 0.472323
São Paulo 3 2 0.470874
Atlantic City 3 2 0.470726
Chaluk 3 2 0.467165
Frank Lloyd 3 2 0.462585
may refer to: 6 4 0.462234
gold medalists 4 2 0.458494
, 2nd Baron 6 2 0.456996
people) 4 4 0.454926
series aired 4 2 0.453057
Srib 3 2 0.451708
with blackish 4 2 0.450033
World Cup players 4 2 0.448979
main role 3 2 0.448569
Bos 4 2 0.448425
Asenath 4 2 0.448259
Royal Navy 3 3 0.445617
2. Bundesliga players 7 2 0.445210
External links 3 69 0.444921
an unincorpor 6 2 0.443527
Gast 2 4 0.437695
Pfor 3 2 0.432194
Elisio de Med 5 2 0.431518
" (2007) "Jad 12 2 0.429412
Elkh 3 2 0.428984
Früh 3 2 0.427781
order of the NK 5 2 0.424037
Table 4: Llama-3-8b Wikipedia results (892 sequences total). n𝑛nitalic_n is the number of tokens in the sequence, and ‘ct’ represents occurrences of this segment. ψ𝜓\psiitalic_ψ is averaged over all occurrences.
Token Sequence n𝑛nitalic_n ct ψ𝜓\psiitalic_ψ
lower case 3 2 0.736012
storm 2 4 0.716379
excursion 4 2 0.713134
====… (72 ‘equals’ signs) 8 2 0.712982
Mom 3 2 0.706778
acre 3 2 0.629213
Subject 3 2 0.607172
ninth 3 2 0.606669
processing elements 3 2 0.599549
CVC 3 2 0.596735
VPN 3 3 0.596052
Regul 3 2 0.591968
bore 2 2 0.590212
$\dot{G 5 2 0.589714
Rates 3 2 0.589637
INSURANCE 5 2 0.584323
Commercial 4 2 0.581543
Barney 3 3 0.574872
PTA 3 2 0.571932
penetrated 4 2 0.570164
MG 3 2 0.569830
Leigh 3 2 0.567894
jail 3 3 0.567225
TNS 3 2 0.567003
peptides 4 2 0.565775
John Arena 3 2 0.565648
Disease 4 2 0.564662
welfare 4 4 0.564364
wild type 3 2 0.560699
uws 3 3 0.557799
ongrel 4 3 0.554208
liquid cry 3 3 0.553408
princess 3 2 0.551672
Denmark 3 2 0.548702
birthday 3 2 0.548504
atedmes 4 2 0.548171
"ENOENT 5 2 0.547169
third-party 4 2 0.546949
aliens 3 2 0.546507
Durban 3 4 0.545848
Bouncy 4 3 0.545826
CHO 3 2 0.542762
unjust 3 2 0.538813
these motivational 4 3 0.537485
DLS 3 4 0.535933
\n& 3 2 0.534510
uneven 3 2 0.533137
watt 3 2 0.532243
’She 3 2 0.531300
HP 3 3 0.529555
Table 5: Llama-2-7b Pile results (1658 sequences total). n𝑛nitalic_n is the number of tokens in the sequence, and ‘ct’ represents occurrences of this segment. ψ𝜓\psiitalic_ψ is averaged over all occurrences.
Token Sequence n𝑛nitalic_n ct ψ𝜓\psiitalic_ψ
</td>\n<td> 9 2 0.627583
{d}x 5 3 0.599395
\n 4 3 0.587016
_{n=1}{̂\in 7 4 0.585434
</td>\n<td 8 2 0.573310
-2-2007-061 12 3 0.551581
reticulum 4 3 0.549337
INSURANCE 5 2 0.548263
32;\n internal static 8 2 0.547893
;\n internal static 6 9 0.540374
: At 4 2 0.538609
(2,9,’ 6 4 0.537495
Respondent 4 2 0.534509
\t\t}\n\n\t 7 3 0.530669
(3,0,’ 6 4 0.529493
_{n-1}\ar 7 2 0.527303
thank you for your understanding 6 2 0.513979
hydroxyl 4 2 0.510059
>\n*/\private $ 9 2 0.510054
in mukaan 5 2 0.506333
{w}{̂B}_{ 6 2 0.505970
/2\Z 5 2 0.501998
’); \nINSERT INTO 6 10 0.501055
7-f131 7 2 0.496881
0, 1L> 8 2 0.495809
/0 S 5 2 0.492042
5 Audi 4 2 0.491043
all that apply 4 3 0.490469
": true,\n 6 2 0.486807
4,\n 5 2 0.485315
to as DSP 5 2 0.484967
*B**]{}\ 6 2 0.483484
;\ninternal 5 3 0.479777
100% used 6 2 0.475673
", "x": 5 3 0.474701
2.7 4 2 0.473720
</td>\n 6 2 0.473578
" code=" 4 4 0.473514
e2d-d 6 2 0.473418
is under conversion 4 5 0.473355
{ int|sys 5 3 0.471213
();\n}\n\nprivate boolean isAny 12 2 0.470941
(2,8,’ 6 4 0.470214
trachea 4 2 0.469154
use in an automobile 6 2 0.467788
at org.apache.c 7 5 0.467637
world around us 4 2 0.464469
2\left(1+x 8 2 0.463555
or Commodore 5 3 0.463106
11-117 7 2 0.459824
Table 6: Llama-3-8b Pile results (819 sequences total). n𝑛nitalic_n is the number of tokens in the sequence, and ‘ct’ represents occurrences of this segment. ψ𝜓\psiitalic_ψ is averaged over all occurrences.