Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Abstract
LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b’s tokenizer splits the word “northeastern” into the tokens [_n, ort, he, astern], none of which correspond to semantically meaningful units like “north” or “east.” Similarly, the overall meanings of named entities like “Neil Young” and multi-word expressions like “break a leg” cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced “erasure” effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to “read out” the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.111Code and data available at footprints.baulab.info
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Sheridan Feucht David Atkinson Byron C. Wallace David Bau Northeastern University {feucht.s, atkinson.da, b.wallace, d.bau}@northeastern.edu
1 Introduction
Despite their widespread use, the specific mechanisms by which LLMs are able to “understand” and generate coherent text are not well understood. One mystery is the process by which groups of subword tokens are converted into meaningful representations, a process described by Elhage et al., 2022 and Gurnee et al., 2023 as detokenization.
![Refer to caption](x1.png)
Current language models process text as a series of tokens drawn from a set token vocabulary: One token can correspond to a single word (_fish), or to a piece of a larger word (mon in “salmon”). The vocabulary of tokens available to a model is typically determined before training with byte-pair encoding (Sennrich et al., 2016), which is based on a specific dataset and can lead to unintuitive results. For example, Llama-2-7b’s (Touvron et al., 2023) tokenizer breaks the word “northeastern” into the tokens [_n, ort, he, astern], none of which correspond to semantically meaningful units like “north” or “east.” Capitalization also creates unexpected issues: for example, the word “Hawaii” is split into two tokens if the first letter is capitalized [_Hawai, i], but four if the first letter is lowercase [_ha, w, ai, i]. In spite of these challenges, large models are apparently able to “understand” such idiosyncratic tokenizations of multi-token words with few observable effects on downstream performance (Gutiérrez et al., 2023), unless these weaknesses are directly targeted (Wang et al., 2024; Batsuren et al., 2024). How is this possible?
We hypothesize that during pretraining, LLMs develop an implicit vocabulary that maps from groups of arbitrary tokens to semantically meaningful units. These lexical units may be multi-token words (“northeastern”), named entities (“Neil Young”), or idiomatic multi-word expressions (“break a leg”) and can be understood as “item[s] that function as single unit[s] of meaning” in a model’s vocabulary (Simpson, 2011). Lexical items are also non-compositional: Just as the meaning of “break a leg” cannot be predicted from the individual meanings of “break” and “leg,” the meaning of “patrolling” cannot be predicted from its constituent tokens pat and rolling. This arbitrariness necessitates some kind of storage system, implicit or otherwise (Murphy, 2010).
How exactly do LLMs deal with these cases mechanistically? In this paper, we begin to answer this question by investigating token-level information stored in LLM representations.
- •
-
•
We develop a heuristic for scoring the “lexicality” of a given sequence of tokens, and use it to “read out” a list of an LLM’s lexical items given a large dataset of natural text.
We interpret this erasure effect as a “footprint” of a mechanism in early layers that orchestrates the formation of meaningful lexical items.
2 Background
Previous work has shown that knowledge about a multi-token entity is often stored in the last token of that entity. For example, Meng et al. (2022) found that factual information about a subject like “The Space Needle” would be concentrated in the representation for le. Geva et al. (2023) find evidence for a subject enrichment stage during factual recall, where information about an entity is collected at its last token in early layers, which is also seen in other work on factual recall using the same dataset (Katz et al., 2024), and corroborated by research on athlete sport lookups (Nanda et al., 2023). This phenomenon may be due to the autoregressive nature of decoder transformer models: models cannot enrich “Space” with information about Seattle until after “Needle” is seen, as “Space” could refer to a number of unrelated concepts (“Space Jam,” “Space Station”).222This is not a hard-and-fast rule; it depends on entity frequency and context cues. For example, if a model sees _The, _E, and iff, it may already know that these tokens refer to “The Eiffel Tower” without needing to see el and Tower.
Other work in interpretability has also started to uncover evidence of models encoding lexical items. Elhage et al. (2022) observe neurons in early layers that fire on the last tokens of multi-token words, names of famous people, generic nouns, compound words, and LaTeX commands. They also find late-layer neurons that seem to be relevant to retokenization, i.e., conversion from internal representations back into tokens. For example, a retokenization neuron might fire on _st and promote rag in order to facilitate the output of the word “straggler.” Gurnee et al. (2023) also find examples of polysemantic neurons in Pythia models (Mallen and Belrose, 2023) that activate for a number of multi-token constructions like “apple developer,” “Bloom.ington,” and “research.gate.”
3 Linear Probing of Hidden States
3.1 Method
If last token positions are so important (Section 2), then what do these representations encode? Perhaps the last hidden state directly stores information about other subject tokens (e.g., _Wars might contain some encoding for _Star in its hidden state). To test this hypothesis, we investigate hidden states for both Llama-2-7b and Llama-3-8b, as they have significantly different token vocabulary sizes (32k and 128k tokens, respectively). We train linear probes to take a hidden state at layer and token position and predict the value of a nearby token . (e.g., a probe trained to predict the previous token for layer 5 hidden states would be denoted by ).
We train probes for all layer indexes and offsets . We also train probes in the same manner on the embedding layer () and on the final outputs of the network before the language modelling head (). We trained probes on a random sample of 428k tokens from the Pile (Gao et al., 2020) using AdamW for 16 epochs with a batch size of 4 and a learning rate of 0.1. Hyperparameters were selected based on validation performance on a separate Pile sample (279k tokens) after a random sweep. Each probe takes 6-8 hours to train on an RTX-A6000.
3.2 CounterFact Subjects
After training probes in Section 3.1, we test them on the CounterFact dataset (Meng et al., 2022), which consists of prompts about subjects that require factual knowledge to complete correctly (e.g. “Mount Passel is in Antarctica”). We filter the dataset to include only prompts that the model answers correctly, yielding 5,063 examples for Llama-2-7b and 5,495 examples for Llama-3-8b. To augment this dataset, we also sampled and filtered down [album/movie/series creator] pairs from Wikidata Vrandečić and Krötzsch (2014) and embedded them in prompts in the same manner, yielding a total of 12,135 correctly-answered prompts for Llama-2-7b and 13,995 for Llama-3-8b.
Figure 2 shows probe test results on CounterFact last subject tokens (right) versus every other type of token in the dataset (left). We see a striking “erasure” effect for last tokens of CounterFact subjects, where these hidden states consistently “forget about” preceding and current tokens. Subject tokens that are not in the last position (e.g., _Star) do not exhibit this pattern (Appendix A, Figure 13). This striking drop in token accuracy is reminiscent of the subject enrichment stage described by Geva et al. (2023), suggesting that the tokens _Star and _Wars may be overwritten in the process of representing the concept of Star Wars.
We also observe the same phenomenon when testing on named entities identified by spaCy in Wikipedia articles (Appendix A, Figure 12), suggesting that this effect is not an artifact of the short templates found in the CounterFact dataset. Additionally, we consider whether this result is due to imbalances in probe training data, but this seems not to be the case either (Appendix B).
3.3 Multi-Token Words
Intuitively, the process of converting a multi-token sequence like [_n, ort, he, astern] into a meaningful representation of the word “northeastern” resembles the process of converting [_E, iff, el, Tower] into “Eiffel Tower.” We hypothesize that models treat multi-token words in the same way that they treat multi-token entities, and test our probes from Section 3.1 on multi-token words. After sampling 500 articles (256k tokens) from the 20220301.en split of the Wikipedia dump (Foundation, 2022), we split by white-space to naively identify word boundaries. As predicted, we see the same “erasing” pattern for multi-token words that we do for multi-token entities (Appendix A, Figure 11). This suggests that they may be processed in a similar manner in early layers.
4 Building a Vocabulary
After examination of probe behavior for multi-token words and entities, we hypothesize that this “erasure” effect is a result of the implicit formation of lexical representations in early layers. To characterize this phenomenon, we propose an erasure score to identify token sequences that follow the pattern observed in Section 3. We then introduce an approach to “reading out” a list of implicit vocabulary entries for a given model using this score.
4.1 An Erasure Score
We first define an erasure score , which is a heuristic designed around the intuition that the last token representation of a lexical item should exhibit a strong “erasing” effect for probe predictions from layer 1 to layer .333For both Llama-2-7b and Llama-3-8b we set . The score quantifies “erasing” behavior at token positions within a given sequence at indices through , and penalizes any erasure of tokens outside of these boundaries. Equation 1 defines the score for a sequence of length as:
(1) |
where denotes the change in probability of the predicted token from layer 1 to layer , based on probes from Section 3.1.
(2) |
If lies outside the boundaries of , we decrease . Otherwise, a large drop between layers increases the value of .
(3) |
We provide further explanation of the intuition behind this approach in Appendix C.
4.2 Segmenting Documents
We develop an algorithm built around our erasure score that breaks any given document into high-scoring, non-overlap** segments covering all of (Algorithm 1, Appendix D). Figure 1 shows the top-scoring sequences calculated in this manner from a Wikipedia excerpt about Thelonious Monk, where unigram scores are excluded for clarity. Not all multi-token words are scored highly via our approach, but the highest-scoring sequences are plausible lexical items that are non-compositional in nature (“dram.atic”, “sil.ences”, “tw.ists”). We share examples of complete segmentations in Appendix D.
4.3 Model Vocabularies
MTW | MTE | ||||
---|---|---|---|---|---|
llama | data | prec. | recall | prec. | recall |
2-7b | wiki | 0.306 | 0.016 | 0.143 | 0.016 |
pile | 0.296 | 0.017 | 0.080 | 0.018 | |
3-8b | wiki | 0.044 | 0.001 | 0.010 | 0.000 |
pile | 0.023 | 0.001 | 0.012 | 0.001 |
Finally, we propose a method to “read out” the implicit vocabulary of a model given a dataset . For each document , we segment using Algorithm 1. We then average scores for every multi-token sequence that appears more than once across all documents. As this process is very data-dependent, we compare the top 50 results for Pile and Wikipedia text in Appendix E.
With this approach, we are able to recover 1800 sequences for Llama-2-7b and 900 for Llama-3-8b using the same five hundred Wikipedia articles. Although recall is quite low (Table 1), we find that 44.9% of sequences recovered for Llama-2-7b on Wikipedia text are either multi-token words or multi-token entities (29.68% for Pile text). For Llama-3-8b, only 5% and 3% of sequences are MTWs or MTEs. However, looking at examples of Llama-3-8b sequences in Appendix E, we can observe other interesting cases, like multi-token expressions (“gold medalists,” “by per capita income,” “thank you for your understanding”) and LaTeX commands (as similarly observed by Elhage et al. (2022)). Because Llama-3-8b’s token vocabulary is four times larger than Llama-2-7b’s, its implicit vocabulary also seems to consist of more multi-word expressions and chunks of code rather than multi-token words (Appendix E, Table 6).
5 Conclusion
In this work, we present preliminary evidence for the existence of an implicit vocabulary that allows models to convert from byte-pair encoded tokens to useful lexical items. We posit that the “erasure” effect we observe for Llama-2-7b and Llama-3-8B is a result of model processes that deal with multi-token expressions, and use this insight to propose a new method for “reading out” an LLM’s implicit vocabulary. This is a first step towards understanding the formation of lexical representations in LLMs, and may serve as a useful tool for elucidation of words that a given model “knows.”
Limitations
Evaluation of implicit vocabulary-building methods (Section 4) is challenging due to the lack of a known ground-truth. Our approach is motivated by the desire to understand the inner workings of the model being studied, but we have no authoritative reference that distinguishes between situations where a given sequence gets a high value because it is truly treated as a lexical unit by the model, or where it may be due to an error in our methodology. To quantify our results, we have compared the extracted vocbulary to sequences that we assume to be likely lexical items: multi-token words and spaCy named entities. However, this likely does not cover all cases for which “token grou**” occurs in LLMs.
Another limitation of this work is that we have restricted our analysis to known entities. There is also the question of what happens for intermediate cases such as plausible-sounding fictional towns or names of people who are not famous. If correlates with sequence presence in training data, these results could be useful for understanding how familiar an LLM is with a given word or entity.
Finally, our measurements have been run only on the Llama family of models and do not yet extend to non-Llama models of comparable size, or Llama models of larger sizes.
Ethics Statement
In this work, we restrict our analysis to English words, due to our biases as native speakers of English. We hope that this work can also provide valuable insights for other languages, especially low-resource languages, where understanding “what words an LLM knows” may be especially useful.
Acknowledgments
We thank David Smith, Bilal Chughtai, Chantal Shaib, Atticus Geiger, and Adrian Chang for helpful discussion and feedback throughout the course of this project. This work was supported in part by Open Philanthropy, and by the National Science Foundation (NSF) grant IIS-1901117.
Experiments were implemented using the nnsight library; many were run on the Center for AI Safety Compute Cluster. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
References
- Batsuren et al. (2024) Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, and Gábor Bella. 2024. Evaluating subword tokenization: Alien subword composition and oov generalization challenge. Preprint, arXiv:2404.13292.
- Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. 2022. Softmax linear units. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/solu/index.html.
- Foundation (2022) Wikimedia Foundation. 2022. Wikimedia downloads.
- Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. 2023. Dissecting recall of factual associations in auto-regressive language models. ArXiv, abs/2304.14767.
- Gurnee et al. (2023) Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. 2023. Finding neurons in a haystack: Case studies with sparse probing. Preprint, arXiv:2305.01610.
- Gutiérrez et al. (2023) Bernal Jiménez Gutiérrez, Huan Sun, and Yu Su. 2023. Biomedical language models are robust to sub-optimal tokenization. Preprint, arXiv:2306.17649.
- Katz et al. (2024) Shahar Katz, Yonatan Belinkov, Mor Geva, and Lior Wolf. 2024. Backward lens: Projecting language model gradients into the vocabulary space. Preprint, arXiv:2402.12865.
- Mallen and Belrose (2023) Alex Mallen and Nora Belrose. 2023. Eliciting latent knowledge from quirky language models. Preprint, arXiv:2312.01037.
- Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. In Neural Information Processing Systems.
- Meta (2024) Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date.
- Murphy (2010) M. Lynne Murphy. 2010. Lexical Meaning. Cambridge Textbooks in Linguistics. Cambridge University Press.
- Nanda et al. (2023) Neel Nanda, Senthooran Rajamanoharan, János Krámar, and Rohin Shah. 2023. Fact finding: Attempting to reverse-engineer factual recall on the neuron level.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Simpson (2011) James Simpson. 2011. The Routledge handbook of applied linguistics. Taylor & Francis.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
- Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85.
- Wang et al. (2024) Dixuan Wang, Yanda Li, Junyuan Jiang, Zepeng Ding, Guochao Jiang, Jiaqing Liang, and Deqing Yang. 2024. Tokenization matters! degrading large language models through challenging their tokenization. Preprint, arXiv:2405.17067.
Appendix A Linear Probing on Hidden States
A.1 Llama-3-8b Results
CounterFact Accuracy
Multi-Token Word Accuracy
Figure 9 shows results for probes tested on the last token positions of multi-token words from Wikipedia (where “words” are determined by whitespace separation).
Multi-Token Entity Accuracy
Figure 10 shows results for probes tested on the last token positions of multi-token entities identified by spaCy, using the same dataset as A.1. We use spaCy’s named entity recognition pipeline to identify named entities. Because digits 0-9 are added to Llama-2-7b’s vocabulary, we filter out all classes relating to numbers (PERCENT, DATE, CARDINAL, TIME, ORDINAL, MONEY, QUANTITY), with the thought that these sequences may be treated differently at the detokenization stage.
A.2 Llama-2-7b Results
Multi-Token Word Accuracy
Multi-Token Entity Accuracy
A.3 Llama-2-7b Extras
![Refer to caption](extracted/5696859/figures/probe-pile.png)
Pile Accuracy
Comparison of Token Positions
Figure 13 shows the breakdown of probe performance on different types of subject tokens: first subject tokens, middle subject tokens, and last subject tokens. We see that the observed drop in previous and current token representation observed in last subject tokens still exists, but is not as drastic for first and middle subject tokens.
Comparison of Subject Lengths
Appendix B Accounting for Possible Training Imbalance
One explanation for the observed drop in accuracy for CounterFact entities across layers is that our probes have simply not been exposed to as many entity tokens during training. We do not believe this is the case for Llama-2-7b for two reasons: (1) If this effect was due to probes being less sensitive to tokens found in multi-token entities, we would also see a significant drop for first and middle tokens, which does not occur (Figure 13). (2) We measure the frequency of all test n-grams in the original Pile data used to train our probes, and find that both subject and non-subject n-grams are found in the probe training dataset at similar rates, with the median number of occurrences in the test set for both types of sequences being zero. After removing the few non-subject sequences that do appear often in the probe training set, we still see the same “erasure” effect.
Appendix C Intuition for
C.1 Explanation of Equation 1
Our first assumption when designing is that if an LLM is processing a sequence of tokens corresponding to a lexical item (e.g., [Cal, g, ary]), the last token ary in that sequence should “forget itself” between layers 1 and , according to the pattern we observe in Section 3. We measure this drop using probability scores from probe outputs for a single example. In plain English, the first term in Equation 1 represents how much drops between layer 1 and layer when probing the hidden representation for ary.
In the double summation term, we take into account probe predictions for tokens one () or two () positions before the current token. If there is a drop in probability between layer 1 and for these positions, we take this as evidence of token clum**.
However, we also observe (from Figure 13 and manual inspection) that probe predictions for token positions that lie outside the boundaries of a presumed lexical item are maintained across layers, or even promoted. This is clear from the case in the leftmost plot for Figure 13, and was also a consistent pattern when we examined probe behavior over a number of example documents. Given this fact, we include the indicator function to decrease in cases where erasure is happening outside the bounds of the given sequence.
C.2 Choice of
We choose based on probe “erasure” behavior for Llama-2-7b and Llama-3-8b, particularly Figure 2. We also present a short ablation experiment for with results in Table 2, which shows that other values of after “the drop” are roughly equivalent to .
MTW | MTE | |||
---|---|---|---|---|
prec. | recall | prec. | recall | |
5 | 0.307 | 0.002 | 0.143 | 0.002 |
9 | 0.306 | 0.016 | 0.143 | 0.016 |
13 | 0.328 | 0.003 | 0.169 | 0.003 |
17 | 0.330 | 0.003 | 0.180 | 0.003 |
21 | 0.319 | 0.003 | 0.172 | 0.003 |
Appendix D Document Segmentation
We show full document segmentations using Algorithm 1 for short excerpts from the same Wikipedia article in Figure 4 and Figure 5. Figure 6 and Figure 7 show more segmentations for a Pile document.
![Refer to caption](extracted/5696859/figures/doc_examples/llama2_wikival_doc0.png)
![Refer to caption](extracted/5696859/figures/doc_examples/llama3_wikival_doc0.png)
![Refer to caption](extracted/5696859/figures/doc_examples/llama2_pile_doc2.png)
![Refer to caption](extracted/5696859/figures/doc_examples/llama3_pile_doc2.png)
Appendix E Model Vocabularies
Tables 3 through 6 show the top 50 highest-scoring multi-token sequences for Llama-2-7b and Llama-3-8b across either five hundred Wikipedia articles or five hundred Pile samples. Entries were filtered to show only sequences that appear more than once.
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
![Refer to caption](x7.png)
![Refer to caption](x8.png)
![Refer to caption](x9.png)
Token Sequence | ct | ||
---|---|---|---|
Gottsche | 3 | 2 | 0.685220 |
berth | 3 | 2 | 0.680793 |
carries | 3 | 2 | 0.647844 |
Eurocop | 3 | 2 | 0.644104 |
franchises | 3 | 2 | 0.642707 |
0 Women | 3 | 2 | 0.639162 |
rape | 3 | 2 | 0.632567 |
Rebell | 3 | 3 | 0.614295 |
intermittently | 4 | 2 | 0.613479 |
enn State | 4 | 3 | 0.607535 |
North Dakota | 4 | 10 | 0.600616 |
Sride | 3 | 2 | 0.600013 |
fiction | 2 | 2 | 0.599339 |
Sox | 3 | 3 | 0.599043 |
Bazz | 3 | 2 | 0.598242 |
erect | 3 | 2 | 0.597915 |
borough | 3 | 3 | 0.596054 |
encompasses | 5 | 2 | 0.592084 |
northernmost | 3 | 2 | 0.591607 |
Madras | 3 | 2 | 0.590394 |
hull | 3 | 2 | 0.586968 |
iron | 2 | 2 | 0.586959 |
Galaxy | 3 | 2 | 0.585879 |
began operations | 3 | 2 | 0.584680 |
Redding | 3 | 2 | 0.584244 |
gloss | 3 | 2 | 0.576740 |
cello | 3 | 2 | 0.573732 |
Gators | 3 | 5 | 0.573675 |
senator | 3 | 2 | 0.572947 |
restructuring | 4 | 2 | 0.570552 |
supervised | 3 | 3 | 0.570421 |
Mediterranean | 4 | 2 | 0.567790 |
Madera | 3 | 2 | 0.567563 |
sequel | 3 | 2 | 0.563626 |
scarp | 3 | 3 | 0.561548 |
Sout | 3 | 2 | 0.560640 |
South Division | 3 | 2 | 0.558720 |
rectangular | 3 | 2 | 0.557339 |
Danny | 3 | 2 | 0.556836 |
Examiner | 4 | 2 | 0.555797 |
Kuwait | 4 | 4 | 0.554636 |
Bogue | 3 | 6 | 0.552219 |
Lancaster | 3 | 3 | 0.552166 |
Leuven | 4 | 3 | 0.548806 |
the Park | 3 | 2 | 0.548687 |
first Baron | 3 | 2 | 0.547447 |
fights | 3 | 2 | 0.547171 |
Carpio | 3 | 2 | 0.547116 |
Czech Republic | 3 | 2 | 0.546651 |
Survive | 4 | 2 | 0.546255 |
Token Sequence | ct | ||
---|---|---|---|
1992 births | 7 | 2 | 0.573139 |
19th-century | 7 | 3 | 0.568861 |
dehydrogen | 5 | 2 | 0.553029 |
Swahili | 4 | 4 | 0.539052 |
Chuck Liddell | 6 | 2 | 0.537169 |
its population was | 5 | 5 | 0.534977 |
by per capita income | 6 | 3 | 0.518991 |
are brownish | 4 | 2 | 0.515703 |
ate women’s football | 7 | 4 | 0.509384 |
Almeida | 4 | 5 | 0.507277 |
of New South Wales | 5 | 3 | 0.503120 |
2015 deaths | 8 | 2 | 0.503074 |
Pittsburgh | 3 | 3 | 0.503070 |
21st-century | 7 | 4 | 0.499362 |
(NSW | 4 | 9 | 0.497107 |
age of the United Kingdom | 6 | 3 | 0.487303 |
Presidential | 3 | 2 | 0.485317 |
Landmark | 3 | 2 | 0.484965 |
Alistair | 4 | 2 | 0.484930 |
Tauri | 3 | 8 | 0.482449 |
2 km | 4 | 2 | 0.479984 |
20th-century | 7 | 3 | 0.475703 |
East Bay | 3 | 2 | 0.475156 |
game goes in extra time, if the scored | 10 | 2 | 0.472323 |
São Paulo | 3 | 2 | 0.470874 |
Atlantic City | 3 | 2 | 0.470726 |
Chaluk | 3 | 2 | 0.467165 |
Frank Lloyd | 3 | 2 | 0.462585 |
may refer to: | 6 | 4 | 0.462234 |
gold medalists | 4 | 2 | 0.458494 |
, 2nd Baron | 6 | 2 | 0.456996 |
people) | 4 | 4 | 0.454926 |
series aired | 4 | 2 | 0.453057 |
Srib | 3 | 2 | 0.451708 |
with blackish | 4 | 2 | 0.450033 |
World Cup players | 4 | 2 | 0.448979 |
main role | 3 | 2 | 0.448569 |
Bos | 4 | 2 | 0.448425 |
Asenath | 4 | 2 | 0.448259 |
Royal Navy | 3 | 3 | 0.445617 |
2. Bundesliga players | 7 | 2 | 0.445210 |
External links | 3 | 69 | 0.444921 |
an unincorpor | 6 | 2 | 0.443527 |
Gast | 2 | 4 | 0.437695 |
Pfor | 3 | 2 | 0.432194 |
Elisio de Med | 5 | 2 | 0.431518 |
" (2007) "Jad | 12 | 2 | 0.429412 |
Elkh | 3 | 2 | 0.428984 |
Früh | 3 | 2 | 0.427781 |
order of the NK | 5 | 2 | 0.424037 |
Token Sequence | ct | ||
---|---|---|---|
lower case | 3 | 2 | 0.736012 |
storm | 2 | 4 | 0.716379 |
excursion | 4 | 2 | 0.713134 |
====… (72 ‘equals’ signs) | 8 | 2 | 0.712982 |
Mom | 3 | 2 | 0.706778 |
acre | 3 | 2 | 0.629213 |
Subject | 3 | 2 | 0.607172 |
ninth | 3 | 2 | 0.606669 |
processing elements | 3 | 2 | 0.599549 |
CVC | 3 | 2 | 0.596735 |
VPN | 3 | 3 | 0.596052 |
Regul | 3 | 2 | 0.591968 |
bore | 2 | 2 | 0.590212 |
$\dot{G | 5 | 2 | 0.589714 |
Rates | 3 | 2 | 0.589637 |
INSURANCE | 5 | 2 | 0.584323 |
Commercial | 4 | 2 | 0.581543 |
Barney | 3 | 3 | 0.574872 |
PTA | 3 | 2 | 0.571932 |
penetrated | 4 | 2 | 0.570164 |
MG | 3 | 2 | 0.569830 |
Leigh | 3 | 2 | 0.567894 |
jail | 3 | 3 | 0.567225 |
TNS | 3 | 2 | 0.567003 |
peptides | 4 | 2 | 0.565775 |
John Arena | 3 | 2 | 0.565648 |
Disease | 4 | 2 | 0.564662 |
welfare | 4 | 4 | 0.564364 |
wild type | 3 | 2 | 0.560699 |
uws | 3 | 3 | 0.557799 |
ongrel | 4 | 3 | 0.554208 |
liquid cry | 3 | 3 | 0.553408 |
princess | 3 | 2 | 0.551672 |
Denmark | 3 | 2 | 0.548702 |
birthday | 3 | 2 | 0.548504 |
atedmes | 4 | 2 | 0.548171 |
"ENOENT | 5 | 2 | 0.547169 |
third-party | 4 | 2 | 0.546949 |
aliens | 3 | 2 | 0.546507 |
Durban | 3 | 4 | 0.545848 |
Bouncy | 4 | 3 | 0.545826 |
CHO | 3 | 2 | 0.542762 |
unjust | 3 | 2 | 0.538813 |
these motivational | 4 | 3 | 0.537485 |
DLS | 3 | 4 | 0.535933 |
\n& | 3 | 2 | 0.534510 |
uneven | 3 | 2 | 0.533137 |
watt | 3 | 2 | 0.532243 |
’She | 3 | 2 | 0.531300 |
HP | 3 | 3 | 0.529555 |
Token Sequence | ct | ||
---|---|---|---|
</td>\n<td> | 9 | 2 | 0.627583 |
{d}x | 5 | 3 | 0.599395 |
\n | 4 | 3 | 0.587016 |
_{n=1}{̂\in | 7 | 4 | 0.585434 |
</td>\n<td | 8 | 2 | 0.573310 |
-2-2007-061 | 12 | 3 | 0.551581 |
reticulum | 4 | 3 | 0.549337 |
INSURANCE | 5 | 2 | 0.548263 |
32;\n internal static | 8 | 2 | 0.547893 |
;\n internal static | 6 | 9 | 0.540374 |
: At | 4 | 2 | 0.538609 |
(2,9,’ | 6 | 4 | 0.537495 |
Respondent | 4 | 2 | 0.534509 |
\t\t}\n\n\t | 7 | 3 | 0.530669 |
(3,0,’ | 6 | 4 | 0.529493 |
_{n-1}\ar | 7 | 2 | 0.527303 |
thank you for your understanding | 6 | 2 | 0.513979 |
hydroxyl | 4 | 2 | 0.510059 |
>\n*/\private $ | 9 | 2 | 0.510054 |
in mukaan | 5 | 2 | 0.506333 |
{w}{̂B}_{ | 6 | 2 | 0.505970 |
/2\Z | 5 | 2 | 0.501998 |
’); \nINSERT INTO | 6 | 10 | 0.501055 |
7-f131 | 7 | 2 | 0.496881 |
0, 1L> | 8 | 2 | 0.495809 |
/0 S | 5 | 2 | 0.492042 |
5 Audi | 4 | 2 | 0.491043 |
all that apply | 4 | 3 | 0.490469 |
": true,\n | 6 | 2 | 0.486807 |
4,\n | 5 | 2 | 0.485315 |
to as DSP | 5 | 2 | 0.484967 |
*B**]{}\ | 6 | 2 | 0.483484 |
;\ninternal | 5 | 3 | 0.479777 |
100% used | 6 | 2 | 0.475673 |
", "x": | 5 | 3 | 0.474701 |
2.7 | 4 | 2 | 0.473720 |
</td>\n | 6 | 2 | 0.473578 |
" code=" | 4 | 4 | 0.473514 |
e2d-d | 6 | 2 | 0.473418 |
is under conversion | 4 | 5 | 0.473355 |
{ int|sys | 5 | 3 | 0.471213 |
();\n}\n\nprivate boolean isAny | 12 | 2 | 0.470941 |
(2,8,’ | 6 | 4 | 0.470214 |
trachea | 4 | 2 | 0.469154 |
use in an automobile | 6 | 2 | 0.467788 |
at org.apache.c | 7 | 5 | 0.467637 |
world around us | 4 | 2 | 0.464469 |
2\left(1+x | 8 | 2 | 0.463555 |
or Commodore | 5 | 3 | 0.463106 |
11-117 | 7 | 2 | 0.459824 |