On Implications of Scaling Laws on Feature Superposition

Pavan Kattaβˆ—
Abstract

Using results from scaling laws, this theoretical note argues that the following two statements cannot be simultaneously true:

  1. 1.

    Superposition hypothesis where sparse features are linearly represented across a layer is a complete theory of feature representation.

  2. 2.

    Features are universal, meaning two models trained on the same data and achieving equal performance will learn identical features.

11footnotetext: Correspondence to [email protected]

1 Introduction

Scaling laws for language models give us a relation for a model’s macroscopic properties such as cross entropy loss L𝐿Litalic_L, Amount of Data D𝐷Ditalic_D used and Number of non-embedding parameters N𝑁Nitalic_N in the model (Kaplan etΒ al., 2020).

L⁒(N,D)=[(NcN)Ξ±NΞ±D+(DcD)Ξ±D]Ξ±D𝐿𝑁𝐷superscriptdelimited-[]superscriptsubscript𝑁𝑐𝑁subscript𝛼𝑁subscript𝛼𝐷superscriptsubscript𝐷𝑐𝐷subscript𝛼𝐷subscript𝛼𝐷L(N,D)=\left[\left(\frac{N_{c}}{N}\right)^{\frac{\alpha_{N}}{\alpha_{D}}}+% \left(\frac{D_{c}}{D}\right)^{\alpha_{D}}\right]^{\alpha_{D}}italic_L ( italic_N , italic_D ) = [ ( divide start_ARG italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ) start_POSTSUPERSCRIPT divide start_ARG italic_Ξ± start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG italic_Ξ± start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT + ( divide start_ARG italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_D end_ARG ) start_POSTSUPERSCRIPT italic_Ξ± start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_Ξ± start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (1)

where Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, Ξ±Nsubscript𝛼𝑁\alpha_{N}italic_Ξ± start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and Ξ±Dsubscript𝛼𝐷\alpha_{D}italic_Ξ± start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT are constants for a given task such as language modeling.

The scaling laws are not mere empirical observations and can be seen as predictive laws on limits of language model performance. During training of GPT-4, OpenAI was able to predict the final loss of GPT-4 early in the training process using scaling laws with high accuracy (OpenAI, 2023).

An important detail is that the relation is expressed in terms of the number of parameters. It’s natural to think of a model’s computational capacity in terms of parameters, as they are the fundamental independent variables that the model can tune during learning. The amount of computation that a model performs in FLOPs for each input is also estimated to be 2⁒N2𝑁2N2 italic_N (Kaplan etΒ al., 2020).

Let’s compare this with Interpretability, where the representation of a feature is defined in terms of neurons or groups of neurons. At first glance, it might seem unnecessary to distinguish between computational capacity and feature representational capacity, as parameters are connections between neurons after all. However, we can change the number of neurons in a model while kee** the number of parameters constant. Kaplan et al. found that Transformer performance depends very weakly on the shape parameters nl⁒a⁒y⁒e⁒rsubscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿn_{layer}italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT (number of layers), nh⁒e⁒a⁒d⁒ssubscriptπ‘›β„Žπ‘’π‘Žπ‘‘π‘ n_{heads}italic_n start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_s end_POSTSUBSCRIPT (number of attention heads), and df⁒fsubscript𝑑𝑓𝑓d_{ff}italic_d start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT (feedforward layer dimension) when we hold the total non-embedding parameter count N𝑁Nitalic_N fixed (Kaplan etΒ al., 2020). The paper reports that the aspect ratio (the ratio of number of neurons per layer to the number of layers) can vary by more than an order of magnitude, with performance changing by less than 1

In this paper, we assume the above to be true and consider the number of parameters to be the true limiting factor, and we can achieve similar model performance for a range of aspect ratios. We then apply this as a postulate to the superposition hypothesis, our current best and successful theory of feature representation, and explore the implications.

The superposition hypothesis states that models can pack more features than the number of neurons they have (Elhage etΒ al., 2022). There will be interference between the features as they can’t be represented orthogonally, but when the features are sparse enough, the benefit of representing a feature outweighs the cost of interference. Concretely, given a layer of activations of mπ‘šmitalic_m neurons, we can decompose it linearly into activations of n𝑛nitalic_n features, where n>mπ‘›π‘šn>mitalic_n > italic_m, as:

a⁒c⁒t⁒i⁒v⁒a⁒t⁒i⁒o⁒nl⁒a⁒y⁒e⁒r=xf1⁒Wf1+xf2⁒Wf2+β‹―+xfn⁒Wfnπ‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œsubscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿsubscriptπ‘₯subscript𝑓1subscriptπ‘Šsubscript𝑓1subscriptπ‘₯subscript𝑓2subscriptπ‘Šsubscript𝑓2β‹―subscriptπ‘₯subscript𝑓𝑛subscriptπ‘Šsubscript𝑓𝑛activation_{layer}=x_{f_{1}}W_{f_{1}}+x_{f_{2}}W_{f_{2}}+\cdots+x_{f_{n}}W_{f_% {n}}italic_a italic_c italic_t italic_i italic_v italic_a italic_t italic_i italic_o italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + β‹― + italic_x start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT (2)

where a⁒c⁒t⁒i⁒v⁒a⁒t⁒i⁒o⁒nl⁒a⁒y⁒e⁒rπ‘Žπ‘π‘‘π‘–π‘£π‘Žπ‘‘π‘–π‘œsubscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿactivation_{layer}italic_a italic_c italic_t italic_i italic_v italic_a italic_t italic_i italic_o italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT and Wfisubscriptπ‘Šsubscript𝑓𝑖W_{f_{i}}italic_W start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are vectors of size mπ‘šmitalic_m, and xfisubscriptπ‘₯subscript𝑓𝑖x_{f_{i}}italic_x start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the magnitude of activation of the i𝑖iitalic_i-th feature. Sparsity means that for a given input, only a small fraction of features are active, which means xfisubscriptπ‘₯subscript𝑓𝑖x_{f_{i}}italic_x start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is non-zero for only a few values of i𝑖iitalic_i.

2 Case study on changing Aspect Ratio

Table 1: Macroscopic Properties of Transformer Models with Different Aspect Ratios but Equal Parameters
Model A Model B
Total Parameters dm⁒o⁒d⁒e⁒l2⁒nl⁒a⁒y⁒e⁒rsuperscriptsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™2subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿd_{model}^{2}n_{layer}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT dm⁒o⁒d⁒e⁒l2⁒nl⁒a⁒y⁒e⁒rsuperscriptsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™2subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿd_{model}^{2}n_{layer}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT
Neurons per Layer dm⁒o⁒d⁒e⁒lsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT 2⁒dm⁒o⁒d⁒e⁒l2subscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™2d_{model}2 italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT
Number of Layers nl⁒a⁒y⁒e⁒rsubscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿn_{layer}italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT nl⁒a⁒y⁒e⁒r4subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿ4\frac{n_{layer}}{4}divide start_ARG italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT end_ARG start_ARG 4 end_ARG
Total Number of Neurons dm⁒o⁒d⁒e⁒l⁒nl⁒a⁒y⁒e⁒rsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿd_{model}n_{layer}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT dm⁒o⁒d⁒e⁒l⁒nl⁒a⁒y⁒e⁒r2subscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿ2\frac{d_{model}n_{layer}}{2}divide start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG
Total Number of Features Learned F𝐹Fitalic_F F𝐹Fitalic_F
Number of Features per Layer Fnl⁒a⁒y⁒e⁒r𝐹subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿ\frac{F}{n_{layer}}divide start_ARG italic_F end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT end_ARG 4⁒Fnl⁒a⁒y⁒e⁒r4𝐹subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿ\frac{4F}{n_{layer}}divide start_ARG 4 italic_F end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT end_ARG
Features per Neuron Fdm⁒o⁒d⁒e⁒l⁒nl⁒a⁒y⁒e⁒r𝐹subscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿ\frac{F}{d_{model}n_{layer}}divide start_ARG italic_F end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT end_ARG 2⁒Fdm⁒o⁒d⁒e⁒l⁒nl⁒a⁒y⁒e⁒r2𝐹subscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿ\frac{2F}{d_{model}n_{layer}}divide start_ARG 2 italic_F end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT end_ARG

Let’s consider two Transformer models, Model A and Model B, having the same macroscopic properties. Both have an equal number of non-embedding parameters, are trained on the same dataset, and achieve similar loss according to scaling laws (Kaplan etΒ al., 2020). However, their shape parameters differ. Using the same notation as Kaplan et al., let’s denote the number of layers as nl⁒a⁒y⁒e⁒rsubscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿn_{layer}italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT, and number of neurons per layer as dm⁒o⁒d⁒e⁒lsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT. Model B has twice the number of neurons per layer compared to A. As the number of parameters is approximated by dm⁒o⁒d⁒e⁒l2⁒nl⁒a⁒y⁒e⁒rsuperscriptsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™2subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿd_{model}^{2}n_{layer}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT, Model B must have 1414\dfrac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG the number of layers to maintain the same number of parameters as Model A. This means Model B has 8 times the aspect ratio (dm⁒o⁒d⁒e⁒lnl⁒a⁒y⁒e⁒rsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™subscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿ\dfrac{d_{model}}{n_{layer}}divide start_ARG italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT end_ARG) of A which falls under the reported range in Kaplan et al.

The total number of neurons in a model is calculated by multiplying the number of neurons per layer by the number of layers. As a result, Model B has half the total number of neurons compared to Model A.

Now, let’s apply the superposition hypothesis, which states that features can be linearly represented in each layer. Since both models achieve equal loss on the same dataset, it’s reasonable to assume that they have learned the same features (Olah etΒ al., 2020). Let’s denote the total number of features learned by both models as F𝐹Fitalic_F.

The above three paragraphs are summarized in Table 1.

The average number of features per neuron is calculated by dividing the number of features per layer by the number of neurons per layer. In Model B, this value is twice as high as in Model A, which means that Model B is effectively compressing twice as many features per neuron, in other words, there’s a higher degree of superposition. However, superposition comes with a cost of interference between features, and a higher degree of superposition requires more sparsity.

Elhage et al.(Elhage etΒ al., 2022) show that, using lower bounds of compressed sensing (Ba etΒ al., 2010), if we want to recover n𝑛nitalic_n features compressed in mπ‘šmitalic_m neurons (where n>mπ‘›π‘šn>mitalic_n > italic_m), the bound is

m=Ω⁒(βˆ’n⁒(1βˆ’S)⁒log⁑(1βˆ’S)),π‘šΞ©π‘›1𝑆1𝑆m=\Omega(-n(1-S)\log(1-S)),italic_m = roman_Ξ© ( - italic_n ( 1 - italic_S ) roman_log ( 1 - italic_S ) ) , (3)

where 1βˆ’S1𝑆1-S1 - italic_S is the sparsity of the features. For example, if a feature is non-zero only 1 in 100 times, then 1βˆ’S1𝑆1-S1 - italic_S equals 0.01. We can define the degree of superposition as

nm=1(1βˆ’S)⁒log⁑(1βˆ’S)π‘›π‘š11𝑆1𝑆\frac{n}{m}=\frac{1}{(1-S)\log(1-S)}divide start_ARG italic_n end_ARG start_ARG italic_m end_ARG = divide start_ARG 1 end_ARG start_ARG ( 1 - italic_S ) roman_log ( 1 - italic_S ) end_ARG (4)

which is a function of sparsity, inline with our theoretical understanding.

So Model B, with higher degree of superposition, should have sparser features compared to Model A. But, sparsity of a feature is a property of the data itself, and the same feature can’t be sparser in Model B if both models are trained on the same data. This might suggest that they are not the same features, which breaks our initial assumption of two models learning the same features. So either our starting assumption of feature representation through superposition or feature universality needs revision. In the next section, we discuss how we might modify our assumptions.

3 Discussion

To recap, we started with the postulate that model performance is invariant over a wide range of aspect ratios and arrived at the inconsistency between superposition and feature universality. Though we framed the argument through the lens of superposition, the core issue is that the model’s computational capacity is a function of parameters whereas the model’s representational capacity is a function of total neurons.

A useful, though non-rigorous analogy, is to visualize a solid cylinder of radius dm⁒o⁒d⁒e⁒lsubscriptπ‘‘π‘šπ‘œπ‘‘π‘’π‘™d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT and height nl⁒a⁒y⁒e⁒rsubscriptπ‘›π‘™π‘Žπ‘¦π‘’π‘Ÿn_{layer}italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT. The volume (parameters) of the cylinder can be thought of as computational capacity whereas features are represented on the surface (neurons). We can change the aspect ratio of the cylinder while kee** the volume constant by stretching or squashing it. This changes the surface area accordingly. Though this analogy doesn’t include sparsity, it captures the essentials of the argument in a simple way.

Refer to caption
Figure 1: A visual representation of the cylinder analogy, showing the relationship between the number of neurons per layer and the number of layers. The volume of the cylinder represents the total number of parameters, while the surface area corresponds to the total number of neurons available for feature representation.

Coming to solutions, I do not have one that’s consistent with scaling laws, superposition hypothesis, and feature universality, but will speculate on what a possible one might look like.

3.1 Schemes of Compression Alternative to Superposition

A crude and simple way to convert the total number of features into a function of parameters is to add a square term to compressed sensing bounds so it becomes n=m2.f⁒(1βˆ’S)formulae-sequence𝑛superscriptπ‘š2𝑓1𝑆n=m^{2}.f(1-S)italic_n = italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . italic_f ( 1 - italic_S ). But this would require a completely new compression scheme compared to superposition. Methods such as Dictionary learning which disentangle features assuming superposition hypothesis have been successful for extracting interpretable features (Bricken etΒ al., 2023). So it’s not ideal to ignore it, representation schemes whose first-order approximation looks like superposition might be more viable.

This isn’t to say there’s nothing we can improve on in the superposition hypothesis. Although dictionary learning features in Bricken et al. are much more mono-semantic than individual neurons, the lower activation levels in these features still look quite polysemantic.

3.2 Cross Layer Superposition

Previously, we used to look for features in a single neuron (Radford etΒ al., 2017), now we extended it to a group of neurons in a layer. A natural progression is to look for features localizing to neurons across multiple layers. But Model B from the above section, has half the number of neurons as A and the same inconsistencies would arise if features grow linearly on the number of neurons. Number of features represented across two or more layers by cross-layer superposition should grow superlinearly if Model B were to compensate for fewer neurons and still have the same representational capacity.

Acknowledgements

I’m thankful to Jeffrey Wu and Tom McGrath for their helpful feedback on this topic. Thanks to Vinay Bantupalli for providing feedback on the draft. Earlier version of this work was supported by an Open Philanthropy research grant.

References

  • Ba etΒ al. (2010) KhanhΒ Do Ba, Piotr Indyk, Eric Price, and DavidΒ P Woodruff. Lower bounds for sparse recovery. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 1190–1197, 2010.
  • Bricken etΒ al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, JosiahΒ E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023.
  • Elhage etΒ al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022.
  • Kaplan etΒ al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, TomΒ B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • Olah etΒ al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Radford etΒ al. (2017) Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.