Transcendence: Generative Models Can Outperform
The Experts That Train Them

Edwin Zhang
Harvard University
Founding
[email protected]
&Vincent Zhu
UC Santa Barbara
Founding
[email protected]
&Naomi Saphra
Harvard University
Kempner Institute
[email protected]
&Anat Kleiman
Harvard University
Apple
[email protected]
&Benjamin L. Edelman
Princeton University
Harvard University
[email protected]
&Milind Tambe
Harvard University
[email protected]
&Sham Kakade
Harvard University
Kempner Institute
[email protected]
&Eran Malach
Harvard University
Kempner Institute
[email protected]
Abstract

Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset.111To play with our models, code, and data, please see our website at https://transcendence.eddie.win.  We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.

Refer to caption
Figure 1: Ratings of our autoregressive decoder-only transformer, ChessFormer, over several different temperatures. We refer to our models as “ChessFormer <Maximum Glicko-2 rating seen during training>" to easily distinguish between different models in subsequent sections. Each model is trained only on games with players up to a certain rating (1000100010001000, 1300130013001300, 1500150015001500, respectively). We report 95% confidence intervals calculated through taking ±1.96σplus-or-minus1.96𝜎\pm 1.96\sigma± 1.96 italic_σ.

1 Introduction

Generative models (GMs) are typically trained to mimic human behavior. These humans may be skilled in their various human objectives: answering a question, creating art, singing a song. The model has only one objective: minimizing the cross-entropy loss with respect to the output distribution, thereby adjusting it to match the distribution of human labels222Although chatbots are subject to a variety of post-training tuning methods, e.g., RLHF, we restrict our scope by assuming that the specialized knowledge and capacities are already provided by cross-entropy loss.. Therefore, one might assume the model can, at best, match the performance of an expert on their human objectives. Is it possible for these models to surpass—to transcend—their expert sources in some domains?

We illustrate an example of such transcendence in Figure 1, which measures the chess ratings (Glicko-2 [7]) of several transformer [35] models. Our experimental testbed is generative modeling on chess, which we choose as a domain for its well-understood, constrained nature. The transformer models are trained on public datasets of human chess transcripts, autoregressively predicting the next move in the game. To test for transcendence, we limit the maximal rating of the human players in the dataset below a specified score. We find that ChessFormer 1000100010001000 and ChessFormer 1300130013001300 (the latter number being the maximum rating seen during training) achieve significant levels of transcendence, surpassing the maximal rating seen in the dataset. Our focus is this capacity of a GM to transcend its expert sources by broadly outperforming any one expert. The key to our findings is the observation that GMs implicitly perform majority voting over the human experts. As these models are trained on a collection of many experts with diverse capacities, predilections, and biases, this majority vote oftentimes outperforms any individual expert, a phenomena that is known as “wisdom of the crowd”.

Our objective is to formalize the notion of transcendence and focus narrowly on this source of improvement over the experts: the removal of diverse human biases and errors. We prove that this form of denoising is enabled by low-temperature sampling, which implicitly induces a majority vote. Our result draws a subtle but deep connection from our new setting to a rich prior literature on model ensembling [1, 6, 19], enabling several key results. We precisely characterize the conditions under which transcendence is possible, and give a rigorous theoretical framework for enabling future study into the phenomenon. To test the predictive power of our theory, we then empirically demonstrate these effects. Digging deeper into the effects of majority voting, we show that its advantage is primarily due to performing much better on a small subset of states—that is, under conditions that are likely key to determining the outcome of the game. We also find that diversity in the data is a necessary condition for practically effective majority voting, confirming our theoretical findings. In short:

  • We formalize the notion of transcendence in generative models (Section 2).

  • We find a key insight explaining one cause of transcendence by connecting the case of denoising experts to model ensembling. In low temperature sampling settings, we prove that a generative model can transcend if trained on a single expert that makes mistakes uniformly at random. We then extend this result to transcending a collection of experts that are each skilled in different domains (Section 3).

  • We train a chess transformer on game transcripts that only include players up to a particular skill level. We confirm our theoretical prediction that this model only surpasses the maximum rating of its expert data generators at low temperature settings (Section 4).

  • We visualize the distribution of changes in reward by setting a lower sampling temperature, attributing the increased performance to large improvements on a relatively small portion of states (Section 4.2).

  • We explore the necessity of dataset diversity, and the inability of ChessFormer to transcend when trained on less diverse datasets (Section 4.2).

2 Definition of Transcendence

Denote by 𝒳𝒳\mathcal{X}caligraphic_X the (variable-length) input space and by 𝒴𝒴\mathcal{Y}caligraphic_Y the (finite) output space. Let \mathcal{F}caligraphic_F be the class of all functions map** 𝒳P(𝒴)maps-to𝒳𝑃𝒴\mathcal{X}\mapsto P(\mathcal{Y})caligraphic_X ↦ italic_P ( caligraphic_Y ) (where we use the notation P(𝒴)𝑃𝒴P(\mathcal{Y})italic_P ( caligraphic_Y ) to denote probability distributions over 𝒴𝒴\mathcal{Y}caligraphic_Y). That is, the functions in \mathcal{F}caligraphic_F map inputs in 𝒳𝒳\mathcal{X}caligraphic_X to probability distributions over 𝒴𝒴\mathcal{Y}caligraphic_Y, so each function f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F defines a conditional probability distribution of y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y given x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. We denote this distribution by f(y|x)𝑓conditional𝑦𝑥f(y|x)italic_f ( italic_y | italic_x ).

Fix some input distribution p𝑝pitalic_p over 𝒳𝒳\mathcal{X}caligraphic_X such that p𝑝pitalic_p has full support (namely, for every x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X we have p(x)>0𝑝𝑥0p(x)>0italic_p ( italic_x ) > 0). Throughout the paper, we assume that our data is labeled by k𝑘kitalic_k experts, denoted f1,,fksubscript𝑓1subscript𝑓𝑘f_{1},\dots,f_{k}\in\mathcal{F}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_F. Namely, we assume that the inputs are sampled from the input distribution p𝑝pitalic_p and then each input x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X is labeled by some expert chosen uniformly at random333Equivalently, we can assume that each example is labeled by all experts.. This process induces a joint probability distribution over 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}caligraphic_X × caligraphic_Y, which we denote by DD\operatorname*{D}roman_D. Specifically, D(x,y)=p(x)f¯(y|x)D𝑥𝑦𝑝𝑥¯𝑓conditional𝑦𝑥\operatorname*{D}(x,y)=p(x)\overline{f}(y|x)roman_D ( italic_x , italic_y ) = italic_p ( italic_x ) over¯ start_ARG italic_f end_ARG ( italic_y | italic_x ) where f¯¯𝑓\overline{f}over¯ start_ARG italic_f end_ARG is the mixture of the expert distributions, namely

f¯(y|x)=1ki=1kfi(y|x)¯𝑓conditional𝑦𝑥1𝑘superscriptsubscript𝑖1𝑘subscript𝑓𝑖conditional𝑦𝑥\overline{f}(y|x)=\frac{1}{k}\sum_{i=1}^{k}f_{i}(y|x)over¯ start_ARG italic_f end_ARG ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | italic_x ) (1)

We measure the quality of some prediction function f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F using a reward assigned to each input-output pair. Namely, we define a reward function r:𝒳×𝒴:𝑟𝒳𝒴r:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}italic_r : caligraphic_X × caligraphic_Y → blackboard_R, s.t. for all x𝑥xitalic_x, the function r(x,)𝑟𝑥r(x,\cdot)italic_r ( italic_x , ⋅ ) is not constant (i.e., for every input x𝑥xitalic_x not all outputs have the same reward). We choose some test distribution ptestsubscript𝑝testp_{\mathrm{test}}italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT over 𝒳𝒳\mathcal{X}caligraphic_X, and for some f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F define the average reward of f𝑓fitalic_f over ptestsubscript𝑝testp_{\mathrm{test}}italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT by:

Rptest(f)=𝔼xptest[rx(f)],whererx(f)=𝔼yf(|x)[r(x,y)]R_{p_{\mathrm{test}}}(f)=\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[r_{x}(f)% \right],~{}~{}~{}\mathrm{where}~{}~{}r_{x}(f)=\mathbb{E}_{y\sim f(\cdot|x)}% \left[r(x,y)\right]italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_f ) ] , roman_where italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_f ) = blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_f ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] (2)

A learner has access to the distribution DD\operatorname*{D}roman_D, and needs to find a function that minimizes the cross-entropy loss over DD\operatorname*{D}roman_D. Namely, the learner chooses some function f^^𝑓\hat{f}\in\mathcal{F}over^ start_ARG italic_f end_ARG ∈ caligraphic_F s.t.  f^=argminf𝔼xp[H(f¯,f)]^𝑓subscript𝑓subscript𝔼similar-to𝑥𝑝delimited-[]𝐻¯𝑓𝑓\hat{f}=\arg\min_{f\in\mathcal{F}}\mathbb{E}_{x\sim p}\left[H(\overline{f},f)\right]over^ start_ARG italic_f end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p end_POSTSUBSCRIPT [ italic_H ( over¯ start_ARG italic_f end_ARG , italic_f ) ] where H𝐻Hitalic_H is the cross-entropy function.

Definition 1.

We define “transcendence” to be a setting of f1,,fksubscript𝑓1subscript𝑓𝑘f_{1},\dots,f_{k}\in\mathcal{F}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_F and pP(𝒳)𝑝𝑃𝒳p\in P(\mathcal{X})italic_p ∈ italic_P ( caligraphic_X ) where:

Rptest(f^)>maxi[k]Rptest(fi)subscript𝑅subscript𝑝test^𝑓subscript𝑖delimited-[]𝑘subscript𝑅subscript𝑝testsubscript𝑓𝑖R_{p_{\mathrm{test}}}(\hat{f})>\max_{i\in[k]}R_{p_{\mathrm{test}}}(f_{i})italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG ) > roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (3)

In other words, transcendence describes cases where the learned predictor performs better (achieves better reward) than the best expert generating the data. Note that we are focusing on an idealized setting, where the learner has access to infinite amount of data from the distribution DD\operatorname*{D}roman_D, and can arbitrarily choose any function to fit the distribution (not limited to a particular choice of architecture or optimization constraints). As we will show, even in this idealized setting, transcendence can be impossible to achieve without further modifying the distribution.

Remark 1.

We have made various simplifying assumptions when introducing our setting. For example, we assume that all experts share the same input distribution, we assume that all inputs have non-zero probability under the training distribution p𝑝pitalic_p, and we assume the experts are sampled uniformly at random. We leave a complete analysis of a more general setting to future work, and discuss this point further in section 6.

3 Conditions for Transcendence

In this section we analyze the necessary and sufficient conditions for transcendence in our setting. We begin by showing that low-temperature sampling is necessary for transcendence in our specific setting. Then, we analyze specific sufficient conditions for transcendence, both in the case where the data is generated by a single expert and when the data is generated by multiple experts. We defer all proofs to Appendix A.

Refer to caption
Figure 2: Visualizing the denoising effects of low temperature on the action distribution: an example of ChessFormer shifting probability mass towards the high reward move of trap** the queen with the rook as the temperature τ𝜏\tauitalic_τ decreases. Opacity of the red arrows represent the probability mass given to different moves. The color of the square represent the reward that would be given for taking the action that moves the given piece to that state. Purple here is high reward, while blue is low. For more visualizations, see Appendix B.

3.1 Low-Temperature Sampling is Necessary for Transcendence

Observe that by definition of f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, and using standard properties of the cross-entropy loss, we get that f^=f¯^𝑓¯𝑓\hat{f}=\overline{f}over^ start_ARG italic_f end_ARG = over¯ start_ARG italic_f end_ARG, as defined in Eq. (1). Therefore, the conditional probability distribution generated by f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG is simply an average of the distributions generated by the expert. Since the reward is a linear function of these distributions, we get that f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG never achieves transcendence:

Theorem 1.

For all choice of f1,,fksubscript𝑓1subscript𝑓𝑘f_{1},\dots,f_{k}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and ptestsubscript𝑝testp_{\mathrm{test}}italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT, there exists some fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT s.t. Rptest(fi)Rptest(f^)subscript𝑅subscript𝑝testsubscript𝑓𝑖subscript𝑅subscript𝑝test^𝑓R_{p_{\mathrm{test}}}(f_{i})\geq R_{p_{\mathrm{test}}}(\hat{f})italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG ).

Note that in our setting, we assume that all experts are sampled uniformly for a given input x𝑥xitalic_x. If instead this assumption is removed, then it may be possible to achieve transcendence with a bayesian weighting. We leave this analysis for future work.

3.2 Transcendence with Low-Temperature Sampling

Now, we consider a temperature sampling scheme over the learned function f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG. Namely, for some temperature τ>0𝜏0\tau>0italic_τ > 0, and some probability distribution qP(𝒴)𝑞𝑃𝒴q\in P(\mathcal{Y})italic_q ∈ italic_P ( caligraphic_Y ), denote the softmax operator with temperature τ𝜏\tauitalic_τ by softmax(q;τ)P(𝒴)softmax𝑞𝜏𝑃𝒴\mathrm{softmax}(q;\tau)\in P(\mathcal{Y})roman_softmax ( italic_q ; italic_τ ) ∈ italic_P ( caligraphic_Y ) s.t. softmax(q;τ)y=exp(qy/τ)y𝒴exp(qy/τ)softmaxsubscript𝑞𝜏𝑦subscript𝑞𝑦𝜏subscriptsuperscript𝑦𝒴subscript𝑞superscript𝑦𝜏\mathrm{softmax}(q;\tau)_{y}=\dfrac{\exp(q_{y}/\tau)}{\sum_{y^{\prime}\in% \mathcal{Y}}\exp(q_{y^{\prime}}/\tau)}roman_softmax ( italic_q ; italic_τ ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT roman_exp ( italic_q start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT / italic_τ ) end_ARG. Additionally, we define argmax(q)P(𝒴)argmax𝑞𝑃𝒴\operatorname*{arg\,max}(q)\in P(\mathcal{Y})start_OPERATOR roman_arg roman_max end_OPERATOR ( italic_q ) ∈ italic_P ( caligraphic_Y ) to be the uniform distribution over the maximal values of q𝑞qitalic_q, namely argmax(q)=1/|Yq|argmax𝑞1subscript𝑌𝑞\operatorname*{arg\,max}(q)=1/{\left\lvert Y_{q}\right\rvert}start_OPERATOR roman_arg roman_max end_OPERATOR ( italic_q ) = 1 / | italic_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | if yYq𝑦subscript𝑌𝑞y\in Y_{q}italic_y ∈ italic_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and 0 if yYq𝑦subscript𝑌𝑞y\notin Y_{q}italic_y ∉ italic_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, where Yq={y𝒴:qy=max(q)}subscript𝑌𝑞conditional-set𝑦𝒴subscript𝑞𝑦𝑞Y_{q}=\{y\in\mathcal{Y}:q_{y}=\max(q)\}italic_Y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = { italic_y ∈ caligraphic_Y : italic_q start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = roman_max ( italic_q ) }.  Now, define f^τsubscript^𝑓𝜏\hat{f}_{\tau}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT to be the temperature sampling of f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, i.e. f^τ(|x)=softmax(f^(|x);τ)\hat{f}_{\tau}(\cdot|x)=\mathrm{softmax}(\hat{f}(\cdot|x);\tau)over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( ⋅ | italic_x ) = roman_softmax ( over^ start_ARG italic_f end_ARG ( ⋅ | italic_x ) ; italic_τ ) and f^maxsubscript^𝑓\hat{f}_{\max}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT the arg-max “sampling” of f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, i.e. f^max(|x)=argmax(f^(|x))\hat{f}_{\max}(\cdot|x)=\operatorname*{arg\,max}(\hat{f}(\cdot|x))over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( ⋅ | italic_x ) = start_OPERATOR roman_arg roman_max end_OPERATOR ( over^ start_ARG italic_f end_ARG ( ⋅ | italic_x ) ). We now show that if the arg-max predictor f^maxsubscript^𝑓\hat{f}_{\max}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is better than the best expert, then transcendence is possible with low-temperature sampling.

Theorem 2.

Rptest(f^max)>maxi[k]Rptest(fi)subscript𝑅subscript𝑝testsubscript^𝑓subscript𝑖delimited-[]𝑘subscript𝑅subscript𝑝testsubscript𝑓𝑖R_{p_{\mathrm{test}}}(\hat{f}_{\max})>\max_{i\in[k]}R_{p_{\mathrm{test}}}(f_{i})italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) > roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) if and only if there exists some temperature τ(0,1)𝜏01\tau\in(0,1)italic_τ ∈ ( 0 , 1 ) s.t. for all 0ττ0superscript𝜏𝜏0\leq\tau^{\prime}\leq\tau0 ≤ italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_τ, it holds that Rptest(f^τ)>maxi[k]Rptest(fi).subscript𝑅subscript𝑝testsubscript^𝑓superscript𝜏subscript𝑖delimited-[]𝑘subscript𝑅subscript𝑝testsubscript𝑓𝑖R_{p_{\mathrm{test}}}(\hat{f}_{\tau^{\prime}})>\max_{i\in[k]}R_{p_{\mathrm{% test}}}(f_{i}).italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) > roman_max start_POSTSUBSCRIPT italic_i ∈ [ italic_k ] end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

The above shows that, even though transcendence cannot be achieved when directly modeling the distribution, it can be achieved by temperature sampling, assuming that the arg-max predictor achieves higher reward compared to all experts. In other words, we make the subtle connection here that low-temperature sampling can be thought of as performning majority vote [1, 6] between the experts. When the experts put non-negligible mass onto the best actions, the resulting majority vote may find the best action [9], which improves performance compared to individual experts (i.e., “wisdom of the crowd”) and thus achieve transcendence.

3.3 Denoising a Single Expert

We now turn to study particular cases where low-temperature sampling can lead to transcendence. The most simple case is of a single expert that outputs correct but noisy predictions. Denote by fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the optimal expert, s.t. for all x𝑥xitalic_x we have  f(y|x)=δ(yYx)|Yx|superscript𝑓conditional𝑦𝑥𝛿𝑦subscriptsuperscript𝑌𝑥subscriptsuperscript𝑌𝑥f^{*}(y|x)=\dfrac{\delta(y\in Y^{*}_{x})}{\lvert Y^{*}_{x}\rvert}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = divide start_ARG italic_δ ( italic_y ∈ italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) end_ARG start_ARG | italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | end_ARG, where Yx={y𝒴:y=maxyr(x,y)}subscriptsuperscript𝑌𝑥conditional-set𝑦𝒴𝑦subscriptsuperscript𝑦𝑟𝑥superscript𝑦Y^{*}_{x}=\{y\in\mathcal{Y}:y=\max_{y^{\prime}}r(x,y^{\prime})\}italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = { italic_y ∈ caligraphic_Y : italic_y = roman_max start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r ( italic_x , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } and δ(condition)𝛿condition\delta(\text{condition})italic_δ ( condition ) is 1 if the condition is true and 0 otherwise. Now, for some ρ(0,1)𝜌01\rho\in(0,1)italic_ρ ∈ ( 0 , 1 ), let fρsubscript𝑓𝜌f_{\rho}italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT be a “noisy” expert, s.t., for all x𝑥xitalic_x, with probability ρ𝜌\rhoitalic_ρ chooses a random output, and with probability 1ρ1𝜌1-\rho1 - italic_ρ chooses an output according to the optimal expert f(|x)f^{*}(\cdot|x)italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ | italic_x ), namely fρ(y|x)=ρ/|𝒴|+(1ρ)f(y|x)subscript𝑓𝜌conditional𝑦𝑥𝜌𝒴1𝜌superscript𝑓conditional𝑦𝑥f_{\rho}(y|x)=\rho/\left\lvert\mathcal{Y}\right\rvert+(1-\rho)f^{*}(y|x)italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_y | italic_x ) = italic_ρ / | caligraphic_Y | + ( 1 - italic_ρ ) italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ). We show that transcendence is achieved with low-temperature sampling for data generated by fρsubscript𝑓𝜌f_{\rho}italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT:

Theorem 3.

Assume the data is generated by a single expert fρsubscript𝑓𝜌f_{\rho}italic_f start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. Then, there exists some temperature τ(0,1)𝜏01\tau\in(0,1)italic_τ ∈ ( 0 , 1 ) s.t. for all ττsuperscript𝜏𝜏\tau^{\prime}\leq\tauitalic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_τ, the predictor f^τsubscript^𝑓superscript𝜏\hat{f}_{\tau^{\prime}}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT achieves “transcendence”.

3.4 Transcendence from Multiple Experts

Next, we consider the case where the dataset is generated by multiple experts that complement each other in terms of their ability to correctly predict the best output. For example, consider the case where the input space is partitioned into k𝑘kitalic_k disjoint subsets, 𝒳=𝒳1˙˙𝒳k𝒳subscript𝒳1˙˙subscript𝒳𝑘\mathcal{X}=\mathcal{X}_{1}\dot{\cup}\dots\dot{\cup}\mathcal{X}_{k}caligraphic_X = caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over˙ start_ARG ∪ end_ARG … over˙ start_ARG ∪ end_ARG caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, s.t. the i𝑖iitalic_i-th expert performs well on the subset 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, but behaves randomly on other subsets. Namely, assume the expert fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT behaves as follows:  fi(y|x)=(δ(x𝒳i,yYx)|Yx|+δ(x𝒳i)|𝒴|)subscript𝑓𝑖conditional𝑦𝑥𝛿formulae-sequence𝑥subscript𝒳𝑖𝑦superscriptsubscript𝑌𝑥subscriptsuperscript𝑌𝑥𝛿𝑥subscript𝒳𝑖𝒴f_{i}(y|x)=\left(\frac{\delta(x\in\mathcal{X}_{i},y\in Y_{x}^{*})}{\lvert Y^{*% }_{x}\rvert}+\frac{\delta(x\notin\mathcal{X}_{i})}{\lvert\mathcal{Y}\rvert}\right)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y | italic_x ) = ( divide start_ARG italic_δ ( italic_x ∈ caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ∈ italic_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG | italic_Y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | end_ARG + divide start_ARG italic_δ ( italic_x ∉ caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | caligraphic_Y | end_ARG ) where Yxsuperscriptsubscript𝑌𝑥Y_{x}^{*}italic_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is as previously defined and δ(condition)𝛿condition\delta(\text{condition})italic_δ ( condition ) is 1 if the condition is true and 0 otherwise. We show that, assuming that the test distribution ptestsubscript𝑝testp_{\mathrm{test}}italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT is not concentrated on a single subset 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we achieve transcendence with low-temperature sampling:

Theorem 4.

Let ptestsubscript𝑝testp_{\mathrm{test}}italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT be some distribution s.t. there are at least two subsets 𝒳i𝒳jsubscript𝒳𝑖subscript𝒳𝑗\mathcal{X}_{i}\neq\mathcal{X}_{j}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT s.t. ptest(𝒳i),ptest(𝒳j)>0subscript𝑝testsubscript𝒳𝑖subscript𝑝testsubscript𝒳𝑗0p_{\mathrm{test}}(\mathcal{X}_{i}),p_{\mathrm{test}}(\mathcal{X}_{j})>0italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > 0. Then, if the data is generated by f1,,fksubscript𝑓1subscript𝑓𝑘f_{1},\dots,f_{k}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, there exists some temperature τ(0,1)𝜏01\tau\in(0,1)italic_τ ∈ ( 0 , 1 ) s.t. for all ττsuperscript𝜏𝜏\tau^{\prime}\leq\tauitalic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_τ, the predictor f^τsubscript^𝑓superscript𝜏\hat{f}_{\tau^{\prime}}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT achieves “transcendence”.

In order to build intuition for Theorem 4, see Appendix C for an intuitive diagram of the theorem.

4 Experiments

To evaluate the predictive power of our impossibility result of transcendence with no temperature sampling (Theorem 1) as well as our result of transcendence from multiple experts with low temperature sampling (Theorem 2), we turn to modeling and training chess players. Chess stands out as an attractive option for several reasons. Chess is a well-understood domain and more constrained than other settings such as natural language generation, lending to easier and stronger analysis. Evaluation of skill in chess is also natural and well-studied, with several rigorous statistical rating systems available. In this paper, we use the Glicko-2 rating system [7], which is also adopted by https://lichess.org, the free and open-source online chess server from which we source our dataset.

4.1 Experimental Setup

Refer to caption
Figure 3: Inspired by Mnih et al. [20], we generate a t-SNE embedding [34] of ChessFormer’s last hidden layer latent representations of game transcripts during training time. The colors represent the probability of winning, with +11+1+ 1 corresponding to a state where White has won and 00 to Black. Probabiliy of winning is computed through the Stockfish analysis engine. We also visualize several board states associated with different clusters in the t-SNE embedding, and their associated expected reward when following the expert Stockfish distribution. Note that the model distinguishes between states where the outcome has already been determined (the two left boards), versus opening states that are extremely similar (the two right boards). See the full t-SNE in Appendix G.

Training Details.

We trained several 50505050M parameter autoregressive transformer decoders following best practices from modern large model training, including a cosine learning rate schedule and similar batch size-learning rate ratios as prescribed by the OPT-175B team [37]. Our dataset consists of human chess games from the lichess.org open source database from January 2023 to October 2023. In total, this dataset contains approximately one billion games. In this setting, an expert is a specific individual player. To test for transcendence, we truncate this dataset by a maximum rating, so that during training a model only sees data up to a given rating. We train our model on the next-token prediction objective, and represent our chess games as Portable Game Notation (PGN) strings, such as 1.e4 e5 2.Nf3 Nc6 3.Bb5... 1/2-1/2. Note that we do not give any rating or reward information during training—the only input the model sees are the moves and the outcome of the game. We tokenize our dataset at the 32323232-symbol character level. (For further details, see Appendix E.) Our model plays chess “blind”—without direct access to the board state—and, furthermore, is never explicitly given the rules of the game: at no point is play constrained to valid outputs for a given piece or board state. Nontrivial chess skill is therefore not straightforward to acquire, and if not for the surprising capabilities of modern large transformers, one might imagine such a model would fail to learn even the basic rules of playing chess. This blindfolded setting has also been studied by prior work [23, 30], as discussed further in section 5.

One gap between our theory and practice is that in our theory, we assume that each expert is defined over the entire input space 𝒳𝒳\mathcal{X}caligraphic_X. However, in the chess setting such full coverage is extremely unlikely to be the case after around move 15151515, as there are more unique chess games than atoms in the universe due to the high branching factor of the game tree. To address this gap, we visualize the latent representation of our model in Figure 3, where we find the model is able to capture meaningful semantics regarding both the relative advantage of a state, as well as the identity of the black and white player. This visualization illustrates the ability of our model to generalize by compressing games into some shared latent representation, enabling experts to generalize to unseen states, bridging this gap between theory and practice.

Evaluation.

We evaluate each model by its Glicko-2 ratings against Stockfish 16.1 [29], a popular open-source chess engine. Stockfish uses a traditional minimax search equipped with a bespoke CPU-efficient neural network for evaluation [22] and α𝛼\alphaitalic_α-β𝛽\betaitalic_β pruning for further efficiency. We evaluate Stockfish at levels 1, 3, and 5 with a 100ms timeout directly on Lichess’ platform against the Maia [18] 1, 5, and 9 bots (human behavior cloned convolutional networks trained at rating bins 1100-1200, 1500-1600, and 1900-2000, respectively) for several hundred games, obtaining calibrated Glicko-2 ratings for Stockfish specifically on Lichess’ platform (1552±45.2plus-or-minus155245.21552\pm 45.21552 ± 45.2, 1842±45.2plus-or-minus184245.21842\pm 45.21842 ± 45.2, 2142±59plus-or-minus2142592142\pm 592142 ± 59 for Stockfish Levels 1, 3, and 5, respectively). Next, for evaluating our own models, we then play against Stockfish levels of 1, 3, and 5 for 100 games each, reaching a final rating calculation with 300 games. We then report both the Glicko-2 rating R𝑅Ritalic_R as well as rating deviation RD𝑅𝐷RDitalic_R italic_D of our models, where R±2RDplus-or-minus𝑅2𝑅𝐷R\pm 2*RDitalic_R ± 2 ∗ italic_R italic_D provides a 95%percent9595\%95 % confidence interval. To play against Stockfish, we successively prompt our model with the current game PGN string. Note that our output is entirely unconstrained, and may be either illegal in the current board state or altogether unparsable. If our model fails to generate a valid legal move after 5 samples, we consider it to have lost. After generation, we give the updated board state to Stockfish and pass a new PGN string appended with the prior move of Stockfish back to our model. We repeat this process until the game ends.

4.2 Experimental Results

Main Result: Low-temperature sampling enables transcendence.

In this section we attempt to answer our primary research question, can low-temperature sampling actually induce transcendence in practice? We test Theorem 2 by evaluating several ChessFormers across different temperature values, from 0.0010.0010.0010.001 (nearly deterministic), to 1.01.01.01.0 (original distribution), to 1.51.51.51.5 (high entropy). In Figure 1 we definitively confirm the existence of transcendence. Our ChessFormer 1000 (where the latter number refers to the maximum rating seen during training) and ChessFormer 1300 models are able to transcend to around 1500 rating at temperature τ𝜏\tauitalic_τ equal to 0.0010.0010.0010.001. Interestingly, ChessFormer 1500 is unable to transcend at test time, a result that we further analyze in subsection 4.2.

To more deeply understand when and why transcendence occurs, we investigate two questions. (1) How does the reward function defined in Equation 2 shift with respect to low-temperature sampling? (2) Does transcendence rely on dataset diversity, as introduced theoretically in subsection 3.4?

Lowering temperature increases rewards in expectation on specific states, leading to transcendence over the full game.

When playing chess, a low-skilled player may play reasonably well until they make a significant blunder at a key point in play. If these errors are idiosyncratic, averaging across many experts would have a denoising effect, leaving the best moves with higher probability. Therefore, low-temperature sampling would move probability mass towards better moves in specific play contexts. Without low-temperature sampling, the model would still put probability mass onto blunders. To gain intuition for this idea, we visualize it theoretically in Appendix C and empirically in Figure 2 and Appendix B. This hypothesis motivates our first research question in this section: Does low-temperature sampling improve the expected reward very much for just some specific key game states, or a little for many game states?

To formalize this notion, we first define a “favor” function, which captures the improvement in reward by following some new probability distribution over some baseline probability distribution. Our definition is inspired by the Performance Difference Lemma (PDL) [10] from Reinforcement Learning (RL), which establishes an equivalence between the change in performance from following some new policy (a probability distribution of actions given a state) over some old policy, and the expected value of the advantage function of the old policy sampled with respect to the new policy. In RL, the advantage function is defined as the difference between the value of taking a single action in a given state versus the expected value of following some policy distribution of actions in that state.

Here, we define the “favor” of fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT over f𝑓fitalic_f in x𝑥xitalic_x as the change in the reward function by comparing what f𝑓fitalic_f would have done when following fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for a given input x𝑥xitalic_x:

F(f,f;x)=𝔼xdf,yf(|x)[r(x,y)]𝔼xdf,yf(|x)[r(x,y)].F(f^{\prime},f;x)=\mathbb{E}_{x\sim d^{f^{\prime}},y\sim f^{\prime}(\cdot|x)}[% r(x,y)]-\mathbb{E}_{x\sim d^{f^{\prime}},y\sim f^{\prime}(\cdot|x)}[r(x,y)].italic_F ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_f ; italic_x ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_d start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_y ∼ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] - blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_d start_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_y ∼ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_x ) end_POSTSUBSCRIPT [ italic_r ( italic_x , italic_y ) ] . (4)

Where dfsuperscript𝑑𝑓d^{f}italic_d start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT refers to the state visitation distribution [31] when following f𝑓fitalic_f in a sequential setting—informally, this variable can be thought of the distribution of states seen when sampling from f𝑓fitalic_f with a fixed transition function that takes in an input x𝑥xitalic_x, a output y𝑦yitalic_y, and outputs a next input x𝑥xitalic_x. Here, that transition function is given by the rules of chess and the opponent player. Given this favor function, we can now quantitatively explore the effects that lead to transcendence by setting the baseline f𝑓fitalic_f to be the original imitation-learned probability distribution (temperature τ=1𝜏1\tau=1italic_τ = 1), and fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as a low-temperature intervention on f𝑓fitalic_f (e.g. temperature τ=0𝜏0\tau=0italic_τ = 0). We can empirically calculate the reward by using the evaluation function [22] of Stockfish, an expert neural reward function that Stockfish uses to calculate its next move. This reward function is a neural network trained to predict the probability of winning through a sigmoid on a linear combination of handcrafted expert heuristics, such as amount of material versus opponent material, and number of moves to a potential checkmate.

Refer to caption
Figure 4: The favor probability distribution, or change in expected reward by setting temperature lower than τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0. We plot the favor distribution across two different temperatures: setting τ=.75𝜏.75\tau=.75italic_τ = .75 and τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 by running the Stockfish analysis engine across 100100100100 total Chessformer 1000100010001000 games played at 0.0010.0010.0010.001 temperature against Stockfish level 1111 (as theoretically justified by PDL [10]). We calculate favor by sampling 100100100100 counterfactual potential moves at τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0 per actual move made at τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 to compute a baseline expected reward. In total, we gather an empirical probability distribution with n=382,000𝑛382000n=382,000italic_n = 382 , 000 total samples per τ𝜏\tauitalic_τ (38.238.238.238.2 moves on average per game). Note that we plot the distributions with transparency, so the brownish area is where the two overlap. We visualize several long-tail examples in Appendix B.

In Figure 4, we find that lowering the temperature has the effect of skewing the expected reward distribution to the left, especially for the green τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 distribution. This result implies that the model does not improve the expected reward by a small amount for many game states, but rather improves the expected reward by a relatively large amount for a few game states. Thus, τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 improves the expected reward (probability of winning) by an average of 2.15±0.17%plus-or-minus2.15percent0.17\mathbf{2.15\pm 0.17\%}bold_2.15 ± bold_0.17 %, but for some states, this expected improvement is over 5%. Note that the original temperature expected reward can be thought of as a Dirac distribution centered at 00.  The above finding answers our research question in this section: Low-temperature sampling is able improves the expected reward by relatively large amounts for some specific game states, which is likely why the ChessFormer 1000100010001000 and 1300130013001300 model was able to achieve transcendence.

Temperature 𝔼[τ(%)]\mathbb{E}[\mathbb{P}_{\tau}(\%)]blackboard_E [ blackboard_P start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( % ) ] 𝔼[τ1.0]𝔼delimited-[]subscript𝜏subscript1.0\mathbb{E}[\mathbb{P}_{\tau}-\mathbb{P}_{1.0}]blackboard_E [ blackboard_P start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - blackboard_P start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT ] Top 1 Acc𝐴𝑐𝑐Accitalic_A italic_c italic_c (%) Top 3 Acc𝐴𝑐𝑐Accitalic_A italic_c italic_c (%) Top 5 Acc𝐴𝑐𝑐Accitalic_A italic_c italic_c (%)
τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 39.95±0.92plus-or-minus39.950.92\mathbf{39.95\pm 0.92}bold_39.95 ± bold_0.92 2.15±0.17plus-or-minus2.150.17\mathbf{2.15\pm 0.17}bold_2.15 ± bold_0.17 29.61±1.43plus-or-minus29.611.43\mathbf{29.61\pm 1.43}bold_29.61 ± bold_1.43 54.26±1.57plus-or-minus54.261.57\mathbf{54.26\pm 1.57}bold_54.26 ± bold_1.57 66.86±1.47plus-or-minus66.861.47\mathbf{66.86\pm 1.47}bold_66.86 ± bold_1.47
τ=0.75𝜏0.75\tau=0.75italic_τ = 0.75 38.79±0.90plus-or-minus38.790.9038.79\pm 0.9038.79 ± 0.90 0.99±0.06plus-or-minus0.990.060.99\pm 0.060.99 ± 0.06 25.08±0.95plus-or-minus25.080.9525.08\pm 0.9525.08 ± 0.95 47.84±1.09plus-or-minus47.841.0947.84\pm 1.0947.84 ± 1.09 60.37±1.04plus-or-minus60.371.0460.37\pm 1.0460.37 ± 1.04
τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0 37.80±0.87plus-or-minus37.800.8737.80\pm 0.8737.80 ± 0.87 0±0plus-or-minus000\pm 00 ± 0 22.61±0.86plus-or-minus22.610.8622.61\pm 0.8622.61 ± 0.86 44.00±9.96plus-or-minus44.009.9644.00\pm 9.9644.00 ± 9.96 56.27±0.93plus-or-minus56.270.9356.27\pm 0.9356.27 ± 0.93
Table 1: Table of several statistics describing the relationship between reward at τ=0𝜏0\tau=0italic_τ = 0 vs. τ=1𝜏1\tau=1italic_τ = 1. In the first column, we display the expected reward across our dataset, which is \mathbb{P}blackboard_P of winning calculated by Stockfish 16.1). In the second column, we display F𝐹Fitalic_F, or the change in reward for the given temperature τ𝜏\tauitalic_τ versus the baseline. In the last three columns we display the accuracy for the best moves ranked by Stockfish analysis run at a time cutoff of 1111 second. Here, the top-k𝑘kitalic_k accuracy is the percentage of games where the actual move sampled by the model was in the top-k𝑘kitalic_k moves as ranked by Stockfish. We report 95% bootstrapped confidence intervals with 10K resamples.

In Table 1, we present the statistics of the favor function for different temperature values. From this table, we observe that as the temperature decreases, the top-k𝑘kitalic_k accuracies monotonically increase, suggesting that the model becomes more consistent in selecting good moves. We also observe that although the model improves as temperature decreases, the probability of winning is still below 50%percent5050\%50 %, meaning our model should tend to lose more games than it wins against Stockfish 1111. This result matches with our results in Figure 1, as the rating of Stockfish 1111 is also higher than the reported rating for τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001 (1550155015501550 for Stockfish 1 vs 1450similar-toabsent1450\sim 1450∼ 1450 for Chessformer 1000100010001000). Overall, the analysis of the advantage statistics provides further evidence for the effectiveness of low-temperature sampling in inducing transcendence in chess models.

Dataset diversity is essential for transcendence.

As we note in subsection 3.4, our theory requires dataset diversity as a necessary condition for enabling transcendence. Importantly, we find in Figure 1 that not all models are able to transcend. Unlike ChessFormer 1000 or 1300, the Chessformer 1500 fails to transcend. We hypothesize that this results is due to the fact that in the band of ratings from 1000100010001000 to 1500150015001500, diversity does not significantly increase. If so, a 1000100010001000 rated player can be thought of as a noisy 1500150015001500 rated player, but a 1500150015001500 rated player cannot be thought of as a noisy 2000200020002000 rated player. In this section we ask the following research question: Is diversity in data required for enabling transcendence?

In Figure 5, we explore this research question by quantifying dataset diversity through the normalized entropy on the action distribution f(Y|X)=𝔼yf(y|x=X)[log2f(y|x=X)]/log2|𝒴|.subscript𝑓conditional𝑌𝑋subscript𝔼similar-to𝑦𝑓conditional𝑦𝑥𝑋delimited-[]subscript2𝑓conditional𝑦𝑥𝑋subscript2𝒴\mathcal{H}_{f}(Y|X)={\mathbb{E}_{y\sim f(y|x=X)}[-\log_{2}f(y|x=X)]}/{\log_{2% }|\mathcal{Y}|}.caligraphic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_Y | italic_X ) = blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_f ( italic_y | italic_x = italic_X ) end_POSTSUBSCRIPT [ - roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_f ( italic_y | italic_x = italic_X ) ] / roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | caligraphic_Y | . To gain intuition for this metric, imagine the action distribution of moves taken for any given state. Entropy will be higher for more uniform action distributions, and lower for more deterministic, peaked action distributions. The average entropy of these action distributions can therefore serve as a measurement of the diversity of the dataset. We normalize this entropy to the range [0,1]01[0,1][ 0 , 1 ] by dividing by the binary log of the number of legal moves: log2|𝒴|subscript2𝒴\log_{2}|\mathcal{Y}|roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | caligraphic_Y |.

Importantly, we cannot calculate this normalized entropy for every state, as most states after move 16161616 in the midgame and before the engame are unique within the dataset and we therefore observe just a single action for thus states. Therefore our metric is limited in that it only considers opening moves, the beginning of the midgame, and the endgame. We consider only common states with greater than 100100100100 actions by sampling 1,000,00010000001,000,0001 , 000 , 000 games from each dataset. The average entropy confirm our hypothesis: The <1500absent1500<1500< 1500 cut off dataset has on average less diversity than the <1300absent1300<1300< 1300 dataset, which has is again less than the <1000absent1000<1000< 1000 dataset. This result suggests that Chessformer 1500150015001500 likely is not transcendent due to a lack of diversity in its dataset. If the entropy instead stayed constant for each dataset, it would imply that each had a similar level of diversity. In such a case, we would expect that ChessFormer 1500150015001500 likely would also transcend. Instead, as predicted, Chessformer 1500 likely is not transcendent due to a lack of diversity in its dataset.

Refer to caption
Figure 5: Action distribution diversity, as measured by the average normalized entropy over different chess rating dataset cutoffs with n=2681,3037,3169𝑛268130373169n=2681,3037,3169italic_n = 2681 , 3037 , 3169 common states for ratings 1000,1300,15001000130015001000,1300,15001000 , 1300 , 1500, respectively. These entropies are calculated directly from the empiricial frequencies of our dataset, and are model-agnostic.

5 Related Work

Chess and AI.

Chess has been motivating AI research since the field began. In 1950, before anyone had used the term “artificial intelligence”, automated chess were explored by both Claude Shannon [26] and Alan Turing [32]. Arguably, this history goes back even further: the famed “mechanical turk” of the 18th century was a fraudulently automated chess player. These centuries of mechanical ambitions were finally realized in 1997, when world champion Garry Kasparov was defeated by IBM’s Deep Blue [3].  Since then, chess program developers have drawn on neural approaches, with the RL-based convolutional network AlphaZero [27] far surpassing prior world champion engines such as Stockfish [25].Our chess model testbed is inspired by a number of existing approaches, including other models trained on lichess data [18], and other transformer-based sequential chess agents [23, 5].

Diversity beats Strength.

Another historical thread in AI research is the strength of diverse learners. Long since the development of ensemble methods that exploit learner diversity—including bagging [1], boosting [6], and model averaging [19]—researchers have continued to articulate this insight across settings. Similar to our chess setting, a diverse team of go playing agents have been proven and empirically shown to outperform solitary agents [9] and homogeneous teams [28], even when the alternative models individually outperform the diverse team members [17]. We draw a connection to this deep literature through our theoretical results which shows that training on just the imitation learning objective and then performing low-temperature sampling subtly implies the same principle of majority voting used in this literature.

Teacher diversity has also been explored in the machine learning literature. One related method is ensemble distillation [16], in which a model is trained with an additional objective to match a variety of weaker teacher models. Closer to our setting, ensemble self-training approaches [24] train a learner directly on the labels produced by varied teachers. Large language models supervised by smaller or less trained models are said to exhibit “weak to strong generalization” [2]. Overall, evidence continues to accrue that the general phenomenon we address is pervasive: that is, models can substantially improve over the experts that generate their training data.

Offline Reinforcement Learning.

Our work also draws connections to the Offline Reinforcement Learning [14] setting, where one attempts to learn a new policy π𝜋\piitalic_π that improves upon a fixed dataset generated by some behavior policy πβsubscript𝜋𝛽\pi_{\beta}italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT. However, our setting of imitation learning differs substantially from this literature, as we do not explicitly train our model on a RL objective that attempts to improve upon the dataset. Importantly, such an objective oftentimes introduces training instabilities [15] and also assumes reward labels, both of which are avoided with a pure imitation or self-supervised learning objective. We defer a more extended discussion of related work to Appendix D.

6 Discussion and Future Work

This paper introduces the concept of transcendence, where generative models trained on expert data outperform the best individual experts. Our theoretical analysis shows that low-temperature sampling is key to achieving transcendence by denoising expert biases and consolidating diverse knowledge. We validate our findings empirically by training several chess models which, under low-temperature sampling, surpass the performance of the players who produced their training data. We highlight the necessity of dataset diversity for transcendence, emphasizing the role of varied expert perspectives.

Limitations.

While our work provides a strong foundation for understanding and achieving transcendence in generative models, several avenues for future research remain. Future work may investigate transcendence and its causes in domains and contexts beyond chess, such as natural language processing, computer vision, and text-to-video, to understand the generalizability of our findings. Additionally, our theoretical framework assumes that game conditions at test time match those seen during training; in order to extend our findings to cases of composition or reasoning, we must forego this assumption.

Future Work.

Future work could also explore the practical implementations of transcendence, and ethical considerations in the broader context of deployed generative models. Ultimately, our findings lay the groundwork for leveraging generative models to not only match but exceed human expertise across diverse applications, pushing the theoretical boundaries of what generative models can achieve.

Broader Impact.

The possibility of “superintelligent” AGI has recently fueled many speculative hopes and fears. It is therefore possible that our work will be cited by concerned communities as evidence of a threat, but we would highlight that the denoising effect addressed in this paper does not offer any evidence for a model being able to produce novel solutions that a human expert would be incapable of devising. In particular, we do not present evidence that low temperature sampling leads to novel abstract reasoning, but rather denoising of errors.

Acknowledgements

Sham Kakade acknowledges this work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence; support from the Office of Naval Research under award N00014-22-1-2377, and the National Science Foundation Grant under award #IIS 2229881.

References

  • Breiman [1996] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. URL https://api.semanticscholar.org/CorpusID:47328136.
  • Burns et al. [2023] C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  • Campbell et al. [2002] M. Campbell, A. J. Hoane, and F.-h. Hsu. Deep Blue. Artificial Intelligence, 134(1):57–83, Jan. 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1.
  • Chen et al. [2021] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling, 2021.
  • Feng et al. [2023] X. Feng, Y. Luo, Z. Wang, H. Tang, M. Yang, K. Shao, D. Mguni, Y. Du, and J. Wang. ChessGPT: Bridging Policy Learning and Language Modeling. Advances in Neural Information Processing Systems, 36:7216–7262, Dec. 2023.
  • Freund and Schapire [1999] Y. Freund and R. E. Schapire. A short introduction to boosting, 1999. URL https://api.semanticscholar.org/CorpusID:9621074.
  • Glickman [2012] M. E. Glickman. Example of the glicko-2 system. Boston University, 28, 2012.
  • Janner et al. [2021] M. Janner, Q. Li, and S. Levine. Offline reinforcement learning as one big sequence modeling problem, 2021.
  • Jiang et al. [2014] A. Jiang, L. Soriano Marcolino, A. D. Procaccia, T. Sandholm, N. Shah, and M. Tambe. Diverse randomized agents vote to win. Advances in Neural Information Processing Systems, 27, 2014.
  • Kakade and Langford [2002] S. M. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002. URL https://api.semanticscholar.org/CorpusID:31442909.
  • Karpathy [2022] A. Karpathy. NanoGPT. https://github.com/karpathy/nanoGPT, 2022.
  • Karvonen [2024] A. Karvonen. Emergent world models and latent variable estimation in chess-playing language models. arXiv preprint arXiv:2403.15498, 2024.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Levine et al. [2020] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. [2023] J. Li, E. Zhang, M. Yin, Q. Bai, Y.-X. Wang, and W. Y. Wang. Offline reinforcement learning with closed-form policy improvement operators. In International Conference on Machine Learning, pages 20485–20528. PMLR, 2023.
  • Lin et al. [2020] T. Lin, L. Kong, S. U. Stich, and M. Jaggi. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351–2363, 2020.
  • Marcolino et al. [2013] L. S. Marcolino, A. X. Jiang, and M. Tambe. Multi-agent team formation: Diversity beats strength? In IJCAI, volume 13, 2013.
  • McIlroy-Young et al. [2020] R. McIlroy-Young, S. Sen, J. Kleinberg, and A. Anderson. Aligning Superhuman AI with Human Behavior: Chess as a Model System. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1677–1687, Aug. 2020. doi: 10.1145/3394486.3403219.
  • McMahan et al. [2023] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data, 2023.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Munos et al. [2016] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29, 2016.
  • Nasu [2018] Y. Nasu. Efficiently Updatable Neural-Network-based Evaluation Functions for Computer Shogi, 2018.
  • Noever et al. [2020] D. Noever, M. Ciolino, and J. Kalin. The Chess Transformer: Mastering Play using Generative Language Models, Sept. 2020.
  • Odonnat et al. [2024] A. Odonnat, V. Feofanov, and I. Redko. Leveraging ensemble diversity for robust self-training in the presence of sample selection bias, 2024.
  • Pete [2018] Pete. AlphaZero Crushes Stockfish In New 1,000-Game Match. https://www.chess.com/news/view/updated-alphazero-crushes-stockfish-in-new-1-000-game-match, Dec. 2018.
  • Shannon [1950] C. E. Shannon. XXII. Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, Mar. 1950. ISSN 1941-5982, 1941-5990. doi: 10.1080/14786445008521796.
  • Silver et al. [2017] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Dec. 2017.
  • Soriano Marcolino et al. [2014] L. Soriano Marcolino, H. Xu, A. Xin Jiang, M. Tambe, and E. Bowring. Give a Hard Problem to a Diverse Team: Exploring Large Action Spaces. Proceedings of the AAAI Conference on Artificial Intelligence, 28(1), June 2014. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v28i1.8880.
  • The Stockfish developers (2024) [see AUTHORS file] The Stockfish developers (see AUTHORS file). Stockfish, 2024. URL https://stockfishchess.org/.
  • Toshniwal et al. [2022] S. Toshniwal, S. Wiseman, K. Livescu, and K. Gimpel. Chess as a Testbed for Language Model State Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11385–11393, June 2022. doi: 10.1609/aaai.v36i10.21390.
  • Touati et al. [2020] A. Touati, A. Zhang, J. Pineau, and P. Vincent. Stable policy optimization via off-policy divergence regularization. In Conference on Uncertainty in Artificial Intelligence, pages 1328–1337. PMLR, 2020.
  • Turing [2004] A. Turing. Chess (1953). In B. J. Copeland, editor, The Essential Turing, page 0. Oxford University Press, Sept. 2004. ISBN 978-0-19-825079-1. doi: 10.1093/oso/9780198250791.003.0023.
  • Uma et al. [2021] A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470, 2021.
  • Van der Maaten and Hinton [2008] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Xie et al. [2020] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
  • Zhang et al. [2022] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Appendix A Proofs

Here we prove Theorem 1, where transcendencecannot occur by purely using imitation learning in our setting where all experts are sampled uniformly across the input distribution.

Proof.

From linearity of the expectation

Rptest(f^)subscript𝑅subscript𝑝test^𝑓\displaystyle R_{p_{\mathrm{test}}}(\hat{f})italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG ) =𝔼xptest[rx(f¯)]absentsubscript𝔼similar-to𝑥subscript𝑝testdelimited-[]subscript𝑟𝑥¯𝑓\displaystyle=\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[r_{x}(\overline{f})\right]= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over¯ start_ARG italic_f end_ARG ) ]
=𝔼xptest[1ki=1krx(fi)]=1ki=1kRptest(fi)maxiRptest(fi)absentsubscript𝔼similar-to𝑥subscript𝑝testdelimited-[]1𝑘superscriptsubscript𝑖1𝑘subscript𝑟𝑥subscript𝑓𝑖1𝑘superscriptsubscript𝑖1𝑘subscript𝑅subscript𝑝testsubscript𝑓𝑖subscript𝑖subscript𝑅subscript𝑝testsubscript𝑓𝑖\displaystyle=\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[\frac{1}{k}\sum_{i=1}^% {k}r_{x}(f_{i})\right]=\frac{1}{k}\sum_{i=1}^{k}R_{p_{\mathrm{test}}}(f_{i})% \leq\max_{i}R_{p_{\mathrm{test}}}(f_{i})= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

We now give the proof of Theorem 2 that if the arg-max prediction is better than the best expert, then transcendence is possible with low-temperature sampling.

Proof.

Observe that for all q𝑞qitalic_q, it holds that limτ0softmax(q;τ)=argmax(q)subscript𝜏0softmax𝑞𝜏𝑞\lim_{\tau\to 0}\mathrm{softmax}(q;\tau)=\arg\max(q)roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT roman_softmax ( italic_q ; italic_τ ) = roman_arg roman_max ( italic_q ). Therefore, for all x𝑥xitalic_x

limτ0rx(f^τ)=limτ0yr(x,y)f^τ(y|x)=yr(x,y)f^max(y|x)=rx(f^max)subscript𝜏0subscript𝑟𝑥subscript^𝑓𝜏subscript𝜏0subscript𝑦𝑟𝑥𝑦subscript^𝑓𝜏conditional𝑦𝑥subscript𝑦𝑟𝑥𝑦subscript^𝑓conditional𝑦𝑥subscript𝑟𝑥subscript^𝑓\lim_{\tau\to 0}r_{x}(\hat{f}_{\tau})=\lim_{\tau\to 0}\sum_{y}r(x,y)\cdot\hat{% f}_{\tau}(y|x)=\sum_{y}r(x,y)\hat{f}_{\max}(y|x)=r_{x}(\hat{f}_{\max})roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_r ( italic_x , italic_y ) ⋅ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_y | italic_x ) = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_r ( italic_x , italic_y ) over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_y | italic_x ) = italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )

and so,

limτ0Rptest(f^τ)subscript𝜏0subscript𝑅subscript𝑝testsubscript^𝑓𝜏\displaystyle\lim_{\tau\to 0}R_{p_{\mathrm{test}}}(\hat{f}_{\tau})roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) =limτ0𝔼xptest[rx(f^τ)]absentsubscript𝜏0subscript𝔼similar-to𝑥subscript𝑝testdelimited-[]subscript𝑟𝑥subscript^𝑓𝜏\displaystyle=\lim_{\tau\to 0}\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[r_{x}(% \hat{f}_{\tau})\right]= roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ]
=𝔼xptest[limτ0rx(f^τ)]=𝔼xptestrx(f^max)=Rptest(f^max)absentsubscript𝔼similar-to𝑥subscript𝑝testdelimited-[]subscript𝜏0subscript𝑟𝑥subscript^𝑓𝜏subscript𝔼similar-to𝑥subscript𝑝testsubscript𝑟𝑥subscript^𝑓subscript𝑅subscript𝑝testsubscript^𝑓\displaystyle=\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[\lim_{\tau\to 0}r_{x}(% \hat{f}_{\tau})\right]=\mathbb{E}_{x\sim p_{\mathrm{test}}}r_{x}(\hat{f}_{\max% })=R_{p_{\mathrm{test}}}(\hat{f}_{\max})= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) = italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT )

Therefore, the required immediately follows. ∎

To prove Theorem 3, we directly use the result in Theorem 2.

Proof.

Notice that for this expert, argmax(f(|x))=f(y|x)\operatorname*{arg\,max}(f(\cdot|x))=f^{*}(y|x)start_OPERATOR roman_arg roman_max end_OPERATOR ( italic_f ( ⋅ | italic_x ) ) = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ), which achieves higher reward compared to f𝑓fitalic_f. Therefore, Theorem 2 implies that we achieve transcendence in the setting where all the data is generated by a single expert f𝑓fitalic_f. ∎

Finally, we give the proof Theorem 4, or the statement that transcendence can occur from multiple experts if the test distribution ptestsubscript𝑝testp_{\mathrm{test}}italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT is spread across multiple disjoing subsets of 𝒳isubscript𝒳𝑖\mathcal{X}_{i}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Proof.

In this case, observe that for all i𝑖iitalic_i

Rptest(fi)subscript𝑅subscript𝑝testsubscript𝑓𝑖\displaystyle R_{p_{\mathrm{test}}}(f_{i})italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =ptest(𝒳i)𝔼xptest|𝒳irx(f)+ptest(𝒳𝒳i)𝔼xp¯|𝒳𝒳i[𝔼yUni(𝒴)r(x,y)]absentsubscript𝑝testsubscript𝒳𝑖subscript𝔼similar-to𝑥evaluated-atsubscript𝑝testsubscript𝒳𝑖subscript𝑟𝑥superscript𝑓subscript𝑝test𝒳subscript𝒳𝑖subscript𝔼similar-to𝑥evaluated-at¯𝑝𝒳subscript𝒳𝑖delimited-[]subscript𝔼similar-to𝑦Uni𝒴𝑟𝑥𝑦\displaystyle=p_{\mathrm{test}}(\mathcal{X}_{i})\cdot\mathbb{E}_{x\sim p_{% \mathrm{test}}|_{\mathcal{X}_{i}}}r_{x}(f^{*})+p_{\mathrm{test}}(\mathcal{X}% \setminus\mathcal{X}_{i})\cdot\mathbb{E}_{x\sim\overline{p}|_{\mathcal{X}% \setminus\mathcal{X}_{i}}}\left[\mathbb{E}_{y\sim\mathrm{Uni}(\mathcal{Y})}r(x% ,y)\right]= italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT | start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT ( caligraphic_X ∖ caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ blackboard_E start_POSTSUBSCRIPT italic_x ∼ over¯ start_ARG italic_p end_ARG | start_POSTSUBSCRIPT caligraphic_X ∖ caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_y ∼ roman_Uni ( caligraphic_Y ) end_POSTSUBSCRIPT italic_r ( italic_x , italic_y ) ]
<Rptest(f)absentsubscript𝑅subscript𝑝testsuperscript𝑓\displaystyle<R_{p_{\mathrm{test}}}(f^{*})< italic_R start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_test end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

Therefore, we get that for all x𝑥xitalic_x

f^(y|x)=1kj=1kfj(y|x)=k1k1|𝒴|+1kf(y|x)=k1k|𝒴|+1k|Yx|𝟏yYx^𝑓conditional𝑦𝑥1𝑘superscriptsubscript𝑗1𝑘subscript𝑓𝑗conditional𝑦𝑥𝑘1𝑘1𝒴1𝑘superscript𝑓conditional𝑦𝑥𝑘1𝑘𝒴1𝑘superscriptsubscript𝑌𝑥subscript1𝑦superscriptsubscript𝑌𝑥\hat{f}(y|x)=\frac{1}{k}\sum_{j=1}^{k}f_{j}(y|x)=\frac{k-1}{k}\cdot\frac{1}{% \left\lvert\mathcal{Y}\right\rvert}+\frac{1}{k}f^{*}(y|x)=\frac{k-1}{k\cdot% \left\lvert\mathcal{Y}\right\rvert}+\frac{1}{k\left\lvert Y_{x}^{*}\right% \rvert}\cdot\mathbf{1}_{y\in Y_{x}^{*}}over^ start_ARG italic_f end_ARG ( italic_y | italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_y | italic_x ) = divide start_ARG italic_k - 1 end_ARG start_ARG italic_k end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_Y | end_ARG + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = divide start_ARG italic_k - 1 end_ARG start_ARG italic_k ⋅ | caligraphic_Y | end_ARG + divide start_ARG 1 end_ARG start_ARG italic_k | italic_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG ⋅ bold_1 start_POSTSUBSCRIPT italic_y ∈ italic_Y start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

Thus, we get fmax=fsubscript𝑓superscript𝑓f_{\max}=f^{*}italic_f start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and the required follows from Theorem 2.

Appendix B Additional Denoising Visualizations

Refer to caption
Figure 6: An example of where denoising helps black find the only correct move. White has pinned the black rook to the Queen: any move where the rook does not move to e4 results in a heavy loss of material. As τ𝜏\tauitalic_τ decreasses, the expected reward increases substantially and converges onto the correct move.
Refer to caption
Figure 7: Another example where denoising helps avoid errors. Moving the queen to either d1 or h1 takes a bishop or rook, respectively, but loses the queen in the following turn. While queen to e5 does not put the queen in immediate danger, it allows white to push the pawn on f3 to d3, where it threatens the queen and is protected by the bishop on c1. The queen then must move out of danger, losing its opportunity to take the free pawn on h4 and giving white valuable space towards the center of the board. As τ𝜏\tauitalic_τ decreases, the expected reward converges to the move queen to d4, taking the pawn and checking the black king.
Refer to caption
Figure 8: In this setup, a higher temperature shows two plausible moves for the black rook: g1 or f1. As the temperature decreases, the expected reward converges to g1. If the black rook were to move to f1, the white rook would take the black rook, blocking the black pawn on f2 from promoting and protecting the promotion square from the h2 pawn. If the rook were to move to g1, on the other hand, it would open the promotion square from the h2 pawn without being at any immediate risk. If white responded by moving its bishop to g2, protecting the promotion squares from both of the advanced black pawns, black could respond by taking the rook on a1, gaining significant material.

Appendix C Intuition of low temperature sampling inducing transcendence

To build intuition for the primary mechanism of transcendence that we explore in this paper, we give the following toy progression of distributions in order to clearly illustrate how low-temperature sampling can induce transcendence through majority voting. Here, the middle purple action represent the correct, high-reward output, whilst the left and right actions are low-reward bad outputs. We plot the probability of each output as a label on the x axis.

Refer to caption
Figure 9: The first expert output distribution. Although it puts non-negligible mass on the purple, high-reward action, it still samples a low-reward action the majority of the time.
Refer to caption
Figure 10: The second expert output distribution. Symmetric to to the first expert, it also puts non-negligible mass on the purple, high-reward action. However, it samples a low-reward action the majority of the time on the right.
Refer to caption
Figure 11: By taking the average of the first and second expert, we observe that this distribution now puts the majority of mass onto the correct action.
Refer to caption
Figure 12: Finally, by setting temperature τ𝜏\tauitalic_τ to be <1absent1<1< 1, more weight is shifted towards the high probability action, leading to a gain in the expected reward.

Appendix D Further Related Work

D.1 Label Disagreement

Label disagreement in training data, in particular, can improve models in practice. Xie et al. [36] empirically show that adding random noise to teacher-generated labels can improve a student model. Uma et al. [33] even survey the literature on human interannotator disagreement and find a trend of improvements when models are trained on the full set of disagreeing labels rather than on majority vote labels or only on data where labelers agree. Our theoretical claims build on these findings by making the point that the learner can even improve on these original diverse labelers.

D.2 Offline Reinforcement Learning

Although most Offline Reinforcement Learning algorithms train on an RL objective, perhaps most similar to our work is Decision Transformer [4] and Trajectory Transformer [8]: prior models trained on just the sequence prediction of trajectories. Most notably, Decision Transformer also finds an alternative form of transcendencethan the one explored in this paper: by conditioning the trained transformer by the performance of the trajectory, at inference time they can then prompt the model to perform better than the best trajectory seen during training. This remains another promising direction to explore transcendence under.

Interestingly, an analogue to low-temperature sampling also has been noticed and exploited by Reinforcement Learning practitioners in the context of off-policy learning, where a different exploration policy πEsubscript𝜋𝐸\pi_{E}italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is used than the final learned target policy πTsubscript𝜋𝑇\pi_{T}italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Oftentimes πTsubscript𝜋𝑇\pi_{T}italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT will just be set to a greedy version of πEsubscript𝜋𝐸\pi_{E}italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT [21], such as choosing πTsubscript𝜋𝑇\pi_{T}italic_π start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to take the argmaxargmax\operatorname*{arg\,max}roman_arg roman_max action of πEsubscript𝜋𝐸\pi_{E}italic_π start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, which we note is directly equivalent to setting temperature to 0.

Appendix E Training Details

We give a full list of the hyperparameters we used for training here. Note that we largely follow the same hyperparameter set as [37], but lower the batch size to 125K125𝐾125K125 italic_K as we found training to still be stable ta this level. We also release our code openly to support further research into transcendence, which was built off the wonderful work done by Karvonen [12] and Karpathy [11].

Hyperparameter Value
ChessFormer Optimizer AdamW [13]
Activation Function ReLU
Mini-batch size 125K tokens
Gradient Accumulation Steps 1
Transformer num. layers 16
Transformer num. heads 8
Transformer embedding dim. 512
Dropout 0.0
Learning Rate 3e-4
Number of gradient steps 100K
Weight Decay 0.1
Critic hidden layers 3
Adam β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.90
Adam β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.95
Gradient Clip 1.0
Cosine Learning Rate True
Warmup Iterations 2000
Minimum Learning Rate 3e-5
Learning Rate Deacy Iterations 400K
Tensor datatype bfloat16
Table 2: Hyperparameters for our ChessFormer model.

Appendix F Compute Resources

We train all of our models on the Nvidia H100 80GB GPU. To train one of our models takes around 6 to 12 hours.

Appendix G Full t-SNE

We visualize the full t-SNE here, coloring by the reward of the game. We see that the model has learned some representation of the reward, with high absolute reward states being more likely to be near each other in the latent space. This also points towards evidence that the model has learned some sort equivariant representation of the player identity, as the region of symmetric high reward states indicate. Note that reward is not directly given to the model during training.

{adjustwidth}

-0.1-0.1 [Uncaptioned image]

We visualize the same t-SNE, but this time coloring by game length rather than reward. We see that games with high reward tend to be longer, which makes logical sense as the result of the game will tend to be clearer as the game proogresses.

{adjustwidth}

-0.1-0.1 [Uncaptioned image]