Transcendence: Generative Models Can Outperform
The Experts That Train Them
Abstract
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset.111To play with our models, code, and data, please see our website at https://transcendence.eddie.win. We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
![Refer to caption](x1.png)
1 Introduction
Generative models (GMs) are typically trained to mimic human behavior. These humans may be skilled in their various human objectives: answering a question, creating art, singing a song. The model has only one objective: minimizing the cross-entropy loss with respect to the output distribution, thereby adjusting it to match the distribution of human labels222Although chatbots are subject to a variety of post-training tuning methods, e.g., RLHF, we restrict our scope by assuming that the specialized knowledge and capacities are already provided by cross-entropy loss.. Therefore, one might assume the model can, at best, match the performance of an expert on their human objectives. Is it possible for these models to surpass—to transcend—their expert sources in some domains?
We illustrate an example of such transcendence in Figure 1, which measures the chess ratings (Glicko-2 [7]) of several transformer [35] models. Our experimental testbed is generative modeling on chess, which we choose as a domain for its well-understood, constrained nature. The transformer models are trained on public datasets of human chess transcripts, autoregressively predicting the next move in the game. To test for transcendence, we limit the maximal rating of the human players in the dataset below a specified score. We find that ChessFormer and ChessFormer (the latter number being the maximum rating seen during training) achieve significant levels of transcendence, surpassing the maximal rating seen in the dataset. Our focus is this capacity of a GM to transcend its expert sources by broadly outperforming any one expert. The key to our findings is the observation that GMs implicitly perform majority voting over the human experts. As these models are trained on a collection of many experts with diverse capacities, predilections, and biases, this majority vote oftentimes outperforms any individual expert, a phenomena that is known as “wisdom of the crowd”.
Our objective is to formalize the notion of transcendence and focus narrowly on this source of improvement over the experts: the removal of diverse human biases and errors. We prove that this form of denoising is enabled by low-temperature sampling, which implicitly induces a majority vote. Our result draws a subtle but deep connection from our new setting to a rich prior literature on model ensembling [1, 6, 19], enabling several key results. We precisely characterize the conditions under which transcendence is possible, and give a rigorous theoretical framework for enabling future study into the phenomenon. To test the predictive power of our theory, we then empirically demonstrate these effects. Digging deeper into the effects of majority voting, we show that its advantage is primarily due to performing much better on a small subset of states—that is, under conditions that are likely key to determining the outcome of the game. We also find that diversity in the data is a necessary condition for practically effective majority voting, confirming our theoretical findings. In short:
-
•
We formalize the notion of transcendence in generative models (Section 2).
-
•
We find a key insight explaining one cause of transcendence by connecting the case of denoising experts to model ensembling. In low temperature sampling settings, we prove that a generative model can transcend if trained on a single expert that makes mistakes uniformly at random. We then extend this result to transcending a collection of experts that are each skilled in different domains (Section 3).
-
•
We train a chess transformer on game transcripts that only include players up to a particular skill level. We confirm our theoretical prediction that this model only surpasses the maximum rating of its expert data generators at low temperature settings (Section 4).
-
•
We visualize the distribution of changes in reward by setting a lower sampling temperature, attributing the increased performance to large improvements on a relatively small portion of states (Section 4.2).
-
•
We explore the necessity of dataset diversity, and the inability of ChessFormer to transcend when trained on less diverse datasets (Section 4.2).
2 Definition of Transcendence
Denote by the (variable-length) input space and by the (finite) output space. Let be the class of all functions map** (where we use the notation to denote probability distributions over ). That is, the functions in map inputs in to probability distributions over , so each function defines a conditional probability distribution of given . We denote this distribution by .
Fix some input distribution over such that has full support (namely, for every we have ). Throughout the paper, we assume that our data is labeled by experts, denoted . Namely, we assume that the inputs are sampled from the input distribution and then each input is labeled by some expert chosen uniformly at random333Equivalently, we can assume that each example is labeled by all experts.. This process induces a joint probability distribution over , which we denote by . Specifically, where is the mixture of the expert distributions, namely
(1) |
We measure the quality of some prediction function using a reward assigned to each input-output pair. Namely, we define a reward function , s.t. for all , the function is not constant (i.e., for every input not all outputs have the same reward). We choose some test distribution over , and for some define the average reward of over by:
(2) |
A learner has access to the distribution , and needs to find a function that minimizes the cross-entropy loss over . Namely, the learner chooses some function s.t. where is the cross-entropy function.
Definition 1.
We define “transcendence” to be a setting of and where:
(3) |
In other words, transcendence describes cases where the learned predictor performs better (achieves better reward) than the best expert generating the data. Note that we are focusing on an idealized setting, where the learner has access to infinite amount of data from the distribution , and can arbitrarily choose any function to fit the distribution (not limited to a particular choice of architecture or optimization constraints). As we will show, even in this idealized setting, transcendence can be impossible to achieve without further modifying the distribution.
Remark 1.
We have made various simplifying assumptions when introducing our setting. For example, we assume that all experts share the same input distribution, we assume that all inputs have non-zero probability under the training distribution , and we assume the experts are sampled uniformly at random. We leave a complete analysis of a more general setting to future work, and discuss this point further in section 6.
3 Conditions for Transcendence
In this section we analyze the necessary and sufficient conditions for transcendence in our setting. We begin by showing that low-temperature sampling is necessary for transcendence in our specific setting. Then, we analyze specific sufficient conditions for transcendence, both in the case where the data is generated by a single expert and when the data is generated by multiple experts. We defer all proofs to Appendix A.
![Refer to caption](extracted/5697352/advantage-analysis.png)
3.1 Low-Temperature Sampling is Necessary for Transcendence
Observe that by definition of , and using standard properties of the cross-entropy loss, we get that , as defined in Eq. (1). Therefore, the conditional probability distribution generated by is simply an average of the distributions generated by the expert. Since the reward is a linear function of these distributions, we get that never achieves transcendence:
Theorem 1.
For all choice of and , there exists some s.t. .
Note that in our setting, we assume that all experts are sampled uniformly for a given input . If instead this assumption is removed, then it may be possible to achieve transcendence with a bayesian weighting. We leave this analysis for future work.
3.2 Transcendence with Low-Temperature Sampling
Now, we consider a temperature sampling scheme over the learned function . Namely, for some temperature , and some probability distribution , denote the softmax operator with temperature by s.t. . Additionally, we define to be the uniform distribution over the maximal values of , namely if and 0 if , where . Now, define to be the temperature sampling of , i.e. and the arg-max “sampling” of , i.e. . We now show that if the arg-max predictor is better than the best expert, then transcendence is possible with low-temperature sampling.
Theorem 2.
if and only if there exists some temperature s.t. for all , it holds that
The above shows that, even though transcendence cannot be achieved when directly modeling the distribution, it can be achieved by temperature sampling, assuming that the arg-max predictor achieves higher reward compared to all experts. In other words, we make the subtle connection here that low-temperature sampling can be thought of as performning majority vote [1, 6] between the experts. When the experts put non-negligible mass onto the best actions, the resulting majority vote may find the best action [9], which improves performance compared to individual experts (i.e., “wisdom of the crowd”) and thus achieve transcendence.
3.3 Denoising a Single Expert
We now turn to study particular cases where low-temperature sampling can lead to transcendence. The most simple case is of a single expert that outputs correct but noisy predictions. Denote by the optimal expert, s.t. for all we have , where and is 1 if the condition is true and 0 otherwise. Now, for some , let be a “noisy” expert, s.t., for all , with probability chooses a random output, and with probability chooses an output according to the optimal expert , namely . We show that transcendence is achieved with low-temperature sampling for data generated by :
Theorem 3.
Assume the data is generated by a single expert . Then, there exists some temperature s.t. for all , the predictor achieves “transcendence”.
3.4 Transcendence from Multiple Experts
Next, we consider the case where the dataset is generated by multiple experts that complement each other in terms of their ability to correctly predict the best output. For example, consider the case where the input space is partitioned into disjoint subsets, , s.t. the -th expert performs well on the subset , but behaves randomly on other subsets. Namely, assume the expert behaves as follows: where is as previously defined and is 1 if the condition is true and 0 otherwise. We show that, assuming that the test distribution is not concentrated on a single subset , we achieve transcendence with low-temperature sampling:
Theorem 4.
Let be some distribution s.t. there are at least two subsets s.t. . Then, if the data is generated by , there exists some temperature s.t. for all , the predictor achieves “transcendence”.
In order to build intuition for Theorem 4, see Appendix C for an intuitive diagram of the theorem.
4 Experiments
To evaluate the predictive power of our impossibility result of transcendence with no temperature sampling (Theorem 1) as well as our result of transcendence from multiple experts with low temperature sampling (Theorem 2), we turn to modeling and training chess players. Chess stands out as an attractive option for several reasons. Chess is a well-understood domain and more constrained than other settings such as natural language generation, lending to easier and stronger analysis. Evaluation of skill in chess is also natural and well-studied, with several rigorous statistical rating systems available. In this paper, we use the Glicko-2 rating system [7], which is also adopted by https://lichess.org, the free and open-source online chess server from which we source our dataset.
4.1 Experimental Setup
![Refer to caption](extracted/5697352/latent_board_state_reward_tsne.png)
Training Details.
We trained several M parameter autoregressive transformer decoders following best practices from modern large model training, including a cosine learning rate schedule and similar batch size-learning rate ratios as prescribed by the OPT-175B team [37]. Our dataset consists of human chess games from the lichess.org open source database from January 2023 to October 2023. In total, this dataset contains approximately one billion games. In this setting, an expert is a specific individual player. To test for transcendence, we truncate this dataset by a maximum rating, so that during training a model only sees data up to a given rating. We train our model on the next-token prediction objective, and represent our chess games as Portable Game Notation (PGN) strings, such as 1.e4 e5 2.Nf3 Nc6 3.Bb5... 1/2-1/2. Note that we do not give any rating or reward information during training—the only input the model sees are the moves and the outcome of the game. We tokenize our dataset at the -symbol character level. (For further details, see Appendix E.) Our model plays chess “blind”—without direct access to the board state—and, furthermore, is never explicitly given the rules of the game: at no point is play constrained to valid outputs for a given piece or board state. Nontrivial chess skill is therefore not straightforward to acquire, and if not for the surprising capabilities of modern large transformers, one might imagine such a model would fail to learn even the basic rules of playing chess. This blindfolded setting has also been studied by prior work [23, 30], as discussed further in section 5.
One gap between our theory and practice is that in our theory, we assume that each expert is defined over the entire input space . However, in the chess setting such full coverage is extremely unlikely to be the case after around move , as there are more unique chess games than atoms in the universe due to the high branching factor of the game tree. To address this gap, we visualize the latent representation of our model in Figure 3, where we find the model is able to capture meaningful semantics regarding both the relative advantage of a state, as well as the identity of the black and white player. This visualization illustrates the ability of our model to generalize by compressing games into some shared latent representation, enabling experts to generalize to unseen states, bridging this gap between theory and practice.
Evaluation.
We evaluate each model by its Glicko-2 ratings against Stockfish 16.1 [29], a popular open-source chess engine. Stockfish uses a traditional minimax search equipped with a bespoke CPU-efficient neural network for evaluation [22] and - pruning for further efficiency. We evaluate Stockfish at levels 1, 3, and 5 with a 100ms timeout directly on Lichess’ platform against the Maia [18] 1, 5, and 9 bots (human behavior cloned convolutional networks trained at rating bins 1100-1200, 1500-1600, and 1900-2000, respectively) for several hundred games, obtaining calibrated Glicko-2 ratings for Stockfish specifically on Lichess’ platform (, , for Stockfish Levels 1, 3, and 5, respectively). Next, for evaluating our own models, we then play against Stockfish levels of 1, 3, and 5 for 100 games each, reaching a final rating calculation with 300 games. We then report both the Glicko-2 rating as well as rating deviation of our models, where provides a confidence interval. To play against Stockfish, we successively prompt our model with the current game PGN string. Note that our output is entirely unconstrained, and may be either illegal in the current board state or altogether unparsable. If our model fails to generate a valid legal move after 5 samples, we consider it to have lost. After generation, we give the updated board state to Stockfish and pass a new PGN string appended with the prior move of Stockfish back to our model. We repeat this process until the game ends.
4.2 Experimental Results
Main Result: Low-temperature sampling enables transcendence.
In this section we attempt to answer our primary research question, can low-temperature sampling actually induce transcendence in practice? We test Theorem 2 by evaluating several ChessFormers across different temperature values, from (nearly deterministic), to (original distribution), to (high entropy). In Figure 1 we definitively confirm the existence of transcendence. Our ChessFormer 1000 (where the latter number refers to the maximum rating seen during training) and ChessFormer 1300 models are able to transcend to around 1500 rating at temperature equal to . Interestingly, ChessFormer 1500 is unable to transcend at test time, a result that we further analyze in subsection 4.2.
To more deeply understand when and why transcendence occurs, we investigate two questions. (1) How does the reward function defined in Equation 2 shift with respect to low-temperature sampling? (2) Does transcendence rely on dataset diversity, as introduced theoretically in subsection 3.4?
Lowering temperature increases rewards in expectation on specific states, leading to transcendence over the full game.
When playing chess, a low-skilled player may play reasonably well until they make a significant blunder at a key point in play. If these errors are idiosyncratic, averaging across many experts would have a denoising effect, leaving the best moves with higher probability. Therefore, low-temperature sampling would move probability mass towards better moves in specific play contexts. Without low-temperature sampling, the model would still put probability mass onto blunders. To gain intuition for this idea, we visualize it theoretically in Appendix C and empirically in Figure 2 and Appendix B. This hypothesis motivates our first research question in this section: Does low-temperature sampling improve the expected reward very much for just some specific key game states, or a little for many game states?
To formalize this notion, we first define a “favor” function, which captures the improvement in reward by following some new probability distribution over some baseline probability distribution. Our definition is inspired by the Performance Difference Lemma (PDL) [10] from Reinforcement Learning (RL), which establishes an equivalence between the change in performance from following some new policy (a probability distribution of actions given a state) over some old policy, and the expected value of the advantage function of the old policy sampled with respect to the new policy. In RL, the advantage function is defined as the difference between the value of taking a single action in a given state versus the expected value of following some policy distribution of actions in that state.
Here, we define the “favor” of over in as the change in the reward function by comparing what would have done when following for a given input :
(4) |
Where refers to the state visitation distribution [31] when following in a sequential setting—informally, this variable can be thought of the distribution of states seen when sampling from with a fixed transition function that takes in an input , a output , and outputs a next input . Here, that transition function is given by the rules of chess and the opponent player. Given this favor function, we can now quantitatively explore the effects that lead to transcendence by setting the baseline to be the original imitation-learned probability distribution (temperature ), and as a low-temperature intervention on (e.g. temperature ). We can empirically calculate the reward by using the evaluation function [22] of Stockfish, an expert neural reward function that Stockfish uses to calculate its next move. This reward function is a neural network trained to predict the probability of winning through a sigmoid on a linear combination of handcrafted expert heuristics, such as amount of material versus opponent material, and number of moves to a potential checkmate.
![Refer to caption](extracted/5697352/adv-gain-dist-flat.png)
In Figure 4, we find that lowering the temperature has the effect of skewing the expected reward distribution to the left, especially for the green distribution. This result implies that the model does not improve the expected reward by a small amount for many game states, but rather improves the expected reward by a relatively large amount for a few game states. Thus, improves the expected reward (probability of winning) by an average of , but for some states, this expected improvement is over 5%. Note that the original temperature expected reward can be thought of as a Dirac distribution centered at . The above finding answers our research question in this section: Low-temperature sampling is able improves the expected reward by relatively large amounts for some specific game states, which is likely why the ChessFormer and model was able to achieve transcendence.
Temperature | Top 1 (%) | Top 3 (%) | Top 5 (%) | ||
---|---|---|---|---|---|
In Table 1, we present the statistics of the favor function for different temperature values. From this table, we observe that as the temperature decreases, the top- accuracies monotonically increase, suggesting that the model becomes more consistent in selecting good moves. We also observe that although the model improves as temperature decreases, the probability of winning is still below , meaning our model should tend to lose more games than it wins against Stockfish . This result matches with our results in Figure 1, as the rating of Stockfish is also higher than the reported rating for ( for Stockfish 1 vs for Chessformer ). Overall, the analysis of the advantage statistics provides further evidence for the effectiveness of low-temperature sampling in inducing transcendence in chess models.
Dataset diversity is essential for transcendence.
As we note in subsection 3.4, our theory requires dataset diversity as a necessary condition for enabling transcendence. Importantly, we find in Figure 1 that not all models are able to transcend. Unlike ChessFormer 1000 or 1300, the Chessformer 1500 fails to transcend. We hypothesize that this results is due to the fact that in the band of ratings from to , diversity does not significantly increase. If so, a rated player can be thought of as a noisy rated player, but a rated player cannot be thought of as a noisy rated player. In this section we ask the following research question: Is diversity in data required for enabling transcendence?
In Figure 5, we explore this research question by quantifying dataset diversity through the normalized entropy on the action distribution To gain intuition for this metric, imagine the action distribution of moves taken for any given state. Entropy will be higher for more uniform action distributions, and lower for more deterministic, peaked action distributions. The average entropy of these action distributions can therefore serve as a measurement of the diversity of the dataset. We normalize this entropy to the range by dividing by the binary log of the number of legal moves: .
Importantly, we cannot calculate this normalized entropy for every state, as most states after move in the midgame and before the engame are unique within the dataset and we therefore observe just a single action for thus states. Therefore our metric is limited in that it only considers opening moves, the beginning of the midgame, and the endgame. We consider only common states with greater than actions by sampling games from each dataset. The average entropy confirm our hypothesis: The cut off dataset has on average less diversity than the dataset, which has is again less than the dataset. This result suggests that Chessformer likely is not transcendent due to a lack of diversity in its dataset. If the entropy instead stayed constant for each dataset, it would imply that each had a similar level of diversity. In such a case, we would expect that ChessFormer likely would also transcend. Instead, as predicted, Chessformer 1500 likely is not transcendent due to a lack of diversity in its dataset.
![Refer to caption](x2.png)
5 Related Work
Chess and AI.
Chess has been motivating AI research since the field began. In 1950, before anyone had used the term “artificial intelligence”, automated chess were explored by both Claude Shannon [26] and Alan Turing [32]. Arguably, this history goes back even further: the famed “mechanical turk” of the 18th century was a fraudulently automated chess player. These centuries of mechanical ambitions were finally realized in 1997, when world champion Garry Kasparov was defeated by IBM’s Deep Blue [3]. Since then, chess program developers have drawn on neural approaches, with the RL-based convolutional network AlphaZero [27] far surpassing prior world champion engines such as Stockfish [25].Our chess model testbed is inspired by a number of existing approaches, including other models trained on lichess data [18], and other transformer-based sequential chess agents [23, 5].
Diversity beats Strength.
Another historical thread in AI research is the strength of diverse learners. Long since the development of ensemble methods that exploit learner diversity—including bagging [1], boosting [6], and model averaging [19]—researchers have continued to articulate this insight across settings. Similar to our chess setting, a diverse team of go playing agents have been proven and empirically shown to outperform solitary agents [9] and homogeneous teams [28], even when the alternative models individually outperform the diverse team members [17]. We draw a connection to this deep literature through our theoretical results which shows that training on just the imitation learning objective and then performing low-temperature sampling subtly implies the same principle of majority voting used in this literature.
Teacher diversity has also been explored in the machine learning literature. One related method is ensemble distillation [16], in which a model is trained with an additional objective to match a variety of weaker teacher models. Closer to our setting, ensemble self-training approaches [24] train a learner directly on the labels produced by varied teachers. Large language models supervised by smaller or less trained models are said to exhibit “weak to strong generalization” [2]. Overall, evidence continues to accrue that the general phenomenon we address is pervasive: that is, models can substantially improve over the experts that generate their training data.
Offline Reinforcement Learning.
Our work also draws connections to the Offline Reinforcement Learning [14] setting, where one attempts to learn a new policy that improves upon a fixed dataset generated by some behavior policy . However, our setting of imitation learning differs substantially from this literature, as we do not explicitly train our model on a RL objective that attempts to improve upon the dataset. Importantly, such an objective oftentimes introduces training instabilities [15] and also assumes reward labels, both of which are avoided with a pure imitation or self-supervised learning objective. We defer a more extended discussion of related work to Appendix D.
6 Discussion and Future Work
This paper introduces the concept of transcendence, where generative models trained on expert data outperform the best individual experts. Our theoretical analysis shows that low-temperature sampling is key to achieving transcendence by denoising expert biases and consolidating diverse knowledge. We validate our findings empirically by training several chess models which, under low-temperature sampling, surpass the performance of the players who produced their training data. We highlight the necessity of dataset diversity for transcendence, emphasizing the role of varied expert perspectives.
Limitations.
While our work provides a strong foundation for understanding and achieving transcendence in generative models, several avenues for future research remain. Future work may investigate transcendence and its causes in domains and contexts beyond chess, such as natural language processing, computer vision, and text-to-video, to understand the generalizability of our findings. Additionally, our theoretical framework assumes that game conditions at test time match those seen during training; in order to extend our findings to cases of composition or reasoning, we must forego this assumption.
Future Work.
Future work could also explore the practical implementations of transcendence, and ethical considerations in the broader context of deployed generative models. Ultimately, our findings lay the groundwork for leveraging generative models to not only match but exceed human expertise across diverse applications, pushing the theoretical boundaries of what generative models can achieve.
Broader Impact.
The possibility of “superintelligent” AGI has recently fueled many speculative hopes and fears. It is therefore possible that our work will be cited by concerned communities as evidence of a threat, but we would highlight that the denoising effect addressed in this paper does not offer any evidence for a model being able to produce novel solutions that a human expert would be incapable of devising. In particular, we do not present evidence that low temperature sampling leads to novel abstract reasoning, but rather denoising of errors.
Acknowledgements
Sham Kakade acknowledges this work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence; support from the Office of Naval Research under award N00014-22-1-2377, and the National Science Foundation Grant under award #IIS 2229881.
References
- Breiman [1996] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. URL https://api.semanticscholar.org/CorpusID:47328136.
- Burns et al. [2023] C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
- Campbell et al. [2002] M. Campbell, A. J. Hoane, and F.-h. Hsu. Deep Blue. Artificial Intelligence, 134(1):57–83, Jan. 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1.
- Chen et al. [2021] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling, 2021.
- Feng et al. [2023] X. Feng, Y. Luo, Z. Wang, H. Tang, M. Yang, K. Shao, D. Mguni, Y. Du, and J. Wang. ChessGPT: Bridging Policy Learning and Language Modeling. Advances in Neural Information Processing Systems, 36:7216–7262, Dec. 2023.
- Freund and Schapire [1999] Y. Freund and R. E. Schapire. A short introduction to boosting, 1999. URL https://api.semanticscholar.org/CorpusID:9621074.
- Glickman [2012] M. E. Glickman. Example of the glicko-2 system. Boston University, 28, 2012.
- Janner et al. [2021] M. Janner, Q. Li, and S. Levine. Offline reinforcement learning as one big sequence modeling problem, 2021.
- Jiang et al. [2014] A. Jiang, L. Soriano Marcolino, A. D. Procaccia, T. Sandholm, N. Shah, and M. Tambe. Diverse randomized agents vote to win. Advances in Neural Information Processing Systems, 27, 2014.
- Kakade and Langford [2002] S. M. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002. URL https://api.semanticscholar.org/CorpusID:31442909.
- Karpathy [2022] A. Karpathy. NanoGPT. https://github.com/karpathy/nanoGPT, 2022.
- Karvonen [2024] A. Karvonen. Emergent world models and latent variable estimation in chess-playing language models. arXiv preprint arXiv:2403.15498, 2024.
- Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Levine et al. [2020] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Li et al. [2023] J. Li, E. Zhang, M. Yin, Q. Bai, Y.-X. Wang, and W. Y. Wang. Offline reinforcement learning with closed-form policy improvement operators. In International Conference on Machine Learning, pages 20485–20528. PMLR, 2023.
- Lin et al. [2020] T. Lin, L. Kong, S. U. Stich, and M. Jaggi. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351–2363, 2020.
- Marcolino et al. [2013] L. S. Marcolino, A. X. Jiang, and M. Tambe. Multi-agent team formation: Diversity beats strength? In IJCAI, volume 13, 2013.
- McIlroy-Young et al. [2020] R. McIlroy-Young, S. Sen, J. Kleinberg, and A. Anderson. Aligning Superhuman AI with Human Behavior: Chess as a Model System. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1677–1687, Aug. 2020. doi: 10.1145/3394486.3403219.
- McMahan et al. [2023] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data, 2023.
- Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Munos et al. [2016] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29, 2016.
- Nasu [2018] Y. Nasu. Efficiently Updatable Neural-Network-based Evaluation Functions for Computer Shogi, 2018.
- Noever et al. [2020] D. Noever, M. Ciolino, and J. Kalin. The Chess Transformer: Mastering Play using Generative Language Models, Sept. 2020.
- Odonnat et al. [2024] A. Odonnat, V. Feofanov, and I. Redko. Leveraging ensemble diversity for robust self-training in the presence of sample selection bias, 2024.
- Pete [2018] Pete. AlphaZero Crushes Stockfish In New 1,000-Game Match. https://www.chess.com/news/view/updated-alphazero-crushes-stockfish-in-new-1-000-game-match, Dec. 2018.
- Shannon [1950] C. E. Shannon. XXII. Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, Mar. 1950. ISSN 1941-5982, 1941-5990. doi: 10.1080/14786445008521796.
- Silver et al. [2017] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Dec. 2017.
- Soriano Marcolino et al. [2014] L. Soriano Marcolino, H. Xu, A. Xin Jiang, M. Tambe, and E. Bowring. Give a Hard Problem to a Diverse Team: Exploring Large Action Spaces. Proceedings of the AAAI Conference on Artificial Intelligence, 28(1), June 2014. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v28i1.8880.
- The Stockfish developers (2024) [see AUTHORS file] The Stockfish developers (see AUTHORS file). Stockfish, 2024. URL https://stockfishchess.org/.
- Toshniwal et al. [2022] S. Toshniwal, S. Wiseman, K. Livescu, and K. Gimpel. Chess as a Testbed for Language Model State Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11385–11393, June 2022. doi: 10.1609/aaai.v36i10.21390.
- Touati et al. [2020] A. Touati, A. Zhang, J. Pineau, and P. Vincent. Stable policy optimization via off-policy divergence regularization. In Conference on Uncertainty in Artificial Intelligence, pages 1328–1337. PMLR, 2020.
- Turing [2004] A. Turing. Chess (1953). In B. J. Copeland, editor, The Essential Turing, page 0. Oxford University Press, Sept. 2004. ISBN 978-0-19-825079-1. doi: 10.1093/oso/9780198250791.003.0023.
- Uma et al. [2021] A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470, 2021.
- Van der Maaten and Hinton [2008] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Xie et al. [2020] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
- Zhang et al. [2022] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Appendix A Proofs
Here we prove Theorem 1, where transcendencecannot occur by purely using imitation learning in our setting where all experts are sampled uniformly across the input distribution.
Proof.
From linearity of the expectation
∎
We now give the proof of Theorem 2 that if the arg-max prediction is better than the best expert, then transcendence is possible with low-temperature sampling.
Proof.
Observe that for all , it holds that . Therefore, for all
and so,
Therefore, the required immediately follows. ∎
Proof.
Notice that for this expert, , which achieves higher reward compared to . Therefore, Theorem 2 implies that we achieve transcendence in the setting where all the data is generated by a single expert . ∎
Finally, we give the proof Theorem 4, or the statement that transcendence can occur from multiple experts if the test distribution is spread across multiple disjoing subsets of .
Proof.
In this case, observe that for all
Therefore, we get that for all
Thus, we get , and the required follows from Theorem 2.
∎
Appendix B Additional Denoising Visualizations
![Refer to caption](extracted/5697352/denoising_viz_1.png)
![Refer to caption](extracted/5697352/denoising_viz_2.png)
![Refer to caption](extracted/5697352/denoising_viz_3.png)
Appendix C Intuition of low temperature sampling inducing transcendence
To build intuition for the primary mechanism of transcendence that we explore in this paper, we give the following toy progression of distributions in order to clearly illustrate how low-temperature sampling can induce transcendence through majority voting. Here, the middle purple action represent the correct, high-reward output, whilst the left and right actions are low-reward bad outputs. We plot the probability of each output as a label on the x axis.
![Refer to caption](extracted/5697352/intuition1.png)
![Refer to caption](extracted/5697352/intuition2.png)
![Refer to caption](extracted/5697352/intuition3.png)
![Refer to caption](extracted/5697352/intuition4.png)
Appendix D Further Related Work
D.1 Label Disagreement
Label disagreement in training data, in particular, can improve models in practice. Xie et al. [36] empirically show that adding random noise to teacher-generated labels can improve a student model. Uma et al. [33] even survey the literature on human interannotator disagreement and find a trend of improvements when models are trained on the full set of disagreeing labels rather than on majority vote labels or only on data where labelers agree. Our theoretical claims build on these findings by making the point that the learner can even improve on these original diverse labelers.
D.2 Offline Reinforcement Learning
Although most Offline Reinforcement Learning algorithms train on an RL objective, perhaps most similar to our work is Decision Transformer [4] and Trajectory Transformer [8]: prior models trained on just the sequence prediction of trajectories. Most notably, Decision Transformer also finds an alternative form of transcendencethan the one explored in this paper: by conditioning the trained transformer by the performance of the trajectory, at inference time they can then prompt the model to perform better than the best trajectory seen during training. This remains another promising direction to explore transcendence under.
Interestingly, an analogue to low-temperature sampling also has been noticed and exploited by Reinforcement Learning practitioners in the context of off-policy learning, where a different exploration policy is used than the final learned target policy . Oftentimes will just be set to a greedy version of [21], such as choosing to take the action of , which we note is directly equivalent to setting temperature to 0.
Appendix E Training Details
We give a full list of the hyperparameters we used for training here. Note that we largely follow the same hyperparameter set as [37], but lower the batch size to as we found training to still be stable ta this level. We also release our code openly to support further research into transcendence, which was built off the wonderful work done by Karvonen [12] and Karpathy [11].
Hyperparameter | Value | |
---|---|---|
ChessFormer | Optimizer | AdamW [13] |
Activation Function | ReLU | |
Mini-batch size | 125K tokens | |
Gradient Accumulation Steps | 1 | |
Transformer num. layers | 16 | |
Transformer num. heads | 8 | |
Transformer embedding dim. | 512 | |
Dropout | 0.0 | |
Learning Rate | 3e-4 | |
Number of gradient steps | 100K | |
Weight Decay | 0.1 | |
Critic hidden layers | 3 | |
Adam | 0.90 | |
Adam | 0.95 | |
Gradient Clip | 1.0 | |
Cosine Learning Rate | True | |
Warmup Iterations | 2000 | |
Minimum Learning Rate | 3e-5 | |
Learning Rate Deacy Iterations | 400K | |
Tensor datatype | bfloat16 |
Appendix F Compute Resources
We train all of our models on the Nvidia H100 80GB GPU. To train one of our models takes around 6 to 12 hours.
Appendix G Full t-SNE
We visualize the full t-SNE here, coloring by the reward of the game. We see that the model has learned some representation of the reward, with high absolute reward states being more likely to be near each other in the latent space. This also points towards evidence that the model has learned some sort equivariant representation of the player identity, as the region of symmetric high reward states indicate. Note that reward is not directly given to the model during training.
-0.1-0.1
We visualize the same t-SNE, but this time coloring by game length rather than reward. We see that games with high reward tend to be longer, which makes logical sense as the result of the game will tend to be clearer as the game proogresses.
-0.1-0.1