Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing

Ye Tian, Baolin Peng*, Linfeng Song*, Lifeng **, Dian Yu, Haitao Mi, Dong Yu
Tencent AI Lab, Bellevue, WA
{yaptian,baolinpeng,lfsong,lifeng**,yudian,haitaomi}@global.tencent.com

Equal Contribution; †Corresponding Author
Abstract

Despite the impressive capabilities of Large Language Models (LLMs) on various tasks, they still struggle with scenarios that involves complex reasoning and planning. Recent work proposed advanced prompting techniques and the necessity of fine-tuning with high-quality data to augment LLMs’ reasoning abilities. However, these approaches are inherently constrained by data availability and quality. In light of this, self-correction and self-learning emerge as viable solutions, employing strategies that allow LLMs to refine their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs in self-refining its response, particularly in complex reasoning and planning task, remains dubious. In this paper, we introduce AlphaLLM for the self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop, thereby enhancing the capabilities of LLMs without additional annotations. Drawing inspiration from the success of AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM for self-improvement, including data scarcity, the vastness search spaces of language tasks, and the subjective nature of feedback in language tasks. AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach tailored for language tasks, and a trio of critic models for precise feedback. Our experimental results in mathematical reasoning tasks demonstrate that AlphaLLM significantly enhances the performance of LLMs without additional annotations, showing the potential for self-improvement in LLMs.

1 Introduction

LLMs, trained on trillions of tokens with billions of parameters have shown unparalleled capabilities in a wide range of natural language processing tasks (Touvron et al., 2023b; Team et al., 2023; OpenAI, 2023). Nevertheless, they continue to face challenges in scenarios requiring complex reasoning and strategic planning  (Valmeekam et al., 2022; Stechly et al., 2024). While advanced prompting approaches such as Chain, Tree, Graph-of-Thought (Wei et al., 2022; Yao et al., 2024; Besta et al., 2024; Ding et al., 2023), which generate intermediate steps in the reasoning process demonstrate large improvements on reasoning capability of LLMs, it remains essential to fine-tune LLMs using a substantial volume of high-quality, supervised data to fundamentally improve the model performance (Nye et al., 2021; Lewkowycz et al., 2022; Chung et al., 2022). This methodology is inherently limited by the scope and quality of data that humans can provide.

Considering existing challenges, the concept of self-correction and self-learning have been proposed as promising solutions (Madaan et al., 2024; Saunders et al., 2022; Chen et al., 2024). Within these framework, LLMs typically operate by employing two main strategies: 1) they continuously refine their responses based on the feedback of their past responses, and 2) they extensively sample responses then learn from preferences judged by itself as reward models with PPO or DPO (Yuan et al., 2024a, b; Chen et al., 2024). However, it remains a matter of ongoing research whether LLMs can effectively critique their own outputs to either enhance response quality or apply a scalar reward to indicate the quality of responses, especially in contexts demanding intricate planning and reasoning (Valmeekam et al., 2022; Stechly et al., 2024; Huang et al., 2023; Hong et al., 2023). On the other hand, advanced search algorithms such as Monte Carlo Tree Search (MCTS), combined with reinforcement learning, have enabled models to learn from self-play and achieve human parity or even surpass human performance in complex tasks such as the game of Go (Silver et al., 2016, 2017). This naturally raises a question: is it viable to leverage the strengths of MCTS alongside LLMs to inaugurate a novel paradigm of self-improving? More precisely, could the assimilation of MCTS empower LLMs to more effectively explore better responses, guided by strategic signals, and subsequently optimize these responses to enhance overall performance?

To answer this question, we begin with a systematic examination of AlphaGo, identifying three critical aspects for its success: (i) The large volume of expert and self-play data; imitation learning on expert data enables it to simulate human-like strategies, and the reinforcement learning on self-play data fosters the emergence of novel tactics that surpass human capabilities (Clark & Storkey, 2015). (ii) The use of tree search, which facilitates the exploration of potential moves through statistical sampling of the large search space. This approach allows AlphaGo to effectively identify and simulate the most promising strategies, thereby making highly informed decisions in the complex and vast decision space (Silver et al., 2016). (iii) Accurate and unambiguous environment feedback; the direct and accurate feedback (win or loss) provided by the game of Go offers a clear and unequivocal learning signal (Silver et al., 2017). The integration of MCTS with LLMs for self-improvement has several challenges: (i) Limited Data: High-quality annotated data for LLMs is generally scarce. Furthermore, how to construct of synthetic data for LLMs training, similar to AlphaGo’s self-play data, remains unclear. (ii) Search Efficiency: The vast number of potential token combinations in natural language tasks results in an exponentially large search space, posing a significant challenge to the efficiency of MCTS (Ramamurthy et al., 2022). (iii) Imperfect Feedback: In contrast to the clear win/loss feedback in Go, feedback in natural language tasks is often subjective and nuanced, without a straightforward measure of success.

Refer to caption
Figure 1: Imagination-Searching-Criticizing self-improvement loop: Imagination component synthesizes prompts as new learning examples, with MCTS searching better trajectories guided by signals from critics for policy improving.

In this paper, we introduce AlphaLLM, an imagination-searching-criticizing framework designed for the self-improvement of LLMs . AlphaLLM consists of three key components, as illustrated in Figure 1. First, an imagination component is designed to synthesize prompts, alleviating the issues of data scarcity. Second, we propose η𝜂\etaitalic_ηMcts tailored for efficient searching in language tasks. Particularly, it has been show that planning at multiple levels of temporal abstraction is critical for RL problems with a long horizon and large action space (Sutton et al., 1999b; Peng et al., 2017; Luketina et al., 2019). As such, we propose formulating the text generation process as options over a Markov Decision Process (MDP) problem, where each option represents the generation of a collection of tokens for a specific subtask, similar to the concept of chains in chain-of-thought prompting. This formulation improves search efficiency by substantially reducing the search depth. Additionally, we propose the use of state fusion and adaptive branching factors to further enhance search efficiency by balancing the trade-off between search width and depth. Lastly, since accurate feedback is crucial to the success of MCTS, we introduce a trio of critic models to guide η𝜂\etaitalic_ηMcts, including a value function for estimating future rewards, a process reward model for assessing node correctness, and an outcome reward model for evaluating the overall trajectory. For complex tasks with which LLMs struggle assessing such as arithmetic computation and code execution, to ensure the accuracy of feedback, we augment the critics with the capacity to make dynamic decisions on which tools to use, when to use them, and how to use them effectively. After η𝜂\etaitalic_ηMcts stage, we collect the trajectory with the largest reward from the critic models as the training examples to improve LLMs.

The experimental results on mathematical reasoning tasks demonstrate that AlphaLLM can efficiently search for better responses and use them to improve LLMs’ performance, forming an effective self-improving loop. Notably, based on LLaMA-2 70b, AlphaLLM can improve its performance from 57.8 to 92.0 on GSM8K and from 20.7 to 51.0 on MATH, performing comparably to GPT-4. In summary, our contributions are threefold:

  • We examine the inherent challenges in harnessing AlphaGo’s self-learning algorithms for LLMs, which are data scarcity, the complexity of search spaces, and the nuanced nature of feedback.

  • We introduce AlphaLLM, an imagination-searching-criticizing framework that integrates MCTS with LLMs, enabling them to self-improve without the need for additional annotations

  • Experiments on mathematical reasoning problems show that, by employing AlphaLLM, we can significantly enhance the performance of LLaMA-2 70B, elevating it to levels comparable with GPT-4 on the GSM8K and MATH datasets when η𝜂\etaitalic_ηMcts decoding is utilized.

2 Related Work

Search with LLM

Effective search strategy has been shown crucial for tasks that involve complex reasoning and planning, such as go (Silver et al., 2016) and math reasoning (Cobbe et al., 2021; Hendrycks et al., 2021). For math reasoning tasks, various search methods have been studied. One direction of research (Zhu et al., 2024; Xie et al., 2024) designed beam search with dynamic pruning, where beam items of low quality are pruned. Another line of work (Yao et al., 2024; Long, 2023; Besta et al., 2024; Hao et al., 2023; Feng et al., 2023) maintains a tree or a graph that represents the current progress of solving the input question where potential branches are iteratively expanded. Both our approach and Feng et al. (2023) are based on the MCTS algorithm, while one main difference is how to define a search step: Feng et al. (2023) fix a search step to be either a token or a sentence, while our approach is more flexible on deciding steps. More importantly, we also study how to leverage MCTS for effective self-improve. We also design the MCTS process more carefully, such as we merge multiple critique signals to effectively guide the search process. As the result, our approach achieves much better performances than Feng et al. (2023).

LLM Self-improving

Being a key to the success of scalable oversight (Bowman et al., 2022), self-improving for LLM aims to align the LLM to human preference and values mainly using the supervision from the knowledge inside the LLM. One crucial part of self-improving is how to obtain reliable signal of critique to distinguish between good responses from the LLM and bad ones. Initial work (Bai et al., 2022; Wang et al., 2022) first asks the LLM to generate input queries of diverse tasks and the corresponding outputs. They then rely on hand-crafted heuristic rules to filter out redundant or low-quality data pairs (e.g. the query is too long or too short). Since it is non-trivial to compose effective heuristic rule, later work (Sun et al., 2023; Li et al., 2023; Guo et al., 2024) proposes a few general principles or judging criteria and ask the LLM itself to evaluate the quality its responses based on these guidance. They hope that the LLM can automatically designate these principles into each data point to better guide data filtering. However, this requires the LLM to have strong abilities to apply these principles for each specific case and make correct judgements. Different from previous work, we propose to leverage the supervision from MCTS for LLM self-improvement: taking the outputs of MCTS to continue train the LLM. This is because the outputs from MCTS are usually in much better quality then standard nucleus sampling, and the large gap ensure that the LLM can self improve.

Another line of research explores cheaply available knowledge. Some (Saunders et al., 2022; Wang et al., 2023b) collects large-scale critique data from question-and-answer websites (e.g., stack exchange) for continue pretraining, while others (Gou et al., 2023a) utilize external tools to provide more fine-grained guidance. The goal of both directions is to enhance critique ability of the LLM for self-improving. Our approach based on MCTS is intuitively orthogonal to this line of research.

3 Preliminaries

3.1 Problem Formulation

In this paper, we consider a LLM characterized by probability pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and denoted as policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. It takes a sequence 𝒙=[x1,,xn]𝒙subscript𝑥1subscript𝑥𝑛{\bm{x}}=[x_{1},\cdots,x_{n}]bold_italic_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] as input, which is typically referred as prompt, to generate the response 𝒚=[y1,,ym]𝒚subscript𝑦1subscript𝑦𝑚{\bm{y}}=[y_{1},\cdots,y_{m}]bold_italic_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ]. The response 𝒚𝒚{\bm{y}}bold_italic_y can be viewed as samples from the conditional probability distribution pθ(|𝒙)p_{\theta}(\cdot|{\bm{x}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | bold_italic_x ). In the context of LLMs, each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a token from a pre-defined vocabulary. The policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT operates in an autoregressive manner, where each token is generated sequentially, relying solely on the context provided by the previously generated tokens. The policy therefore constitutes a Markov process in which the conditional probability distribution pθ(𝒚|𝒙)subscript𝑝𝜃conditional𝒚𝒙p_{\theta}({\bm{y}}|{\bm{x}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_y | bold_italic_x ) can be decomposed and expressed with the chain rule:

pθ(𝒚|𝒙)=i=1mpθ(yi|𝒙,𝒚<i)subscript𝑝𝜃conditional𝒚𝒙superscriptsubscriptproduct𝑖1𝑚subscript𝑝𝜃conditionalsubscript𝑦𝑖𝒙subscript𝒚absent𝑖p_{\theta}({\bm{y}}|{\bm{x}})=\prod_{i=1}^{m}p_{\theta}(y_{i}|{\bm{x}},{\bm{y}% }_{<i})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_y | bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_x , bold_italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT )

With this property, the text generation task can be formulated as an Markov Decision Process (MDP) problem consisting of (𝒮,𝒜,T,R,γ)𝒮𝒜𝑇𝑅𝛾({\mathcal{S}},{\mathcal{A}},T,R,\gamma)( caligraphic_S , caligraphic_A , italic_T , italic_R , italic_γ )  in which:

  • State 𝒔t𝒮subscript𝒔𝑡𝒮{\bm{s}}_{t}\in{\mathcal{S}}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S: Represents the context information of current trajectory, i.e., current status of the generation process, e.g., a partial response to a prompt. The initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT corresponds to the original prompt.

  • Action at𝒜subscript𝑎𝑡𝒜a_{t}\in{\mathcal{A}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A: Denotes a single action or sampled token from the vocabulary, leading to a transition to a new state 𝒔t+1subscript𝒔𝑡1{\bm{s}}_{t+1}bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, by concatenating 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

  • Reward rt=R(𝒔t,at)subscript𝑟𝑡𝑅subscript𝒔𝑡subscript𝑎𝑡r_{t}=R({\bm{s}}_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ): Manifest the evaluation of the generation to the prompt, reflecting the desirability or preferences of each state-action pair, such as whether the actions follow instructions in the prompt.

γ𝛾\gammaitalic_γ denotes the discount factor, while T𝑇Titalic_T here signifies the transition probability function. We omit its detailed description as in text generation environment the transition is deterministic.

This MDP framework sets the stage for applying Reinforcement Learning (RL) methods to optimize the policy π𝜽subscript𝜋𝜽\pi_{\bm{\theta}}italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT aiming to maximize the expected cumulative reward R𝑅Ritalic_R. Base on these setups, we describe the self-improving problem. Given a LLM π𝜽subscript𝜋𝜽\pi_{\bm{\theta}}italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT and an initial dataset 𝒟0superscript𝒟0{\mathcal{D}}^{0}caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, which consists of N𝑁Nitalic_N expert-generated prompt-response pairs {(𝒙i0,𝒚i0)i[N]}conditional-setsuperscriptsubscript𝒙𝑖0superscriptsubscript𝒚𝑖0𝑖delimited-[]𝑁\{({\bm{x}}_{i}^{0},{\bm{y}}_{i}^{0})\mid i\in[N]\}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∣ italic_i ∈ [ italic_N ] }, the goal of self-improving is to iteratively refine πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to maximize the reward. The refinement process includes learning from synthesized prompts and corresponding responses. These responses are obtained using an advanced search algorithm that navigates the space of possible responses to maximize the expected reward. The detailed process is described in Algorithm 1. The primary challenges in forming an effective self-improving loop lie in synthesizing suitable prompts, efficiently searching over a vast action space, and obtaining precise feedback, which will be discussed in §4.

Input Initial dataset 𝒟0={(𝒙i0,𝒚i0)i[N]}superscript𝒟0conditional-setsuperscriptsubscript𝒙𝑖0superscriptsubscript𝒚𝑖0𝑖delimited-[]𝑁{\mathcal{D}}^{0}=\{({\bm{x}}_{i}^{0},{\bm{y}}_{i}^{0})\mid i\in[N]\}caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∣ italic_i ∈ [ italic_N ] }, policy model πθ0superscriptsubscript𝜋𝜃0\pi_{\theta}^{0}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, reward model R𝑅Ritalic_R, number of self-improving training loop K𝐾Kitalic_K
Output θksuperscript𝜃𝑘\theta^{k}italic_θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT
for k1,,K𝑘1𝐾k\leftarrow 1,\dots,Kitalic_k ← 1 , … , italic_K do
       Generate synthetic prompts [𝒙k]=SYN(πθk1,𝒟k1)delimited-[]superscript𝒙𝑘SYNsuperscriptsubscript𝜋𝜃𝑘1superscript𝒟𝑘1[{\bm{x}}^{k}]=\texttt{SYN}(\pi_{\theta}^{k-1},{\mathcal{D}}^{k-1})[ bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] = SYN ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT )
      Collect trajectories with search algorithm, e.g., MCTS guided by R𝑅Ritalic_R. [𝒚^k]=MCTS(πθk1,[𝒙k])delimited-[]superscript^𝒚𝑘MCTSsuperscriptsubscript𝜋𝜃𝑘1delimited-[]superscript𝒙𝑘[\hat{{\bm{y}}}^{k}]=\texttt{MCTS}(\pi_{\theta}^{k-1},[{\bm{x}}^{k}])[ over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] = MCTS ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , [ bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] )
      Construct dataset 𝒟k={(𝒙k,𝒚^k)}superscript𝒟𝑘superscript𝒙𝑘superscript^𝒚𝑘{\mathcal{D}}^{k}=\{({\bm{x}}^{k},\hat{{\bm{y}}}^{k})\}caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) }
      Update policy θk=argminθL(πθk1,𝒟k)superscript𝜃𝑘subscript𝜃𝐿superscriptsubscript𝜋𝜃𝑘1superscript𝒟𝑘\theta^{k}=\arg\min_{\theta}L(\pi_{\theta}^{k-1},{\mathcal{D}}^{k})italic_θ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
end for
Algorithm 1 LLM self-improving loop

3.2 Monte Carlo Tree Search

MCTS is a sampling-based search algorithm for policy optimization in decision-making problems. It would iteratively build a search tree, by repeating four phases: selection, expansion, evaluation, and backpropagation. In the selection phase, it would recursively select the children from the root node by Upper Confidence Bound (UCB) bandit  Auer et al. (2002), which is

UCB(i)=wi+C2lnNini𝑈𝐶𝐵𝑖subscript𝑤𝑖𝐶2subscript𝑁𝑖subscript𝑛𝑖UCB(i)=w_{i}+C*\sqrt{2*\ln{\frac{N_{i}}{n_{i}}}}italic_U italic_C italic_B ( italic_i ) = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_C ∗ square-root start_ARG 2 ∗ roman_ln divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG (1)

where nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the visit counts for the node i𝑖iitalic_i and its parent respectively, C𝐶Citalic_C represents a hyperparameter balancing exploration and exploitation, and the wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the average value of all descendant nodes of i𝑖iitalic_i. Following selection, the tree undergoes expansion according to the defined policy in the expansion phase. Then in the evaluation phase, the value of the newly expanded node is estimated, by sampling or model-based methods. Finally, in the backpropagation phase, the estimated value is backpropagated to all ancestor nodes of the newly expanded node.

4 AlphaLLM

4.1 Overview

The architecture of AlphaLLM is depicted in Figure 1, comprising three key components. Firstly, the imagination component is tasked with synthesizing prompts as learning examples. Secondly, an efficient search component, named η𝜂\etaitalic_ηMcts, is proposed to search high-quality trajectories for optimizing the policy. Lastly, the search process is guided by critics specifically designed to provide reliable signals.

4.2 Data Synthesizing

Let 𝒟0={(𝒙i,𝒚i)i[N]}superscript𝒟0conditional-setsubscript𝒙𝑖subscript𝒚𝑖𝑖delimited-[]𝑁{\mathcal{D}}^{0}=\{({\bm{x}}_{i},{\bm{y}}_{i})\mid i\in[N]\}caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∣ italic_i ∈ [ italic_N ] } denote the initial dataset consisting of N𝑁Nitalic_N expert-generated prompt-response pairs. The data synthesizing process aims to expand this dataset by generating a set of synthesized prompts 𝒟1={(𝒙i1,)i[N]}superscript𝒟1conditional-setsuperscriptsubscript𝒙𝑖1𝑖delimited-[]𝑁{\mathcal{D}}^{1}=\{({\bm{x}}_{i}^{1},\cdots)\mid i\in[N]\}caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ ) ∣ italic_i ∈ [ italic_N ] }. The generation of each synthesized prompt 𝒙i1superscriptsubscript𝒙𝑖1{\bm{x}}_{i}^{1}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT can be mathematically described as a transformation g𝑔gitalic_g applied to one or more examples from 𝒟0superscript𝒟0{\mathcal{D}}^{0}caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT:

𝒙i1=g(𝒙i10,,𝒙im0,π0)superscriptsubscript𝒙𝑖1𝑔superscriptsubscript𝒙subscript𝑖10superscriptsubscript𝒙subscript𝑖𝑚0superscript𝜋0{\bm{x}}_{i}^{1}=g({\bm{x}}_{i_{1}}^{0},\cdots,{\bm{x}}_{i_{m}}^{0},\pi^{0})bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )

where 𝒙i10,,𝒙im0superscriptsubscript𝒙subscript𝑖10superscriptsubscript𝒙subscript𝑖𝑚0{\bm{x}}_{i_{1}}^{0},\cdots,{\bm{x}}_{i_{m}}^{0}bold_italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT are selected examples from 𝒟0superscript𝒟0{\mathcal{D}}^{0}caligraphic_D start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. The transformation function g𝑔gitalic_g controls the synthesis process, which can be a learnable function, manually defined heuristic rules, a strong LLM or the policy model itself π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT equipped with data synthesis instructions. The data synthesizing process aims to enrich the diversity and complexity presented for the training of the policy model. Among various strategies, such as Self-instruct (Wang et al., 2022), Evol-instruct (Xu et al., 2023), we opt for a method akin to that described in Yu et al. (2023).

4.3 η𝜂\etaitalic_ηMcts

Refer to caption
Figure 2: An overview of the four operations of η𝜂\etaitalic_ηMcts. A node is selected, expanded, simulated with fast rollout policy until a terminal node is reached, then the signals from value function, PRM and ORM are backpropagated.

4.3.1 Option-level MCTS

Search Node Example Termination
Token-level y0y1y2y3y5y6y7y8subscript𝑦0subscript𝑦1subscript𝑦2subscript𝑦3subscript𝑦5subscript𝑦6subscript𝑦7subscript𝑦8y_{0}\rightarrow y_{1}\rightarrow y_{2}\rightarrow y_{3}\rightarrow y_{5}% \rightarrow y_{6}\rightarrow y_{7}\rightarrow y_{8}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT → italic_y start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT token
Sentence-level y0y1y2subscript𝑦0subscript𝑦1subscript𝑦2y_{0}y_{1}y_{2}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \keys\return y4y5y6absentsubscript𝑦4subscript𝑦5subscript𝑦6\rightarrow y_{4}y_{5}y_{6}→ italic_y start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT \keys\return y7y8y9y10absentsubscript𝑦7subscript𝑦8subscript𝑦9subscript𝑦10\rightarrow y_{7}y_{8}y_{9}y_{10}→ italic_y start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT new line
Option-level y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT y1y2absentsubscript𝑦1subscript𝑦2\rightarrow y_{1}y_{2}→ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT \keys\return y4y5y6absentsubscript𝑦4subscript𝑦5subscript𝑦6\rightarrow y_{4}y_{5}y_{6}→ italic_y start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT \keys\return y7y8y9subscript𝑦7subscript𝑦8subscript𝑦9y_{7}y_{8}y_{9}italic_y start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT \keys\return y10absentsubscript𝑦10\rightarrow y_{10}→ italic_y start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT termination function
Table 1: Comparative illustration of token-level, sentence-level, and option-level MCTS search nodes. y𝑦yitalic_y denotes a token sampled from the policy model. The arrow \rightarrow represents the transition from one search node to the subsequent node within the search process.

When applying MCTS to LLMs, it is natural to perform token-level search , where each token is considered as an action (Liu et al., 2023). However, the substantial vocabulary size typical of LLMs presents a significant challenge i.e., conducting a deep search in such a vast space becomes increasingly complex as the search space expands exponentially. To mitigate this, some paper proposed a sentence-level search, treating each sentence or step as a search node (Feng et al., 2023). While this method reduces the search space, it might compromise the flexibility and effectiveness of applying MCTS to LLMs, which is particularly true for tasks where subtle variations in token can dramatically impact the outcome, or where a more comprehensive search beyond a sentence is necessary.

Inspired by Sutton et al. (1999a); De Waard et al. (2016), we use the term option as a search node and propose option-level MCTS where each option represents a sequence of tokens, which can range from multiple tokens to several sentences. A comparisons of different levels search is listed in Table 1. Mathematically, an option o=,π,β𝑜𝜋𝛽o=\langle{\mathcal{I}},\pi,\beta\rangleitalic_o = ⟨ caligraphic_I , italic_π , italic_β ⟩, where 𝒮𝒮{\mathcal{I}}\subseteq{\mathcal{S}}caligraphic_I ⊆ caligraphic_S is a set of initial states for the option; π:𝒮×𝒜[0,1]:𝜋𝒮𝒜01\pi:{\mathcal{S}}\times{\mathcal{A}}\rightarrow[0,1]italic_π : caligraphic_S × caligraphic_A → [ 0 , 1 ] is a policy to generate actions, which in our case is a LLM; and β:𝒮+[0,1]:𝛽superscript𝒮01\beta:{\mathcal{S}}^{+}\rightarrow[0,1]italic_β : caligraphic_S start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → [ 0 , 1 ] is the termination function. Starting from a state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can choose all the options for which stsubscript𝑠𝑡s_{t}\in{\mathcal{I}}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_I. Once an option is chosen, the policy π𝜋\piitalic_π will generate actions for several steps until the option terminates according to the termination function β𝛽\betaitalic_β. As illustrated in Figure 2, option-level MCTS consists of the following operations:

  • Selection Starting from the root node, we iteratively select the child node based on Equation 1.

  • Expansion Once an expandable leaf node is selected, a new node is generated by starting with the previous state of the parent node as the initial option state. The option is then sampled using the policy π𝜋\piitalic_π, and its completion is determined by the termination function β𝛽\betaitalic_β.

  • Simulation The scaled reward of the newly expanded node, as well as some simulated future trajectories are evaluated using the feedback functions, which will be discussed in §4.4.

  • Backpropagation The average value of the newly generated node and all its ancestors is updated using the scaled reward from the evaluation step. Meanwhile, the visit counts for these nodes are also increased by one.

Employing an option to substitute a single token within each node could reduces search space, as the number of options in a trajectory is much smaller than the number of tokens. This facilitates a deeper search, broader coverage of the search space, and minimizes the frequency of requesting feedback from functions such as the value model. Moreover, the option-level offers more flexibility compared to the sentence-level, as a new line can be treated as a special case of the termination function, as demonstrated in Table 1.

4.3.2 Importance Weighted Expansion

In previous work related to option/sentence level tree search  Feng et al. (2023); Yao et al. (2024), it has been a common practice to assume that each node in the tree has the same predefined width i.e., branching factor. This is due to the fact that unlike token-level MCTS with a limited action space, the sample space at the option-level is exceedingly large, with an unlimited number of token combinations. Consequently, it is necessary to set a predefined maximum width. However, this assumption can often result in an inefficient search space, as it may be either too large or too small.

A more effective and efficient way to determine the branching factor for each node is to dynamically adjust it based on the importance of each node. This approach allows us to allocate a larger child budget to nodes of higher importance, thereby preventing inefficient exploration of these nodes and ensuring that we do not miss promising solutions. Meanwhile, by reducing the number of children for less important nodes, we can perform deeper searches at various levels of the tree, rather than considering all possible options at each node. Inspired by  Taylor et al. (2014); Clouse (1996), we define the importance of a node 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as:

I(𝒔t)=max𝒐t|vπ([𝒔t,𝒐t])vπ(𝒔t)|𝐼subscript𝒔𝑡subscriptsubscript𝒐𝑡superscript𝑣𝜋subscript𝒔𝑡subscript𝒐𝑡superscript𝑣𝜋subscript𝒔𝑡I({\bm{s}}_{t})=\max_{{\bm{o}}_{t}}|v^{\pi}([{\bm{s}}_{t},{\bm{o}}_{t}])-v^{% \pi}({\bm{s}}_{t})|italic_I ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( [ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) - italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) |

where vπsuperscript𝑣𝜋v^{\pi}italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is the value function which will be detailed in §4.4. I(𝒔t)𝐼subscript𝒔𝑡I({\bm{s}}_{t})italic_I ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) captures the maximum value deviation from the current state. When this value is small, there is no need to explore further on this node, as there will not be a significant difference by rolling out on this node. Conversely, if the value is large, it is worth trying different children. We set the number of children allowed for a node n(𝒔t)𝑛subscript𝒔𝑡n({\bm{s}}_{t})italic_n ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to be linear with this importance, using a factor α𝛼\alphaitalic_α. In practice, to avoid extreme cases, we bound the number of children by depth-dependent constants c𝚖𝚒𝚗(t)subscript𝑐𝚖𝚒𝚗𝑡c_{\mathtt{min}}(t)italic_c start_POSTSUBSCRIPT typewriter_min end_POSTSUBSCRIPT ( italic_t ) and c𝚖𝚊𝚡(t)subscript𝑐𝚖𝚊𝚡𝑡c_{\mathtt{max}}(t)italic_c start_POSTSUBSCRIPT typewriter_max end_POSTSUBSCRIPT ( italic_t ):

n(𝒔t)=max(c𝚖𝚒𝚗(t),min(αI(𝒔t),c𝚖𝚊𝚡(t))).𝑛subscript𝒔𝑡subscript𝑐𝚖𝚒𝚗𝑡𝛼𝐼subscript𝒔𝑡subscript𝑐𝚖𝚊𝚡𝑡n({\bm{s}}_{t})=\max\left(c_{\mathtt{min}}(t),\min\left(\lfloor\alpha I({\bm{s% }}_{t})\rfloor,c_{\mathtt{max}}(t)\right)\right).italic_n ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_max ( italic_c start_POSTSUBSCRIPT typewriter_min end_POSTSUBSCRIPT ( italic_t ) , roman_min ( ⌊ italic_α italic_I ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⌋ , italic_c start_POSTSUBSCRIPT typewriter_max end_POSTSUBSCRIPT ( italic_t ) ) ) .

4.3.3 State Merge

With n(𝒔t)𝑛subscript𝒔𝑡n({\bm{s}}_{t})italic_n ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) determined, another issue is that states under the same node can be very similar, causing many unnecessary sub-trees. To maximize diversity among states and cover as much space as possible with limited rollouts, we utilize the concept of move groups  Van Eyck & Müller (2012). By partitioning available options into distinct groups based on their similarities, with the maximum number of groups equal to the branching factor, we enhance diversity among groups. This strategy allows us to cover a larger problem space with limited search rollouts, making the search process more efficient.

In practice, each time we generate a new option from the policy, we use some heuristic functions to measure its similarity with existing options. The heuristic function can either be a faster rule-based measurement (e.g., edit distance) or a model-based method (e.g., prompting a LLM). Based on this, we decide whether to merge this option with a previous one or create a new group. This process is repeated until a maximum number of repetitions is reached. The details of this process are outlined in Algorithm 2.

Input max number of trails max_trials𝑚𝑎𝑥_𝑡𝑟𝑖𝑎𝑙𝑠max\_trialsitalic_m italic_a italic_x _ italic_t italic_r italic_i italic_a italic_l italic_s, threshold thres𝑡𝑟𝑒𝑠thresitalic_t italic_h italic_r italic_e italic_s
Output pool of children nodes
n0𝑛0n\leftarrow 0italic_n ← 0
min_dist0𝑚𝑖𝑛_𝑑𝑖𝑠𝑡0min\_dist\leftarrow 0italic_m italic_i italic_n _ italic_d italic_i italic_s italic_t ← 0
while n<max_trials𝑛𝑚𝑎𝑥_𝑡𝑟𝑖𝑎𝑙𝑠n<max\_trialsitalic_n < italic_m italic_a italic_x _ italic_t italic_r italic_i italic_a italic_l italic_s and min_dthres𝑚𝑖𝑛_𝑑𝑡𝑟𝑒𝑠min\_d\leq thresitalic_m italic_i italic_n _ italic_d ≤ italic_t italic_h italic_r italic_e italic_s do
       𝒐tπ(st)similar-tosubscript𝒐𝑡𝜋subscript𝑠𝑡{\bm{o}}_{t}\sim\pi(s_{t})bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
       min_dmin𝒐At,𝚙𝚘𝚘𝚕𝙳𝚒𝚜𝚝(𝒐t,𝒐)𝑚𝑖𝑛_𝑑subscript𝒐subscript𝐴𝑡𝚙𝚘𝚘𝚕𝙳𝚒𝚜𝚝subscript𝒐𝑡𝒐min\_d\leftarrow\min_{{\bm{o}}\in A_{t,\mathtt{pool}}}\mathtt{Dist}({\bm{o}}_{% t},{\bm{o}})italic_m italic_i italic_n _ italic_d ← roman_min start_POSTSUBSCRIPT bold_italic_o ∈ italic_A start_POSTSUBSCRIPT italic_t , typewriter_pool end_POSTSUBSCRIPT end_POSTSUBSCRIPT typewriter_Dist ( bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o )
       nn+1𝑛𝑛1n\leftarrow n+1italic_n ← italic_n + 1
      
end while
Add 𝒔t+1=[𝒔t,𝒐t]subscript𝒔𝑡1subscript𝒔𝑡subscript𝒐𝑡{\bm{s}}_{t+1}=[{\bm{s}}_{t},{\bm{o}}_{t}]bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = [ bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] to the pool of children nodes
Algorithm 2 Find Action with Minimum Distance Larger Than Threshold

In Algorithm 2, we iteratively sample an option 𝒐tsubscript𝒐𝑡{\bm{o}}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the policy π(𝒔t)𝜋subscript𝒔𝑡\pi({\bm{s}}_{t})italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and compute the minimum distance min_d𝑚𝑖𝑛_𝑑min\_ditalic_m italic_i italic_n _ italic_d between 𝒐tsubscript𝒐𝑡{\bm{o}}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the actions in the pool At,𝚙𝚘𝚘𝚕subscript𝐴𝑡𝚙𝚘𝚘𝚕A_{t,\mathtt{pool}}italic_A start_POSTSUBSCRIPT italic_t , typewriter_pool end_POSTSUBSCRIPT measured by distance function Dist. If min_d𝑚𝑖𝑛_𝑑min\_ditalic_m italic_i italic_n _ italic_d is larger than a predefined threshold thres𝑡𝑟𝑒𝑠thresitalic_t italic_h italic_r italic_e italic_s or the maximum number of trials max_trials𝑚𝑎𝑥_𝑡𝑟𝑖𝑎𝑙𝑠max\_trialsitalic_m italic_a italic_x _ italic_t italic_r italic_i italic_a italic_l italic_s is reached, the loop terminates and the resulting state 𝒔t+1subscript𝒔𝑡1{\bm{s}}_{t+1}bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is added to the pool of children nodes.

4.3.4 Fast Rollout with Specialized LM

The simulation operation which employs a rollout policy to project future trajectories from a given state, is crucial for an effective MCTS. This process significantly improves the efficiency of exploration and exploitation, and enhances the accuracy of reward estimation111Typically, the closer the simulation is to the termination state, the more accurate the reward estimation becomes.. By simulating numerous potential trajectories, MCTS can better approximate the likely outcomes of various actions, thereby facilitating a more informed and search process.

Ideally, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT would serve as the rollout policy, yet its computational demands render it impractical for the rapid simulations required by MCTS. To address this challenge, we propose the use of a smaller, specialized LM as the fast rollout policy π𝚏𝚊𝚜𝚝superscript𝜋𝚏𝚊𝚜𝚝\pi^{\mathtt{fast}}italic_π start_POSTSUPERSCRIPT typewriter_fast end_POSTSUPERSCRIPT. Given a state 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the fast rollout policy π𝚏𝚊𝚜𝚝superscript𝜋𝚏𝚊𝚜𝚝\pi^{\mathtt{fast}}italic_π start_POSTSUPERSCRIPT typewriter_fast end_POSTSUPERSCRIPT efficiently continues generation until it reaches a termination condition, denoted as π𝚏𝚊𝚜𝚝(𝒔t)superscript𝜋𝚏𝚊𝚜𝚝subscript𝒔𝑡\pi^{\mathtt{fast}}({\bm{s}}_{t})italic_π start_POSTSUPERSCRIPT typewriter_fast end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

4.4 Critic

It is crucial for searching algorithms to have reliable guidance signals towards achieving the end goal. In AlphaLLM, we design three types of critic models to guide the search process, i.e., value function vπsuperscript𝑣𝜋v^{\pi}italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT predicting the future reward, process reward models PRM estimating node quality, and outcome reward model ORM assessing the overall trajectory quality.

Value Function

The value function, denoted as vπ(𝒔)superscript𝑣𝜋𝒔v^{\pi}({\bm{s}})italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_italic_s ), is the expected reward starting from state 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following the policy π𝜋\piitalic_π thereafter. To train a value function vϕπ(𝒔)subscriptsuperscript𝑣𝜋italic-ϕ𝒔v^{\pi}_{\phi}({\bm{s}})italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s ) parameterized by ϕitalic-ϕ\phiitalic_ϕ, we use the Monte Carlo (MC) estimate to empirically approximate the expected reward by averaging the rewards observed after many samplings starting from state s𝑠sitalic_s and following policy π𝜋\piitalic_π. The reward from a state is the sum of rewards obtained in the future, discounted by a factor γ𝛾\gammaitalic_γ at each time step. Thus, the MC estimate of vϕπ(𝒔)subscriptsuperscript𝑣𝜋italic-ϕ𝒔v^{\pi}_{\phi}({\bm{s}})italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s ) can be written as vϕπ(𝒔)1Jj=1JG(j)(𝒔)subscriptsuperscript𝑣𝜋italic-ϕ𝒔1𝐽superscriptsubscript𝑗1𝐽superscript𝐺𝑗𝒔v^{\pi}_{\phi}({\bm{s}})\approx\frac{1}{J}\sum_{j=1}^{J}G^{(j)}({\bm{s}})italic_v start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s ) ≈ divide start_ARG 1 end_ARG start_ARG italic_J end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_italic_s ) where J𝐽Jitalic_J is the number of trajectory starting from state 𝒔𝒔{\bm{s}}bold_italic_s, G(j)(𝒔)superscript𝐺𝑗𝒔G^{(j)}({\bm{s}})italic_G start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_italic_s ) is the total discounted reward from state s𝑠sitalic_s in the j𝑗jitalic_j-th trajectory. Particularly, given the expert demonstration dataset 𝒟={(𝒙i,𝒚i)}𝒟subscript𝒙𝑖subscript𝒚𝑖{\mathcal{D}}=\{({\bm{x}}_{i},{\bm{y}}_{i})\}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }, for each prompt 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we generate trajectories 𝝉ij={𝒙i,𝒐i1j,𝒐i2j,,𝒐iTj}superscriptsubscript𝝉𝑖𝑗subscript𝒙𝑖superscriptsubscript𝒐𝑖1𝑗superscriptsubscript𝒐𝑖2𝑗superscriptsubscript𝒐𝑖𝑇𝑗{\bm{\tau}}_{i}^{j}=\{{\bm{x}}_{i},{\bm{o}}_{i1}^{j},{\bm{o}}_{i2}^{j},\cdots,% {\bm{o}}_{iT}^{j}\}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , ⋯ , bold_italic_o start_POSTSUBSCRIPT italic_i italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } by following policy π𝜋\piitalic_π. A reward rijsuperscriptsubscript𝑟𝑖𝑗r_{i}^{j}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is assigned to indicate whether 𝝉ijsuperscriptsubscript𝝉𝑖𝑗{\bm{\tau}}_{i}^{j}bold_italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT aligns with 𝒚isubscript𝒚𝑖{\bm{y}}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, e.g., rewarding trajectories that contains correct answers in mathematical tasks or closely follows the instruction as the ground-truth. We then construct a dataset 𝒟𝚟𝚊𝚕𝚞𝚎={(𝒔it,vit)|i[N],t[T]}subscript𝒟𝚟𝚊𝚕𝚞𝚎conditional-setsubscript𝒔𝑖𝑡subscript𝑣𝑖𝑡formulae-sequence𝑖delimited-[]𝑁𝑡delimited-[]𝑇{\mathcal{D}}_{\mathtt{value}}=\{({\bm{s}}_{it},v_{it})|i\in[N],t\in[T]\}caligraphic_D start_POSTSUBSCRIPT typewriter_value end_POSTSUBSCRIPT = { ( bold_italic_s start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) | italic_i ∈ [ italic_N ] , italic_t ∈ [ italic_T ] } in which 𝒔it=[𝒙i𝒐<it]subscript𝒔𝑖𝑡delimited-[]subscript𝒙𝑖subscript𝒐absent𝑖𝑡{\bm{s}}_{it}=[{\bm{x}}_{i}\cdot{\bm{o}}_{<it}]bold_italic_s start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_o start_POSTSUBSCRIPT < italic_i italic_t end_POSTSUBSCRIPT ] and vit=1Jj=1JriTjsubscript𝑣𝑖𝑡1𝐽superscriptsubscript𝑗1𝐽subscriptsuperscript𝑟𝑗𝑖𝑇v_{it}=\frac{1}{J}\sum_{j=1}^{J}r^{j}_{iT}italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_J end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_T end_POSTSUBSCRIPT. The value function vϕπsuperscriptsubscript𝑣italic-ϕ𝜋v_{\phi}^{\pi}italic_v start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is optimized by minimizing mean squared error:

ϕ=𝔼(𝒔,v)𝒟𝚟𝚊𝚕𝚞𝚎(vϕπ(𝒔)v)2subscriptitalic-ϕsubscript𝔼similar-to𝒔𝑣subscript𝒟𝚟𝚊𝚕𝚞𝚎superscriptsuperscriptsubscript𝑣italic-ϕ𝜋𝒔𝑣2{\mathcal{L}}_{\phi}=-{\mathbb{E}}_{({\bm{s}},v)\sim{\mathcal{D}}_{\mathtt{% value}}}(v_{\phi}^{\pi}({\bm{s}})-v)^{2}caligraphic_L start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( bold_italic_s , italic_v ) ∼ caligraphic_D start_POSTSUBSCRIPT typewriter_value end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( bold_italic_s ) - italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

We opt to initialize vϕπsuperscriptsubscript𝑣italic-ϕ𝜋v_{\phi}^{\pi}italic_v start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT using the parameters from policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, incorporating an MLP layer on top of it to output a scalar on each token. The scalar prediction at the last token of each state is used as the value.

PRM

The value function often struggles with credit assignment problem (Sutton, 1984) and its learning could be inefficient due to delayed and sparse rewards (Sutton & Barto, 2018). Therefore, we propose to incorporate PRM that introduces process supervision (Lightman et al., 2023) for direct option assessment. PRM generates intrinsic rewards (Chentanez et al., 2004) to encourage explorations of advantageous options, effectively mitigating issues of reward sparsity by providing immediate, action-specific rewards. Given a state 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and an option 𝒐tsubscript𝒐𝑡{\bm{o}}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, the PRM aims to predict the immediate reward rtPRMsuperscriptsubscript𝑟𝑡PRMr_{t}^{\texttt{PRM}}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT PRM end_POSTSUPERSCRIPT that results from taking option 𝒐tsubscript𝒐𝑡{\bm{o}}_{t}bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in state 𝒔tsubscript𝒔𝑡{\bm{s}}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Formally, the PRM is a function R(𝒔t,𝒐t)rt𝙿𝚁𝙼𝑅subscript𝒔𝑡subscript𝒐𝑡subscriptsuperscript𝑟𝙿𝚁𝙼𝑡R({\bm{s}}_{t},{\bm{o}}_{t})\rightarrow r^{\mathtt{PRM}}_{t}italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) → italic_r start_POSTSUPERSCRIPT typewriter_PRM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Instead of adding a MLP layer on top of the policy model for outputting a scalar reward (Ouyang et al., 2022), we formulate PRM as a text generation task to best leverage LLM’s intrinsic knowledge for assessing the quality of an option. We use prefix sampling (Wang et al., 2023a) to estimate the quality of an option by starting from an option and exploring the final reward after reaching terminal states. The intuition is that an intermediate step can be regarded as a good if it frequently leads to achiving the goal. We adapt the dataset constructed for the value function as 𝒟𝙿𝚁𝙼={(𝒔it,𝒐t,rt𝙿𝚁𝙼)|i[N],t[T]}subscript𝒟𝙿𝚁𝙼conditional-setsubscript𝒔𝑖𝑡subscript𝒐𝑡superscriptsubscript𝑟𝑡𝙿𝚁𝙼formulae-sequence𝑖delimited-[]𝑁𝑡delimited-[]𝑇{\mathcal{D}}_{\mathtt{PRM}}=\{({\bm{s}}_{it},{\bm{o}}_{t},r_{t}^{\mathtt{PRM}% })|i\in[N],t\in[T]\}caligraphic_D start_POSTSUBSCRIPT typewriter_PRM end_POSTSUBSCRIPT = { ( bold_italic_s start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_PRM end_POSTSUPERSCRIPT ) | italic_i ∈ [ italic_N ] , italic_t ∈ [ italic_T ] } where rt𝙿𝚁𝙼superscriptsubscript𝑟𝑡𝙿𝚁𝙼r_{t}^{\mathtt{PRM}}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_PRM end_POSTSUPERSCRIPT is the textual description of the reward, e.g., an option can be regarded as good if vitsubscript𝑣𝑖𝑡v_{it}italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is larger than certain threshold. To train PRM, we initialize it from the policy model π𝜋\piitalic_π and use the following prompt templates and typical language model loss.

###[A detailed rubric that specifies how to evaluate a step of a task]\n\n### State\n{state}\n\n###Action\n{option}\n\n###Assessment\n{textual reward}
ORM

In additional to the value function and PRM, we introduce ORM to guide MCTS. ORM is designed to evaluate options sequences in their entirety, assessing the extent to which the complete trajectory aligns with the desired end goal. The outcome evaluation complements value function and PRM by offering a comprehensive assessment of trajectories. Crucially, ORM plays a vital role in the simulation stage of MCTS by providing more accurate signals on the terminal state, which in turn facilitates a more balance between exploration and exploitation strategies. ORM is formulated as a text generation task, similar to PRM. We leverage the same dataset for the value function training and construct 𝒟𝙾𝚁𝙼={(𝒙i,𝒐1:Ti,ri𝙾𝚁𝙼)|i[N]}subscript𝒟𝙾𝚁𝙼conditional-setsubscript𝒙𝑖superscriptsubscript𝒐:1𝑇𝑖superscriptsubscript𝑟𝑖𝙾𝚁𝙼𝑖delimited-[]𝑁{\mathcal{D}}_{\mathtt{ORM}}=\{({\bm{x}}_{i},{\bm{o}}_{1:T}^{i},r_{i}^{\mathtt% {ORM}})|i\in[N]\}caligraphic_D start_POSTSUBSCRIPT typewriter_ORM end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_o start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_ORM end_POSTSUPERSCRIPT ) | italic_i ∈ [ italic_N ] }, where each instance includes a initial state or prompt 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a sequence of actions or options 𝒐1:Tisuperscriptsubscript𝒐:1𝑇𝑖{\bm{o}}_{1:T}^{i}bold_italic_o start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT taken from that state, and a textual reward ri𝙾𝚁𝙼superscriptsubscript𝑟𝑖𝙾𝚁𝙼r_{i}^{\mathtt{ORM}}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT typewriter_ORM end_POSTSUPERSCRIPT indicating the sequence’s success or quality. Similarly, ORM is initialized from the policy model π𝜋\piitalic_π and the following prompt templates and language model loss are used for training.

###[A detailed rubric that specifies how to evaluate a complete trajectory of a task]\n\n### Prompt\n{prompt}\n\n###Trajectory\n{trajectory}\n\n###Assessment\n{textual reward}

4.5 Policy Self-Improvement

We have discussed how η𝜂\etaitalic_ηMcts can guide policy to find trajectories of higher quality and. In this subsection, we discuss how to leverage these trajectories to further improve the policy. It is an iterative process with each iteration containing two main steps: data generation and policy finetuning.

Data generation

In this step, we assume to have the current policy πθksubscript𝜋subscript𝜃𝑘\pi_{\theta_{k}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and synthetic prompts 𝒟k={𝒙1k,}subscript𝒟𝑘subscriptsuperscript𝒙𝑘1{\mathcal{D}}_{k}=\{{\bm{x}}^{k}_{1},\dots\}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … } at the k𝑘kitalic_k-th round, where each 𝒙1ksubscriptsuperscript𝒙𝑘1{\bm{x}}^{k}_{1}bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents a question. We obtain the corresponding training data 𝒟ksubscript𝒟𝑘{\mathcal{D}}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for policy πθksubscript𝜋subscript𝜃𝑘\pi_{\theta_{k}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT by firstly performing η𝜂\etaitalic_ηMcts on 𝒟ksubscript𝒟𝑘{\mathcal{D}}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT4.3) and then sampling a trajectory 𝒚iksubscriptsuperscript𝒚𝑘𝑖{\bm{y}}^{k}_{i}bold_italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the corresponding MCTS forest for each question 𝒙iksubscriptsuperscript𝒙𝑘𝑖{\bm{x}}^{k}_{i}bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. There are several ways to select a trajectory from a MCTS forest, such as taking a greedy path based on the critic score (wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Eq. 1). Here we choose the trajectory that yield the highest critic score on the leaf node for each input question. As the next step, we filter out instances where the corresponding trajectory is not in high quality:

𝒟k={(𝒙ik,𝒚ik)|f(𝒙ik,𝒚ik)>γ}subscript𝒟𝑘conditional-setsubscriptsuperscript𝒙𝑘𝑖subscriptsuperscript𝒚𝑘𝑖𝑓subscriptsuperscript𝒙𝑘𝑖subscriptsuperscript𝒚𝑘𝑖𝛾{\mathcal{D}}_{k}=\{({\bm{x}}^{k}_{i},{\bm{y}}^{k}_{i})~{}|~{}f({\bm{x}}^{k}_{% i},{\bm{y}}^{k}_{i})>\gamma\}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_γ }

where f𝑓fitalic_f represents the quality evaluating function for quality scoring, and γ𝛾\gammaitalic_γ represents the threshold. There can be several ways to implement the function, and here we simply use the ORM4.4).

Policy finetuning

With the obtained training data 𝒟ksubscript𝒟𝑘{\mathcal{D}}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we organize the data into the following prompt templates:

A chat between a curious user and an artificial intelligence assistant.\n The assistant gives helpful, detailed, and polite answers to the user’s questions.\n User: 𝒙isubscript𝒙𝑖{\bm{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT\n Assistant: 𝒚isubscript𝒚𝑖{\bm{y}}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Then the policy πθksubscript𝜋subscript𝜃𝑘\pi_{\theta_{k}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT is finetuned using target-loss SFT:

θk=𝔼(𝒙ik,𝒚ik)𝒟k[logπθk(𝒚ik|𝒙ik)]subscriptsubscript𝜃𝑘subscript𝔼similar-tosubscriptsuperscript𝒙𝑘𝑖subscriptsuperscript𝒚𝑘𝑖subscript𝒟𝑘delimited-[]subscript𝜋subscript𝜃𝑘conditionalsubscriptsuperscript𝒚𝑘𝑖subscriptsuperscript𝒙𝑘𝑖\mathcal{L}_{\theta_{k}}=\mathbb{E}_{({\bm{x}}^{k}_{i},{\bm{y}}^{k}_{i})\sim{% \mathcal{D}}_{k}}\big{[}\log\pi_{\theta_{k}}({\bm{y}}^{k}_{i}|{\bm{x}}^{k}_{i}% )\big{]}caligraphic_L start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]

This results in an updated policy πθk+1subscript𝜋subscript𝜃𝑘1\pi_{\theta_{k+1}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We leave other training methods, such as DPO (Rafailov et al., 2023) or PPO (Schulman et al., 2017) in future work.

5 Experiments

5.1 Evaluation Setups

Datasets

AlphaLLM is generally applicable to a wide spectrum tasks. As an early exploration, in this paper, we conduct experiments on mathematical reasoning problems where the learning signals are clear to define i.e., , final answer is correct or wrong. We choose to evaluate on two widely used datasets GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). For GSM8K, we utilize the whole test set while for MATH, due to computation constraints, we utilize a subset following the same procedure of Lightman et al. (2023).

Metrics

We evaluate the performance of predicting answers correctly for policy models. In the same time, we calculate the average rollouts, represented by the number of nodes in the tree, as a measure of computational efficiency.

5.2 Baseline Systems

We evaluate the performance of AlphaLLM against a suite of proprietary model, including OpenAI’s GPT-4 and GPT-3.5, Anthropic’s Claude-2, as well as Google’s PaLM-2 and the gemini model family. To ensure a fair and consistent evaluation, we employ CoT as our primary prompting method. We additionally report PAL (Gao et al., 2023) prompting performance with GPT-4 as it demonstrates enhanced performance.

Additionally, we conduct comparisons with strong open-source models, including LLaMA-2 70B (Touvron et al., 2023a) and Wizardmath 70B (Luo et al., 2023). For LLaMA-2 70B, we present results from few-shot prompting as well as zero-shot prompting for its SFT version, which was trained using CoT rationales and final answers. Wizardmath 70B has been trained on a diverse set of mathematical data generated by ChatGPT, employing both SFT and RLHF. We provide zero-shot prompting results.

5.3 Implementation Details

We select LLaMA-2 70B as the policy model for the GSM8K dataset and Wizardmath 70B V10 for the MATH dataset. To construct the training dataset for the value function, PRM and ORM, we generate 50 trajectories for each prompt and construct the training target following Section 4.4. Both PRM and ORM are initialized using the weights from the policy model. In the design of ORM, tool usage is not incorporated for GSM8K. However, for MATH, we enhance ORM by incorporating tools like pythoin sympy to assess the quality of a trajectory, in a manner similar to that described by Gou et al. (2023b). The training employ a learning rate of 1e-6 and are trained for one epoch. For the fast rollout policy model, we opt for the Abel-002-7B model (Chern et al., 2023) for both the GSM8K and MATH tasks for its high efficiency and superior performance.

We set the MCTS parameters as follows: in GSM8K, c=1𝑐1c=1italic_c = 1 for the small scale (#rollout) and 1.51.51.51.5 for the large scale, with α=1𝛼1\alpha=1italic_α = 1. For t=0𝑡0t=0italic_t = 0, cmin(0)=10subscript𝑐min010c_{\text{min}}(0)=10italic_c start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ( 0 ) = 10 for the small scale and 40404040 for the large scale, while for the rest of t𝑡titalic_t, cmin(t)=2subscript𝑐min𝑡2c_{\text{min}}(t)=2italic_c start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ( italic_t ) = 2. We also set cmax(0)=10subscript𝑐max010c_{\text{max}}(0)=10italic_c start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( 0 ) = 10 for the small scale and 40404040 for the large scale, and for the remaining t𝑡titalic_t, cmax(t)=10subscript𝑐max𝑡10c_{\text{max}}(t)=10italic_c start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) = 10. The termination condition is based on sentence termination. In MATH, the parameters are c=1𝑐1c=1italic_c = 1, α=1𝛼1\alpha=1italic_α = 1, and for t=0𝑡0t=0italic_t = 0, cmin(0)=10subscript𝑐min010c_{\text{min}}(0)=10italic_c start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ( 0 ) = 10 for the small scale and 20202020 for the large scale, while for the rest of t𝑡titalic_t, cmin(t)=3subscript𝑐min𝑡3c_{\text{min}}(t)=3italic_c start_POSTSUBSCRIPT min end_POSTSUBSCRIPT ( italic_t ) = 3. We set cmax(0)=10subscript𝑐max010c_{\text{max}}(0)=10italic_c start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( 0 ) = 10 for the small scale and 20202020 for the large scale, and for the remaining t𝑡titalic_t, cmax(t)=10subscript𝑐max𝑡10c_{\text{max}}(t)=10italic_c start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( italic_t ) = 10. The termination function is rule-based, checking if there are any formulations or calculations in the sentence. If there are, the option is terminated; otherwise, the option continues to extend.

For policy self-improving (§4.5), we train the policy model up to 3 epochs, setting batch size to 128, learning rate to 5×1065superscript1065\times 10^{-6}5 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and minimal learning rate to 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. Linear warm-up and decay is used with warm-up percent to be 10%. We perform early stop** based on a devset held out from the training instances. For second-round self-improving, we sample 7.9k MetaMath (Yu et al., 2023) prompts to obtain the corresponding MCTS outputs for training.

5.4 Results

Model Decoding #Annotation RN FA SYN GSM8K MATH
GPT-3.5  Sampling - - - - 80.8 35.5
GPT-4  Sampling - - - - 92.0 42.5
GPT-4 (PAL)  Sampling - - - - 94.2 51.8
Gemini 1.0 Pro  Sampling - - - - 77.9 32.6
Gemini 1.0 Ultra  Sampling - - - - 88.9 53.2
Gemini 1.5 Pro  Sampling - - - - 92.5 58.5
Claude-2  Sampling - - - - 85.2 32.5
PaLM-2 540B  Sampling - - - - 80.7 34.3
LLaMA-2 70B Greedy 0 ×\times× ×\times× ×\times× 57.8 -
LLaMA-2 70B SFT Greedy 7.5k \checkmark \checkmark ×\times× 69.3 -
WizardMath 70B V1.0 Greedy 96k \checkmark \checkmark ×\times× - 20.7
AlphaLLM Greedy 7.5k/3k ×\times× \checkmark \checkmark 73.7 23.6
AlphaLLM η𝜂\etaitalic_ηMcts 7.5k/3k ×\times× \checkmark ×\times× 88.9 48.7
AlphaLLM η𝜂\etaitalic_ηMcts 7.5k/3k ×\times× \checkmark \checkmark 92.0 51.0
Table 2: Comparison results of AlphaLLM on the GSM8K and MATH datasets, utilizing LLaMA-2 70B and WizardMath 70B V1.0 as base models for GSM8K and MATH datasets, respectively. #Annotation indicates the quantity of labeled data employed for fine-tuning each base model. The annotation used for training are noted as RN for rationales and FA for final answers. SYN means models trained on synthetic prompts, where trajectories were generated using η𝜂\etaitalic_ηMcts.

Table 2 lists the performance comparisons of various methods on the GSM8K and MATH datasets. Our findings reveal that AlphaLLM, which utilizes only final answer annotations and self-improves through the training on synthetic prompts with responses from η𝜂\etaitalic_ηMcts, outperforms both LLaMA-2 70B and WizardMath 70B V1.0—even though these models are trained on a larger set of examples that include both rationales and final answer annotations. This comparison underscores the efficacy and broad applicability of our imagination-searching-criticizing self-improving framework. Moreover, when our model is augmented with η𝜂\etaitalic_ηMcts decoding strategy, its performance markedly improves, achieving scores of 88.9 and 48.7 on the GSM8K and MATH datasets, respectively. Following two iterations of self-improvement using synthetic prompts, AlphaLLM demonstrates performance comparable to that of GPT-4. This suggests a viable approach to improving LLMs’ capabilities in complex problem-solving tasks in a self-improving fashion, leveraging a minimal amount of labeled data.

In addition, table 3 presents the performance of various methods applied to different number of responses, from 10 to 50. Our analysis confirms several key findings: 1) Reranking utilizing ORM consistently outperforms self-consistency techniques, indicating that ORM is capable of generating meaningful signals for searching. 2) η𝜂\etaitalic_ηMcts demonstrates superior performance while requiring significantly fewer rollouts. For instance, on the MATH dataset, η𝜂\etaitalic_ηMcts achieves better results with only half the number of rollouts compared to reranking. These results suggest that our design of an efficient MCTS in AlphaLLM can serve as an effective policy improvement operation, enabling the search for high-quality trajectories with reduced computational cost.

5.5 Ablation Study

Method #Responses GSM8K MATH
#Rollouts Accuracy #Rollouts Accuracy
Greedy 1 4.6 57.8 9.9 20.7
Self-consistency 10 46 67.4 99 22.5
30 137 74.2 299 27.3
50 229 75.4 499 28.8
Re-ranking 10 46 80.8 99 34.1
30 137 86.3 299 39.0
50 229 87.7 499 42.0
η𝜂\etaitalic_ηMcts - 55 87.0 223 45.4
- 230 88.9 341 48.7
Table 3: Comparative results of various searching method on GSM8K and MATH.
PRM FR-ORM SM LG-#Rollout Acc
×\times× ×\times× ×\times× ×\times× 84.9
\checkmark ×\times× ×\times× ×\times× 85.9
\checkmark \checkmark ×\times× ×\times× 86.5
\checkmark \checkmark \checkmark ×\times× 87.0
\checkmark \checkmark \checkmark \checkmark 88.9
(a) Ablation study on GSM8K
TA-ORM Option Acc #Rollout
×\times× ×\times× 38.8 201
\checkmark ×\times× 44.1 198
\checkmark \checkmark 45.4 148
(b) Ablation study on MATH
Table 4: (a): Ablation studies on the GSM8K test set of various components of η𝜂\etaitalic_ηMcts, including PRM, fast-rollout with ORM, state merge, and large number of rollouts. (b): Ablation studies of the impacts of tool-augmented ORM and option-level formulation on MATH.

We assess the effectiveness of each component in AlphaLLM and report the results on GSM8K in Table 4(a). Vanilla MCTS, that is coupled with only value function, yields an accuracy of 84.9%, which is used as a reference point to assess the incremental benefit provided by each subsequent component. The addition of PRM improves the accuracy modestly to 85.9%, showing the effectivenss of process supervision for searching. A more significant improvement is observed with the introduction of ORM with fast rollout, which boosts the accuracy to 86.5%. Integrating state merging results in a further increase in accuracy, reaching 87.0%. Finally the combined of increasing the number of rollouts with the other components yields the best performance on this task.

Table 4(b) presents the ablation study of option formulation and the tool-augmented critic on the MATH dataset. Our proposed η𝜂\etaitalic_ηMcts achieves an accuracy of 45.4 with 148 rollouts. When options are excluded, reverting to essentially sentence-level MCTS, the performance decreases to 44.1 with a noticeable increase in the number of rollouts to 198. This demonstrates that option formulation introduces enhanced flexibility to MCTS, enabling better performance with fewer search efforts. Furthermore, the most significant decrease in performance is observed when only intrinsic knowledge is utilized for ORM, which drops to an accuracy of 38.8. This suggests that the absence of an external tool critically impedes the ORM’s capability to effectively assess challenging math problems.

Refer to caption
Figure 3: Empirical analysis on GSM8K of different self-improving data collection methods and number of iterations. Models are evaluated with greedy decoding, η𝜂\etaitalic_ηMcts with small #rollout and large #rollout. Two iterations of self-improvement are conducted using data from reranking and η𝜂\etaitalic_ηMcts

Figure 3 depicts a comparative results on GSM8K of two rounds of self-improving trained on trajectories collected using reranking and η𝜂\etaitalic_ηMcts. We report the performance of greedy decoding, η𝜂\etaitalic_ηMcts with a moderate number of rollouts (55), and η𝜂\etaitalic_ηMcts with a large number of rollouts (230) for each model. We observe that 1) Models trained on the trajectories from reranking or η𝜂\etaitalic_ηMcts outperform the initial policy by a significant margin. In addition, the performance can be iteratively improved with training suggesting that self-improving has the potential to achieve continual performance gain. 2) While both reranking and η𝜂\etaitalic_ηMcts can generate high-quality trajectories for self-improving , η𝜂\etaitalic_ηMcts is performant with high efficiency and better accuracy. Models trained on trajectories generated by it not only exceed the performance of those trained on reranked trajectories but also, when decoded with η𝜂\etaitalic_ηMcts, demonstrate on par performance with GPT-4, revealing that AlphaLLM is an effective self-improving framework.

6 Limitations and Future Work

Despite the promising results demonstrated by AlphaLLM in this study, there are several limitations that requires further exploration. (i) Our current implementation employs relatively simple methods for generating synthetic prompts. Future iterations of AlphaLLM should explore advanced techniques, such as Self-Instruct, to create both diverse and model capability-awared prompts. (ii) Although AlphaLLM demonstrates improvements over base models, its performance in greedy sampling is substantially inferior to that observed when decoded with η𝜂\etaitalic_ηMcts. This indicates that the full potential of MCTS for self-improvement in LLMs has not yet been fully realized. Two potential factors contributing to this issue have been identified: a) the self-improvement loop may not be leveraging sufficient data; and b) the base model may be limited in its capacity for rapid learning. Addressing these concerns could lead to more significant improvemens. (iii) In our existing framework, the critic models remain static. We will explore mechanisms to continually update critic models to adapt to new policy models. This will help ensure the discriminator-generator gap and improve the overall training dynamics. (iv) The evaluation of AlphaLLM has been limited to mathematical reasoning tasks. To verify the generalizability and broader applicability of the framework, future research will need to extend its application to other domains.

7 Conclusion

In this paper, we introduce AlphaLLM, an imagination-searching-criticizing framework designed for the self-improvement of LLMs without the necessity of additional annotations. At the heart of it is the integration of MCTS with LLMs. To tackle the inherent challenges associated with this integration, including data scarcity, the vastness of search spaces, and the subjective nature of feedback in language tasks, we introduce a data synthesizer for strategic prompt synthesis, an optimized MCTS tailored for efficient search in language tasks, and a trio of critic models to provide precise feedback. Our experimental findings on mathematical reasoning tasks reveal that AlphaLLM significantly boosts the performance of LLMs without requiring extra data annotations. Moreover, when decoded with η𝜂\etaitalic_ηMcts, AlphaLLM performs comparably to GPT-4, highlighting the potential for self-improvement in LLMs.

References

  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  • Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  17682–17690, 2024.
  • Bowman et al. (2022) Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
  • Chen et al. (2024) Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  • Chentanez et al. (2004) Nuttapong Chentanez, Andrew Barto, and Satinder Singh. Intrinsically motivated reinforcement learning. Advances in neural information processing systems, 17, 2004.
  • Chern et al. (2023) Ethan Chern, Haoyang Zou, Xuefeng Li, Jiewen Hu, Kehua Feng, Junlong Li, and Pengfei Liu. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023.
  • Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  • Clark & Storkey (2015) Christopher Clark and Amos Storkey. Training deep convolutional neural networks to play go. In International conference on machine learning, pp.  1766–1774. PMLR, 2015.
  • Clouse (1996) Jeffery Allen Clouse. On integrating apprentice learning and reinforcement learning. University of Massachusetts Amherst, 1996.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • De Waard et al. (2016) Maarten De Waard, Diederik M Roijers, and Sander CJ Bakkes. Monte carlo tree search with options for general video game playing. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp.  1–8. IEEE, 2016.
  • Ding et al. (2023) Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, Si Qin, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254, 2023.
  • Feng et al. (2023) Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
  • Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  • Gou et al. (2023a) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Nan Duan, Weizhu Chen, et al. Critic: Large language models can self-correct with tool-interactive critiquing. In Second Agent Learning in Open-Endedness Workshop, 2023a.
  • Gou et al. (2023b) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023b.
  • Guo et al. (2024) Hongyi Guo, Yuanshun Yao, Wei Shen, Jiaheng Wei, Xiaoying Zhang, Zhaoran Wang, and Yang Liu. Human-instruction-free llm self-alignment with limited samples. arXiv preprint arXiv:2401.06785, 2024.
  • Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8154–8173, 2023.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021.
  • Hong et al. (2023) Ruixin Hong, Hongming Zhang, Xinyu Pang, Dong Yu, and Changshui Zhang. A closer look at the self-verification abilities of large language models in logical reasoning. arXiv preprint arXiv:2311.07954, 2023.
  • Huang et al. (2023) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  • Lewkowycz et al. (2022) Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  • Li et al. (2023) Xian Li, ** Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, and Mike Lewis. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259, 2023.
  • Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  • Liu et al. (2023) Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Ye** Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028, 2023.
  • Long (2023) Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
  • Luketina et al. (2019) Jelena Luketina, Nantas Nardelli, Gregory Farquhar, Jakob N. Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. A survey of reinforcement learning informed by natural language. ArXiv, abs/1906.03926, 2019. URL https://api.semanticscholar.org/CorpusID:182952502.
  • Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  • Madaan et al. (2024) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
  • Nye et al. (2021) Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  • OpenAI (2023) R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Peng et al. (2017) Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sung** Lee, and Kam-Fai Wong. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017.
  • Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  • Ramamurthy et al. (2022) Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Ye** Choi. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. ArXiv, abs/2210.01241, 2022. URL https://api.semanticscholar.org/CorpusID:252693405.
  • Saunders et al. (2022) William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  • Silver et al. (2017) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  • Stechly et al. (2024) Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024.
  • Sun et al. (2023) Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
  • Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Sutton et al. (1999a) Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211, 1999a. ISSN 0004-3702. doi: https://doi.org/10.1016/S0004-3702(99)00052-1. URL https://www.sciencedirect.com/science/article/pii/S0004370299000521.
  • Sutton et al. (1999b) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999b.
  • Sutton (1984) Richard Stuart Sutton. Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, 1984.
  • Taylor et al. (2014) Matthew E Taylor, Nicholas Carboni, Anestis Fachantidis, Ioannis Vlahavas, and Lisa Torrey. Reinforcement learning agents providing advice in complex video games. Connection Science, 26(1):45–63, 2014.
  • Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Touvron et al. (2023a) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Valmeekam et al. (2022) Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
  • Van Eyck & Müller (2012) Gabriel Van Eyck and Martin Müller. Revisiting move groups in monte-carlo tree search. In Advances in Computer Games: 13th International Conference, ACG 2011, Tilburg, The Netherlands, November 20-22, 2011, Revised Selected Papers 13, pp.  13–23. Springer, 2012.
  • Wang et al. (2023a) Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023a.
  • Wang et al. (2023b) Tianlu Wang, ** Yu, Xiaoqing Ellen Tan, Sean O’Brien, Ramakanth Pasunuru, Jane Dwivedi-Yu, Olga Golovneva, Luke Zettlemoyer, Maryam Fazel-Zarandi, and Asli Celikyilmaz. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592, 2023b.
  • Wang et al. (2022) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • Xie et al. (2024) Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems, 36, 2024.
  • Xu et al. (2023) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  • Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, **cheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  • Yuan et al. (2024a) Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, et al. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078, 2024a.
  • Yuan et al. (2024b) Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, **g Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024b.
  • Zhu et al. (2024) Tinghui Zhu, Kai Zhang, Jian Xie, and Yu Su. Deductive beam search: Decoding deducible rationale for chain-of-thought reasoning. arXiv preprint arXiv:2401.17686, 2024.