HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: floatrow

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2402.02658v1 [cs.AI] 05 Feb 2024
\newfloatcommand

capbtabboxtable[][\FBwidth]

Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision

Zihan Wang1      Yunxuan Li2      Yuexin Wu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT      Liangchen Luo22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT      Le Hou22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT   Hongkun Yu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT      **gbo Shang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTUniversity of California, San Diego        22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTGoogle
  This work was done during the author’s internship at Google.  Correspondence: [email protected]
Abstract

Large language models often struggle to generate error-free solutions in complex problem-solving scenarios. To address this, recent advancements have adopted a reasoner-verifier framework, where a verifier model is trained in a supervised fashion, to evaluate intermediate solution steps generated by a reasoning model. However, this “process supervision” approach requires obtaining ground truth annotations on intermediate solution steps to train the verifier model, which is resource-intensive and expensive. In this paper, we introduce Model-induced Process Supervision (MiPS), a novel method for automating data curation. MiPS annotates an intermediate step by sampling completions of this solution through the reasoning model, and obtaining an accuracy defined as the proportion of correct completions. Errors in the reasoner would cause MiPS to underestimate the accuracy of intermediate steps, therefore, we suggest and empirically show that verification focusing on high predicted values of the verifier shall be preferred over that of low predicted values, contrary to prior work. Our approach significantly improves the performance of PaLM 2 on math and coding tasks (accuracy +0.67% on GSM8K, +4.16% on MATH, +0.92% on MBPP compared with an output verifier). Additionally, our study demonstrates that the verifier exhibits strong generalization ability across different reasoning models.

1 Introduction

Multi-step problem solving (e.g., math problems and coding challenges) showcases the capabilities of machine intelligence. While researchers have shown that model- and data-upscaling still hold powerful for large language models (LLMs) on multi-step problem solving Achiam et al. (2023); Touvron et al. (2023); Team Gemini et al. (2023); Huang et al. (2022); Azerbayev et al. (2023); Luo et al. (2023a); Yu et al. (2023a), even the state-of-the-art LLMs still produce easily observable mistakes. Furthermore, standard fine-tuning on the training dataset or model generated solutions directly does not yield consistent and significant improvements Luo et al. (2023a); Yu et al. (2023a); Ni et al. (2022). In Tab. 2, we show the test set performance of PaLM 2-S on a math problem dataset, GSM8K Cobbe et al. (2021), before and after task-specific finetuning. It is observed that such fine-tuning does not exploit the model’s full potential, as the performance falls much behind using a standard self-consistency approach Wang et al. (2022) and especially when considering that the model very likely generates a correct-answer solution to the problem.

Self-consistency can be seen as an inference-time technique where the goal is to pick one model-generated solution among many. This broadly falls under a reasoner-verifier paradigm (Fig. 2). In particular, the reasoner model generates solution(s) given a problem to solve and the verifier model makes predictions to the solutions, selecting the solution with the highest predicted value.

Training a verifier in a supervised fashion has demonstrated strong performance in both coding and math language problems. Cobbe et al. (2021) showed that by simply gathering correct and incorrect solutions to train a binary classification model and using such model to pick the highest confidence solution generated by the reasoner during inference time, the accuracy can be improved significantly. More recent studies suggest that verifying on intermediate steps could offer better guidance than training solely on the whole solutions Li et al. (2022); Uesato et al. (2022); Paul et al. (2023); Lightman et al. (2023); Feng et al. (2023); Yu et al. (2023b); Liu et al. (2023a); Wang et al. (2023). As such, verifiers trained (and applied) on intermediate steps are called process supervised verifiers (PSV), whereas those trained on whole solutions are called output supervised verifiers (OSV). In prior work, process supervision data is either obtained by an ad-hoc algorithm Li et al. (2022); Paul et al. (2023), or through expensive human annotations Uesato et al. (2022); Lightman et al. (2023), lacking an automatic and generic way of constructing of annotations of intermediate solutions.

Figure 1: An illustration of the reasoner-verifier paradigm. The verifier predicts scores for the solutions generated by the reasoner, and selectes the solution with the highest score.
Refer to caption
Method Acc.
Base, FT on Train (2K) 61.6
FT on Train (7K) 62.9
FT on Generated (similar-to\sim100K) 66.0
Self-consistency (Base, 32 paths) 77.1
At Least 1 Solve (Base, 32 paths) 95.6
Figure 1: An illustration of the reasoner-verifier paradigm. The verifier predicts scores for the solutions generated by the reasoner, and selectes the solution with the highest score.
Figure 2: We fine-tune PaLM 2-S on GSM8K. The base model is tuned with 2K training data to adhere with the problem format.
Refer to caption
Figure 3: The Model-induced Process Supervision (MiPS) data construction method we introduce in this work. By completing a intermediate solution with a reasoner several times, we can obtain the percentage value of these completions being correct, which is used to annotated the step-wise solutions.

Training verifier models require solution wise or step wise labels, which is expensive to collect. There have been a series of work following an LLM-as-a-verifier approach where an off-the-shelf LLM is employed to judge the solutions through prompting Madaan et al. (2023); Kim et al. (2023); Pan et al. (2023). However, while such work may have seen improvements on language tasks, they haven’t been very successful in math or coding problems Huang et al. (2023); Luo et al. (2023b).

In conclusion, to achieve optimal quality, training data is needed for building a strong verifier model. On the other hand, manually collecting solution verification labels is expensive and non-scalable. In this work, we propose to use Monte Carlo Sampling on the completions of the intermediate solutions to obtain step-wise training annotations (Fig. 3). Specifically, for each intermediate solution, we complete the solution with the reasoner several times through a sample decoding mechanism, and the percentage of the completed solutions being correct is referred to as the correctness of the solution. Such correctness scores can be used to train a PSV. Because of the nature of involving the reasoning model’s completion on the intermediate solutions, we call the construction of this data Model-induced Process Supervision (MiPS). While such an idea is also explored in a concurrent work Wang et al. (2023), we supplement with analysis of using MiPS constructed data. We found out that because the reasoner model, which completes the solutions, is not perfect, the noises it introduces would affect the design choices of using the process supervised verifier:

  • We analyzed various ways to merge step-wise prediction scores to a single score value (we refer to this process as using an aggregation function) when using the verifier. Prior work used an aggregation function that focuses on low predicted scores that worked well for PSV trained on noise-free human annotated data Lightman et al. (2023). For the noisy MiPS data, we suggest aggregation functions that focuses on high predicted scores.

  • We re-examine the usefulness of process supervision by isolating the trained PSV and studying the benefits of incorporating the score from each intermediate step during verification. Our results reveal that (1) the verification scores from later intermediate steps are indeed useful even for a PSV trained on the noisy MiPS data, however, the earlier step scores could hurt the verification, depending on the aggregation function; and (2) only using the PSV predicted score of the last step, in similar fashion as OSV, can sometimes be much better than OSV itself, indicating process supervision data can regularize OSV training.

  • We show that verifiers trained on MiPS data generated by a reasoner can transfer to validate solutions by a different (and more competent) reasoner. This indicates that MiPS would not produce verifiers that are overly biased towards mistakes of the reasoner that generated the data.

Following in this paper, we will provide a more complete review of related works, a precise definition of our method, and empirical results of the method and analysis on two math problem dataset and one coding dataset. The contributions of the paper is mainly (1) we propose MiPS to construct process supervision data automatically for training process supervision verifiers; (2) we extend the evaluation of problem solving verifiers to coding problems; (3) we provide empirical analysis on design choices and properties of the trained verifier from MiPS data.

2 Related Works

The advances of problem solving of LLMs can be broadly chracterized into two regimes, first by training a better reasoning model and the second by validating the solution from the reasoning model at inference time.

Pre-training/Fine-tuning Better Reasoners. Standard training recipes also transfer to training better reasoners for problem solving. During pre-training, larger model sizes and training compute yields an LLM that is more competent in multiple aspects, including language abilities and problem solving (Achiam et al., 2023; Touvron et al., 2023; Anil et al., 2023, inter alia). Within fine-tuning, it is also observed that transfer learning Azerbayev et al. (2023) from a pile of generic math datasets, training on an augmented dataset on failure examples or diverse statements Huang et al. (2022); Luo et al. (2023a); Yu et al. (2023a); Ni et al. (2022) leads to improvements. Despite these approaches, it is apparent that (1) the state-of-the-art LLM still can fail at simple mistakes during multi-step problem solving and (2) the improvement of a simple verfication method by majority voting (self consistency Wang et al. (2022)) is still significant upon fine-tuning. Therefore, the exploration of verifiers to validate and pick the solutions is necessary.

Validating Through LLM-as-a-verifier. There have been numerous attempts on using the LLM reasoner itself to correct and validate its generated solutions. Madaan et al. (2023); Kim et al. (2023) and many methods surveyed in Pan et al. (2023) broadly follow the strategy where the LLM validates and provides feedback to the generated solutions through prompting. Huang et al. (2023) and Luo et al. (2023b) revisited these methods and found that LLMs do not serve as good verifiers for equally competent solutions at the time being, and such methods improves marginally on math word problems.

Validating Through Trained Verifiers. In contrast, verifiers trained on a human labelled dataset does show significant improvements Cobbe et al. (2021); Uesato et al. (2022); Lightman et al. (2023). Importantly, Lightman et al. (2023) showed that on a challenging competition-level mathematics problem set Hendrycks et al. (2021), verifiers trained on annotated intermediate solutions (PSV) surpasses verifiers trained on final solutions by a large margin, and both substantially better than self-consistency. Other analysis also emphasize on the importance of step-wise feedback: Uesato et al. (2022) showed that PSV selects solutions that are more accurate in their reasonings and Yao et al. (2023); Feng et al. (2023); Liu et al. (2023b), inter alia, showed that during decoding, LLMs can be guided towards better solutions step-by-step. We believe that improving the training of a PSV, and especially, identifying a scalable solution to generate the process supervision data, is of imminent importance. Therefore, in this work, we identify an automatic and generic solution to generate process supervision data (MiPS), and conducted detailed analysis centered on the noises of this automatic process.

Math-Shepherd. Coincidentally, such an automatic process supervision data curation method was studied concurrently and independently by Wang et al. (2023). We share a generally similar methodology with their work, with a few minor design differences we highlight in later sections. The empirical results of MiPS is similar on the two datasets we share (GSM8K and MATH) despite using different, but about competent, LLMs. Their work extended training the verifier by applying it to fine-tune the reasoner through reinforcement learning, while our work included an additional coding dataset (MBPP) and provided analysis on the design choices of using the verifier, addressing the data noises. We believe these two works compliment each other.

3 Model-induced Process Supervision

We consider the reasoner-verifier framework where we start with a fairly competent reasoner on a task, generate verifier training data on a given set of problems with the reasoner, and train a verifier on the data to validate some new generated solutions by a reasoner. We first introduce discuss Model-induced Process Supervision, our data curation method that automatically creates process supervision data (annotations of accuracy of intermediate solutions). Then, we discuss the details about the verification process.

3.1 Scope

We start our introduction of MiPS with the requirements it is bound to: there should be a solution checker and the solutions could be decomposed into intermediate steps.

Without a solution checker, MiPS might require as much human annotation power as the original process supervision annotation schema Lightman et al. (2023). Fortunately, many tasks have a checker. For example, several math tasks are written in the form of multi-class classification (MMLU Hendrycks et al. (2020), MATHQA Amini et al. (2019)) or open-ended generation but with a special format that supports automatic parsing and checking (GSM8K Cobbe et al. (2021), MATH Hendrycks et al. (2021)). For coding problems, while the generated code is not directly verifiable, a common approach is to execute it on a series of test cases designed for the problem, and the code is only correct if it passes all the test cases (MBPP Austin et al. (2021), HumanEval Chen et al. (2021)).

Step-decomposition requires the solutions to be a multi-step problem that can be programmatically broken into intermediate solutions. This is also often the case. For example, in the math problems, the common approach is to treat a valid step as each individual line in the solution and the decomposition is simply dividing at the line breaks.

3.2 Obtaining MiPS data

MiPS constructs process supervision data through Monte Carlo sampling. First, we employ a reasoner model rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to generate a fix number of ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT solutions for each problem, using temperature based decoding with a temperature of tgsubscript𝑡𝑔t_{g}italic_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Then, for each solution, we decompose them into individual steps. After that, for each intermediate solution containing a prefix list of steps, we employ a reasoner model rmcsubscript𝑟𝑚𝑐r_{mc}italic_r start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT to generate again a fix number of nmcsubscript𝑛𝑚𝑐n_{mc}italic_n start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT solutions, with a termperature of tmcsubscript𝑡𝑚𝑐t_{mc}italic_t start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT, completing the intermediate solution. For each completed intermediate solution, we calculate the percentage (out of nmcsubscript𝑛𝑚𝑐n_{mc}italic_n start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT) of them being correct, and these correctness values comprises the MiPS data. In all experiments in this paper, we consider rg=rmcsubscript𝑟𝑔subscript𝑟𝑚𝑐r_{g}=r_{mc}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT, namely, the reasoner model that is used to estimate the intermediate solution’s correctness is the same model that generates the solution data. This is particularly the most challenging case for MiPS, otherwise, using a more capable reasoner for the completion would enjoy a reduction of noise in MiPS data.

3.3 Training an Output Supervised Verifier (OSV)

We diverge a bit to illustrate how one would train a verifier from output supervision, which would later aid in explaining how the process supervised verifier is trained. The training data for OSV are the generations from the reasoner rgsubscript𝑟𝑔r_{g}italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT with the same temperature value tgsubscript𝑡𝑔t_{g}italic_t start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. The verifier itself is nothing different from a standard language model.

3.4 Training a Process Supervised Verifier (PSV)

The main differences of training PSV from OSV are:

  • To enable predicting a score at each step in the solution, we mark the last token of the each step (e.g., if each step is represented as a single line, the last token will be the new line token), and optimize step-wise predictions at each step at the same time. During inference, we would also obtain a score for each step in a solution.

  • While for the output supervision data, or human labelled process supervision data, the score is either 0 or 1, for MiPS data, the correctness scores are percentage values. The training objective considered in this work is to learn the percentage values cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the i𝑖iitalic_ith step in the solution directly, through the same binary cross entropy loss. However, we note that it is possible to consider a different learning objective. For example, Wang et al. (2023) considered learning a binarized score:

    ci~={1,if ci>0.00.otherwise.~subscript𝑐𝑖cases1,if subscript𝑐𝑖0.00.otherwise.\tilde{c_{i}}=\begin{cases}\text{1,}&\quad\text{if }c_{i}>0.0\\ \text{0.}&\quad\text{otherwise.}\\ \end{cases}over~ start_ARG italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = { start_ROW start_CELL 1, end_CELL start_CELL if italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0.0 end_CELL end_ROW start_ROW start_CELL 0. end_CELL start_CELL otherwise. end_CELL end_ROW

3.5 Aggregating Step-wise Predictions

Table 1: We show the statistics of the datasets we use in this paper. The average number of steps is depicted with a granuarity of 0.5, using PaLM 2-S for GSM8K and MBPP, and PaLM 2-L for MATH. We note that these are not the most standard data splits, for reasons explained in Sec 4.2.
Dataset GSM8K MATH MBPP
Domain math math coding
Fine-tuning # Data 2000 4000 0
Verification Training # Data 5000 8000 384
Testing # Data 1319 500 500
Average Steps 4.5 11.0 7.0

The trained verifier is used to score the solutions generated by the reasoner. For OSV, the verifier prediction can be directly used as the score for the solution. For PSV, the verifier predictions are a list of predicted probabilities p1,p2,,pnsubscript𝑝1subscript𝑝2subscript𝑝𝑛p_{1},p_{2},\ldots,p_{n}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, one for each step in the solution. Aggregating the predictions into a final score is necessary. Lightman et al. (2023) considered two aggregation functions:

𝚖𝚒𝚗=min{p1,p2,,pn},sum_logprob=i=1nlogpi=logi=1npi,formulae-sequence𝚖𝚒𝚗subscript𝑝1subscript𝑝2subscript𝑝𝑛sum_logprobsuperscriptsubscript𝑖1𝑛subscript𝑝𝑖superscriptsubscriptproduct𝑖1𝑛subscript𝑝𝑖\begin{split}&\texttt{min}=\min\{p_{1},p_{2},\ldots,p_{n}\},\\ &\texttt{sum\_logprob}=\sum_{i=1}^{n}\log p_{i}=\log\prod_{i=1}^{n}p_{i},\end{split}start_ROW start_CELL end_CELL start_CELL min = roman_min { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL sum_logprob = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_log ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL end_ROW

They claimed that both are equivalently good aggregation functions. In later analysis, we show that for MiPS data, these two functions are underperforming for the trained verifier, whereas some tweaks

𝚖𝚊𝚡=max{p1,p2,,pn},sum_logit=i=1nlogpi1pi,mean_odd=i=1npi1pin,formulae-sequence𝚖𝚊𝚡subscript𝑝1subscript𝑝2subscript𝑝𝑛formulae-sequencesum_logitsuperscriptsubscript𝑖1𝑛subscript𝑝𝑖1subscript𝑝𝑖mean_oddsuperscriptsubscript𝑖1𝑛subscript𝑝𝑖1subscript𝑝𝑖𝑛\begin{split}&\texttt{max}=\max\{p_{1},p_{2},\ldots,p_{n}\},\\ &\texttt{sum\_logit}=\sum_{i=1}^{n}\log\frac{p_{i}}{1-p_{i}},\\ &\texttt{mean\_odd}=\frac{\sum_{i=1}^{n}\frac{p_{i}}{1-p_{i}}}{n},\end{split}start_ROW start_CELL end_CELL start_CELL max = roman_max { italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL sum_logit = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL mean_odd = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_n end_ARG , end_CELL end_ROW

performs much better. We analyze a much larger set of aggregation functions and suggest that the noise in the MiPS data causes the discrepancy.

4 Analysis

Refer to caption
(a) GSM8K, PaLM 2-S
Refer to caption
(b) GSM8K, PaLM 2-L
Refer to caption
(c) MATH, PaLM 2-L *
Refer to caption
(d) MBPP base model
Figure 4: We apply the trained output- and process-supervised verifiers on various combinations of model and datasets. No verification and self consistency are also shown, wherever applicable. We use the default training objective that directly learns the estimate accuracies and the max aggregation function for the process verifier. In the x-axis, we vary the number of generated solutions to apply the verifier on, and in the y-axis we plot the performance (accuracy %). The standard deviation is also given. As a reference, we note that the purple line, representing the average performance of the generated solutions of the reasoner without any verification, matches the expectation to be an (almost) flat horizontal line with decreasing standard deviation. *While the reasoner that generates MiPS data and the reasoner that the verifier validates on is PaLM 2-L, the verifier is trained from a PaLM 2-S.

4.1 Models

In our experiments, we consider two LLMs, PaLM 2-S and PaLM 2-L Anil et al. (2023) to conduct our experiments on. We intend to understand the capability of MiPS data and analyze design choices of the verifier when trained on it. A concurrent work Wang et al. (2023) conducted a similar experiment on a different set of LLMs, namely LLama2, LLemma, Mixtral, and Deepseek Touvron et al. (2023); Azerbayev et al. (2023); Jiang et al. (2024); Bi et al. (2024).

4.2 Datasets

We use two math datasets and one coding dataset for evaluations in this paper.

  • GSM8K Cobbe et al. (2021) is a dataset of grade school math problems. The solution is given in a one-line-per-step format with an exact numerical answer in the last line in the format of ####{answer}.####answer\#\#\#\#\{\text{answer}\}.# # # # { answer } . To enforce the reasoner following this format, we use the first 2000 instances in its training set to fine-tune the reasoner model to follow such a format. The solutions to fine-tune are from the training set. We use the coming 5000500050005000 data to train the verifier and evaluate the verifier on solutions generated by the reasoner on the test set.

  • MATH Hendrycks et al. (2021) is also a math word problems dataset. It consists of math problems of high school math competitions. The solutions are given in a format that mixes latex code and natural language. A dedicated solution checker was developed Hendrycks et al. (2021); Lightman et al. (2023). While the dataset itself does not resemble steps into different lines, we prompted GPT-4 to break down the reference solutions into one step per line, and fine-tuned the reasoner on the line separated dataset to make it follow the format. We use the test split suggested in Lightman et al. (2023).

  • MBPP is an entry-level Python programming dataset. The questions are coding challenges along with a test case that defines the function format and the solutions are Python code that is expected to solve several hidden test cases. We treat each individual line in the generated code as a step. For languages like Python, this resembles one statement per step. Due to the small dataset size, we can not afford to fine-tune the reasoner model, and decide to use 3 prompts in the validation split as in context examples to make sure the model generates code in the expected format.

Table 1 contains detailed statistics about the datasets.

4.3 Settings

Throughout the paper, we choose to use a temperature value of rg=rmc=0.7subscript𝑟𝑔subscript𝑟𝑚𝑐0.7r_{g}=r_{mc}=0.7italic_r start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT = 0.7 for both constructing MiPS and generating solutions on the test set. The number of generations for constructing MiPS ngsubscript𝑛𝑔n_{g}italic_n start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is set to 32 for GSM8K and MBPP, and 8 for MATH. The number of completions nmcsubscript𝑛𝑚𝑐n_{mc}italic_n start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT is also 32 for GSM8K and MBPP, and 8 for MATH. For GSM8K, we experiment with both PaLM 2-S and PaLM 2-L in the reasoner-verifier framework. For MATH, due to compute constraints, the reasoner we use is the PaLM 2-L, and the verifier trained is the PaLM 2-S. For MBPP, we find marginal differences in performance between using PaLM 2-S and PaLM 2-L as the reasoner, therefore we experiment only with the PaLM 2-S. During generation, all models are 8-bit quantized, and during training, we use a bfloat16 percision. Since MiPS contains an annotation for each intermediate step in the solution, it is naturally the number of steps times larger than output supervision. Therefore, we additionally generate more data to train the OSV. For training, we follow standard reward model training recipes, with an exception on the training epochs. Similar to Lightman et al. (2023), we also find it better to train the OSV for 1 epoch and the PSV for 2 epochs. For the OSV on MATH data, we find that training with a small 0.2 epochs (essentially training on less data) is better than training longer. For all experiments, we report the results of the average of 5 independently trained verifiers with different random seeds.

4.4 Directly Applying MiPS

Refer to caption
(a) GSM8K, PaLM 2-S
Refer to caption
(b) MATH, PaLM 2-L
Refer to caption
(c) MBPP, PaLM 2-S
Figure 5: For each aggregation function, we show its accuracy on the MiPS data generated and the accuracy of using it with the PSV trained. We additionally plot two lines for easier understanding of the figure, a horizontal line corresponding to the performance of OSV and a vertical line corresponding to the maximum accuracy achievable by the reasoner on the dataset (some problems are not solvable by the reasoner among the all solutions we generate).
Refer to caption
(a) GSM8K, PaLM 2-S,sum_logit
Refer to caption
(b) MATH, PaLM 2-L, sum_logit
Refer to caption
(c) MBPP, PaLM 2-S, sum_logit
Refer to caption
(d) MATH, PaLM 2-L, sum_logit
Refer to caption
(e) MATH, PaLM 2-L, max
Refer to caption
(f) MATH, PaLM 2-L, sum_odd
Figure 6: We plot the performance of using various aggreagtion functions with PSV while restricting it to predict only the last k𝑘kitalic_k steps or the last p𝑝pitalic_p percentage of steps.

We first present the performance of the process verifier trained on MiPS data, using the default training objective on correctness scores directly, and the max aggregation function on the three datasets. For this experiment, we varied the number of solutions to be verified by the verifier from 2 to 128, to clearly depict the trend of the compared verifier’s performance. The results are shown in Fig. 4. The plots conveys several pieces of information.

  • It is evident that using any verifier improves significantly upon no verification, matching with the initial assumption that verification plays a vital role in multi-step problem solving.

  • In all experiments, the verifier trained on MiPS using the max aggregation function showed stronger results than output verification. On GSM8K, the process verification is better than self consistency. On MATH, the performance lacks a bit. We note that this may be because we are training a less competent verifier (PaLM 2-S) than the reasoner (PaLM 2-L). Wang et al. (2023) showed improvements upon self consistency when the verifier and reasoner are of the same sizes.

  • The high performance of max may be unexpected if one believes max would be biased towards the first few correct steps of in in correct solution. In later analysis, we will show that (1) max favors high scores, similar to some other aggregation functions that perform well, that is preferred on MiPS data; (2) Due to higher noise in the earlier steps in MiPS data, the prediction scores of the earlier steps is of lower value (i.e., model confidence is lower), thus showing less affect to max.

  • In all experiments, the sum_logprob (product of probabilities) and min aggregation function are much worse than max and even using OSV, nevertheless still providing benefits over not using a verifier.

  • For several verifiers, we observe that the performance of the verifier is on a decreasing trend when the number of generations is high. This is particularly interesting since the larger the generations, the closer the performance should approximate the true verification performance. This would indicate that the while the verifier might identify some correct solutions with high scores, it also incorrectly predicts some fewer incorrect solutions with even higher scores, a sign of improper generalization.

From these results, we focus our analysis on two subjects: (1) the choice of aggregation functions, and (2) the affect of noise on generalization of the PSV.

4.5 Aggregation Functions

Table 2: We train a verifier based on PaLM 2-S and data generated by PaLM 2-S and test its applicability to transfer to validate solutions generated by two different reasoners, PaLM 2-L and gpt-turbo-3.5. A reference score of validing solutions generated by PaLM 2-S is also given. We evaluate this on both GSM8K and MBPP.
GSM8K PaLM 2-S No Verifier Self Consistency OSV PSV w/ sum_logit
\to PaLM 2-S 61.6 78.7 89.5 90.5
\to PaLM 2-L 80.7 89.4 92.1 92.6
\to gpt-turbo-3.5 72.5 86.2 88.0 89.1
MBPP PaLM 2-S No Verifier OSV PSV w/ sum_logit PSV w/ sum_logit (last 3 steps)
\to PaLM 2-S 41.7 56.8 54.2 57.8
\to PaLM 2-L 42.4 56.6 55.0 57.4
\to gpt-turbo-3.5 66.2 67.6 67.6 68.2

To start with the analysis, we consider ten aggregation functions (sum and means of log probabilities, probabilities, logits, and odds, and maximum and minimum value over all steps). We obtain their performance on the MiPS dataset and plot it with the performance of the verifier using the aggragation function on the test set (Fig. 5). We first observe that in general, the two performances have positive correlations, indicating that it is possible to select a aggregation function on the MiPS dataset and use it during inference. Second, we notice that both min and sum_logprob have low performances not only during inference, but also in MiPS. This indicates that the poor performance of them likely is related to the construct of MiPS. Indeed, we realize that sum_logprob does not have a high correlation with a correct solution as it naturally penalizes long solutions. Similarly, for min, the possibility for a set of solutions to be wrongly verified using min is when the solution with the largest minimum correctness over all steps turns out to be wrong. This is not an unlikely event to happen, particularly in hard datasets where the largest minimum correctness is not high. To clarify these better, we answer the following questions:

What are common in good aggregation functions for MiPS data? We believe a rule of thumb of a good aggregation function is a function that values highly scores highly. Consider two functions, one that values high scores (selects the solution with the highest high scores, e.g., max) and one that values the low scores (selects the solutions with the highest low scores, e.g., min). The first function is wrong only when the highest score solution is incorrect, in a simple case where there is only one observed step score for each solution, the probability is 1smax,1subscript𝑠max1-s_{\text{max}},1 - italic_s start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , where smaxsubscript𝑠maxs_{\text{max}}italic_s start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the score. Similarly, for min, the probability that the solution with the highest minimum score is wrong is 1smin.1subscript𝑠min1-s_{\text{min}}.1 - italic_s start_POSTSUBSCRIPT min end_POSTSUBSCRIPT . Since smaxsmin,subscript𝑠maxsubscript𝑠mins_{\text{max}}\geq s_{\text{min}},italic_s start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ≥ italic_s start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , the first function shall be preferred. This is in line with the observation from Fig. 5 that aggregations of odds and logits are usually better than that of probabilities and log probabilities.

Why did sum_logprob and min work well in Lightman et al. (2023) and Wang et al. (2023)? In Lightman et al. (2023), the dataset is constructed by human identifying all (earliest) incorrect steps, which corresponds to a prediction of 0 for the verifier (i.e., following the analogy in the previous discussion, this indicates that smax=smin=1.0subscript𝑠maxsubscript𝑠min1.0s_{\text{max}}=s_{\text{min}}=1.0italic_s start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = 1.0). The min function would be correct on every instance in the training dataset, and if the verifier generalizes well, resembles human identification of mistakes on the test dataset. For Wang et al. (2023), we note two differences during MiPS data construction. First, their training objective is to predict the binary value of the correctness score. Second, they are using a different reasoner to complete the intermediate solutions than the reasoner that generates the solutions. We leave this difference for further investigations.

The aggregation function analysis would indicate that a good MiPS dataset score indicates a good aggregation function. This is not completely correct, since, a contradictory result is that the final step score, which is used to train the output supervised verifier, achieves 100% accuracy on the training dataset, while not as good as the process supervised verifier on the test set. This suggests that the output supervised verifier might encounter some generalization issues from the data, and MiPS data can help relieve them.

4.6 Different Length Aggregations

To understand how the intermediate steps in process supervision help the verifier model, we illustrate the result of applying an aggregation function to only the last k𝑘kitalic_k steps or last p𝑝pitalic_p percentage steps of the solutions in Fig. 6. In the upper three plots, we show the performance of an aggregation function sum_logit on the three datasets with 1k51𝑘51\leq k\leq 51 ≤ italic_k ≤ 5. In the lower three plots, we show the performance of three aggregation functions on MATH with 10p10010𝑝10010\leq p\leq 10010 ≤ italic_p ≤ 100. We note that only the MATH dataset have solutions long enough such that looking at a percentage number of steps is sensible.

  • For all experiments, the performance increases with more steps considered from the end, up to some point. This indicates that the verifier predictions on the final few steps brings in increasing value, suggesting that process scores indeed is beneficial.

  • For most experiments, the performance starts to drop after including some earlier steps. This suggests that the quality of the predictions for the first steps are poor. We believe this is because MiPS has a poorer estimation of the first steps than the last steps, since (1) intuitively it is hard to predict the correctness of a very early solution even for humans, causing burden for the verifier to learn, and (2) Monte Carlo samplings that create MiPS data also tend to suffer from longer horizons.

  • For max, the performance does not change significantly across including more earlier steps. We examined the predicted scores and find the usually earlier steps are smaller in value, causing it to contribute little to max. This is another evidence that PSV trained on MiPS data might suffer from noise in the earlier steps.

  • In all experiments, using the last-step process verifier predicted value is more beneficial than output supervision alone. Recall that this is not because of the problem of data quantity as we upscaled the data to train the output verifier. We suggest that this is because the process supervision data is of more diverse context, thus hel** the model in generalization.

4.7 Transferring to a Different Reasoner

Finally, we provide an auxiliary experiment to check whether the trained verifiers would transfer to different reasoning models. We apply the verifiers trained on reasoning data generated by a PaLM 2-S and use it to valid solutions generated by stronger reasoners (reasoners having higher No Verifier accuracy). We find the sum_logit aggregation function working well in this case. The result is shown in Tab. 2, which shows that the trained verifier transfers to different and stronger reasoners with a strong validation ability, indicating that the verifier is not learning something overly specific to the reasoner that generates the data

5 Conclusion

In this work, we introduce MiPS to automatically annotate intermediate solutions for multi-step problem solving. Such data can be used to train a process supervised verifier that validates solutions generated by a reasoner. On two math datasets and one coding dataset, we demonstrated that MiPS improves the ability of picking the correct solution over an otherwise trained output supervised verifier. We conduct analysis on the aggregation function used to pick the solution and suggest that compared to verifiers trained on human-annotated process supervision, MiPS data trained verifiers prefer different aggregation functions. We also showed that such verifiers do not overly emphasize on the mistakes of the reasoner that produced the data, and can be transferred to different reasoners. Furture work could explore creating a scalable way to obtain MiPS data for each token in solutions to train a more competent verifier and use it to tune the reasoner via reinforcement learning.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Team Gemini et al. [2023] Team Gemini, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Huang et al. [2022] Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  • Azerbayev et al. [2023] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  • Luo et al. [2023a] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.
  • Yu et al. [2023a] Longhui Yu, Weisen Jiang, Han Shi, **cheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023a.
  • Ni et al. [2022] Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2022.
  • Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  • Wang et al. [2022] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  • Li et al. [2022] Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making large language models better reasoners with step-aware verifier. arXiv preprint arXiv:2206.02336, 2022.
  • Uesato et al. [2022] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
  • Paul et al. [2023] Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beatriz Borges, Antoine Bosselut, Robert West, and Boi Faltings. Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904, 2023.
  • Lightman et al. [2023] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  • Feng et al. [2023] Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
  • Yu et al. [2023b] Fei Yu, Anningzhe Gao, and Benyou Wang. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023b.
  • Liu et al. [2023a] Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, and Yi Zhang. Tinygsm: achieving> 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023a.
  • Wang et al. [2023] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023.
  • Madaan et al. [2023] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  • Kim et al. [2023] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
  • Pan et al. [2023] Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188, 2023.
  • Huang et al. [2023] Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  • Luo et al. [2023b] Liangchen Luo, Zi Lin, Yinxiao Liu, Lei Shu, Yun Zhu, **gbo Shang, and Lei Meng. Critique ability of large language models. arXiv preprint arXiv:2310.04815, 2023b.
  • Anil et al. [2023] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  • Yao et al. [2023] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  • Liu et al. [2023b] Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Ye** Choi, Hannaneh Hajishirzi, and Asli Celikyilmaz. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028, 2023b.
  • Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Amini et al. [2019] Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Ye** Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  • Austin et al. [2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  • Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  • Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  • Bi et al. [2024] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.