Natural Language Embedded Programs for
Hybrid Language Symbolic Reasoning
Abstract
How can we perform computations over natural language representations to solve tasks that require symbolic and numeric reasoning? We propose natural language embedded programs (NLEP) as a unifying framework for addressing math/symbolic reasoning, natural language understanding, and instruction following tasks. Our approach prompts a language model to generate full Python programs that define functions over data structures which contain natural language representations of structured knowledge. A Python interpreter then executes the generated code and prints the output. Despite using a task-general prompt, we find that this approach can improve upon strong baselines across a range of different tasks including math and symbolic reasoning, text classification, question answering, and instruction following. We found that the generated programs are interpretable since they outline the exact reasoning process followed by the program interpreter.
Natural Language Embedded Programs for
Hybrid Language Symbolic Reasoning
Tianhua Zhang†∗, Jiaxin Ge♢∗, Hongyin Luo‡††thanks: Equal contribution. Correspondence to Hongyin Luo at [email protected]. Work done during Jiaxin Ge visiting MIT. Source code is available at https://github.com/luohongyin/LangCode. , Yung-Sung Chuang‡, Mingye Gao‡, Yuan Gong‡, Xixin Wu†, Yoon Kim‡, Helen Meng†, James Glass‡ †The Chinese University of Hong Kong, ‡Massachusetts Institute of Technology, ♢Peking University [email protected], [email protected], [email protected]
1 Introduction
Solving complex language tasks often requires performing computations over natural language representations. For language-based reasoning, chain-of-thought prompting (CoT; Wei et al., 2022) has emerged as a promising approach for surfacing the symbolic reasoning capabilities of large language models (LLMs). However, certain types of computations (e.g., arithmetic) are unnatural to perform in pure language space, and hence present difficulties for LLMs. General-purpose programming languages, on the other hand, provide convenient abstractions as well as predefined libraries and functions for natively implementing many types of symbolic computations, and there has been much recent work on interleaving program calls within CoT-style reasoning to extend the capabilities of LLMs. While promising, existing methods are generally limited to narrow types of tasks such as math and symbolic reasoning (Chen et al., 2022; Cai et al., 2023; Gao et al., 2023), simple API calling (Schick et al., 2023; Paranjape et al., 2023; Liang et al., 2023a), and database accessing (Cheng et al., 2022). These works moreover rely on task-specific prompts which are hard to generalize across datasets.
This work describes a task-general approach for combining the language-based reasoning capabilities of LLMs with symbolic computations enabled by the use of programs. Specifically, we prompt LLMs to generate natural language embedded programs (NLEPs), which are fully executable Python programs containing appropriate package importing, structured natural language representations of knowledge, function definitions for problem solving, and response printing. The generated NLEP is then executed using a Python interpreter that captures the standard output of the program as the response. An example of an NLEP generated by GPT-4 is shown in Figure 1.
NLEPs use code as a scaffold to reason over natural language representations of data and knowledge. This makes our approach different from ToolFormer (Schick et al., 2023) and language model as tool maker (LATM; Cai et al., 2023), which instead use language as the scaffold and interleave API calls within natural language sentences during LLM generation. Compared to program-of-thought (PoT; Chen et al., 2022) and program aided language models (PAL; Gao et al., 2023), which mainly focus on math and symbolic problems, NLEPs handles natural language, math, and symbolic tasks with significantly better generalization ability based on task-general prompts. Compared to existing neuro-symbolic augmentations approaches needing new logical language (Li et al., 2023) and predefined solvers (Pan et al., 2023), NLEP uses a general programming interface thus can handle a significantly wider range of tasks with more flexible programming elements including packages, databases, and APIs.
Experiments across math and symbolic reasoning, question answering and instruction following, and text classification tasks demonstrate that (1) NLEPs conducts accurate reasoning on both structured and unstructured tasks and inputs; (2) NLEP’s step-by-step, meta prompting strategy can significantly improve the prompt efficiency across different tasks. As a result, we conclude that programming language prompting with NLEP is more capable and generalizable than existing natural language and neuro-symbolic prompting strategies.
2 Approach: NLEP Prompting
An NLEP is a program containing both programming code and natural language. NLEPs use natural language in several different ways. First, it uses natural language comments to guide step-by-step program generation. Second, language is used to represent structured knowledge through Python’s native data structures (e.g., dictionaries and lists). Finally, an NLEP uses language to print fluent responses to the user input by constructing a standard output string containing references to program variables.
The hybrid language-symbolic design of NLEP enables generalized problem solving for natural language, math, symbolic reasoning, and API calling tasks, which have traditionally been tackled by separate mechanisms. This approach combines the benefits of language-based reasoning with program synthesis: comments and knowledge in natural language improve program generation, while the structured/symbolic reasoning powered by program interpreters provides more accurate computations than would have been obtained via direct decoding from LLMs.
An example of an NLEP for answering a question is shown in Figure 5. In the generated program, each section is preceded by comments in natural language, and the defined counting function uses knowledge stored in a key-value dictionary (which itself is generated from GPT-4’s internal knowledge) to find the correct answer. Finally, the answer is printed through a natural language response. In this example, we generated 5 independent NLEPs and found that they achieve 100% accuracy, compared to 60% for ChatGPT-4 and 40% GPT-4 API.
NLEP structure. More generally, each NLEP contains four sections: (1) importing necessary libraries, (2) defining variables containing structured knowledge, (3) implementing problem-solving functions, and (4) printing the response in natural language. Instead of providing direct solutions for each task, we guide the model to arrive at a solution following this four-step process. The structured and modularized programming design disentangles the knowledge extraction and reasoning steps, in contrast to PoT (Chen et al., 2022), where different segments may intertwine. As show in the example in Figure 1, an NLEP answers the question by constructing a structured knowledge dictionary containing the birthday and start date of the US presidents. To recognize the weekdays, the program utilizes pre-defined functions in the datetime package. The selected answers are stored in a list and then embedded into an output template. The NLEP also handles the situation when no answer is found. The correct answer is then printed by the NLEP.
Task-general demonstration prompts. As is standard in chain-of-thought prompting (Nye et al., 2021; Wei et al., 2022), our approach uses demonstration prompts for NLEP generation. However, unlike previous approaches our demonstrations are not task-specific. The unified NLEP structure motivates models to solve different problems in the same disentangled four steps, bypassing the necessity for task-specific, in-domain examples to explicitly teach the models for a particular task. For example, for all classification tasks we consider we use the same demonstration prompt (derived from SST2). Similarly, we use mostly the same prompt for our math and symbolic reasoning tasks. This task-general prompt is similar in spirit to zero-shot chain-of-thought prompting (Kojima et al., 2023) which adds a task-agnostic prompt (“Let’s think step-by-step”) to elicit the reasoning capabilities of LLMs in a task-agnostic way. The prompts used for the various tasks are given in Table 1, and the exact prompts are given in Appendix D. In summary, we use 4 different demonstration prompts across 16 tasks, each of which works well within a task category. Thus, while the proposed method is not fully task-agnostic in the strictest sense of the term, it is still more flexible than previous approaches that combine program synthesis with chain-of-thought prompting (Chen et al., 2022; Gao et al., 2023), which use examples from the dataset to craft prompts.
Programmatic reasoning for natural language understanding. Prior works on combining program synthesis with LLM-based reasoning have generally focused on math and symbolic reasoning tasks (Chen et al., 2022; Gao et al., 2023), and it has not been clear how such methods could be extended to address natural language understanding (NLU) tasks. We show that NLEPs can be straightforwardly extended to text classification tasks.
For question answering, we apply NLEP prompting and the target output is constructed by the generated programs. Classification tasks, on the other hand, are handled by a different type of NLEP consisting of a decision tree since standard program-based reasoning may not seamlessly translate to text-based classification and yield optimal performance. Each node of the decision tree is annotated by a simple natural language sentence, and the Yes/No decisions at each node are handled in a zero-shot way by an entailment classifier, which has in general been shown to be an effective approach to zero-shot text classification (Obamuyide and Vlachos, 2018; Condoravdi et al., 2003; Ge et al., 2023). Concretely, given the tree we compute the entailment score between the input and the language description associated with each node and traverse the decision tree until a leaf node is reached. We emphasize that the topology of the tree and the textual description of each node is generated by the prompted LLM only once per benchmark. The decision maker per question is indeed the smaller entailment models for tree traversal. Consequently, NLEP leverages LLMs to efficiently construct decision trees, assisting smaller models in making informed decisions and augmenting overall performance. The demonstration prompt for classification tasks is given by a manually constructed example from SST2 (Wang et al., 2018) for different benchmarks. We find that this prompt can generate NLEPs containing sensible decision trees for various classification tasks without requiring task-specific examples. An example of the generated program and the corresponding decision tree is shown in Figure 2.
3 Experiments
We evaluate natural language embedded programs (NLEPs) on 16 tasks across three broad task categories. The tasks and corresponding prompts are summarized in Table 1.
Math and symbolic reasoning tasks include Tracking Shuffled Objects, Dyck Language, Word Sorting and Chinese Remainder Theorem from BigBench (Srivastava et al., 2023), Scheduling Meeting task from Cai et al. (2023), GSM-Hard111We opted for GSM-Hard to mitigate potential data contamination from GSM8K. The numbers in GSM-Hard are large, less likely to have been encountered by LLMs during training, and hence augmenting the task complexity. However, since GSM-Hard is automatically constructed, there are cases where the answers are not reasonable or the questions appear peculiar. We detail the analysis in Appendix B. benchmark of math word problems from Gao et al. (2023), and Game of 24 (Yao et al., 2023a). We use two examples for all tasks except for Game of 24, for which we applied a word sorting example to elicit stronger game-playing reasoning ability. The exact NLEP prompts we used are given in Appendix D.1 and D.2.
Question answering and instruction following tasks include the StrategyQA (Geva et al., 2021a), TruthfulQA (Lin et al., 2022), and VicunaQA (Chiang et al., 2023) benchmarks. StrategyQA requires models to answer multi-hop questions with “Yes” or “No”. TruthfulQA and VicunaQA contain questions and instructions requiring free-form responses. VicunaQA also allows us to test how NLEPs perform in the popular instruction-following setting. The evaluation metrics on question answering focus on accuracy, relevance, and factuality of the generated answers. The prompts in Appendix D.1 are used for StrategyQA. For TruthfulQA and VicunaQA, we added an example with a longer response to encourage more detailed response generation.
Text classification tasks includes tasks that require understanding of both natural language inputs and labels. We evaluate NLEP on movie-review classification (SST2; Socher et al., 2013), linguistic acceptance (COLA; Warstadt et al., 2019), emotion classification (Saravia et al., 2018), amazon review (Ni et al., 2019), hate speech detection (de Gibert et al., 2018), and stereotypes recognition (Sap et al., 2019). We use the prompts in Appendix D.1 for model-free classification. For decision tree generation, the prompts in Appendix D.4 are applied.
Domain | Task | Prompt |
---|---|---|
Math and Symbolic Reasoning | Tracking Shuffled Objects (7) | D.1 |
Dyck Language | D.1 | |
Word Sorting | D.1 | |
Chinese Remainder Theorem | D.1 | |
Scheduling Meeting | D.1 | |
GSM-Hard | D.1 | |
Game of 24 | D.2 | |
Question Answering | StrategyQA | D.1 |
TruthfulQA | D.3 | |
VicunaQA | D.3 | |
Text Classification | SST2 | D.4 |
Cola | D.4 | |
Emotion-Classification | D.4 | |
Amazon Review | D.4 | |
Hate-Speech | D.4 | |
Social Bias Frame | D.4 |
3.1 Math and Symbolic Reasoning
We compare NLEP prompting with chain-of-thought (CoT; Wei et al., 2022), program-of-thought (PoT; Chen et al., 2022), and LLMs as tool makers (LATM; Cai et al., 2023). We also compare against tree-of-thought (ToT; Yao et al., 2023a) on the Game of 24 benchmark, where ToT outperforms CoT by a significant margin (but requires many more calls to the LLM). We evaluate CoT and PoT with both task-general and task-specific demonstrations. Since LATM needs in-domain input-output pairs to create tools, we only report the results with task-specific LATM.
Task-general prompting. For task-general prompts we use two examples as the in-context demonstration for the math and symbolic reasoning benchmarks (see Table 1 and Appendix D). For CoT, we present two examples with intermediate reasoning represented in natural language rather than as programs. Our task-general PoT implementation takes the math and symbolic reasoning lines similar as Chen et al. (2022) and Gao et al. (2023), but without the step-by-step programming scheme in NLEP as an ablation.
Task-specific prompting baselines. We report the task-specific prompting performance as an “upper bound” for each task. For CoT, we use the same prompting settings (from 3 to 8-shot) adopted in previous studies (Cobbe et al., 2021; Cai et al., 2023; Fu et al., 2023). For PoT, we use the same in-context examples as in the task-specific CoT examples, but provide intermediate reasoning steps in Python code. On the GSM-Hard benchmark, we adopt the demonstrations (9-shot) for GSM8K used in Chen et al. (2022). For the Chinese Remainder Theorem and Scheduling Meeting benchmarks, we construct the in-context examples with the first three successful instances of task-general PoT. For LATM, we evaluate its performance on Tracking Shuffled Objects (7) using the provided tool and cite the results for other tasks from Cai et al. (2023). Details are shown in Appendix E.
Program synthesis approaches (PoT and NLEP) may sometimes generate non-executable programs if lack task-specific programming demonstration. For both approaches, we select certain benchmarks to resample up to three additional programs if the returned program failed at execution. Since this condition is triggered only if program execution fails, there is no label leakage. We discuss this further in Section 4 and provide results details in Appendix B.
GPT-4 | GPT-3.5-Turbo | ||||||||||
(a) Task-Specific | (b) Task-General | (c) Task-Specific | (d) Task-General | ||||||||
Tasks / Method | CoT | PoT | LATM | CoT | PoT | NLEP | CoT | PoT | CoT | PoT | NLEP |
Tracking Shuffled Objects | 100.0 | 100.0 | 100.0 | 81.2 | 98.4 | 100.0 | 68.0 | 6.8 | 51.2 | 88.4 | 74.4 |
Dyck Language | 63.6† | 60.8 | 87.5† | 39.6 | 66.4 | 91.6 | 20.4† | 28.4 | 38.0 | 4.0 | 7.2 |
Word Sorting | 90.9† | 100.0 | 99.1† | 84.4 | 99.6 | 99.6 | 59.2† | 100.0 | 75.2 | 100.0 | 99.6 |
Chinese Remainder Theorem | 0.0† | 100.0 | 100.0† | 0.0 | 84.4 | 97.2 | 0.0† | 100.0 | 0.0 | 72.4 | 96.4 |
Scheduling Meeting | 55.6† | 75.2 | 100.0† | 82.8 | 85.2 | 93.2 | 18.9† | 33.6 | 39.6 | 49.2 | 85.6 |
GSM-Hard | 57.4 | 74.1 | – | 54.9 | 69.3 | 67.7 | 45.0 | 63.4 | 42.8 | 52.2 | 54.1 |
Average | 61.3 | 85.0 | 97.3 | 57.2 | 83.9 | 91.6 | 35.3 | 55.4 | 41.1 | 61.0 | 69.6 |
3.1.1 Results
We show the main results of NLEP prompting on six math and symbolic reasoning tasks in Table 2. An example of NLEP generated for solving a Dyck language problem is shown in Figure 3(a).
GPT-4 Results. Among the three approaches employing task-general prompts, NLEP demonstrates superior performance over CoT across 5 of 6 tasks and outperforms PoT on 4 of 6 tasks. Additionally, NLEP achieves equivalent performance to PoT on Word Sorting benchmark. The large performance gap between NLEP and CoT suggests that programmatic reasoning can enable more accurate answers. Compared to PoT, NLEP achieves significantly higher average accuracy, especially on the Dyck Language (66.4%91.6%) and the complex Chinese Remainder Theorem (84.4%97.2%) tasks. On GSM-Hard, we confirm the same phenomenon discovered by Gao et al. (2023) where language does not further benefit the calculation accuracy with GPT-4.
NLEP also achieves comparable performance to task-specific, few-shot prompting methods. Notably, our method achieves the best performance on Tracking Shuffled Objects (7) and Dyck Language, and outperforms task-specific CoT on many benchmarks. On the Word Sorting benchmark, NLEP only fails on one instance where the input word sequence contains “steelmake" and GPT-4 automatically corrected it to “steelmaker". We find that the high scores of task-specific PoT on Word Sorting and Chinese Remainder Theorem come from the generally applicable programming code from the in-context demonstrations.
GPT-3.5 Results. We observe a significant decline in performance with GPT-3.5 for all reasoning approaches across nearly all benchmarks. However NLEP still achieves the best average performance, exhibiting significant improvement on 5 of 6 tasks over both task-specific and task-general CoT baselines. NLEP exceeds task-general PoT notably across 4 tasks and demonstrates comparable performance on the Word Sorting benchmark. On the Dyck Language benchmark, program-based strategies (PoT and NLEP with task-general prompts) failed to accomplish the problem without task-specific examples, highlighting the need for strong backbone LLMs.
Prompt | Method | Accuracy (%) |
---|---|---|
Task-specific | CoT | 4 |
ToT (b = 1) | 45 | |
ToT (b = 5) | 74 | |
Task-general | PoT | 52 |
NLEP | 66 |
Game of 24 results. Table 3 shows the results on the challenging Game of 24 task from Yao et al. (2023a). Our approach also surpasses the oracle setup of IO/CoT, which calculates the success rate of IO/CoT by considering the best of 100 samples for each instance. However, unlike ToT which requires in-context demonstrations for each decomposed sub-task, NLEP prompting achieves a significant performance gain over ToT (b=1) without requiring a computationally expensive multi-chain inference procedure.
3.2 QA and Instruction Following
setting | GPT-4 | GPT-3.5-Turbo | ||||||
---|---|---|---|---|---|---|---|---|
Task-specific | Task-general | Task-specific | Task-general | |||||
CoT | CoT | PoT | NLEP (ours) | CoT | CoT | PoT | NLEP (ours) | |
StrategyQA | 81.7 | 78.6 | 68.6 | 81.2 | 71.6 | 68.1 | 50.2 | 61.1 |
StrategyQA. Experiment results are presented in Table 4. With GPT-4, NLEP achieves the best performance under the task-general prompt setting and is competitive with the task-specific CoT. With GPT-3.5, although the scores of code-based strategies decrease more than CoT (PoT: 18.4%, NLEP: 20.1%, task-general CoT: 10.5%, task-specific CoT: 10.1%), NLEP still exceeds PoT by a significant margin. An example of output is shown in 3(b).
Foundation Model | Mode | True | Info | True * Info |
---|---|---|---|---|
GPT-4 | Text | 76.01 | 97.55 | 73.56 |
NLEP | 75.76 | 99.63 | 75.40 | |
GPT-3.5-Turbo | Text | 68.91 | 98.90 | 67.93 |
NLEP | 61.69 | 97.18 | 59.00 |
TruthfulQA. We also evaluate how NLEP prompting influences the factuality of question answering with the TruthfulQA benchmark (Lin et al., 2022). A fine-tuned GPT-3 model is applied for automatic scoring. In this experiment, we compare the vanilla auto-regressive text generation method against NLEP. Traditionally, such question answering tasks have been solved only with black-box language model without explicit symbolic computations due to the complexity of test questions.
The results are shown in Table 5. With GPT-4, the truth score of NLEP prompting strategy is close to standard LLM-based generation, while the informativeness score is higher. However, performance degrades significant with GPT-3.5-Turbo, indicating a strong dependence on the programming ability of the underlying language model.
Model | # NLEP >Text | Detail | % Score |
% Length
Bias |
---|---|---|---|---|
GPT-4 | 23.75 | yes | 93.08 | 72.72 |
no | 105.06 | 26.67 | ||
GPT-3.5 | 38.75 | yes | 101.22 | 3.13 |
-Turbo | no | 102.50 | 10.34 |
VicunaQA. Results on the VicunaQA benchmark are shown in Figure 4, where we follow the standard approach and evaluate the answers using GPT-4. We find that GPT-4 prefers its own generations, which are generally more detailed than GPT-3.5-Turbo and NLEP responses. To control for the bias due to response lengths, we also assess all responses without the requirement about details using another evaluation prompt. The evaluation prompts with and without the requirement on details is shown in Appendix F.1 and F.2.
As we demonstrate in Figure 4, this assessment leads to different results on GPT-4. After removing the detail requirement in the automatic scoring pipeline, NLEP achieves better performance. This suggests that NLEP can help GPT-4 generate accurate, factual, and relevant responses. However, human-generated programs for pretraining the GPT-4 models usually do not embed long pieces of natural language. As a result, the responses generated by NLEP have a limited level of detail.
3.3 Text Classification
Finally, we evaluate whether NLEPs can be applied to solve text classification tasks that have traditionally been difficult for pure program synthesis-based approaches. As discussed in section 2, we manually construct a decision tree NLEP for SST2 and use it as a prompt to guide GPT models to generate decision trees for other tasks only with task and label descriptions. We highlight that the decision tree is generated once per benchmark. Subsequently, all decision-making for each specific question is carried out with RoBERTa/DeBERTa to perform the tree traversal re-using the generate tree topology. An example input and output NLEP generated by GPT-4 for emotion classification is shown in Figure 2.
Model | Method | Performance (Num. Classes) | |||||
---|---|---|---|---|---|---|---|
cola (2) | emotion (6) | amazon (5) | hsd (2) | sbic (3) | Average | ||
RoBERTa | Multi-class Prompting | 65.87 | 49.2 | 33.31 | 67.78 | 52.99 | 53.83 |
Human-Generated Tree | 69.03 | 22.20 | 26.88 | 64.85 | 58.37 | 48.27 | |
NLEP w/ GPT-3.5 | 56.66 | 35.1 | 33.46 | 67.36 | 38.25 | 46.17 | |
NLEP w/ GPT-4 | 68.94 | 54.5 | 38.88 | 70.92 | 55.95 | 57.65 | |
DeBERTa | Multi-class Prompting | 53.50 | 51.93 | 37.01 | 67.78 | 59.08 | 53.86 |
Human-Generated Tree | 69.22 | 32.15 | 33.00 | 72.18 | 55.02 | 52.31 | |
NLEP w/ GPT-3.5 | 49.66 | 39.00 | 36.18 | 70.29 | 52.49 | 49.52 | |
NLEP w/ GPT-4 | 68.36 | 55.4 | 40.2 | 70.08 | 59.68 | 58.74 | |
None | Model-free NLEP w/o Tree | 69.13 | 40.55 | 25.76 | 59.62 | 37.63 | 46.54 |
We compare NLEP against two baseline methods. Our first baseline uses the zero-shot classification method proposed in Ge et al. (2023) (“multi-class prompting”). This method uses the same entailment models but makes the prediction without the tree structure. Our second baseline asks a human expert to design a decision tree for each task also based on the SST-2 example. The results shown in Table 6 show that NLEP generated by GPT-4 outperforms multi-class prompting and human-generated tree baselines on most datasets.
Model-free NLEP. We also tried using the task-general prompt shown in D.1 to generate NLEPs that directly use programs to solve these tasks. These programs do not need any neural models and are hence very efficient (e.g., finishing the entire validation set in about 2 seconds on CPUs). The results can be found in Table 6 (“Model-free NLEP”). While not achieving the performance of entailment-based methods, the generated NLEP significantly outperforms random baselines, suggesting that this may be a viable approach for quickly extracting simple and interpretable classifiers from LLMs.
4 Discussion
Tasks / Methods | CodeLlama7b | NLEP-CodeLlama7b | CodeLlama13b | Claude2 | ||||
---|---|---|---|---|---|---|---|---|
Task-Specific | Task-General | Zero-Shot | Task-General | Task-General | ||||
PoT | PoT | NLEP | NLEP | PoT | NLEP | PoT | NLEP | |
Tracking Shuffled Objects | 95.6 | 21.2 | 30.0 | 84.4 | 23.2 | 23.2 | 93.6 | 96.0 |
Dyck Language | 15.2 | 0.8 | 0.8 | 1.2 | 2.0 | 2.8 | 33.6 | 60.4 |
Word Sorting | 78.0 | 98.0 | 93.2 | 98.4 | 95.6 | 97.6 | 99.6 | 99.6 |
Chinese Remainder Theorem | 100.0 | 0.0 | 18.8 | 0.0 | 1.2 | 27.2 | 40.4 | 60.4 |
Scheduling Meeting | 32.0 | 4.0 | 24.8 | 34.4 | 5.6 | 8.8 | 10.4† | 24.0† |
Execution failures and retries. While the task-general PoT and NLEP lack programming demonstrations for the target task, GPT-4 in general is able to generate bug-free programs as presented in Appendix B Table 11. Notably, both PoT and NLEP obtain execution error rate of 0 on Tracking Shuffled Objects (7) and Word Sort tasks. One advantage of the program synthesis approaches such as PoT and NLEP is that non-executable programs can be identified and filtered. This gives LLMs the chance to “self-correct" and generate new answers, and we take advantage of this in our math and symbolic reasoning tasks by generating up to three programs if there is an execution failure on certain benchmarks. (For fair comparison we apply this reattempting scheme to PoT as well). We ablate on this mechanism in Appendix B, Tables 8, 9 and 11. Besides effectively reducing the execution error as presented in Table 11, these retries greatly enhance the reasoning accuracy. In particular, 12% and 15.6% improvement is observed on the Chinese Remainder Theorem and the Scheduling Meeting tasks in Table 8(b). In this work we only experiment extra retries with larger temperatures for diverse sampling and leave more advanced “self-correction" algorithms (e.g., those that make use of error messages (Cai et al., 2023; Hu et al., 2023)) for future work.
Different foundation LLMs for NLEP. The performance of task-general PoT and NLEP with more foundation models is reported in Table 7. We provide detailed analysis in Appendix B. Overall, NLEP demonstates superior performance compared to task-general PoT, particularly excelling in complex tasks such as the Chinese Remainder Theorem. We note the trend of performance decline on the reasoning tasks as the abilities of underling LLMs decreased from GPT-4, Claude-2, GPT-3.5-Turbo to CodeLlama (Rozière et al., 2023). However, this observation is not limited to NLEP, it also applies to other prevalent reasoning approaches like PoT. For example, on the Dyck Language task, GPT-3.5-Turbo only achieves 7.2% (4.0%) accuracy with NLEP (PoT) prompting, while GPT-4 improves these figures to 91.6% (66.4%). It indicates that strong programming ability of underlying LLMs is vital to generate accurate responses and attain satisfactory performance. Surprisingly, zero-shot CodeLlama-7b (Rozière et al., 2023) trained using NLEP-style data (without in-domain examples) demonstrates superiority on Tracking Shuffled Objects (7) benchmark over NLEP prompted GPT-3.5 and Word Sorting benchmark over task-general CoT prompted GPT-3.5, even with significantly fewer parameters. It shows the potential for effective training of compact large language models, enabling them to achieve performance levels comparable to those of extremely large models.
5 Related Work
Large language models for reasoning. State-of-the-art LLMs (OpenAI, 2022, 2023; Touvron et al., 2023; Zeng et al., 2022) have shown very strong performance on complicated reasoning tasks, including commonsense (Geva et al., 2021b), math (Cobbe et al., 2021), symbolic reasoning (Suzgun et al., 2022), and programming (Austin et al., 2021; Chen et al., 2021). Tackling such tasks with LLMs often requires prompting them with demonstrations that elicit their reasoning capabilities. Wei et al. (2022) proposed chain-of-thought prompting technique that encourages language to generate answers step-by-step. Wang et al. (2022) found that self-consistency can further improve the performance of chain of thoughts reasoning ability. Kojima et al. (2023) discovered that LLMs can perform reasoning without any demonstrations through adding the incantation “Let’s think step-by-step". Tree of thoughts (Yao et al., 2023a) and graph of thoughts (Yao et al., 2023b; Besta et al., 2023) were proposed to tackle tasks that require more complicated reasoning processes. These improved reasoning methods apply chain of thoughts as the atomic reasoning step but organize reasoning “chains” through more advanced mechanisms.
Programs and tools. Previous studies have found that some limitations of LLMs can be overcome by combining program synthesis techniques with prompt-based learning. Program of thoughts (Chen et al., 2022) and program aided language model Gao et al. (2023) both translate mathematical questions to equations and use the python interpreter to ensure the correctness of the calculations. Another line of related work for enabling LLMs to use tools is through interleaving API calls during LLM generation (Qin et al., 2023; Liang et al., 2023b; Mialon et al., 2023; Tang et al., 2023). APIs can aid many tasks that are challenging for LLMs by providing tailored tools (e.g., calculators, search) that can solve specific tasks. Toolformer (Schick et al., 2023) addresses reasoning tasks by using predefined tools, and LLMs as tool makers (LATM) can implement functions solving a class of tasks based on few-shot examples (Cai et al., 2023). With these solutions, the correctness of a prediction can be ensured if correct API is called and correct inputs are selected. Existing works on combining program synthesis and tool usage with LLMs generally rely on task-specific prompts, in contrast to the more task-general prompt explored in the present work.
6 Conclusion
This work describes natural language embedded programs (NLEP), which flexibly combine natural language reasoning with program synthesis within prompt-based learning to tackle a variety of tasks. Our experiments demonstrate that NLEPs expand the scope of applications that can be addressed by program synthesis by more closely incorporating natural language during code generation.
Acknowledgement
This research was supported by the Center for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission’s InnoHK Scheme.
Limitation
We found that the NLEP prompts are not suitable for generating long-form natural language responses. Experimental results on VicunaQA show that most responses generated by NLEP prompting have fewer tokens than responses obtained from usual LLM generation. This feature is expected, because most naturally-occurring programs (on which the LLMs were pretrained) do not contain large chunks of natural language. Future work could consider incorporating (potentially synthetically generated) programs with longer-form natural language within the pretraining set to enable the application of NLEP to more involved NLG tasks. In the context of prompt-based approaches without parameter adjustments, the design of prompts can affect performance. While we aimed to mitigate this issue by expanding the scope of evaluation benchmarks and minimizing random factors, we did not engage in extensive prompt variation analysis. Further exploration into the impact of prompt variations remains an avenue for enhancing the robustness of reasoning approaches.
Ethical Statement
This work intends to design a accurate and interpretable reasoning framework for language that entails more transparent and responsible LLM applications. However, the program generation method is more capable to handle different tasks in both areas of natural and program languages, infecting both humans and computing systems. As a result, we believe program generation models need stronger alignment and careful management.
References
- Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Besta et al. (2023) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. 2023. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
- Cai et al. (2023) Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. 2023. Large language models as tool makers. arXiv preprint arXiv:2305.17126.
- Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code.
- Chen et al. (2022) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Cheng et al. (2022) Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, et al. 2022. Binding language models in symbolic languages. In The Eleventh International Conference on Learning Representations.
- Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Condoravdi et al. (2003) Cleo Condoravdi, Dick Crouch, Valeria de Paiva, Reinhard Stolle, and Daniel G. Bobrow. 2003. Entailment, intensionality and text understanding. In Proceedings of the HLT-NAACL 2003 Workshop on Text Meaning, pages 38–45.
- de Gibert et al. (2018) Ona de Gibert, Naiara Perez, Aitor García-Pablos, and Montse Cuadros. 2018. Hate speech dataset from a white supremacy forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 11–20, Brussels, Belgium. Association for Computational Linguistics.
- Fu et al. (2023) Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. 2023. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance.
- Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
- Ge et al. (2023) Jiaxin Ge, Hongyin Luo, Yoon Kim, and James Glass. 2023. Entailment as robust self-learner. arXiv preprint arXiv:2305.17197.
- Geva et al. (2021a) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021a. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.
- Geva et al. (2021b) Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021b. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL).
- Hu et al. (2023) Yi Hu, Haotong Yang, Zhouchen Lin, and Muhan Zhang. 2023. Code prompting: a neural symbolic method for complex reasoning in large language models.
- Kojima et al. (2023) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large language models are zero-shot reasoners.
- Li et al. (2023) Ziyang Li, Jiani Huang, and Mayur Naik. 2023. Scallop: A language for neurosymbolic programming. Proceedings of the ACM on Programming Languages, 7(PLDI):1463–1487.
- Liang et al. (2023a) Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. 2023a. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs. arXiv:2303.16434.
- Liang et al. (2023b) Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, and Nan Duan. 2023b. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis.
- Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Lyu et al. (2023) Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. 2023. Faithful chain-of-thought reasoning.
- Mialon et al. (2023) Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. 2023. Augmented language models: a survey.
- Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197.
- Nye et al. (2021) Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. Show your work: Scratchpads for intermediate computation with language models.
- Obamuyide and Vlachos (2018) Abiola Obamuyide and Andreas Vlachos. 2018. Zero-shot relation classification as textual entailment. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 72–78, Brussels, Belgium. Association for Computational Linguistics.
- OpenAI (2022) OpenAI. 2022. Introducing chatgpt.
- OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
- Pan et al. (2023) Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. 2023. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295.
- Paranjape et al. (2023) Bhargavi Paranjape, Scott Lundberg anbd Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. 2023. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv:2303.09014.
- Qin et al. (2023) Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, **g Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan, Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and Maosong Sun. 2023. Tool learning with foundation models.
- Rozière et al. (2023) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. 2023. Code llama: Open foundation models for code.
- Sap et al. (2019) Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A Smith, and Ye** Choi. 2019. Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891.
- Saravia et al. (2018) Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. 2018. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium. Association for Computational Linguistics.
- Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
- Suzgun et al. (2022) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Tang et al. (2023) Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. 2023. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Yao et al. (2023a) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023a. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- Yao et al. (2023b) Yao Yao, Zuchao Li, and Hai Zhao. 2023b. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582.
- Zeng et al. (2022) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
Appendix A Additional Examples
Appendix B Additional Results and Analysis on Math and Symbolic Reasoning
We present the detailed experimental results of math and symbolic reasoning tasks in Tables 8 to 9, with execution failure analysis in Table 11. The significance test analysis is outlined in Table 10.
GPT Results. We report the results of Table 2 with more details in Table 8. The effect of extra retries described in Section 4 is highlighted with (). The detailed experimental results on the Game of 24 benchmark is listed in Table 9.
|
|
|
|
|
GSM-Hard | |||||||||||
(a) Task-Specific Prompting: GPT-4 | ||||||||||||||||
CoT | 100.0 | 63.6† | 90.9† | 0.0† | 55.6† | 57.4 | ||||||||||
PoT | 100.0 | 60.8 | 100.0 | 100.0 | 75.2 | 74.1 | ||||||||||
LATM | 100.0 | 87.5† | 99.1† | 100.0† | 100.0† | - | ||||||||||
(b) Task-General Prompting: GPT-4 | ||||||||||||||||
CoT | 81.2 | 39.6 | 84.4 | 0.0 | 82.8 | 54.9 | ||||||||||
PoT | 98.4 | 66.4 | 99.6 | 76.4 (84.4) | 84.4 (85.2) | 69.3 | ||||||||||
NLEP (Ours) | 100.0 | 91.6 | 99.6 | 85.2 (97.2) | 77.6 (93.2) | 67.7 | ||||||||||
(c) Task-Specific Prompting: GPT-3.5-Turbo | ||||||||||||||||
CoT | 68.0 | 20.4† | 59.2† | 0.0† | 18.9† | 45.0 | ||||||||||
PoT | 6.8 | 28.4 | 100.0 | 100.0 | 33.6 | 63.4 | ||||||||||
(d) Task-General Prompting: GPT-3.5-Turbo | ||||||||||||||||
CoT | 51.2 | 38.0 | 75.2 | 0.0 | 39.6 | 42.8 | ||||||||||
PoT | 88.4 | 4.0 | 100.0 | 58.4 (72.4) | 46.8 (49.2) | 39.0 (52.2) | ||||||||||
NLEP (Ours) | 74.4 | 7.2 | 99.6 | 94.8 (96.4) | 75.2 (85.6) | 50.9 (54.1) |
setting | Task-Specific | Task-General | |||||||||||||
IO | CoT |
|
|
|
|
PoT | NLEP (ours) | ||||||||
Game of 24 | 7.3† | 4.0† | 33.0† | 49.0† | 45.0† | 74.0† | 52 (52) | 63 (66) |
Significance Test. We report the significance test in Table 10 for task-general PoT and NLEP prompted GPT-4 results. Due to API cost hamper, we run significance test using bootstrap test: https://github.com/neubig/util-scripts/blob/master/paired-bootstrap.py. The hyperparameters are configured to the default setting: num_samples=10000, sample_ratio=0.5 Besides, to make the results more reliable and reproducible, we performed the major experiments using a temperature of 0 as reported in Table 8, and assessed NLEP across a diverse range of tasks to show its general capabilities.
Tasks/Evaluation | Accuracy | Significance Test | |||||
---|---|---|---|---|---|---|---|
PoT | NLEP | NLEP win/lose/tie | 95% Confidence Interval | P-value | |||
Tracking Shuffled Objects (7) | 98.4 | 100.0 | 86.8/0.0/13.2 |
|
0.132 | ||
Dyck Languages | 66.4 | 91.6 | 100.0/0.0/0.0 |
|
0.000 | ||
Word Sorting | 99.6 | 99.6 | 0.0/0.0/100.0 |
|
/ | ||
Chinese Remainder Theorem | 84.4 | 97.2 | 100.0/0.0/0.0 |
|
0.000 | ||
Scheduling Meeting | 85.2 | 93.2 | 98.5/0.7/0.8 |
|
0.015 | ||
GSM-Hard | 69.3 | 67.7 | 0.92/88.8/20.0 |
|
PoT better 0.112 |
GPT-4 | GPT-3.5-Turbo | |||||
---|---|---|---|---|---|---|
Task-Specific | Task-General | Task-Specific | Task-General | |||
PoT | PoT | NLEP | PoT | PoT | NLEP | |
Track Shuffled Objects (7) | 0 | 0 | 0 | 233 | 26 | 24 |
Dyck Language | 16 | 24 | 10 | 32 | 81 | 26 |
Word Sort | 0 | 0 | 0 | 0 | 0 | 0 |
Chinese Remainder Theorem | 0 | 32 (0) | 37 (6) | 0 | 46 (7) | 4 (0) |
Scheduling Meeting | 0 | 3 (0) | 43 (0) | 2 | 15 (2) | 36 (0) |
GSM-Hard | 17 | 6 | 8 | 31 | 464 (145) | 95 (13) |
Results with Different Fundation Models. To investigate the effect of NLEP on different large language models, we report the results with Claude-2222https://www.anthropic.com/api and CodeLlama (Rozière et al., 2023) in Table 7. Following the guidance of the instruction-following models333https://github.com/facebookresearch/codellama, we employ a chat session to include task-specific and task-general prompts as previous turns by interleaving the “user" and “assistant" messages with a system message “Provide answers in Python" at the beginning for CodeLlama-7b-Instruct and CodeLlama-13b-Instruct. Hence, we only treat bug-free Python programs that have the desired results after execution as correct answers, regardless of natural language outputs since we explicitly prompt CodeLlama to generate the answer in Python. Unlike the prominent performance of GPT-4, the positive impact of NLEP with CodeLlama is diminished due to the much smaller model size and greatly reduced programming capability. Although NLEP prompting outperforms the task-general PoT by a large margin on Chinese Remainder Theorem and Scheduling Meeting benchmarks, a non-negligible performance gap is observed between NLEP and task-specific PoT on most tasks. We notice a decline or performance on Scheduling Meeting bechmark of 13b model over 7b. Interestingly, we found that CodeLlama-13b prefers to use tools (built-in library) more for solving Scheduling Meeting problems while CodeLlama-7b tends to do it directly “by hand”. However, the tool use of a less-proficient model may result in more possible execution errors. For instance, 113 over 125 execution errors (total 250 test cases) of CodeLlama-13b is Error: ‘datetime.time’ object has no attribute ‘overlaps’ while CodeLlama-7b only has 3 execution errors. The performance with Claude-2 is more prominent, with NLEP demonstrates superior performance on Dyck Language, Scheduling Meeting and Chinese Remainder Theorem benchmarks.
To further investigate the benefits of NLEP, we fine-tune a CodeLlama-7b (Rozière et al., 2023) model using NLEP-style instances, resulting in a variant that we term NLEP-CodeLlama. Note that our training corpus does not include specific evaluation tasks. During the evaluation phase, we adopt zero-shot prompting strategy, where the model is provided with only test instances without in-context demonstrations. As presented in Table 7(b), zero-shot NLEP-CodeLlama exhibits consistent performance improvement on 5 of 6 tasks. The only exception is the Chinese Remainder Theorem benchmark, which is notably more complex in nature. Remarkably, zero-shot NLEP-CodeLlama demonstrates superior performance on Word Sorting benchmark when compared to task-general CoT prompted GPT-3.5-Turbo, and outperforms NLEP prompted GPT-3.5-Turbo on Tracking Shuffled Objects (7) benchmark, despite a considerably lower parameter size.
Limitation of GSM-Hard Benchmark. We opted for GSM-Hard (Gao et al., 2023) to mitigate potential data contamination from GSM8K. The numbers in GSM-Hard are large, less likely to have been encountered by LLMs during training, and hence augmenting the task complexity. However, since GSM-Hard is automatically constructed by replacing small numbers in GSM-8K with large ones, there are peculiar cases where sometimes models may refuse to reply or automatically solve it. An example is as follows: the original question in GSM8K is “Claire makes a 3 egg omelet every morning for breakfast. How many dozens of eggs will she eat in 4 weeks?” The number 3 is replaced with 6022727 and the corresponding question in GSM-Hard is “Claire makes a 6022727 egg omelet every morning for breakfast. How many dozens of eggs will she eat in 4 weeks?” The output of task-general PoT aligns with the ground-truth, which is , a decimal number. However, GPT-4 with NLEP prompting regards the number of dozens should be an integer, and consequently returns the result as . This gives additional challenges in evaluation, and we thus mainly evaluate the former five reasoning benchmarks using more models in Table 7, which might serve as more reliable performance indicators.
Execution Failure Analysis. We present the execution failure statistics of code-based reasoning strategies in Table 11. The effect of extra retries described in Section 4 is highlighted with (). Note that different from task-specific PoT with demonstrations showing how to return the desired outputs in Python program, e.g.,
we need to design rules to extract the target answers from the execution results of task-general PoT and NLEP since they are allowed to generate free-from outputs. For example, given the generated programs,
we need to extract “striker", the target answer, from the execution results “At the end of the match, Gertrude is playing striker.".
Although task-specific PoT explicitly instructs the model to generate a required line of code ans={desired_result} to capture the final result by adding
in the instruction and explicitly provides task-specific examples in the prompt, e.g.,
we found that GPT models sometimes fail to accomplish the requirement. On Tracking Shuffled Objects (7) benchmark, only 17 programs out of 250 test cases generated by GPT-3.5 successfully contain the required line of code, i.e., ans={desired_result}, which explains the huge number of failures (233). In addition to already having task-specific programming examples, the inability to distinguish between the erroneous programs and lack of required line of code is another reason why we do not apply the error retries on task-specific PoT.
NLEP v.s. Task-general PoT Compared to PoT, NLEP offers structured and modularized Python blocks. This disentanglement of knowledge extraction from problem solving facilitates enhanced interpretability and controllability. In contrast, PoT may interwine different components, particularly blending knowledge and reasoning steps. An illustrative example is provided below: the question from GSM-Hard is Toula went to the bakery and bought various types of pastries. She bought 3 dozen donuts which cost 5804528 per dozen, 2 dozen mini cupcakes which cost 80 per dozen, and 6 dozen mini cheesecakes for 55 per dozen. How much was the total cost? The generated Python block of NLEP is
The code returned by PoT is
Num | cola | emotion | amazon | hsd | sbic |
---|---|---|---|---|---|
Internal Node | 4 | 7 | 8 | 4 | 3 |
Leaf Node | 4 | 8 | 10 | 5 | 4 |
Label | 2 | 6 | 5 | 2 | 3 |
Appendix C Additional Results on Text Classification
Statistics of Decision Trees. We list the statics of the generated decision tree by NLEP prompting in Table 12. Each internal decision node in the generated tree has two branches, “yes” and “no” , indicating whether it satisfies the requirement or not.
Classification Results on SST2 Benchmark. Since we use the trees derived from SST2 to prompt the LLM for the classification tasks, it would be inappropriate to use these examples for SST2 itself. For SST2, we thus use an automatically generated decision tree for the CoLA task to prompt GPT-4 to generate a new tree for SST2. As shown in Table 13, the automatically generated tree matches the performance of the SST2 decision tree created by the authors.
SST2 Performance | |
---|---|
Model-Free | 66.17 |
RoBERTa-Manual | 83.03 |
RoBERTa-Automatic | 87.36 |
DeBERTa-Manual | 84.06 |
DeBERTa-Automatic | 93.49 |
Appendix D Prompts for Task-General Strategies
D.1 Prompts for Table 2 and 4
We list the prompts for the task-general chain-of-thought (CoT), our implementation of program-of-thoughts (PoT) and the proposed natural language embedded programs (NLEP) strategies in the following code segments. They share the same two NLEP examples (one is for natural language reasoning and the other is for mathematical reasoning) but with different forms of intermediate reasoning steps (e.g., code, text etc.) to evaluate the average performance of different strategies.
Prompt for task-general chain-of-thought (CoT) in Table 2 and 4. The detailed intermediate natural language reasoning chains are generated by prompting GPT-4 given the input and target output.
Prompt for task-general NLEP in Table 2 and 4. The intermediate program reasoning chains are generated by prompting GPT-4 given the input and target output.
Prompt for our task-general program-of-thoughts (PoT) implementation in Table 2 and 4. The intermediate program reasoning chains are obtained from NLEP. We keep the key implementation steps of Python programs but eliminate the step-by-step programming instructions and detailed natural language elaboration as an ablation.
D.2 Prompts for Table 3
The Game of 24 task is much more challenging and we replace the first example in Appendix D.1 with a word sorting example to elicit stronger reasoning ability.
Prompt for task-general NLEP in Table 3. The intermediate program reasoning chains are generated by prompting GPT-4 given the input and target output.
Prompt for our task-general program-of-thoughts (PoT) implementation in Table 3. The intermediate program reasoning chains are obtained from NLEP. We keep the key implementation steps of Python programs but eliminate the step-by-step programming instructions and detailed natural language elaboration as an ablation.
D.3 Prompts for NLEP in Table 5 and Figure 4
For experiments in TruthfulQA and VicunaQA, we added the following example into the NLEP prompts shown in Appendix D.1 to encourage generating more detailed responses:
D.4 Prompts for Table 6 and 13
We use the following prompt for the entailment-based NLEP results in Table 6. The model-free result uses the NLEP prompt shown in D.1.
When we test the SST-2 performance based on a generated Cola decision tree in Table 13, we use the following prompt:
The input format of target tasks is
Appendix E Implementation Details for Task-Specific Strategies
We detail the few-shot chain-of-thought (CoT) and program-of-thought (PoT) prompting under the task-specific setting in Tables 2 and 4:
-
•
Tracking Shuffled Objects (7). We use the same 3-shot examples as used by previous work for both task-specific CoT and PoT. The three examples are related to Tracking Shuffled Objects (3) and the models need to learn from demonstrations and generalize to seven objects test cases. The difference between CoT and PoT lies on the format of intermediate reasoning: CoT adopts natural language as the reasoning chains while we transform the thought process into concise Python code for PoT.
-
•
Dyck Language. We cite the results of CoT from LATM (Cai et al., 2023) and transform the reasoning steps of the 3-shot examples used by previous chain-of-thought work into Python code for PoT. In order to evaluate the generalization ability of program-of-thought prompting, we try to avoid directly giving generally applicable Python program that can be used for all test instances.
-
•
Word Sorting. We cite the results of CoT from LATM (Cai et al., 2023) and transform the reasoning steps of the 3-shot examples used by previous chain-of-thought work into Python code for PoT. Since the task can be effectively resolved by just few lines of code, i.e., read in the given input and use sorted() function, e.g.,
# Python code, return answords = [’oven’, ’costume’, ’counterpart’]ans = " ".join(sorted(words))it can be regarded that the generally applicable tool is already given in the input prompt.
-
•
Chinese Remainder Theorem. We cite the results of CoT from LATM (Cai et al., 2023). We build the in-context examples (3-shot) with the first three successful instances of task-general PoT as we construct the Python reasoning chains from the generated programs of task-general PoT with GPT-4. Indeed, for this complicated task, the provided program in the demonstration can also be regarded as a generally applicable tool. That’s a main reason why task-specific PoT can obtain 100% accuracy on this benchmark.
-
•
Scheduling Meeting. We cite the results of CoT from LATM (Cai et al., 2023). We build the in-context examples (3-shot) with the first three successful instances of task-general PoT as we construct the Python reasoning chains from the generated programs of task-general PoT with GPT-4. However, unlike giving the “ground-truth" problem solving tool for Chinese Remainder Theorem, the provided Python reasoning chains can only derive the correct answer for each specific demonstration question but can not be directly applied to all scenarios due to the complexity of the problem. We hope to compare this setup with Chinese Remainder Theorem and evaluate the performance of task-specific PoT on complicated tasks through different in-context learning demonstrations.
-
•
GSM-Hard. We use the same 8-shot examples as used by previous work on GSM8K dataset for CoT GSM-Hard. For PoT, we adopt the 9-shot examples on GSM8K dataset from program-of-thought (Chen et al., 2022) containing concise Python code as reasoning chains.
-
•
StrategyQA. We remove 1 example that appears in the development set from the 6-shot demonstration of previous work (Lyu et al., 2023) for CoT. As PoT is not designed and applied for natural language question answering task, we did not reproduce task-specific PoT results on StrategyQA benchmark.
Appendix F Evaluation Prompts for VicunaQA
We have two metrics for VicunaQA. The first metric assesses the level of details and biases to long responses generated by GPT-4, while the other metric does not ask for details.
F.1 Evaluation prompt asking for details.
F.2 Evaluation prompt not asking for details.
F.3 Calculation of Length Bias
Suppose we have evaluation cases, each receiving 2 candidate responses. A GPT-4 scorer decides the winner between the candidates. stands for the number of cases where a candidate response with more tokens wins. The length bias is calculated by
(1) |