Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

Sujan Dutta Rochester Institute of Technology, [email protected] Sayantan Mahinder Apple, {smahinder, raviteja_anantha, bbandyopadhyay}@apple.com Raviteja Anantha Apple, {smahinder, raviteja_anantha, bbandyopadhyay}@apple.com Bortik Bandyopadhyay Apple, {smahinder, raviteja_anantha, bbandyopadhyay}@apple.com
Abstract

Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline’s performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.

00footnotetext: † Work done as a part of an internship at Apple.

1 Introduction

LLMs have demonstrated unprecedented natural language understanding and generation capabilities in recent times Brown et al. (2020); Chowdhery et al. (2022); OpenAI (2023); Anil et al. (2023); Jiang et al. (2023); Touvron et al. (2023). Reinforcement Learning with Human Feedback (RLHF) is a key contributor to this success. RLHF is a fine-tuning approach that uses human feedback to train models by incorporating human evaluations into the reward signal. This method improves model performance on complex tasks by aligning the model’s behavior with human preferences. However, this technique is expensive due to the requirement for high-quality human feedback. RLAIF Bai et al. (2022); Lee et al. (2023) has emerged as a promising alternative to replace human feedback with AI feedback, making the fine-tuning more scalable. Concurrently, there is growing research interest in teaching LLMs how to use external tools (APIs) Schick et al. (2024); Nakano et al. (2021); Patil et al. (2023); Qin et al. (2023); Li et al. (2023); Zhuang et al. (2024); Hao et al. (2024). However, the focus on lightweight models (<1B parameters) is limited. In this work, we propose an RLAIF framework to enhance lightweight LLMs’ capability to generate code and effectively integrate API calls. Following Patil et al. Patil et al. (2023), we consider the task of generating Python codes that include suitable API calls given instructions across a wide array of applications. The authors published the Gorilla dataset and showed that fine-tuned LLaMA-7B Touvron et al. (2023) on this dataset outperforms non-finetuned LLMs like GPT-4 OpenAI (2023) in terms of understanding a natural language request and map** it to API calls. Using our RLAIF framework, we fine-tune GPT-2-large (780M parameters), which not only demonstrates comparable API call correctness to Patil et al. (2023) but also surpasses its code generation performance.

Code Generation.

Although extensively studied since the early days of AI research, code generation Waldinger and Lee (1969); Budinsky et al. (1996); Svyatkovskiy et al. (2020); Li et al. (2022) remains a challenging problem. In recent years, the community has explored ways to apply RL in training machine learning models for code generation tasks. For instance, Seq2SQL Zhong et al. (2017) proposed a neural network trained through RL for generating SQL queries given a text description. During training, a generated query is executed against a database, and the result is utilized as the reward in the RL algorithm. Le et al. Le et al. (2022) developed CodeRL, a sequence-to-sequence language model fine-tuned through an actor-critic RL approach for program synthesis. The code-generator LM is treated as the actor during the training, and the critic model, which is trained to predict unit test results, provides the reward for a generated code. Another work Shojaee et al. (2023) similar to the above, proposed using feedback from code execution and a ground truth target code to compute the reward. While these approaches may perform well on classical programming tasks (e.g., writing SQL queries, solving competitive/interview-level coding problems, etc.), they are inapplicable on Gorilla-like Patil et al. (2023) code generation where the program is required to load and execute ML models using the correct API. The bottleneck comes from the fact that the above-mentioned techniques require execution of the generated code to either compute the reward directly or train the critic model, but running thousands of such programs is prohibitively expensive.

Reinforcement Learning with AI Feedback.

Bai et al. Bai et al. (2022) introduced the concept of Reinforcement Learning with AI Feedback (RLAIF), which combines preferences labeled by LLMs with human-labeled preferences to optimize for helpfulness and harmlessness. Since then, many studies have explored the usefulness of AI-generated feedback as an alternative to expensive human annotations in various tasks. For instance, Luo et al. Luo et al. (2023) proposed WizardMath, which enhances the mathematical reasoning abilities of Llama-2 using AI feedback in the training process. In another work Zhang et al. (2023), researchers used real-world data along with RLAIF to improve LLMs as medical consultants. Prior research has also explored AI evaluation for improving factual correctness in LLM-generated medical summaries Mishra et al. (2023). Kwon et al. Kwon et al. (2023) explored the usefulness of LLMs in the reward design for RL agents in Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. Recently, Lee et al. Lee et al. (2023) demonstrated that RLAIF can achieve human-level performance in summarization and helpful and harmless text generation. However, the possibility of using RLAIF to improve the code generation and API usage ability in small models (<1B parameters) is under-explored. We demonstrate that even with a few model parameters, AI feedback significantly improves code generation quality over simple fine-tuning baselines. Moreover, we found RLAIF applied on smaller 780M parameter GPT-2-large model outperforms LLaMA-7B fine-tuned models, which has nine times more parameters.

2 Dataset

Refer to caption
Figure 1: Schematic diagram of the proposed framework. Step 1 is to fine-tune a base model on the dataset. In step 2, we score the SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT generated outputs based on the GPT-3.5 feedback using the technique described in section 3. Using this score, we prepare preference data and train a reward model. Finally, in step 3, we use RL to fine-tune SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT where rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT provides the reward.

We applied our proposed method to the Gorilla dataset published by Patil et al. Patil et al. (2023). The Gorilla dataset consists of three parts - HuggingFace, TensorFlow, and PyTorch. In this work, we only focus on the HuggingFace dataset, which is the most extensive among the three, featuring over 925 unique APIs. These APIs belong to 37 different domains (e.g., Multimodal Text-to-Image, Computer Vision Image Classification, Audio Text-to-Speech, etc.), and for each API, there exist ten unique instructions. Each instance of the data contains an instruction (task description), domain, API call (a single code line), explanation (how to solve the task using the API), and a complete code (Python script) to accomplish the task. Here, we highlight some key differences between Gorilla and the traditional code generation datasets. Most of the problem statements and corresponding code snippets present in the benchmark datasets including CodeSearchNet Husain et al. (2019), XLCoST Zhu et al. (2022), APPS Hendrycks et al. (2021) and MBPP Austin et al. (2021) are related to traditional software engineering tasks, representative of common interview questions, require minimal computational resources to execute and do not require internet connection. On the contrary, the Python scripts in the Gorilla dataset focus on AI-related tasks and require an internet connection and significant computing resources (storage and processing power) to execute. The scripts are expected to download ML models hosted on HuggingFace, load them in memory, and run inference. So, techniques where code execution or unit test outcomes are treated as feedback Zhong et al. (2017); Le et al. (2022); Shojaee et al. (2023) become inapplicable.

While Patil et al. Patil et al. (2023) focused only on generating the API call, we demonstrate the effectiveness of our approach both on API call correctness and the ability to use that API in a complete code.

3 Methodology

Our framework follows a similar pipeline to RLHF Ouyang et al. (2022). However, instead of asking human annotators to rank the generated responses, we employ a bigger LLM by using a novel prompting strategy. More specifically, for a given instruction and generated code (containing an API call), we ask multiple binary (yes/no) questions that capture different aspects of the generated code (and API call) to determine its quality. Our intuition is, that while generating code from natural language might be still challenging for LLMs, providing binary (yes/no) answers guided by few-shot exemplars is a much easier task. These feedbacks in turn could be aggregated as a preference ground truth to train the reward model in the RLHF Ouyang et al. (2022) process. Thus our approach eliminates the need for expensive human annotation cost. We describe the proposed framework (Figure 1) in detail.

\bullet  Step 1: Training a base model

The first step in the pipeline is to fine-tune a language model on the dataset to get a base model. We choose GPT-2-large and train it on the Gorilla dataset using the supervised fine-tuning technique for causal language models. We denote the fine-tuned model by SFTsubscript𝑆𝐹𝑇\mathcal{M}_{SFT}caligraphic_M start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT.

\bullet  Step 2: Training a reward model using LLM feedback

Instead of human feedback from annotators, we employed a bigger LLM to generate the labels for the reward model.

We realized that human graders while judging the correctness of a response, consider different aspects of the generated output. Based on this intuition, we created multiple prompts (Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) that ask different questions (Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) for the same input-output pair. More specifically, we created a set of 8 questions which We feed as prompts to a state-of-the-art language model (GPT-3.5) to get a binary response. Each of these questions addresses a different desired quality (free of bugs, correct imports, no undefined variables, correct syntax, etc.) of the output relevant to the task. Step 2 in Figure 1 presents a sample prompt made using one of the questions. The appendix contains the complete list of prompts. As the questions are binary (yes/no) in nature, we simply count the number of yes replies by GPT 3.5subscriptGPT 3.5\mathcal{M}_{\textit{GPT 3.5}}caligraphic_M start_POSTSUBSCRIPT GPT 3.5 end_POSTSUBSCRIPT to score each input-output pair. More formally, given a task t𝑡titalic_t, generated output o𝑜oitalic_o, and question set {Qi}subscript𝑄𝑖\{Q_{i}\}{ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } the prompt set is defined as P(t,o)={PiPi=[Qi,t,o]}𝑃𝑡𝑜conditional-setsubscript𝑃𝑖subscript𝑃𝑖subscript𝑄𝑖𝑡𝑜P(t,o)=\{P_{i}\mid P_{i}=[Q_{i},t,o]\}italic_P ( italic_t , italic_o ) = { italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t , italic_o ] }. The corresponding score (S𝑆Sitalic_S) is given as:

S(t,o)=PiP(t,o)𝕀(GPT 3.5(Pi)=yes)|P(t,o)|𝑆𝑡𝑜subscriptsubscript𝑃𝑖𝑃𝑡𝑜𝕀subscriptGPT 3.5subscript𝑃𝑖yes𝑃𝑡𝑜S(t,o)=\frac{\sum_{P_{i}\in P(t,o)}\mathbbm{I}(\mathcal{M}_{\textit{GPT 3.5}}(% P_{i})=\textit{yes})}{|P(t,o)|}italic_S ( italic_t , italic_o ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_P ( italic_t , italic_o ) end_POSTSUBSCRIPT blackboard_I ( caligraphic_M start_POSTSUBSCRIPT GPT 3.5 end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = yes ) end_ARG start_ARG | italic_P ( italic_t , italic_o ) | end_ARG

where 𝕀𝕀\mathbbm{I}blackboard_I is the indicator function and GPT 3.5(Pi)subscriptGPT 3.5subscript𝑃𝑖\mathcal{M}_{\textit{GPT 3.5}}(P_{i})caligraphic_M start_POSTSUBSCRIPT GPT 3.5 end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the reply from GPT 3.5subscriptGPT 3.5\mathcal{M}_{\textit{GPT 3.5}}caligraphic_M start_POSTSUBSCRIPT GPT 3.5 end_POSTSUBSCRIPT for the prompt Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We use this score to prepare the training data for rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT in the following way. For each instruction in the training data, we generate two outputs from SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT by varying the generation parameters (top-k, temperature, etc.). Then they are scored using the method described above and labeled (accept or reject) based on this score. These tuples of {input instruction, accepted output, rejected output} are then combined to form the dataset for rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT. In the training phase, rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT learns to classify whether a machine-generated code is acceptable (or not) for a given input instruction. We append a classifier head on top of SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT and use this as the starting point of rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT and train for three epochs.

\bullet  Step 3: Reinforcement Learning

Finally, in the RL step, we fine-tune SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT using the proximal policy optimization (PPO) algorithm Schulman et al. (2017). The reward in this step is given by rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT’s logit scores. We denote our final fine-tuned model by RLsubscriptRL\mathcal{M}_{\textit{RL}}caligraphic_M start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT.

4 Results and Discussions

Model Name
(Size) Executability Rate (%) ROUGE (×100)(\times 100)( × 100 ) CodeBLEU (×100)(\times 100)( × 100 ) AST (%)
GorillasubscriptGorilla\mathcal{M}_{\textit{Gorilla}}caligraphic_M start_POSTSUBSCRIPT Gorilla end_POSTSUBSCRIPT Patil et al. (2023) (7B) 26.9 41.2 36.8 71.68
SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT (780M) 23.4 47.2 40.6 72.96
RLsubscriptRL\mathcal{M}_{\textit{RL}}caligraphic_M start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT (780M) 27.9 47.5 42.2 73.62
Table 1: Performance comparison of different models on the Gorilla dataset.
Refer to caption
Figure 2: Example code generated by different models for the same instruction. In the generations of GorillasubscriptGorilla\mathcal{M}_{\textit{Gorilla}}caligraphic_M start_POSTSUBSCRIPT Gorilla end_POSTSUBSCRIPT and SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT the variable russian_text is undefined and hence will result in an error. Whereas RLsubscriptRL\mathcal{M}_{\textit{RL}}caligraphic_M start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT defines the variable text before using it.

We compute the code generation quality using multiple metrics by comparing the generated output with the ground truth. The reported ROUGE is the average of ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-sum metrics introduced in Lin (2004). CodeBLEU Ren et al. (2020) was specifically designed for evaluating code synthesis. Ren et al. Ren et al. (2020) defined CodeBLEU as the weighted average of standard BLEU Papineni et al. (2002), the weighted n-gram match (BLEUweightsubscriptBLEUweight\textit{BLEU}_{\textit{weight}}BLEU start_POSTSUBSCRIPT weight end_POSTSUBSCRIPT), the syntactic AST match (MatchastsubscriptMatchast\textit{Match}_{\textit{ast}}Match start_POSTSUBSCRIPT ast end_POSTSUBSCRIPT), and the semantic dataflow match (MatchdfsubscriptMatchdf\textit{Match}_{\textit{df}}Match start_POSTSUBSCRIPT df end_POSTSUBSCRIPT). CodeBLEU=αBLEU+βBLEUweight+γMatchast+δMatchdfCodeBLEU𝛼BLEU𝛽subscriptBLEUweight𝛾subscriptMatchast𝛿subscriptMatchdf\textit{CodeBLEU}=\alpha\cdot\textit{BLEU}+\beta\cdot\textit{BLEU}_{\textit{% weight}}+\gamma\cdot\textit{Match}_{\textit{ast}}+\delta\cdot\textit{Match}_{% \textit{df}}CodeBLEU = italic_α ⋅ BLEU + italic_β ⋅ BLEU start_POSTSUBSCRIPT weight end_POSTSUBSCRIPT + italic_γ ⋅ Match start_POSTSUBSCRIPT ast end_POSTSUBSCRIPT + italic_δ ⋅ Match start_POSTSUBSCRIPT df end_POSTSUBSCRIPT . We set α=β=γ=δ=0.25𝛼𝛽𝛾𝛿0.25\alpha=\beta=\gamma=\delta=0.25italic_α = italic_β = italic_γ = italic_δ = 0.25 to give equal importance to all the components. The AST sub-tree-matching metric was proposed in Patil et al. (2023) to capture the correctness of the API calls. In addition to that, we also report the successful execution rate of the generated code (Executability Rate). It is worth noting that running this amount of machine-generated programs that download and use large AI models is challenging. We created a pipeline to automatically run the machine-generated codes in an isolated environment.

Table 4 compares the performance of the proposed model with GorillasubscriptGorilla\mathcal{M}_{\textit{Gorilla}}caligraphic_M start_POSTSUBSCRIPT Gorilla end_POSTSUBSCRIPT (finetuned LLaMA-7B) Patil et al. (2023). The results clearly show that the proposed RLsubscriptRL\mathcal{M}_{\textit{RL}}caligraphic_M start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT boosts the performance of the supervised fine-tuned SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT in terms of CodeBLEU (1.6 points abs), AST (0.66% abs) and Executability Rate (4.5% abs). We also note that RLsubscriptRL\mathcal{M}_{\textit{RL}}caligraphic_M start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT outperforms the GorillasubscriptGorilla\mathcal{M}_{\textit{Gorilla}}caligraphic_M start_POSTSUBSCRIPT Gorilla end_POSTSUBSCRIPT despite having only 1/9-th of the parameters. It is also reflected in the Executability Rate of the generated code. Figure 2 shows an instance where our framework helps in fixing a common error present in GorillasubscriptGorilla\mathcal{M}_{\textit{Gorilla}}caligraphic_M start_POSTSUBSCRIPT Gorilla end_POSTSUBSCRIPT and SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT generations.

5 Ethics statement

This work adheres to the ethical guidelines and principles set out in the ACM Code of Ethics and followed by the broader research community. The dataset used in this paper was originally collected from public repositories hosted on HuggingFace. The authors are aware of the growing literature on jailbreaking language models to generate unsafe content. We hope the community will use the proposed models responsibly and only for the intended use cases.

6 Limitations

One of the common limitations faced by similar fine-tuned models is the presence of biases inherited from the pre-trained model. We anticipate that the biases present in the chosen base model (GPT-2-large) also exist in the final model RLsubscriptRL\mathcal{M}_{\textit{RL}}caligraphic_M start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT which might lead to the generation of biased code comments.

Another limitation of this work is the lack of diversity in programming language. The public dataset we considered contains only Python code. Future work should consider expanding this approach to encompass additional programming languages such as C++, Java, JavaScript, etc. Besides, we have not analyzed the performance between more frequent APIs (head) and infrequent APIs (tail). There might be some scope for improvements by focusing on tail APIs more.

Lastly, the learning methodology applied in this study is offline. Given the rapid evolution and proliferation of machine learning models and the corresponding APIs for specific tasks, the model may not leverage more suitable APIs that emerge post-training. To address this, periodic updates to the model are necessary. Our framework’s reliance on machine-generated feedback significantly reduces the resource intensity associated with the RLHF process, making these updates more feasible and less costly than a human feedback-based approach.

References

  • Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Budinsky et al. (1996) Frank J. Budinsky, Marilyn A. Finnie, John M. Vlissides, and Patsy S. Yu. 1996. Automatic code generation from design patterns. IBM systems Journal, 35(2):151–171.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  • Hao et al. (2024) Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. 2024. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. Advances in neural information processing systems, 36.
  • Hendrycks et al. (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring coding challenge competence with apps. NeurIPS.
  • Husain et al. (2019) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Kwon et al. (2023) Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. 2023. Reward design with language models. arXiv preprint arXiv:2303.00001.
  • Le et al. (2022) Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
  • Lee et al. (2023) Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  • Li et al. (2023) Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
  • Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Luo et al. (2023) Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  • Mishra et al. (2023) Prakamya Mishra, Zonghai Yao, Shuwei Chen, Beining Wang, Rohan Mittal, and Hong Yu. 2023. Synthetic imitation edit feedback for factual alignment in clinical summarization. arXiv preprint arXiv:2310.20033.
  • Nakano et al. (2021) Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  • OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-**g Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  • Patil et al. (2023) Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  • Qin et al. (2023) Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. 2023. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
  • Ren et al. (2020) Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297.
  • Schick et al. (2024) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Shojaee et al. (2023) Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and Chandan K Reddy. 2023. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816.
  • Svyatkovskiy et al. (2020) Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1433–1443.
  • Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • von Werra et al. (2020) Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  • Waldinger and Lee (1969) Richard J Waldinger and Richard CT Lee. 1969. Prow: A step toward automatic program writing. In Proceedings of the 1st international joint conference on Artificial intelligence, pages 241–252.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  • Zhang et al. (2023) Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, et al. 2023. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075.
  • Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
  • Zhu et al. (2022) Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, and Chandan K. Reddy. 2022. Xlcost: A benchmark dataset for cross-lingual code intelligence.
  • Zhuang et al. (2024) Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2024. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36.

Appendix A Prompts

Table 2 lists all the prompts used in accessing the quality of the generated codes.

Prompt
Given an input task and a Python code, determine if the code is functional.
TASK: [instruction]
CODE: [code]
Given an input task and a Python code, determine if the code imports all the necessary classes/modules for execution.
TASK: [instruction]
CODE: [code]
Given an input task and a Python code, determine if the code uses the correct functions/APIs.
TASK: [instruction]
CODE: [code]
Given an input task and a Python code, determine if the code is free of bugs and code smells.
TASK: [instruction]
CODE: [code]
Given an input task and a Python code, determine if the code is sufficient to accomplish the task.
TASK: [instruction]
CODE: [code]
Given an input task and a Python code, determine if the code uses indentations correctly.
TASK: [instruction]
CODE: [code]
Given an input task and a Python code, determine if the code uses quotes in string literals correctly.
TASK: [instruction]
CODE: [code]
Given an input task and a Python code, determine if the code uses duplicate parameters in a function.
TASK: [instruction]
CODE: [code]
Table 2: Complete set of prompts. The tokens [instruction] and [code] are used to denote an instruction from the dataset and the corresponding generated code respectively.

Appendix B Experimental details

B.1 Dataset

The HuggingFace part of the Gorilla dataset Patil et al. (2023) consists of over 9k instruction-output pairs. We trained our model on 90% of the data and kept the rest for evaluation.

B.2 Model and implementation details

SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT, rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT and RLsubscriptRL\mathcal{M}_{\textit{RL}}caligraphic_M start_POSTSUBSCRIPT RL end_POSTSUBSCRIPT all have 780M parameters. While training SFTsubscriptSFT\mathcal{M}_{\textit{SFT}}caligraphic_M start_POSTSUBSCRIPT SFT end_POSTSUBSCRIPT and rewardsubscriptreward\mathcal{M}_{\textit{reward}}caligraphic_M start_POSTSUBSCRIPT reward end_POSTSUBSCRIPT we used a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT respectively. In the RL step (PPO algorithm), we set the learning rate to 6×1066superscript1066\times 10^{-6}6 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. We did not perform any hyperparameter search. The results are reported by taking the mean of three inference runs. We implemented the training pipeline using the following Python libraries: transformers Wolf et al. (2020) and TRL von Werra et al. (2020).

B.3 Computational cost

We used a cluster of NVIDIA A100 40GB GPUs for our experiments. We spent in total similar-to\sim 60 GPU hours for all of the experiments.