Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

Sujan Dutta Rochester Institute of Technology, [email protected] Sayantan Mahinder Apple, {smahinder, raviteja_anantha, bbandyopadhyay}@apple.com Raviteja Anantha Apple, {smahinder, raviteja_anantha, bbandyopadhyay}@apple.com Bortik Bandyopadhyay Apple, {smahinder, raviteja_anantha, bbandyopadhyay}@apple.com

Abstract

Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline’s performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.

⁰⁰footnotetext: † Work done as a part of an internship at Apple.

1 Introduction

LLMs have demonstrated unprecedented natural language understanding and generation capabilities in recent times Brown et al. (2020); Chowdhery et al. (2022); OpenAI (2023); Anil et al. (2023); Jiang et al. (2023); Touvron et al. (2023). Reinforcement Learning with Human Feedback (RLHF) is a key contributor to this success. RLHF is a fine-tuning approach that uses human feedback to train models by incorporating human evaluations into the reward signal. This method improves model performance on complex tasks by aligning the model’s behavior with human preferences. However, this technique is expensive due to the requirement for high-quality human feedback. RLAIF Bai et al. (2022); Lee et al. (2023) has emerged as a promising alternative to replace human feedback with AI feedback, making the fine-tuning more scalable. Concurrently, there is growing research interest in teaching LLMs how to use external tools (APIs) Schick et al. (2024); Nakano et al. (2021); Patil et al. (2023); Qin et al. (2023); Li et al. (2023); Zhuang et al. (2024); Hao et al. (2024). However, the focus on lightweight models (<1B parameters) is limited. In this work, we propose an RLAIF framework to enhance lightweight LLMs’ capability to generate code and effectively integrate API calls. Following Patil et al. Patil et al. (2023), we consider the task of generating Python codes that include suitable API calls given instructions across a wide array of applications. The authors published the Gorilla dataset and showed that fine-tuned LLaMA-7B Touvron et al. (2023) on this dataset outperforms non-finetuned LLMs like GPT-4 OpenAI (2023) in terms of understanding a natural language request and map** it to API calls. Using our RLAIF framework, we fine-tune GPT-2-large (780M parameters), which not only demonstrates comparable API call correctness to Patil et al. (2023) but also surpasses its code generation performance.

Code Generation.

Although extensively studied since the early days of AI research, code generation Waldinger and Lee (1969); Budinsky et al. (1996); Svyatkovskiy et al. (2020); Li et al. (2022) remains a challenging problem. In recent years, the community has explored ways to apply RL in training machine learning models for code generation tasks. For instance, Seq2SQL Zhong et al. (2017) proposed a neural network trained through RL for generating SQL queries given a text description. During training, a generated query is executed against a database, and the result is utilized as the reward in the RL algorithm. Le et al. Le et al. (2022) developed CodeRL, a sequence-to-sequence language model fine-tuned through an actor-critic RL approach for program synthesis. The code-generator LM is treated as the actor during the training, and the critic model, which is trained to predict unit test results, provides the reward for a generated code. Another work Shojaee et al. (2023) similar to the above, proposed using feedback from code execution and a ground truth target code to compute the reward. While these approaches may perform well on classical programming tasks (e.g., writing SQL queries, solving competitive/interview-level coding problems, etc.), they are inapplicable on Gorilla-like Patil et al. (2023) code generation where the program is required to load and execute ML models using the correct API. The bottleneck comes from the fact that the above-mentioned techniques require execution of the generated code to either compute the reward directly or train the critic model, but running thousands of such programs is prohibitively expensive.

Reinforcement Learning with AI Feedback.

Bai et al. Bai et al. (2022) introduced the concept of Reinforcement Learning with AI Feedback (RLAIF), which combines preferences labeled by LLMs with human-labeled preferences to optimize for helpfulness and harmlessness. Since then, many studies have explored the usefulness of AI-generated feedback as an alternative to expensive human annotations in various tasks. For instance, Luo et al. Luo et al. (2023) proposed WizardMath, which enhances the mathematical reasoning abilities of Llama-2 using AI feedback in the training process. In another work Zhang et al. (2023), researchers used real-world data along with RLAIF to improve LLMs as medical consultants. Prior research has also explored AI evaluation for improving factual correctness in LLM-generated medical summaries Mishra et al. (2023). Kwon et al. Kwon et al. (2023) explored the usefulness of LLMs in the reward design for RL agents in Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. Recently, Lee et al. Lee et al. (2023) demonstrated that RLAIF can achieve human-level performance in summarization and helpful and harmless text generation. However, the possibility of using RLAIF to improve the code generation and API usage ability in small models (<1B parameters) is under-explored. We demonstrate that even with a few model parameters, AI feedback significantly improves code generation quality over simple fine-tuning baselines. Moreover, we found RLAIF applied on smaller 780M parameter GPT-2-large model outperforms LLaMA-7B fine-tuned models, which has nine times more parameters.

2 Dataset

Refer to caption — Figure 1: Schematic diagram of the proposed framework. Step 1 is to fine-tune a base model on the dataset. In step 2, we score the $\mathcal{M}_{\textit{SFT}}$ generated outputs based on the GPT-3.5 feedback using the technique described in section 3. Using this score, we prepare preference data and train a reward model. Finally, in step 3, we use RL to fine-tune $\mathcal{M}_{\textit{SFT}}$ where $\mathcal{M}_{\textit{reward}}$ provides the reward.

We applied our proposed method to the Gorilla dataset published by Patil et al. Patil et al. (2023). The Gorilla dataset consists of three parts - HuggingFace, TensorFlow, and PyTorch. In this work, we only focus on the HuggingFace dataset, which is the most extensive among the three, featuring over 925 unique APIs. These APIs belong to 37 different domains (e.g., Multimodal Text-to-Image, Computer Vision Image Classification, Audio Text-to-Speech, etc.), and for each API, there exist ten unique instructions. Each instance of the data contains an instruction (task description), domain, API call (a single code line), explanation (how to solve the task using the API), and a complete code (Python script) to accomplish the task. Here, we highlight some key differences between Gorilla and the traditional code generation datasets. Most of the problem statements and corresponding code snippets present in the benchmark datasets including CodeSearchNet Husain et al. (2019), XLCoST Zhu et al. (2022), APPS Hendrycks et al. (2021) and MBPP Austin et al. (2021) are related to traditional software engineering tasks, representative of common interview questions, require minimal computational resources to execute and do not require internet connection. On the contrary, the Python scripts in the Gorilla dataset focus on AI-related tasks and require an internet connection and significant computing resources (storage and processing power) to execute. The scripts are expected to download ML models hosted on HuggingFace, load them in memory, and run inference. So, techniques where code execution or unit test outcomes are treated as feedback Zhong et al. (2017); Le et al. (2022); Shojaee et al. (2023) become inapplicable.

While Patil et al. Patil et al. (2023) focused only on generating the API call, we demonstrate the effectiveness of our approach both on API call correctness and the ability to use that API in a complete code.

3 Methodology

Our framework follows a similar pipeline to RLHF Ouyang et al. (2022). However, instead of asking human annotators to rank the generated responses, we employ a bigger LLM by using a novel prompting strategy. More specifically, for a given instruction and generated code (containing an API call), we ask multiple binary (yes/no) questions that capture different aspects of the generated code (and API call) to determine its quality. Our intuition is, that while generating code from natural language might be still challenging for LLMs, providing binary (yes/no) answers guided by few-shot exemplars is a much easier task. These feedbacks in turn could be aggregated as a preference ground truth to train the reward model in the RLHF Ouyang et al. (2022) process. Thus our approach eliminates the need for expensive human annotation cost. We describe the proposed framework (Figure 1) in detail.

$\bullet$ Step 1: Training a base model

The first step in the pipeline is to fine-tune a language model on the dataset to get a base model. We choose GPT-2-large and train it on the Gorilla dataset using the supervised fine-tuning technique for causal language models. We denote the fine-tuned model by $\mathcal{M}_{SFT}$ .

$\bullet$ Step 2: Training a reward model using LLM feedback

Instead of human feedback from annotators, we employed a bigger LLM to generate the labels for the reward model.

We realized that human graders while judging the correctness of a response, consider different aspects of the generated output. Based on this intuition, we created multiple prompts ( $P_{i}$ ) that ask different questions ( $Q_{i}$ ) for the same input-output pair. More specifically, we created a set of 8 questions which We feed as prompts to a state-of-the-art language model (GPT-3.5) to get a binary response. Each of these questions addresses a different desired quality (free of bugs, correct imports, no undefined variables, correct syntax, etc.) of the output relevant to the task. Step 2 in Figure 1 presents a sample prompt made using one of the questions. The appendix contains the complete list of prompts. As the questions are binary (yes/no) in nature, we simply count the number of yes replies by $\mathcal{M}_{\textit{GPT 3.5}}$ to score each input-output pair. More formally, given a task $t$ , generated output $o$ , and question set $\{Q_{i}\}$ the prompt set is defined as $P(t,o)=\{P_{i}\mid P_{i}=[Q_{i},t,o]\}$ . The corresponding score ( $S$ ) is given as:

S(t,o)=\frac{\sum_{P_{i}\in P(t,o)}\mathbbm{I}(\mathcal{M}_{\textit{GPT 3.5}}(% P_{i})=\textit{yes})}{|P(t,o)|}

where $\mathbbm{I}$ is the indicator function and $\mathcal{M}_{\textit{GPT 3.5}}(P_{i})$ is the reply from $\mathcal{M}_{\textit{GPT 3.5}}$ for the prompt $P_{i}$ . We use this score to prepare the training data for $\mathcal{M}_{\textit{reward}}$ in the following way. For each instruction in the training data, we generate two outputs from $\mathcal{M}_{\textit{SFT}}$ by varying the generation parameters (top-k, temperature, etc.). Then they are scored using the method described above and labeled (accept or reject) based on this score. These tuples of {input instruction, accepted output, rejected output} are then combined to form the dataset for $\mathcal{M}_{\textit{reward}}$ . In the training phase, $\mathcal{M}_{\textit{reward}}$ learns to classify whether a machine-generated code is acceptable (or not) for a given input instruction. We append a classifier head on top of $\mathcal{M}_{\textit{SFT}}$ and use this as the starting point of $\mathcal{M}_{\textit{reward}}$ and train for three epochs.

$\bullet$ Step 3: Reinforcement Learning

Finally, in the RL step, we fine-tune $\mathcal{M}_{\textit{SFT}}$ using the proximal policy optimization (PPO) algorithm Schulman et al. (2017). The reward in this step is given by $\mathcal{M}_{\textit{reward}}$ ’s logit scores. We denote our final fine-tuned model by $\mathcal{M}_{\textit{RL}}$ .

Model Name
(Size)	Executability Rate (%)	ROUGE $(\times 100)$	CodeBLEU $(\times 100)$	AST (%)
$\mathcal{M}_{\textit{Gorilla}}$ Patil et al. (2023) (7B)	26.9	41.2	36.8	71.68
$\mathcal{M}_{\textit{SFT}}$ (780M)	23.4	47.2	40.6	72.96
$\mathcal{M}_{\textit{RL}}$ (780M)	27.9	47.5	42.2	73.62

Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

Abstract

1 Introduction

Code Generation.

Reinforcement Learning with AI Feedback.

2 Dataset

3 Methodology

4 Results and Discussions

5 Ethics statement

6 Limitations

References

Appendix A Prompts

Appendix B Experimental details

B.1 Dataset

B.2 Model and implementation details

B.3 Computational cost