HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2311.13246v2 [cs.CL] 21 Mar 2024

CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning

Yilun Liu, Shimin Tao, Xiaofeng Zhao, Ming Zhu, Wenbing Ma, Junhao Zhu, Chang Su,
Yutai Hou, Miao Zhang, Min Zhang, Hongxia Ma, Li Zhang, Hao Yang, Yanfei Jiang
Huawei, China {liuyilun3, taoshimin, zhaoxiaofeng14, zhuming47, mawenbing, zhujunhao, suchang8, houyutai,
emma.zhangmiao, zhangmin186, mahongxia, izzie.zhangli, yanghao30, jiangyanfei}@huawei.com
Abstract

Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative. To ensure the high quality of LLM-generated instruction datasets, several approaches have been proposed. Nevertheless, existing methods either compromise dataset integrity by filtering a large proportion of samples, or are unsuitable for industrial applications. In this paper, instead of discarding low-quality samples, we propose CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset. CoachLM is trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset from 17.7% to 78.9%. The effectiveness of CoachLM is further assessed on various real-world instruction test sets. The results show that CoachLM improves the instruction-following capabilities of the instruction-tuned LLM by an average of 29.9%, which even surpasses larger LLMs with nearly twice the number of parameters. Furthermore, CoachLM is successfully deployed in a data management system for LLMs at Huawei, resulting in an efficiency improvement of up to 20% in the cleaning of 40k real-world instruction pairs. We release various assets of CoachLM, including the training data, code and test set111https://github.com/lunyiliu/CoachLM.

Index Terms:
large language model, instruction tuning, data quality, instruction revision

I Introduction

The rapid progress of Large Language Models (LLMs) has brought a profound impact on various domains. Notable examples include ChatGPT [1] and GPT-4 [2], which have demonstrated the ability to perform complex tasks and provide appropriate responses based on human instructions [3, 4, 5]. Furthermore, these models possess an understanding of their limitations in terms of capabilities [1]. The capabilities of LLMs are developed through a three-stage process. The first stage involves pre-training, where a foundation model is trained to predict subsequent words within large corpora [6]. However, while foundation models like LLaMA [7] can complete input sentences, they lack the ability to effectively respond to human instructions. To address this limitation, LLMs undergo fine-tuning on diverse instructions, leveraging desired responses as learning signals in order to generalize to unseen instructions [8, 9, 10]. This process is commonly referred to as instruction tuning. Some LLMs also incorporate Reinforcement Learning (RL) pipelines to dynamically learn the boundaries of their responses, thereby avoiding the generation of harmful or sensitive content [11, 12, 1].

Refer to caption
Figure 1: Illustration of instruction tuning LLMs on pairs of Instruction and Response.

Among these techniques, instruction tuning is considered a crucial process to enhance the capabilities of LLMs by leveraging stored knowledge from pre-training and effectively aligning with human expectations [13]. The process involves further training LLMs on instruction datasets, which consist of formatted instruction pairs. As illustrated in Fig. 1, an instruction pair can be represented as (Instruction, Response), with Instruction denoting the human instruction for the model and Response representing the desired output following the instruction. Crafting a high-quality instruction dataset is essential to elicit the desired behaviors of LLMs through instruction tuning. Prominent LLMs, such as ChatGPT [1], GPT-4 [2], and Bard 222https://bard.google.com/, utilize proprietary instruction datasets constructed with significant amounts of human annotation. However, the collection of human-written instruction pairs is expensive, requiring comprehensive knowledge of annotators. Alternatively, Wang et al. proposed Self-Instruct, an automatic approach to construct instruction datasets by leveraging LLMs to produce instruction pairs with high diversity [14]. With the increasing capabilities and flexibility of LLMs, instruction tuning using LLM-generated instruction datasets has emerged as a paradigm [15, 16, 17]. Notably, the Alpaca project [15] utilizes the GPT-3.5 model and the Self-Instruct strategy to generate 52k instruction pairs (referred to as the Alpaca52k dataset). The Alpaca model, fine-tuned from LLaMA using this dataset, demonstrates a strong ability to follow instructions compared to the GPT-3.5 model.

However, recent studies have raised concerns about the quality of instruction pairs generated by LLMs. These studies [17, 18, 19] suggest that the quality of the instruction dataset used for instruction tuning significantly impacts performance. In response to these concerns, the Alpaca-cleaned project333https://github.com/gururise/AlpacaDataCleaned has identified various issues in the Alpaca52k dataset, including empty responses and inconsistent formats. To address these issues, regular expressions were employed to clean a subset of instruction pairs within the dataset, resulting in improved performance of the subsequently fine-tuned Alpaca-cleaned model. Additionally, AlpaGasus [20] utilized ChatGPT to filter out 9k high-quality instruction pairs from the Alpaca52k dataset. The fine-tuned model using this filtered dataset outperformed the original Alpaca model trained on the full dataset. However, despite these efforts, there remains a need for a systematic investigation into the quality of LLM-generated instruction datasets, as rule-based approaches are unable to address all issues. Furthermore, simply discarding low-rated instruction pairs may reduce the diversity of the dataset, thereby diminishing the generalization ability of LLMs.

In this paper, our objective is to propose a systematic and efficient approach to address the issue of unguaranteed data quality in LLM instruction tuning. Instead of discarding low-quality data, our approach focuses on improving their quality through revisions. To achieve this, we conducted a meticulous manual examination of 6k instruction pairs sampled from the Alpaca52k dataset. We engaged 17 language experts to review from nine different dimensions, encompassing basic correctness and advanced experiences. During the primary revision, deficiencies were identified in 46.8% of the examined instruction pairs. Subsequently, the language experts were asked to rewrite the identified low-quality instruction pairs. This generated an expert revision dataset consisting of approximately 2.3k revised instruction pairs and their original counterparts. Using this dataset, we trained a coach language model (CoachLM) to learn the expert revision process and automatically provide revisions for low-quality instruction pairs. To evaluate the effectiveness of our approach, we conducted experiments on four instruction-following test sets, comprising real-world tasks from various categories. The Alpaca-CoachLM model, which was fine-tuned on the CoachLM-revised Alpaca52k dataset, outperformed other Alpaca variants on all test sets in terms of win rates. Remarkably, it even outperformed stronger LLMs with more parameters and training stages. Our contributions are summarized as follows:

  • We conducted a comprehensive examination of the Alpaca52k dataset, a widely-used LLM instruction tuning dataset. This examination resulted in the identification and rewriting of low-quality instruction pairs, leading to an average improvement of 8.4% in the win rates of our tuned Alpaca-human model, where the expert-revised subset was merged back into the Alpaca52k dataset.

  • We introduced CoachLM, an industry-friendly coach language model that automatically revises instruction pairs. CoachLM significantly increased the proportion of high-quality samples in the Alpaca52k dataset, improving it from 17.7% to 78.9%. Furthermore, CoachLM was trained from open-sourced backbone models, facilitating easy and customized deployment.

  • We demonstrated the effectiveness of CoachLM in enhancing the instruction-following capabilities of instruction-tuned LLMs. Our Alpaca-CoachLM model, fine-tuned on the CoachLM-revised Alpaca52k dataset, outperformed the top-performing Alpaca variants by up to 21.5% and even stronger LLMs with more parameters and training stages.

II Methodology

II-A Motivation

Refer to caption
(a) The training process of CoachLM
Refer to caption
(b) The workflow of CoachLM in boosting LLM instruction tuning
Figure 2: Illustration of CoachLM: (a) in the training stage and (b) in the inference stage. CoachLM learns from the expert revision process in the training stage and perform revisions on instruction pairs in the inference stage. The displayed instruction pairs from the Alpaca52k dataset were revised by CoachLM. For convenience of display, core revisions were marked red, and the line breaks in the instruction pairs were adjusted. CoachLM rewrote the ambiguous instruction in the first sample, added explanations for the response in the second, and corrected the less appropriate response in the third.

Our work is motivated by the challenges of data quality in instruction tuning and the limitations of existing approaches.

(1) A systematic and deeper examination on the data quality of LLM-generated instruction datasets is in need, as unguaranteed quality of instruction pairs will hinder the instruction-following abilities of subsequently tuned LLMs. Recent studies have shown that LLM-generated instruction datasets, such as the Alpaca52k dataset, contain errors in the surface form, such as invalid formats, which negatively impact the performance of LLMs. Although the Alpaca-cleaned project has designed a rule-based approach to correct these surface mistakes, our expert examination reveals deeper deficiencies in the LLM-generated instruction dataset. These deficiencies include incomplete or irrelevant responses and infeasible instructions, which cannot be fully detected by regular expressions. As will be discussed in Section III-C, fixing these deficiencies can further enhance model performances.

(2) There is a need for an automated and industry-friendly approach to improve the quality of instruction datasets, which arises from the high cost associated with manual revisions on a large scale and the uncertainties introduced by relying on API-dependent LLMs. Despite the improvement in the performance of model through expert revisions, a substantial amount of work, totaling 129 person-days, was required to examine only 6k out of 52k instruction pairs. The significant cost makes it challenging to further enhance the performance of LLMs by scaling up the human revisions. Therefore, an automatic approach is necessary to provide an efficient refinement of instruction datasets. Recent approaches, such as AlpaGasus [20], have utilized off-the-shelf and cloud-based LLMs, such as ChatGPT, to automatically enhance the overall quality of instruction datasets. However, the application of such API-dependent methods is often limited in industrial scenarios due to difficulties in reproducing results caused by frequent updates to the LLM and uncertainties in accessibility due to increasingly stringent blocking strategies. Furthermore, it is not feasible to locally deploy these approaches in private domains with limited internet access, emphasizing the need for an industry-friendly approach that ensures reproducibility, accessibility, and privacy protection.

(3) Existing filtering-based approaches have the potential to negatively impact the diversity of instruction datasets, which in turn hampers the generalization ability of LLMs. These approaches typically select a small subset of instruction pairs with high ratings from the dataset and fine-tune LLMs on this subset, resulting in improved performance compared to LLMs tuned on the full dataset [20, 19]. Although it has been extensively demonstrated that including low-quality instruction pairs in LLM instruction tuning diminishes the instruction-following capability of the models [17, 19, 21], drop** the majority of instruction pairs poses a risk of compromising the integrity of the instruction dataset, as this may lead to a lack of instructions from certain categories and a reduction in the instruction-following abilities of subsequently tuned LLMs in those areas. For instance, Chen et al. [20] observed that the high filtering ratio of code-related instruction pairs in the training dataset of AlpaGasus resulted in relatively weaker performance in responding to coding instructions. One potential solution to address this issue is to improve the low-quality portion of the dataset by revising it to ensure diversity, rather than simply discarding low-quality instructions.

II-B Overview of CoachLM

The architecture of CoachLM, our proposed model for automatic instruction pair revision, is depicted in Fig. 2. In the training stage (Fig. 2(a)), we construct an expert revision dataset consisting of original low-quality instruction pairs and their corresponding manually revised versions. The revisions, carried out by experts considering deficiencies in nine dimensions, involve corrections, adjustments, diversifications, and rewrites. Then, the process of coach instruction tuning adapts a backbone LLM to CoachLM, eliciting its instruction-pair revision ability through tuning on the expert revision samples.

In the inference stage, each instruction pair in an instruction dataset is input to CoachLM for revisions, resulting in a CoachLM-revised instruction dataset. This revised dataset is subsequently employed as a training dataset in LLM instruction tuning. As shown in Fig. 2(b), the displayed CoachLM-revised versions of the instruction pairs, when compared with those in the Alpaca52k dataset, alleviate ambiguity in instructions, expand the necessary reasoning process in responses, and enhance adherence to the requirements in instructions. Consequently, when used as a training dataset in LLM instruction tuning, the higher quality of the CoachLM-revised instruction dataset provides better guidance to the foundation LLM in modeling the connection between user instructions and appropriate responses, thereby improving the instruction-following abilities of the instruction-tuned LLMs.

The remainder of Section II is organized as follows. Section II-C introduces the expertise and grou** of the language experts involved in our work. Section II-D discusses the definition of data quality in instruction tuning and presents our criteria for evaluating the quality of instruction pairs. Section II-E describes the human revision process of instruction pairs from the Alpaca52k dataset. Section II-F provides a detailed illustration of the methodology used in the training and inference stages of CoachLM. Finally, Section II-G introduces CoachLM150, the instruction-following test set we created.

II-C Profile of Involved Language Experts

TABLE I: Expertise and Grou** of Involved Language Experts
Group Task
Number of
Experts
Average Years of
Experience
A Revise Instruction Pairs 17 11.29 years
B Create Test Set 6 5.64 years
C Evaluate CoachLM 3 12.57 years
TABLE II: Human Evaluation Criteria for the Quality of Instruction Pairs
Criteria for Instruction

Level

Dimension

Description

Main Checklist

Score Range
Advanced
Requirement
Contextualization

The instruction includes a rich context or effective prompting skills to facilitate detailed and accurate responses.

Check for scenarios, roles, examples, or other requirements, and for skills like chain-of-thought.

    80-100

Basic
Requirement

   Feasibility

The instruction is clear, specific, feasible, and easily understandable.

Check for ambiguous or vague expressions, logical errors, or requests beyond the ability of an AI model.

      0-80

   Readability

The instruction adheres to the conventions and stylistic norms of the target language.

Check for language-related issues such as grammar, spelling, and punctuations.

Criteria for Response

Level

Dimension

Description

Main Checklist

Score Range
Advanced
Experience

Humanization

Responses should be warm, empathetic, and engaging, tailored to the user’s background and preferences.

Check: (1) Emotional Perception. Respond to users’ emotions with empathy; (2) Humanized Tone. Interact with users in a natural and friendly way, avoiding machine-like tone.

    90-100

    Richness

Responses should be diverse, informative, creative, and expanded.

Check: (1) Provide detailed and diverse information with depth and breadth; (2) Enrich the content with novelty, uniqueness, and imagination.

     80-90

Basic
Experience

   Readability

Responses should use fluent, concise and correct language and be properly structured.

Check (1) Language: Error-free writing using precise vocabulary; (2) Content: Meaningful content without redundancy; (3) Structure: Clear, ordered, and logical organization of information with user-friendly layout.

     40-80

Comprehensive-
ness

Responses comprehensively cover all necessary angles and information.

Check (1) No omissions or deficiencies in fully explaining user questions. (2) Multiple angles, sufficient contexts and details for an unbiased response.

   Relevance

Responses should be effective and direct, and provide in-topic solutions.

Check (1) Irrelevance: Response misinterprets user’s intention; (2) Deviation: Response is related to user’s topic, but deviates from the focus.

   Correctness

Responses should be grounded in factual information, common sense, and logical reasoning, while also staying up-to-date and adhering to the user’s specific requirements.

Check (1) Factual Error: Inconsistent with reality; (2) Common Sense Error: Contradict with human common sense; (3) Logical Error: Include concept substitution, self-contradiction, ambiguity and circular reasoning, etc.; (4) Compliance with Constraints: Include word count, genre and style, etc.; (5) Timeliness: The provided information is up-to-date.

Experience
Red Line

     Safety

Responses should be harmless, protecting users’ emotions, body and property.

Check for violation of laws, personal attacks, exposure of user privacy and irresponsible advises on medical or financial matters.

      0-40

To ensure a comprehensive and rigorous assessment of data quality and to provide precise and scholarly revisions on instruction pairs, we established a collaboration with the language service center of a prominent international corporation. We recruited a team of highly experienced language experts who dedicated their full-time efforts to this project. These experts possess diverse skill sets encompassing translation, localization, proofreading, editing, copy-writing, technical writing, and linguistic testing. All participating experts have acquired advanced levels of education. Thus, in addition to their exceptional logical reasoning and writing proficiencies, they possess a solid foundation in arithmetic, coding, science, and general knowledge. Furthermore, owing to the existence of multilingual instructions in the Alpaca52k dataset, the multiple language capabilities of our team members, such as English, Chinese, Spanish, Arabic and French, render them uniquely qualified for this project.

As shown in Table I, a total of 26 language experts participated in the study, and they were divided into three non-overlap** groups, each assigned with specific tasks. The allocation of experts into groups was based on their expressed preferences, while we initially provided an estimated size for each group that roughly corresponded to the workload of the respective tasks. Consequently, group A comprised 17 experts, possessing an average experience of 11.29 years. Their primary responsibility entailed identifying low-quality instruction pairs and manually revising them as necessary. Group B consisted of six experts tasked with creating an instruction-following test set based on real-world scenarios, as well as providing human responses as reference for the test set. Group C comprised three experts responsible for conducting a human evaluation of CoachLM and the subsequently fine-tuned LLM. Moreover, all experts in the three groups actively participated in the formulation of the quality evaluation criteria for instruction pairs. Notably, there was no overlap between the authors of this paper and the language experts.

II-D Quality Evaluation Criteria for Instruction Pairs

Before examining the data quality of the instruction dataset, it is crucial to establish a comprehensive definition of the quality of instruction pairs. Previous studies [18, 19, 20] generally agree that for LLMs, high-quality instruction pairs are advantageous for instruction tuning, while low-quality pairs may impede the instruction-following ability of LLMs trained on such data. To enhance the capabilities of models to follow human instructions, instruction pairs used for training should adhere to a human-expectation paradigm. Existing research [16, 22, 23, 24] suggests that human expectations for LLM behavior encompass various dimensions, including basic language safety and advanced expectations, such as factual correctness, contextual richness, and helpfulness of responses. A robust evaluation criterion should incorporate these dimensions to ensure high-scored training samples align well with human expectations.

By incorporating the dimensions outlined in existing evaluation criteria [16, 22, 23, 24], a comprehensive set of criteria encompassing nine different evaluation dimensions (as shown in Table II) has been proposed to assess the quality of (Instruction, Response) pairs. The Instruction and Response are evaluated independently, yielding two separate scores ranging from 0 to 100 based on their respective criteria. While all dimensions are necessary, they vary in their significance to the overall human interaction experience. Consequently, the dimensions are grouped into three levels based on their importance, which determines their contribution to the final score. The red-line level (e.g., safety) represents the minimum acceptable standard for human tolerance, where any violation results in a score no higher than 40. The basic level (e.g., correctness and relevance) signifies dimensions that enable effective human-model interaction, and any flaws in this level restrict the score to a maximum of 80. Finally, the advanced level encompasses higher human expectations, including rich context and politeness, and accounts for the top 20 points in the criteria. To mitigate bias, evaluators are instructed to independently and separately assess each dimension, since, for example, a response may still be relevant even if it contains factual inaccuracies.

Regarding the criteria for assessing the quality of the Instruction in an instruction pair in Table II, firstly, an Instruction should be grammatically correct and logically feasible. Readability issues may impede accurate understanding of user intent during the training process. Additionally, infeasible Instructions containing logical errors in the training dataset may prevent the model from learning correct connections between instructions and responses, thereby exacerbating the hallucination of tuned LLMs [25, 17, 16]. Moreover, recent studies have shown that including more contextual information and details in user instructions leads to better model responses [26, 27]. Therefore, a high-quality Instruction should also be rich in specific contexts, such as requirements and examples.

Similarly, a high-quality Response to the user’s instruction ensures a desirable user experience. Firstly, the red line of a Response is the safety aspect for the user and other entities. Additionally, a basic requirement for a good user experience is a relevant and comprehensive response without factual and language errors. Furthermore, providing a Response with expanded information and a humanized tone is essential for delivering an advanced user experience.

II-E Manual Instruction Revision with Experts

In this section, we present details of the human revision process conducted on a randomly selected subset of 6k instruction pairs from the Alpaca52k dataset.

II-E1 Preliminary Filtering

TABLE III: The Distribution of the 1088 Excluded Instruction Pairs
Reason

Example

Ratio

Invalid Input: The key content of the instruction is invalid.

Generate a creative title for this article. Input: [Link to an article].

41.7%

Beyond Expertise: Overly professional scenes.

Generate the chords for an E minor scale.

27.7%

Massive Workload: Poem or lyric requiring massive rewriting.

From the given lyrics, create a haiku poem.

8.2%

Multi-modal: Image, video and audio, which are not supported.

List the products in the photo. Input: (photo of a grocery store).

6.5%
Safety: Overly toxic content, copyrighted content and sensitive content. 15.9%

Before the primary revision, experts from group A conducted a preliminary filtering on the sampled 6k instruction pairs to exclude unsuitable pairs. As shown in Table III, a total of 1088 pairs were excluded, mainly due to missing or invalid key parts, excessive expertise or workload requirements, inclusion of unsupported multi-modal information, and overly toxic or sensitive content. These excluded pairs still participated in subsequent LLM training for fair comparison. A small proportion of such pairs were retained during the revision to ensure diversity of revision.

II-E2 Expert Revision

TABLE IV: The Statistics of Expert Revisions Made on Instruction Pairs
Revision

Dimension

Ratio
Distribution of the 1079 revised Instructions

Adjust the language and layout of the instruction to be clear and correct.

Readability

68.1%

Rewrite infeasible instructions; Rewrite the confusing and ambiguous part of instructions.

Feasibility

24.9%

Diversify the context; Add specific requirements and examples.

Contextuali-
zation
7.0%
Distribution of the 2301 revised Responses

Diversify angles of the responses; Add necessary explanations and backgrounds; Expand the reasoning process.

Comprehen-
siveness,
Richness
43.7%

Rewrite the language to be fluent and natural; Rewrite the content to be relevant, useful and logically consistent.

Relevance,
Readability,
Correctness
24.5%

Adjust response layout to be clear; Adjust the tone to be empathetic and personalized.

Readability,
Humanization
23.3%

Correct miscalculations, factual mistakes and common sense violations.

Correctness

6.7%

Other complex and creative revisions; mitigate safety issues.

   Safety,
   Others
1.9%

After excluding the 1088 filtered instruction pairs, the remaining 4.9k instruction pairs underwent the primary revision. To ensure an effective revision process, we adopted an expertise-based approach to assign instruction pairs to experts [28, 29]. Based on the categories proposed in [15], the instruction pairs were classified into three classes representing different levels of difficulty (i.e., expertise required) for revision. The first class involved language tasks that require mostly certain and objective answers, such as information extraction, grammatical correction, and summarizing. The second class included question answering (Q&A), which entails open dialogue completion, suggestion recommendation, and in-domain Q&A. Revising instruction pairs in this class demands higher language expertise due to the diverse and subjective nature of desired answers. The third and most challenging class involved creative composition, such as story creation and copywriting, which often necessitate substantial revision of creative content. In our expertise-based selection approach, the expertise of experts were estimated by their years of experience and the 17 experts from group A were divided into three units according to their expertise, with each unit responsible for revising one class. As a result, the average years of experience for experts in each unit are 9.4 years for language task performing, 11.2 years for Q&A, and 13.1 years for creative composition.

In addition, each unit was assigned an owner whose responsibility was to assess the quality of the revised instruction pairs produced by unit members. The revision process strictly adhered to the criteria outlined in Table II, following the principle of “making all necessary revisions,” regardless of the importance of the revised dimensions. If an instruction pair was identified as lacking in one or more dimensions in the criteria, the expert was required to make substantial revisions in those dimensions until the instruction pair achieved a score of 95 or higher based on the criteria. Consequently, considering the workload of preliminary filtering, quality control, and primary revision, a total of 129 person-days were expended, resulting in 2301 instruction pairs receiving revisions either on the Instruction or Response side. Among the 2.3k revised pairs, 1079 of them underwent revisions on Instruction.

During the revision, each instruction pair may have received revisions in multiple dimensions. The revised instruction pairs were categorized based on the primary type of revisions they underwent, and the distribution of each revision category is displayed in Table IV. For revisions on the Instruction side, approximately 68.1% consisted of minor adjustments in language and layout, while the remaining 31.9% involved improvements in feasibility and the inclusion of additional contextual information. As for Responses, the most common types of revisions comprised expanding the depth of the response or providing necessary supporting explanations, accounting for 43.7% of the revisions. Other revisions include content rewrites in terms of logic and relevancy, adjustments related to layout and tone, and corrections of factual and calculation errors. In order to ensure a diverse range of revisions, approximately 1.9% of the revisions were cases that should have fell into the categories listed in Table III. See more analysis details from the technical report in our repository.

II-F Design of CoachLM

The effectiveness of our criteria and revision process is evident from the advantage of Alpaca-human over Alpaca in Table IX. However, it is important to note that our manual examination only encompasses a limited portion of the Alpaca52k dataset, leaving the quality of the majority of the dataset uncertain. Given the high cost associated with expert revision, expanding the manual revision process on a larger scale is impractical, which necessitates the need for CoachLM, the proposed approach for efficient automatic revisions.

II-F1 Coach Instruction Tuning

CoachLM is trained by taking content revision as a type of instruction, which LLMs can follow via instruction tuning. Similar to general instructions, the requisite knowledge for content revision exists in the pre-training stage of LLMs, and is aligned with human expectations during instruction tuning. For instance, content-revision instructions found in the Alpaca52k dataset, such as “correct the grammatical errors in the sentence”, elicit the basic capacity of instruction-tuned LLMs like Alpaca to engage in content revision. Thus, we propose the process of coach instruction tuning that involves fine-tuning an LLM using specifically designed instruction pairs. These instruction pairs prompt the LLM to provide revisions to input instructions and align its responses with expert-revised outcomes. Through this approach, the LLM is anticipated to develop the ability to revise instruction pairs in a manner consistent with expert revision practices.

Specifically, given an instruction dataset V𝑉Vitalic_V of instruction pairs x𝑥xitalic_x = (Instruction, Response) with xV𝑥𝑉x\in Vitalic_x ∈ italic_V, each instruction pair x𝑥xitalic_x undergoes a revision through the expert revision process, resulting in a revised instruction pair xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. The expert revision dataset R𝑅Ritalic_R is then formed, which comprises both the original and revised instruction pairs, denoted as R={(x,xr)|xV}𝑅conditional-set𝑥subscript𝑥𝑟𝑥𝑉R=\{(x,x_{r})\ |\ x\in V\}italic_R = { ( italic_x , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) | italic_x ∈ italic_V }. During the coach instruction tuning process, each (x,xr)R𝑥subscript𝑥𝑟𝑅(x,x_{r})\in R( italic_x , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∈ italic_R is leveraged to construct an instruction pair xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, leading to an instruction dataset C={xc|xV}𝐶conditional-setsubscript𝑥𝑐𝑥𝑉C=\{x_{c}\ |\ x\in V\}italic_C = { italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_x ∈ italic_V }.

Refer to caption
Figure 3: Illustration on format of the instruction pairs xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in the coach instruction tuning. x𝑥xitalic_x denotes the original instruction pair and xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represents the revised version by experts.

As shown in Fig. 3, the Instruction of xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT instructs the LLM to enhance the quality of x𝑥xitalic_x, the original instruction pair, while the Response of xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, the expert-revised counterpart. When designing the Instruction component, we provide a succinct revision instruction that highlights the primary areas for revision based on the expert revision results. We deliberately refrain from composing an exhaustive and detailed instruction that fully encompasses all criteria, as a lengthy instruction could potentially distract the LLM from capturing the connections between the input instruction pairs and their expert-revised versions. Nonetheless, it is worth exploring whether the design of the instruction pair in Fig. 3 is optimal in future research.

Given an LLM with parameters θ𝜃\thetaitalic_θ as the initial model for coach instruction tuning, training the model on the constructed instruction dataset C𝐶Citalic_C results in the adaption of the LLM’s parameters from θ𝜃\thetaitalic_θ to θcsubscript𝜃𝑐\theta_{c}italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, denoted as CoachLM. Specifically, θcsubscript𝜃𝑐\theta_{c}italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is obtained by maximizing the probability of predicting the next tokens in the Response component of xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, conditioned on the Instruction of xcCsubscript𝑥𝑐𝐶x_{c}\in Citalic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ italic_C, which is formulated as:

θc=argmaxθxcClogP(Response|Instruction;θ,xc).subscript𝜃𝑐subscriptargmax𝜃subscriptsubscript𝑥𝑐𝐶𝑃conditionalResponseInstruction𝜃subscript𝑥𝑐\theta_{c}=\operatorname*{arg\,max}_{\theta}\sum\limits_{x_{c}\in C}\log P(% \textsc{Response}\,|\,\textsc{Instruction};\theta,x_{c}).italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ italic_C end_POSTSUBSCRIPT roman_log italic_P ( Response | Instruction ; italic_θ , italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) . (1)

II-F2 Quality Control of Human Input

In the pre-LLM era, models were required to learn both task-specific knowledge and the alignment between task input and desired output. This is why training on negative samples was sometimes beneficial, as it provided the model with supplementary knowledge and boundaries for the task-specific information [19]. However, with the adoption of current LLM techniques, most of the required knowledge is learned during pre-training. Numerous pieces of evidence suggest that when fine-tuning an LLM through instruction tuning, the introduction of low-quality instruction pairs actually hinders the performance of the tuned LLM [20, 19, 18, 30]. This phenomenon can be explained by the assumption that the instruction tuning process mainly promotes the alignment between the model and the expected user responses, and low-quality samples impede the model’s ability to correctly establish connections between its stored knowledge and following user instructions.

This concern also applies to the proposed coach instruction tuning process, as it may lead to sub-optimal performance of CoachLM if all the 2.3k available revision examples in R𝑅Ritalic_R are used to construct the training dataset C𝐶Citalic_C. Although the expert revision process includes a quality control stage that ensures each revised instruction pair xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT meets the criteria in Table II, the original instruction pair x𝑥xitalic_x may still influence the overall quality of the constructed instruction pair xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. If x𝑥xitalic_x is already in good shape, only minor revisions are made to obtain xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. In extreme cases where x𝑥xitalic_x is identical to xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, including such samples in the construction of C𝐶Citalic_C is akin to introducing negative samples into the coach instruction tuning process, which may hinder the performance of CoachLM as described above. In other words, the quality of xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be determined by the difference between xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and x𝑥xitalic_x, with a higher difference indicating more revisions that CoachLM can learn from.

To avoid biased results from the experts, we did not impose a minimum amount of revision for each revised sample in the expert revision process. Instead, we employ the edit distance metric to assess the quality of (x,xr)R𝑥subscript𝑥𝑟𝑅(x,x_{r})\in R( italic_x , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ∈ italic_R and define α𝛼\alphaitalic_α, the human input ratio, to determine the final subset of samples used in C𝐶Citalic_C. The edit distance, also known as the Levenshtein distance, quantifies the minimum number of single-character edits needed to transform one string into another [31]. The edit distance reflects the difference between x𝑥xitalic_x and xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, thereby measuring the quality of xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Then, by defining a ratio α𝛼\alphaitalic_α between 0 and 1, we can ensure that Cαsubscript𝐶𝛼C_{\alpha}italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT comprises human input samples from R𝑅Ritalic_R with the highest α𝛼\alphaitalic_α proportion of edit distances. By replacing C𝐶Citalic_C with Cαsubscript𝐶𝛼C_{\alpha}italic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT in Eq. (1), we obtain a CoachLM trained with a high-quality subset of the constructed instruction dataset C𝐶Citalic_C.

II-F3 Automatic Revision with CoachLM

Through coach instruction tuning, CoachLM generates automatic revisions on input instruction pairs, creating a CoachLM-revised instruction dataset. This high-quality dataset can subsequently be used as a training dataset for LLM instruction tuning. Let D𝐷Ditalic_D represent an input instruction dataset (e.g., the Alpaca52k dataset), consisting of instruction pairs x𝑥xitalic_x. Each xD𝑥𝐷x\in Ditalic_x ∈ italic_D is combined with the revision prompt shown in Fig. 3 to form an instruction pair xcDsubscriptsuperscript𝑥𝑐superscript𝐷x^{\prime}_{c}\in D^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, with an empty Response to be filled by CoachLM. The CoachLM-revised instruction dataset, denoted as Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, is obtained by applying θcsubscript𝜃𝑐\theta_{c}italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, the CoachLM, on Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

Dc={θc(xc)|xcD},subscript𝐷𝑐conditional-setsubscript𝜃𝑐subscriptsuperscript𝑥𝑐subscriptsuperscript𝑥𝑐superscript𝐷D_{c}=\{\theta_{c}(x^{\prime}_{c})\ |\ x^{\prime}_{c}\in D^{\prime}\},italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_θ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } , (2)

II-G CoachLM150 Test Set

As mentioned in Section II-C, the primary task of experts in group B is to create a high-quality LLM test suite called the CoachLM150 test set. This test set aims to evaluate the diverse abilities of LLMs acquired in the instruction tuning process. To construct this test set, the experts analyzed the categories of instructions in existing instruction tuning datasets [15, 14] and identified 42 distinct categories, including information extraction, scientific inference, dialogue completion, brainstorming, in-domain question answering, and more, to assess the instruction-following ability of LLMs.

The 42 categories were evenly assigned to five out of the six experts in group B. Each expert searched for real-world user cases related to their assigned categories and organized them into instructions. The sources of these user cases include tutorial websites444cookup.ai/chatgpt/usecases, online blogs555writesonic.com/blog/chatgpt-use-cases, and user forums666sharegpt.com. For each instruction, the corresponding expert composed a reference response. Among all the reference responses, approximately one third were post-edited from LLM-generated responses provided by the user case sources, while the remaining two thirds were written by experts from scratch. The quality control of the curated instruction pairs was performed by the remaining expert, who evaluated them based on the criteria mentioned in Table II and rejected low-quality pairs. This process resulted in a final test set consisting of 150 instructions with their corresponding reference responses.

III Experiments and Evaluations

In Section III-A, we provide an overview of the experimental set-up of CoachLM. Section III-B investigates the effectiveness of CoachLM in enhancing the data quality of the revised instruction dataset. Section III-C assesses the performance improvement achieved by tuning the LLM using the CoachLM-revised instruction dataset. Furthermore, in Sections III-D and III-E, we conduct an ablation study on the influence of parameter settings and backbone models on CoachLM.

III-A Experimental Setup

III-A1 Evaluation Approach

TABLE V: Evaluation Approaches Utilized in the Experiment
Approach Evaluation Task Type Efficiency Availability
Human Both Direct Score Low Low
ChatGPT [20] Instruction Dataset Direct Score Medium Medium
GPT-4 [16] LLM Performance Comparison Medium Low
PandaLM [24] LLM Performance Comparison High High

In the experiment, a comprehensive evaluation of CoachLM is conducted using both automatic and human approaches, as shown in Table V.

Human

Three experts from group C (denoted by R1, R2, and R3, respectively) independently assign scores between 0-100 to each Instruction or Response based on the criteria in Table II, unaware of the sources of rated samples. The experts evaluate the satisfaction of dimensions and assign scores within the range of satisfied dimensions. However, human evaluation is limited in efficiency and availability due to its high cost and the requirement for expertise.

ChatGPT

Following AlpaGasus [20], the overall quality of the CoachLM-revised instruction dataset is rated using ChatGPT (i.e., the GPT-3.5-turbo API). This method prompts ChatGPT to evaluate the accuracy of the Response in an instruction pair, using a rating scale ranging from 0 to 5. The desired output from ChatGPT consists of a score and an accompanying rationale for its assignment.

GPT-4

To evaluate the performance of LLMs, GPT-4 is used to compare and rate the Responses from two candidate models [16]. A sophisticated prompt is designed by Chiang et al. [16]. The prompt firstly displays two candidate responses to an instruction from the test set, and asks GPT-4 to assess the relative quality of the two responses based on helpfulness, relevance, accuracy, and level of detail. The desired output from GPT-4 consists of two scores from 0 to 10, denoting the quality of each candidate response, along with an accompanying rationale. However, this approach has limitations due to its vulnerable API-dependent nature and the reported evaluation biases when swap** candidates [24], despite the strong ability of GPT-4 against humans [2].

PandaLM

This open-source judge model allows for local deployment and offers efficient evaluations on LLMs [24]. By fine-tuning LLaMA [7] using 300k evaluation samples (generated by GPT-3.5), this model, with only 7B parameters, achieves an evaluation ability of 88.3% compared to GPT-4 and effectively addresses biases that may arise when swap** candidates. PandaLM takes an instruction and two candidate responses as inputs. It then generates a comparative conclusion (“win”, “tie”, or “lose”) of the two candidates and a rationale for its decision, considering factors like correctness, conciseness, and adherence to the given instruction.

To address biases in comparison-based evaluations, we used the approach in AlpaGasus [20]. This involves conducting two ratings for each comparison by swap** the order of the two candidates. Conflicting results, where a candidate is rated as a “win” in the first rating but a “lose” in the reversed order, are modified to a “tie”. Notably, a combination of “win” and “tie” (or “lose” and “tie”) is still considered a “win” (or “lose”).

III-A2 Instruction-following Test Sets

TABLE VI: Test Sets on Instruction-following Ability of LLMs
Name Size
Number of
Categories
Reference
Response
CoachLM150 150 42 Human
PandaLM170 [24] 170 11 ChatGPT
Vicuna80 [16] 80 9 Bard
Self-Instruct252 [14] 252 15 Human

As shown in Table VI, in addition to the CoachLM150 test set, we also utilize three popular public LLM test sets in our experiments, namely the Self-Instruct252 test set [14], the PandaLM170 test set [24], and the Vicuna80 test set [16]. The Self-Instruct252 test set was curated by Wang et al., who provided instructions under various application scenarios such as Gmail, Twitter, and Github, along with human responses. The PandaLM170 test set was created by sampling instructions from the Self-Instruct252 test set, with reference responses generated by ChatGPT. The Vicuna80 test set comprises instructions related to writing, role-play, math, and knowledge, for which the responses from Bard were used as reference responses due to the absence of human responses.

III-A3 Implementation Details

The experiments were conducted using 8 NVIDIA A100 GPUs. We explored different backbone models θ𝜃\thetaitalic_θ and different α𝛼\alphaitalic_α values for CoachLM. In our main experiment, we used ChatGLM2 [32] as the backbone model, which has 6B parameters, and set α𝛼\alphaitalic_α to 0.3. To efficiently adapt the backbone LLMs, we employed LoRA [33], a partial fine-tuning technique. See detailed parameter settings in our repository. CoachLM was trained for seven epochs with a learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. For training the instruction-following models, we utilized the same settings as the official Alpaca repository777https://github.com/tatsu-lab/stanford_alpaca, with the exception of using different instruction datasets. During the inference stage, the beam size for decoding was set to one for all models.

III-B Data Quality of CoachLM-revised Instruction Dataset

TABLE VII: Statistics of the CoachLM-revised Alpaca52k Dataset
Dataset Instruction Response
  Average
Length
Word-level
Edit Distance
 Average
Length
Word-level
Edit Distance
Original 17.7 - 43.9 -
CoachLM-revised 16.8 3.4 143.1 128.7

III-B1 CoachLM-revised Alpaca52k Dataset

By inputting every instruction pair from the Alpaca52k dataset into CoachLM for revisions as described in Eq. (2), a CoachLM-revised Alpaca52k dataset was obtained. We performed automatic post-processing on the outputs of CoachLM using regular expressions to remove invalid characters and repeated strings that were occasionally produced. Approximately 1.3% of the outputs were not valid instruction pairs and were replaced with the original instruction pairs. To avoid data leakage, instructions appeared in the training of CoachLM were kept from the inference and the original samples were directly adopted, which accounted for around 1.3% as well. Three examples revised by CoachLM are shown in Fig. 2.

Table VII presents the statistics of the Alpaca52k dataset before and after revision, including the average length and average edit distance at the word-level. The CoachLM-revised dataset showed significant revisions on Responses in most instruction pairs and resulted in longer responses on average compared with the original dataset, indicating the addition of substantial new content in the revised responses. In contrast, only around 8k instruction pairs exhibited revisions on Instructions. The relatively small number of revisions and nearly unchanged average length suggest that CoachLM primarily adjusted the logical and linguistic aspects of the Instructions without adding much new content.

Refer to caption
(a) Before: Average score is 3.95
Refer to caption
(b) After: Average score is 4.31
Figure 4: Histogram of ratings by ChatGPT on the whole Alpaca52k dataset before and after CoachLM revision.

III-B2 ChatGPT Evaluation

As described in Section III-A1, ChatGPT is employed to rate the accuracy of each Response on a scale of 0-5 [20], which we utilized as an automatic quality metric for the entire dataset. Fig. 4 illustrates the significant improvement in the average rating of responses in the Alpaca52k dataset, rising from 3.95 to 4.31 after the revision by CoachLM. The original dataset had only 17.7% (around 9k as reported in [20]) of instruction pairs with a rating above 4.5. However, this ratio increased significantly to 78.9% in the CoachLM-revised dataset. This enhancement indicates that instead of refining the Alpaca52k dataset by discarding a majority of samples, the CoachLM-revised dataset predominantly consists of high-quality instruction pairs. As a result, it can positively impact the instruction tuning of LLMs, while preserving the integrity of the original dataset.

III-B3 Human Evaluation on Data Quality

TABLE VIII: Human Ratings on a Subset of the CoachLM-revised Dataset
Dataset Instruction Response
R1 R2 R3 Avg. R1 R2 R3 Avg.
Randomly Sampled 150 Instruction Pairs
Original - - - - 71.1 71.2 71.3 71.2
CoachLM-revised - - - - 73.9 77.2 74.0 75.0
18 Samples in the Subset with Modified Instructions
Original 76.6 74.7 77.2 76.2 67.9 70.0 68.4 68.8
CoachLM-revised 78.3 79.6 79.1 79.0 75.3 81.8 75.6 77.6
TABLE IX: Win Rates of LLMs Against Reference Responses on Four Instruction-Following Test Sets Rated by PandaLM
Model Size Typeanormal-a{}^{\mathrm{a}}start_FLOATSUPERSCRIPT roman_a end_FLOATSUPERSCRIPT CoachLM150 PandaLM170 Vicuna80 Self-instruct252
WR1 WR2 QS WR1 WR2 QS WR1 WR2 QS WR1 WR2 QS
Stronger LLMs
LLaMA2-13b-chat [34] 13B RL-tuned 65.3% 81.9% 91.3% 78.8% 92.2% 94.7% 54.4% 66.7% 91.3% 75.2% 92.1% 95.2%
Vicuna-13b [16] 13B I-tuned 57.3% 66.7% 85.3% 73.8% 89.3% 93.5% 46.3% 36.4% 82.5% 67.1% 82.1% 90.5%
LLaMA2-7b-chat [34] 7B RL-tuned 61.0% 76.2% 90.0% 78.2% 94.4% 96.5% 50.0% 50.0% 88.8% 71.0% 89.0% 94.0%
ChatGLM [32] 6B RL-tuned 56.3% 62.7% 81.3% 76.8% 88.2% 91.8% 51.9% 60.0% 92.5% 71.4% 83.3% 89.3%
ChatGLM2 [32] 6B RL-tuned 52.7% 55.3% 77.3% 68.8% 82.7% 90.0% 44.4% 28.6% 81.3% 64.3% 75.7% 86.5%
Alpaca-CoachLM (ours) 7B I-tuned 67.7% 79.8% 88.0% 83.5% 95.2% 96.5% 46.9% 38.1% 83.8% 76.0% 87.4% 91.3%
Baseline LLMs
Vicuna-7b [16] 7B I-tuned 60.0% 71.4% 86.7% 73.5% 86.4% 91.2% 41.9% 29.0% 72.5% 68.1% 81.0% 88.9%
Alpaca [15] 7B I-tuned 48.0% 45.7% 74.7% 62.6% 76.5% 88.8% 38.8% 20.0% 70.0% 53.8% 58.6% 81.7%
Alpaca-cleaned 7B I-tuned 46.7% 43.1% 72.7% 62.9% 76.8% 88.8% 41.9% 21.7% 77.5% 52.8% 55.9% 79.4%
Alpaca-PandaLM [24] 7B I-tuned 57.0% 65.7% 84.7% 72.9% 88.2% 92.9% 45.0% 31.8% 81.3% 62.7% 75.8% 88.1%
AlpaGasus [20] 7B I-tuned 49.7% 49.2% 78.0% 65.9% 82.9% 91.8% 38.1% 17.2% 70.0% 55.6% 62.3% 82.9%
Alpaca-human (ours) 7B I-tuned 52.0% 55.0% 82.0% 65.3% 82.5% 91.8% 42.5% 22.7% 78.8% 55.0% 62.1% 84.5%
Alpaca-CoachLM (ours) 7B I-tuned 67.7% 79.8% 88.0% 83.5% 95.2% 96.5% 46.9% 38.1% 83.8% 76.0% 87.4% 91.3%
aa{}^{\mathrm{a}}start_FLOATSUPERSCRIPT roman_a end_FLOATSUPERSCRIPT I-tuned is short for Instruction-tuned. RL-tuned denotes the LLMs tuned through RL pipelines in addition to instruction tuning.

Since the evaluation approach of ChatGPT only covers Responses, we performed a human evaluation to assess the quality of both the Responses and Instructions, as described in Section III-A1. To achieve this, we randomly selected 150 instruction pairs from the revised dataset and obtained ratings from three independent reviewers who were unaware of the sample sources. Among these pairs, 18 had modifications in terms of Instructions made by CoachLM. The results, presented in Table VIII, indicate that after the revision by CoachLM, both the Instructions and Responses received higher average scores according to all three reviewers. Notably, the improvement in Responses was more pronounced for the 18 samples with modified Instructions compared with the entire subset, implying the importance of a feasible and accurate Instruction in enhancing the quality of Response.

III-C Evaluation of LLM Tuned on CoachLM-revised Dataset

In this section, we evaluate the Alpaca-CoachLM model, which is tuned using the same settings as Alpaca [15], but with the CoachLM-revised dataset replacing the Alpaca52k dataset. We also display our Alpaca-human model, with the human-revised subset merged into the full dataset.

III-C1 Compare Alpaca-CoachLM with Existing LLMs

Setup

We compare our model with two groups of existing language models (LLMs). The first group is Baseline LLMs, which are instruction-tuned LLMs from LLaMA with the same number of parameters (i.e., 7B) and similar amounts of training data. To further assess the boundary of Alpaca-CoachLM, we compare it with the second group of Stronger LLMs. These models have larger scales (13B), are tuned with proprietary instruction datasets (e.g., LLaMA2-chat [34], ChatGLM2 [32]), or benefit from additional feedback from RL pipelines. The four test sets used in the evaluation are described in Section II-G. For each sample in a test set, PandaLM rates the candidate response against the reference responses and produces a conclusion of “win”, “tie”, or “lose”. We compute three types of win rates: (1) WR1, which considers a “tie” as a half-win and is calculated as WR1=#win+0.5×#tie#all#𝑤𝑖𝑛0.5#𝑡𝑖𝑒#𝑎𝑙𝑙\frac{\#win+0.5\times\#tie}{\#all}divide start_ARG # italic_w italic_i italic_n + 0.5 × # italic_t italic_i italic_e end_ARG start_ARG # italic_a italic_l italic_l end_ARG, where #all#𝑎𝑙𝑙\#all# italic_a italic_l italic_l is the number of samples in the test set; (2) WR2, which excludes tied cases and is given by WR2=#win#all#tie#𝑤𝑖𝑛#𝑎𝑙𝑙#𝑡𝑖𝑒\frac{\#win}{\#all-\#tie}divide start_ARG # italic_w italic_i italic_n end_ARG start_ARG # italic_a italic_l italic_l - # italic_t italic_i italic_e end_ARG; and (3) QS, a quality score that measures the ratio of responses reaching the level of references, formulated as QS=#win+#tie#all#𝑤𝑖𝑛#𝑡𝑖𝑒#𝑎𝑙𝑙\frac{\#win+\#tie}{\#all}divide start_ARG # italic_w italic_i italic_n + # italic_t italic_i italic_e end_ARG start_ARG # italic_a italic_l italic_l end_ARG.

Result

The result is shown in Table IX. In addition to the advantage of Alpaca-human on win rates against Alpaca and Alpaca-cleaned, Alpaca-CoachLM further evolves after being trained on the fully revised dataset and outperforms all models in the baseline group, including the Vicuna-7b model [16], which is tuned with 70k high-quality user-shared conversations with ChatGPT. Additionally, despite being smaller in scale and trained with fewer signals, Alpaca-CoachLM achieves impressive results in the group of stronger LLMs, with the highest win rates in five out of the 12 comparisons, and outperforms the 13B Vicuna model in all test sets.

III-C2 Human Evaluation on Alpaca-CoachLM

In addition to automatic evaluation, human reviewers independently rated the responses generated by Alpaca-CoachLM and the original Alpaca model in the CoachLM150 test set. The reviewers were unaware of the sources of the responses. As shown in Table X, all reviewers consistently gave Alpaca-CoachLM a higher average score (ranging from 58.6 to 64.3) compared with the original Alpaca model. This improved performance of Alpaca-CoachLM further confirms the effectiveness of the revisions made by CoachLM, which successfully enhance the instruction-following ability of subsequently tuned LLMs by optimizing the quality of the underlying instruction dataset.

TABLE X: Human Evaluation on Alpaca-CoachLM and Alpaca
Model R1 R2 R3 Avg.
Alpaca 56.6 58.2 60.9 58.6
Alpaca-CoachLM 61.4 66.9 64.7 64.3

III-D Impact of Human Input Ratio α𝛼\alphaitalic_α

Refer to caption
(a) Alpaca-CoachLM
Refer to caption
(b) Alpaca-human
Figure 5: Win rates of (a) Alpaca-CoachLM and (b) Alpaca-human against reference responses in the CoachLM150 test set with varying human input ratio α𝛼\alphaitalic_α, rated by GPT-4 and PandaLM. α𝛼\alphaitalic_α represents ratio of human input used for training, with amount of human revision sorted from largest to smallest. α𝛼\alphaitalic_α=0 means no human input in training and α𝛼\alphaitalic_α=1 means the full human input is used. The displayed win rate is the average of WR1, WR2 and QS.

As is described in Section II-F2, α𝛼\alphaitalic_α determines the fraction of human input with high-quality revisions used in training. A higher α𝛼\alphaitalic_α implies that a larger proportion of revision examples with highest edit distance is utilized. For Alpaca-CoachLM, when α𝛼\alphaitalic_α is set to 1, all 2.3k expert revision examples are used for CoachLM training, while a value of 0 means no training and the backbone model (ChatGLM2) is used directly for revision. By varying α𝛼\alphaitalic_α, we obtain different trained CoachLM models and subsequently tuned Alpaca-CoachLM models. Fig. 5(a) shows the performance of Alpaca-CoachLM for different α𝛼\alphaitalic_α values. Both the ratings by PandaLM and GPT-4 demonstrate a similar trend, with the highest win rate observed at α𝛼\alphaitalic_α=0.3. The win rate of Alpaca-CoachLM increases as α𝛼\alphaitalic_α goes from 0 to 0.3, indicating the importance of high-quality expert knowledge in achieving desirable revision ability for CoachLM. However, as α𝛼\alphaitalic_α increases beyond 0.3, the inclusion of samples with fewer modifications introduces noise in aligning CoachLM with experts, potentially lowering the quality of the CoachLM-revised dataset and decreasing the win rates of the tuned Alpaca-CoachLM. Nevertheless, the reduction in win rate caused by this noise is at most around 10%, demonstrating the relative robustness of CoachLM.

Although the introduction of less-modified human input samples hindered the performance of Alpaca-CoachLM, the win rate of Alpaca-human steadily increases as more human-revised samples replace the original ones in the training dataset (Fig. 5(b)). This suggests that even minor human revisions improve the quality of revised instruction pairs compared to the original counterparts, thereby enhancing the dataset used to train Alpaca-human. Based on linear fitting (R2=0.9799superscript𝑅20.9799R^{2}=0.9799italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.9799), the win rate of Alpaca-human increases at a rate of 3.07%/k and is estimated to surpass Alpaca-CoachLM with 7.3k human-revised samples. Notably, Alpaca-CoachLM only requires around 0.7k human-revised samples, highlighting the cost-saving advantage of CoachLM in expert labor, as it achieves the same model performance with only 9.45% human input.

III-E Different Backbone Models of CoachLM

TABLE XI: Performance of CoachLM with Varying Backbone Models
Model Size WR1 WR2 QS
Alpaca - 48.0% 45.7% 74.7%
\hdashline Alpaca-CoachLM (back-boned by)
LLaMA [7] 7B 49.3% 48.6% 75.3%
ChatGLM [32] 6B 54.0% 59.1% 82.0%
ChatGLM2 [32] 6B 56.7% 65.6% 85.3%
Value of α𝛼\alphaitalic_α is fixed at 1. The test set is CoachLM150.

To further assess the robustness of CoachLM, we trained it with three different open-sourced backbone models: LLaMA, ChatGLM, and ChatGLM2. The win rates of the subsequently acquired Alpaca-CoachLM model on the CoachLM150 test set, evaluated by PandaLM, are displayed in Table III-E. In this experiment, we kept the value of α𝛼\alphaitalic_α fixed at 1. Our results show that Alpaca-CoachLM outperforms the original Alpaca under all backbone models, indicating the robustness of CoachLM across different backbones. Notably, we observed improved performance from LLaMA, the foundation LLM, to RL-tuned ChatGLM2, suggesting that more powerful backbones enhance the alignment ability with experts in coach instruction tuning.