Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
Abstract
A significant amount of research is focused on develo** and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.
Federico Cassano | Luisa Li | Akul Sethi |
Northeastern University | Northeastern University | Northeastern University |
[email protected] | ||
Noah Shinn | Abby Brennan-Jones | Jacob Ginesin |
Northeastern University | Wellesley College | Northeastern University |
Edward Berman | George Chakhnashvili | Anton Lozhkov |
Northeastern University | Northeastern University | HuggingFace |
Carolyn Jane Anderson | Arjun Guha | |
Wellesley College | Northeastern University | |
[email protected] |
1 Introduction
Instruction: Edit the C4 class and its methods to represent the C8 group instead. |
⬇ -class C4(nn.Module): +class C8(nn.Module): - """Represents the cyclic group C4, + """Represents the cyclic group C8, where each element represents a discrete rotation.""" def __init__(self): super().__init__() def size(self): """Outputs the size of this group.""" - return 4 + return 8 def elements(self): """Returns all the elements of this group""" - return torch.tensor([0., np.pi/2, np.pi, 3*np.pi/2]) + d = np.pi / 4 + return torch.tensor([0., d, d*2, d*3, d*4, d*5, d*6, d*7]) |
Large language models of code (Code LLMs) are becoming an essential tool for software engineering practice and research. There has been significant research on synthesizing code from natural language instructions, but comparatively less attention has been given to code editing tasks. However, LLM users expect models to be capable of editing code. For example, the LMsys dataset of in-the-wild conversations with chatbots (Zheng et al., 2023) has 4,188 conversations containing code, and 831 (19%) of these involve code editing, where the user prompts the model to update code based on natural language instructions (Appendix E). In general, code editing with an LLM encompasses activities like feature addition or removal, bug fixing, and code refactoring (Zhang et al., 2023; Moon et al., 2023; Shinn et al., 2023; Chen et al., 2023; Olausson et al., 2023; ** et al., 2023).
The ability to edit code is also essential for a model to be useful for an AI-focused code editor such as Cursor (Cursor, 2023), Copilot Chat (Copilot, 2023), or ChatGPT Advanced Data Analysis (ADA) (OpenAI, 2023a). Cursor and Copilot Chat facilitate edits with human-written instructions. In contrast, ADA uses both human-written instructions and model-generated reflections (Shinn et al., 2023; Fan et al., 2023; Phung et al., 2023) to extend and edit code. This approach represents a step towards AI-driven code assistance. In both scenarios, instructional code editing is employed, which we define as a function , where is the original code, is the instruction, and is the modified code. Figure 1 illustrates this process, showing how the model edits a code segment from a given instruction.
Model-generated reflections and human-written instructions both describe desired code changes. However, they differ in the level of detail: reflections, usually more detailed, are generated by a model with access to the code, offering richer context and potentially a strategic plan for code modifications. In contrast, human-written instructions are typically shorter and less detailed but may express the true user’s intent more clearly. We refer to these as descriptive and lazy instructions, respectively. We thoroughly analyze examples of such instructions in Appendix E.
In this work, we introduce CanItEdit, a novel dataset comprising 105 hand-crafted instructional code editing problems, featuring both descriptive and lazy instructions and an extensive hidden test suite. Designed to assess a model’s proficiency in handling diverse code editing scenarios, CanItEdit serves as a benchmark for evaluating state-of-the-art Code LLMs in instructional code editing. Our evaluation focuses on measuring the accuracy of a given model’s ability to write correct code modifications without introducing unnecessary code. We conduct comprehensive assessments of closed and open models, revealing significant performance disparities between the leading closed and open models (§ 5). To help address this gap, we propose a training dataset and methodology for code editing. Our findings demonstrate that fine-tuning open Code LLMs on this dataset can significantly enhance code editing performance (§ 4).
To summarize, we make four main contributions: (1) We introduce CanItEdit, a dataset of instructional code editing problems, designed to evaluate the code editing capabilities of large language models (§ 3). (2) We propose a novel metric, ExcessCode, for quantifying the typical volume of unused code produced by a model when generating the correct code edits (§ 5.1). (3) We perform a thorough evaluation of the latest Code LLMs in the context of code editing, providing insights into their current capabilities (§ 5). (4) Finally, we present a specially tailored training dataset for code editing, along with an effective training methodology, demonstrating significantly enhanced code editing performance through fine-tuning models of varying sizes (§ 4).
2 Related Work
Instruction-following Language Models. Correctly prompting an LLM is crucial for it to perform a desired task. There are multiple methods for instruction tuning LLMs to better adhere to natural language instructions. One method involves employing human annotators to create sample instructions and provide feedback on numerous model outputs (Ouyang et al., 2022; Köpf et al., 2023). This method is costly and demands substantial resources. An alternative, cost-effective method is to use an LLM to self-instruct, generating instructions from a smaller set of human-written seed instructions (Wang et al., 2023). These methods have been applied to generate datasets for instruction-tuning Code LLMs (Chaudhary, 2023; Luo et al., 2023). Specific to code generation, another strategy to instruction-tune an LLM is to use commit messages as instructions (Muennighoff et al., 2023). In this paper, we use commit messages as instructions for code editing. Our results demonstrate that while instruction-tuned models can edit code, they are not as effective as models that we explicitly train for this task (§ 5).
Code Generation Benchmarks. Several benchmarks exist that test a model’s code generation ability. HumanEval and MBPP are two prominent benchmarks for evaluating LLMs in Python programming (Chen et al., 2021; Austin et al., 2021). MultiPL-E expands these benchmarks to 18+ additional programming languages (Cassano et al., 2023b). These benchmarks assess model-generated candidate completions against a series of human-authored unit tests. EvalPlus (Liu et al., 2023) utilizes mutation testing to expand the test suites of the Python benchmarks. All of these benchmarks utilize the pass@k metric, which measures the likelihood of the model generating a completion that passes all of the tests in tries; we also adopt this metric in our evaluation (§ 5.1). However, these benchmarks are limited to the evaluation of a model’s ability to generate a single function from a natural language description and do not assess code editing capabilities. HumanEvalPack (Muennighoff et al., 2023) is a benchmark designed for evaluating LLMs across various single-function code generation tasks, such as synthesis, code explanation, and bug fixing. Specifically, HumanEvalFix, a bug-fixing variant of HumanEvalPack, is extensively used for assessing the models’ capabilities in code refinement (Moon et al., 2023; Muennighoff et al., 2023). However, the instruction is fixed for every problem. SWE-Bench (Jimenez et al., 2023) evaluates LLMs across varied programming tasks including planning, retrieval, and code editing. Our work concentrates specifically on code editing tasks, aiming to more precisely guide model development. Unlike SWE-Bench, which sources its problems from GitHub PRs and issues, our benchmark is handcrafted, reducing contamination risks as seen with models like StarCoder and StarCoder2, which are trained extensively on GitHub data (Li et al., 2023b; Lozhkov et al., 2024).
Code Editing Using Large Language Models. Previous studies on code editing with LLMs have predominantly focused on bug fixing (Zhang et al., 2023; Moon et al., 2023; Shinn et al., 2023; Chen et al., 2023; Olausson et al., 2023; ** et al., 2023; Joshi et al., 2023; Wei et al., 2023), a specific subset of code editing; fill-in-the-middle code completion (Bavarian et al., 2022; Fried et al., 2023; Yee & Guha, 2023; Roziere et al., 2023; Guo et al., 2024), an inference strategy that requires specific insert locations; and intrinsic code editing (Li et al., 2023a; Gupta et al., 2023), which involves editing code without a specified instruction, exerting the model’s ability to intrinsically ascertain the desired code changes. Recently, LLMs have progressed in code editing guided by natural language without specific edit locations (Hu et al., 2023; Li et al., 2023b; Muennighoff et al., 2023). However, this advancement lacks benchmark evaluations to effectively measure the models’ code editing skills. Notably, StarCoder (Li et al., 2023b), the first LLM trained on an extensive dataset of commits using the format <before><commit message><after>, we have shown enhanced code editing capabilities (§ 5). Before this study, StarCoder’s practical code editing performance had not been assessed. StarCoder2 has replaced commits with pull requests and issues, which typically include more natural language (Lozhkov et al., 2024). The recent introduction of InstructCoder (Hu et al., 2023), a model explicitly trained and evaluated for code editing, marks a significant step towards code editing with LLMs. However, its evaluation involved GPT-4-generated (OpenAI, 2023b) and human-provided labels, which raises issues regarding reproducibility and comparability in future research. Moreover, the model has not been publicly released, prohibiting us from evaluating it on our benchmark.
3 The CanItEdit Dataset
CanItEdit Dataset Statistics | |
---|---|
Total Tasks | 105 (35/35/35) |
Total Problems | 210 (70/70/70) |
Topics | |
Data Structures & Algorithms | 39 |
Language Processing | 21 |
Mathematics | 25 |
Data Science | 10 |
Miscellaneous | 10 |
Problems With Library Usage | 22 |
Code Segment | Mean Std. Dev. |
Mean Lines (BeforeAfter) | 42.5 33.9 49.8 36.6 |
Levenshtein Distance | 302.1 339.6 |
Combined Mean Lines | 92.3 69.9 |
Combined Mean Tokens | 865.3 639.7 |
Combined Max Tokens | 3,583 |
Instruction | Mean Std. Dev. |
Mean Tokens (DescriptiveLazy) | 81.7 50.4 35.6 30.6 |
Benchmark Overview
CanItEdit is a dataset comprising 105 meticulously constructed Python code editing challenges. Each problem includes the input code segment (before), the expected code segment (after), the two types of natural language instructions (descriptive and lazy), and a hidden test suite. These challenges span a broad spectrum of computer science domains, such as data structures, algorithms, mathematics, language processing, and game programming, requiring knowledge of popular external Python libraries like NumPy, Pandas, PyTorch, and others. Table 1 presents general dataset statistics for CanItEdit.
Following Swanson (1976) and follow-up work (Levin & Yehudai, 2017), we classify code editing tasks into three distinct categories based on their primary goal: a corrective edit fixes errors, a perfective edit enhances existing features, and an adaptive edit meets new requirements. We have 35 problems per category for an even distribution across the different types of code changes. The dual instruction formats test the models’ ability to execute tasks based on the provided context: descriptive instructions provide comprehensive details for explicit guidance, while lazy instructions offer minimal direction, challenging the model to infer the required actions. Both instructions should lead to an equivalent after segment. Descriptive instructions serve to replicate situations where users provide specific specifications or another model outlines a plan. In contrast, lazy instructions resemble typical user queries for LLMs in code generation. As each problem has two distinct instructions, the dataset effectively contains 210 problems. We showcase examples from CanItEdit in Appendix C.
Dataset Creation
To manually construct CanItEdit, we assembled a team of eight experienced Python programmers, each with different domain expertise, and appointed one as the lead. Our objective was to fill each change category with 35 problems, with 105 problems total. Before starting, we provided the team with verbal and written guidance, a standard template, and an example problem. They were instructed to begin by writing a brief description of the problem and the applied changes for review and refinement by the lead. Next, they were tasked to write the ’before’ code segment and hidden test suite, followed by the ’after’ code segment, along with the instructions. The lead initially reviewed all problems in development and they were additionally reviewed by the entire team in weekly meetings. Upon completion of a problem, the lead generated sample completions to ensure that the failures and successes were reasonable and consistent with the problem’s intent.
The team also dedicated significant effort to develo** comprehensive test suites for each problem, which incorporated a variety of testing techniques such as unit tests, property-based testing, mocking, fuzzing, and integration tests. These suites were designed to rigorously evaluate whether the ’after’ segment met the problem requirements while ensuring the ’before’ code did not. To confirm the completeness and correctness of the test suites, we created an automated verification pipeline that ensured line coverage and that the suite passed all tests with the ‘after’ code while failing at least one with the ‘before’ code. The team also manually reviewed the tests to ensure correctness and completeness.
4 Fine-Tuning
We describe our approach to fine-tuning Code LLMs for code editing tasks, focusing on the DeepSeekCoder-Base family (Guo et al., 2024), a variant of CodeLlama (Roziere et al., 2023) trained on 2 trillion tokens of GitHub code and natural language, using StarCoder’s filtering rules (Li et al., 2023b). These models, top-performing in code generation and open-access under a permissive license, show robust performance on CanItEdit without specific training for instructional tasks (§ 5).
For our ablation studies, we focus on the model with 6.7 billion parameters, which offers an ideal balance between size and performance. This allows us to extrapolate results to larger models with more parameters without the need for extensive training. Following the most performant training strategy identified, we also fine-tune the 1.3b and 33b models to evaluate the impact of model size on code editing performance. Our fine-tuned models are referred to as EditCoder. These models have been full-parameter fine-tuned by calculating the loss on only the ‘after’ code segment. Appendix B provides further details and experiments on the training process.
Dataset Statistics | ||
---|---|---|
EditPackFT | Commits2023FT | |
Total Commits | 22,602 | 24,129 |
Unique Initial Verbs | 184 | 199 |
Code Segments (Mean Std. Dev.) | ||
Lines of Code | ||
Levenshtein Distance | ||
Commit Messages (Mean Std. Dev.) | ||
Tokens |
We experiment with two training datasets we gathered: EditPackFT and Commits2023FT, which we describe below. Table 2 presents the statistics for these datasets.
EditPackFT
We created the EditPackFT dataset by further filtering the Python split of the CommitPackFT dataset (Muennighoff et al., 2023). CommitPack is an extensive dataset comprising 4TB of permissively licensed commits from a 2016 GitHub snapshot across various programming languages. CommitPackFT is a subset of CommitPack, filtered to be amenable for instruction-tuning Code LLMs. The primary criterion for CommitPackFT’s selection involved retaining commits whose messages begin with an imperative verb, mirroring the typical structure of natural language instructions. We apply a series of additional filtering steps, which make the dataset more suitable for code editing. We remove any item that pass any of the following predicates:
-
1.
The presence of an empty ‘before’ or ‘after’ code segment, disregarding whitespace.
-
2.
No change detected in the ‘before’ and ‘after’ code segments.
-
3.
The inclusion of the words TODO, FIXME, or BUG in the ‘after’ code segment, which signals an incomplete commit.
-
4.
Incorrect parsing of the ‘after’ code using the Python ast module.
Originally, the dataset contained 56,025 commits, and after applying the filtering steps, we are left with 22,602. As shown by Figure 1(a) and Table 2, the mean number of lines in the ‘before’ and ‘after’ code segments is . The mean Levenshtein distance between the ‘before’ and ‘after’ code segments is characters, indicating that the changes are relatively small. We also analyze the distribution of the commit message lengths, and find that the mean token count is , which is quite low. We further analyzed the original CommitPackFT dataset to ensure that these findings weren’t artifacts of our filtering strategy: the full dataset has similar statistics.
Commits2023
To address the limitations of EditPackFT, we developed the Commits2023FT dataset. This dataset is filtered from 416,792 Python file changes gathered from commits in permissively licensed GitHub repositories. We name the unfiltered dataset Commits2023. Our objective is to create a dataset akin to CommitPackFT, but with more recent data and a more diverse example length distribution. We employed the same filters on this dataset as used for EditPackFT, and also applied the initial filters from CommitPackFT, which includes only commits with messages that start with an imperative verb followed by at least a noun. Additionally, we only retain one file from multi-file commits to avoid exact duplicate commit messages in our dataset. After this filtering process, we obtain a dataset comprising 24,129 Python file changes. Figure 1(b) presents a broader distribution of the number of lines in the ‘before’ and ‘after’ code segments, with an average of . We also observe a much larger change distribution, with a mean Levenshtein distance of , signaling that this is not only a dataset with larger code segments, but also contains more varied changes. Figure 3 illustrates a sunburst plot of the most frequent initial verbs in the commit messages of Commits2023FT, along with their corresponding root nouns. This set of verbs is slightly more varied than those in EditPackFT, featuring unique verbs in comparison to . Furthermore, the token count distribution of the commit messages is twice as high and much more varied than that of EditPackFT, with a mean of and a standard deviation of .
Ablation Datasets
For ablation analysis, we generated two additional datasets: Commits2023Raw25k and Commits2023FT+EditPackFT. Commits2023Raw25k consists of a random selection of 25,000 commits from Commits2023. We use this dataset to assess the impact of the filtering process on the final dataset. Commits2023FT+EditPackFT represents the combined dataset of Commits2023FT and EditPackFT. We find that the combination of Commits2023FT and EditPackFT yields the best results by a significant margin (§ 5.2), and thus we train our final models on this dataset. We believe that these results are due to the increased amount of data and the expanded length distributions.
5 Evaluation
Model | Descriptive | Lazy | |||
Name | Size | pass@1 | ExcessCode | pass@1 | ExcessCode |
Closed Models | |||||
GPT-4 | — | 63.33 | 0.15 0.09 | 51.95 | 0.14 0.10 |
GPT-3.5-Turbo | — | 48.14 | 0.47 0.34 | 42.71 | 0.00 0.00 |
Open Models | |||||
CodeLlama-Instruct | 70b | 45.05 | 0.28 0.15 | 37.52 | 0.02 0.02 |
Mixtral-Instruct | 8x7b | 30.10 | 0.40 0.16 | 24.90 | 0.01 0.01 |
EditCoder | 33b | 55.90 | 0.33 0.21 | 42.33 | 0.27 0.24 |
DeepSeekCoder-Instruct | 33b | 49.78 | 0.36 0.24 | 38.94 | 0.51 0.34 |
DeepSeekCoder-Base | 33b | 47.71 | 0.53 0.24 | 34.71 | 0.62 0.41 |
CodeLlama-Instruct | 34b | 30.63 | 0.33 0.21 | 24.15 | 0.18 0.14 |
StarCoder2 | 15b | 41.95 | 0.36 0.20 | 31.48 | 0.04 0.04 |
StarCoder | 15b | 37.10 | 0.56 0.28 | 27.62 | 0.42 0.34 |
OctoCoder | 15b | 34.43 | 0.12 0.07 | 25.95 | 0.07 0.07 |
CodeLlama-Instruct | 13b | 26.90 | 0.90 0.68 | 16.89 | 0.42 0.41 |
EditCoder | 6.7b | 48.33 | 0.36 0.17 | 39.29 | 0.32 0.25 |
DeepSeekCoder-Instruct | 6.7b | 41.03 | 0.13 0.06 | 31.65 | 0.22 0.12 |
DeepSeekCoder-Base | 6.7b | 32.62 | 1.01 0.42 | 27.76 | 1.25 0.98 |
CodeLlama-Instruct | 7b | 32.83 | 0.31 0.15 | 23.49 | 0.36 0.26 |
EditCoder | 1.3b | 26.67 | 0.14 0.09 | 21.43 | 0.20 0.12 |
DeepSeekCoder-Instruct | 1.3b | 26.22 | 0.32 0.18 | 17.27 | 0.32 0.13 |
DeepSeekCoder-Base | 1.3b | 17.90 | 0.69 0.42 | 11.76 | 2.79 2.29 |
In this section, we evaluate the performance of various open and closed models on the CanItEdit benchmark, as well our fine-tuned models.
Evaluation Tools and Hyperparameters
We run the open-access models using HuggingFace Transformers (Wolf et al., 2020) and vLLM (Kwon et al., 2023). We use the following hyperparameters for all inference experiments: maximum new tokens, temperature , and top- sampling cutoff of . Following Cassano et al. (2023b), we sample 20 completions for each problem. We run all tests in a Docker container to mitigate the risk of malicious code execution.
Models Evaluated
We evaluate several state-of-the-art models of varying sizes, fine-tuning some of them to build EditCoder. We group the models into two categories: open, models which we have access to their weights, and closed, models which we do not. We prompt each model with their recommended prompt template. The specific templates used appear in § A.6. The full list of models and their sizes appears in Table 3.
5.1 Evaluation Metrics
We employed two metrics to assess model performance: pass@k assesses functional correctness, and ExcessCode assesses conciseness and precision of code edits.
-
•
pass@k is the likelihood that at least one successful edit was made from attempts, as assessed by the test suite. In this section, we show results only for pass@1, and evaluation results for pass@10 and pass@100 with higher temperatures can be found in § A.2.
-
•
ExcessCode evaluates the presence of unnecessary code changes, as indicated by the fraction of changed lines not covered by the test suite. We calculate this metric by averaging the mean line coverage for passing completions across all problems, omitting those with no successful completions. The Python code used to calculate this metric is found at § A.1. We additionally report the standard error of the mean for this metric.
5.2 Results with Existing Models
We draw several conclusions from the full results in Table 3.
Closed source models outperform open source models. Our evaluation indicates a significant performance disparity between open and closed models. GPT-4, despite not being specifically trained on code-related tasks, surpasses DeepSeekCoder-Instruct 33b – the leading open source model – by an average of 13% in pass@1 for both descriptive and lazy tasks. DeepSeekCoder-Instruct stands out as the only open model exceeding any closed model’s performance, surpassing GPT-3.5-Turbo for descriptive prompts.
DeepSeekCoder-Instruct utilizes an undisclosed instruction-tuning dataset, therefore direct comparisons with other open models may not be entirely fair. In contrast, Mixtral-Instruct (Jiang et al., 2024), comparable to OpenAI models in its general instruction-following training focus, significantly lags in performance against both closed models and open models specialized in code generation tasks. Lastly, CodeLlama-Instruct-70b, ranked second in instruction-following capabilities, was developed using a dataset of examples generated by Llama 2 70b with a larger focus on code generation tasks. This may explain its superior performance compared to Mixtral-Instruct.
Descriptive prompts yield better performance than lazy prompts. Descriptive prompts result in a 8.68 absolute increase on average in pass@1 compared to lazy prompts, which may be the result of more detailed information and additional pointers in descriptive problems. Lazy instructions generally lead to lower ExcessCode for larger models. This may be indicative of lazy instructions introducing less noise in the task, as there are less tokens to attend to in the prompt given.
Larger models perform better than smaller models. Model size correlates positively with pass@1, and negatively with ExcessCode, indicating that larger models are more adept at precise edits to code. This pattern is most clearly seen in the evaluation results of DeepSeekCoder-Base, StarCoderBase, and StarCoder2, where a steady increase in performance is seen along with an increase in model size (§ A.5).
Models pre-trained on commits are better at code editing. Among open models, StarCoder is pre-trained on GitHub commits, while StarCoder2 focuses on GitHub issues. StarCoder outperforms similar-sized DeepSeek and CodeLlama models in our benchmark, despite their superior code generation capabilities. OctoCoder, a StarCoder-based model fine-tuned on instructions (Muennighoff et al., 2023), shows lower pass@1 performance, suggesting instruction fine-tuning on commit-based models may reduce code editing efficacy. Conversely, StarCoder2, exchanging commits with issues in its training, improves in editing tasks with larger models (§ A.5). This may be attributed to extended training and modern architecture rather than data source shift.
5.3 Results after Fine-Tuning on Commits
Training Dataset | Metrics | |||
---|---|---|---|---|
Name | #Tokens | #Items | pass@1 | ExcessCode |
Commits2023FT+EditPackFT | 74M | 46,274 | 43.81 | 0.34 0.15 |
Commits2023FT | 62M | 24,129 | 41.88 | 0.33 0.17 |
Commits2023Raw25k | 62M | 25,000 | 38.86 | 0.3 0.12 |
EditPackFT | 12M | 22,602 | 41.6 | 0.26 0.14 |
In addition to evaluating existing open models, we also fine-tuned pre-trained DeepSeek models (§ 4) to build EditCoder, which we now evaluate.
Optimal Dataset: Commits2023FT+EditPackFT. In finding the best training dataset for DeepSeekCoder-6.7b-Base, the base model for EditCoder, various ablation datasets were tested. Results in Table 4 show Commits2023FT+EditPackFT is the top performer for both descriptive and lazy instructions. The dataset’s larger size and diverse data types, including varied commits, edits, and instructions, likely contribute to its superior performance.
Fine-tuning on open commits can significantly improve code editing performance. EditCoder-33b surpasses all open models in pass@1 for both descriptive and lazy instructions types, showing an overall increase in pass@1 and a notable decrease in ExcessCode compared to its base model, DeepSeekCoder-Base-33b. Additionally, we see a substantial increase in pass@1 for every iteration of EditCoder over its corresponding base model, with the largest improvement being a increase at 6.7b.
Both EditCoder-33b and EditCoder-6.7b outperform GPT-3.5-Turbo in pass@1 for descriptive instructions, with EditCoder-33b also matching GPT-3.5-Turbo in for lazy ones. In higher temperature scenarios (§ A.2), EditCoder-33b beats GPT-3.5-Turbo in both instruction types for pass@10 and pass@100, and even surpasses GPT-4 in pass@100 for descriptive instructions. Analysis in § A.3 shows EditCoder excels in corrective changes but is less effective in perfective changes. Further analysis in § A.4 shows that EditCoder-33b outperforms all models, including GPT-4, in simple single-function bug fixes, and retains synthesis capabilities. This demonstrates the effectiveness of targeted fine-tuning on code editing datasets, addressing the distinct needs of instructional code editing compared to general code generation.
6 Conclusion
We present CanItEdit, a benchmark designed to assess the instructional code editing skills of Code LLMs. It includes hand-written code editing problems, each accompanied by dual natural language instructions: a “lazy” instruction that a human may write, and a “descriptive” instruction that may be generated by an agent revising code in a loop. Each problem has a comprehensive test suite. We evaluate contemporary state-of-the-art Code LLMs and reveal a significant gap between closed and open models. We also demonstrate that fine-tuning with a custom dataset and training methodology can significantly improve code editing capabilities across various model sizes. Our work provides a foundation for evaluating future enhancements in instructional code editing for Code LLMs, offering valuable tools and insights for AI-based software development research and practice.
Limitations
We evaluated LLMs in reproducing the entire ‘after’ code segment, which may not be the most token-efficient method. A potentially more efficient strategy would involve generating a list of specific changes to be applied to the ‘before’ code segment. Furthermore, our study does not explore varying prompt formats. Instead, we have adopted a format consistent with that used by other models (Li et al., 2023b). Another limitation is the size of our final training dataset, which is relatively modest. We have not investigated the potential benefits of utilizing larger datasets, which could notably enhance performance, particularly with larger models. Our work only targets Python. Similar results may be possible for other high-resource programming languages, but low-resource languages may require additional effort (Cassano et al., 2023a). We identify these areas as opportunities for future work.
References
- Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- (2) Ned Batchelder and Contributors to Coverage.py. Coverage.py: The code coverage tool for Python. URL https://github.com/nedbat/coveragepy.
- Bavarian et al. (2022) Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
- Broder (2000) Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 2000.
- Cassano et al. (2023a) Federico Cassano, John Gouwar, Francesca Lucchetti, Claire Schlesinger, Carolyn Jane Anderson, Michael Greenberg, Abhinav Jangda, and Arjun Guha. Knowledge transfer from high-resource to low-resource programming languages for code llms. arXiv preprint arXiv:2308.09895, 2023a.
- Cassano et al. (2023b) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering (TSE), 2023b.
- Chaudhary (2023) Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
- Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Chen et al. (2023) Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
- Copilot (2023) GitHub Copilot. Github copilot your ai pair programmer, 2023. URL https://github.com/features/copilot.
- Cursor (2023) Cursor. Cursor: The ai-first code editor, 2023. URL https://cursor.sh/features. Accessed: 2023-12-03.
- Dao (2023) Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Fan et al. (2023) Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. Tan. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023. URL https://doi.ieeecomputersociety.org/10.1109/ICSE48619.2023.00128.
- Fried et al. (2023) Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Scott Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=hQwb-lbM6EL.
- Guo et al. (2024) Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
- Gupta et al. (2023) Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari. Grace: Language models meet code edits. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023. URL https://doi.org/10.1145/3611643.3616253.
- Holtzman et al. (2021) Ari Holtzman, Peter West, Vered Shwartz, Ye** Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. In Marie-Francine Moens, Xuan**g Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7038–7051, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.564. URL https://aclanthology.org/2021.emnlp-main.564.
- Hu et al. (2023) Qisheng Hu, Kaixin Li, Xu Zhao, Yuxi Xie, Tiedong Liu, Hui Chen, Qizhe Xie, and Junxian He. Instructcoder: Empowering language models for code editing. arXiv preprint arXiv:2310.20329, 2023.
- Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024.
- Jimenez et al. (2023) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- ** et al. (2023) Matthew **, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023. URL https://doi.org/10.1145/3611643.3613892.
- Joshi et al. (2023) Harshit Joshi, José Cambronero Sanchez, Sumit Gulwani, Vu Le, Ivan Radiček, and Gust Verbruggen. Repair is nearly generation: Multilingual program repair with llms. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, 2023. URL https://doi.org/10.1609/aaai.v37i4.25642.
- Köpf et al. (2023) Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich árd Nagyfi, et al. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In ACM SIGOPS Symposium on Operating Systems Principles (SOSP), 2023.
- Lee et al. (2022) Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022. URL https://aclanthology.org/2022.acl-long.577.
- Leskovec et al. (2014) Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. Finding Similar Items, pp. 68–122. Cambridge University Press, 2 edition, 2014. doi: 10.1017/CBO9781139924801.004.
- Levin & Yehudai (2017) Stanislav Levin and Amiram Yehudai. Boosting automatic commit classification into maintenance activities by utilizing source code changes. In Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE, pp. 97–106, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450353052. doi: 10.1145/3127005.3127016. URL https://doi.org/10.1145/3127005.3127016.
- Li et al. (2023a) Jia Li, Ge Li, Zhuo Li, Zhi **, Xing Hu, Kechi Zhang, and Zhiyi Fu. Codeeditor: Learning to edit source code with pre-trained models. ACM Transactions on Software Engineering and Methodology, 2023a.
- Li et al. (2023b) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. StarCoder: May the source be with you! arXiv preprint arXiv:2305.06161, 2023b.
- Liu et al. (2023) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. arXiv preprint arXiv:2305.01210, 2023.
- Lozhkov et al. (2024) Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024.
- Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, **g Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Moon et al. (2023) Seungjun Moon, Yongho Song, Hyungjoo Chae, Dong** Kang, Taeyoon Kwon, Kai Tzu-iunn Ong, Seung-won Hwang, and **young Yeo. Coffee: Boost your code llms by fixing bugs with feedback. arXiv preprint arXiv:2311.07215, 2023.
- Muennighoff et al. (2023) Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. URL https://openreview.net/forum?id=CjrPqvvUXL.
- Olausson et al. (2023) Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896, 2023.
- OpenAI (2023a) OpenAI. Introducing chatgpt enterprise, 2023a. URL https://openai.com/blog/introducing-chatgpt-enterprise. Accessed: 2023-12-03.
- OpenAI (2023b) OpenAI. Gpt-4 technical report, 2023b.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
- Phung et al. (2023) Tung Phung, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, and Gustavo Soares. Generating high-precision feedback for programming syntax errors using large language models. arXiv preprint arXiv:2302.04662, 2023.
- Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020.
- Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=vAElhFcKW6.
- Swanson (1976) E. Burton Swanson. The dimensions of maintenance. In Proceedings of the 2nd International Conference on Software Engineering, ICSE ’76, pp. 492–497, Washington, DC, USA, 1976. IEEE Computer Society Press.
- Wang et al. (2023) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. URL https://aclanthology.org/2023.acl-long.754.
- Wei et al. (2023) Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the copilots: Fusing large language models with completion engines for automated program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023. URL https://doi.org/10.1145/3611643.3616271.
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020. URL https://aclanthology.org/2020.emnlp-demos.6.
- Yee & Guha (2023) Ming-Ho Yee and Arjun Guha. Do Machine Learning Models Produce TypeScript Types that Type Check? . In Proceedings of the 37th European Conference on Object-Oriented Programming, 2023. URL https://doi.org/10.48550/arXiv.2302.12163.
- Zhang et al. (2023) Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi **. Self-edit: Fault-aware code editor for code generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023. URL https://aclanthology.org/2023.acl-long.45.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, et al. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998, 2023.
Appendix A Additional Evaluation Details
In this section, we provide several additional details about our evaluation. We provide the following additional details:
-
1.
A Python implementation for computing the ExcessCode metric in A.1.
-
2.
Results of our evaluation at higher sampling parameters in A.2.
-
3.
A deeper analysis of our results per change type in A.3.
-
4.
An evaluation of EditCoder on HumanEvalPack in A.4.
-
5.
A deeper comparison of the first and second versions of StarCoder in A.5.
-
6.
Each prompt format used in our evaluation in A.6.
OpenAI Model Versions
For our evaluation we use the following versions of the OpenAI models, which at the time of writing were the latest stable versions:
-
•
GPT-4: gpt-4-0613
-
•
GPT-3.5-Turbo: gpt-3.5-turbo-0125
A.1 Computing ExcessCode
We provide a simple Python implementation for computing the ExcessCode metric in LABEL:appendix:lst:excesscode. The function takes in as input the original code segment, the modified code segment, and the number of lines with missing code coverage. The lines with missing code coverage are computed using a code coverage tool, in our case, Coverage.py (Batchelder & Contributors to Coverage.py, ). For Coverage.py, the number of lines with missing code coverage can be obtained by running the command coverage report -m.
A.2 Results At Higher Sampling Parameters
Model | Metrics | ||||
Name | Size | pass@1 | pass@10 | pass@100 | ExcessCode |
Descriptive | |||||
GPT-4 | — | 63.67 | 73.67 | 80.00 | 0.18 0.09 |
GPT-3.5-Turbo | — | 47.88 | 61.67 | 71.43 | 0.33 0.21 |
EditCoder | 33b | 54.36 | 73.15 | 81.90 | 0.81 0.38 |
CodeLlama-Instruct | 34b | 28.50 | 52.00 | 64.76 | 0.38 0.16 |
StarCoder2 | 15b | 40.06 | 62.12 | 71.43 | 0.47 0.19 |
StarCoder | 15b | 33.31 | 59.80 | 70.48 | 0.76 0.25 |
CodeLlama-Instruct | 13b | 25.01 | 50.04 | 62.86 | 0.47 0.29 |
EditCoder | 6.7b | 46.31 | 60.24 | 70.48 | 0.42 0.23 |
CodeLlama-Instruct | 7b | 29.93 | 52.86 | 65.71 | 0.76 0.39 |
EditCoder | 1.3b | 26.29 | 40.22 | 47.62 | 0.50 0.23 |
Lazy | |||||
GPT-4 | — | 52.55 | 64.56 | 71.43 | 0.12 0.09 |
GPT-3.5-Turbo | — | 40.93 | 53.33 | 60.95 | 0.29 0.27 |
EditCoder | 33b | 40.34 | 58.63 | 68.57 | 0.46 0.20 |
CodeLlama-Instruct | 34b | 21.10 | 42.69 | 56.19 | 0.19 0.08 |
StarCoder2 | 15b | 30.23 | 48.82 | 59.05 | 0.17 0.07 |
StarCoder | 15b | 24.75 | 50.23 | 65.71 | 0.68 0.25 |
CodeLlama-Instruct | 13b | 16.67 | 38.52 | 58.10 | 0.67 0.44 |
EditCoder | 6.7b | 37.74 | 51.16 | 57.14 | 0.33 0.15 |
CodeLlama-Instruct | 7b | 20.85 | 41.10 | 54.29 | 0.11 0.06 |
EditCoder | 1.3b | 20.20 | 32.70 | 39.05 | 1.47 1.20 |
In this section, we evaluate the performance of various models under higher sampling parameters compared to those used in our main evaluation (§ 5).
Standard Sampling Parameters
In § 5, we assessed several models on CanItEdit using standard parameters: temperature of 0.2, top- of 0.95, and 20 samples. These parameters, often used in code generation tasks (Chen et al., 2021; Cassano et al., 2023b; a; Muennighoff et al., 2023; Li et al., 2023b; Lozhkov et al., 2024), balance the selection of higher probability tokens while allowing for sampling of lower probability tokens at conservative levels, addressing surface form competition (Holtzman et al., 2021), making it typically more effective than greedy decoding (Chen et al., 2021).
Higher Sampling Parameters
For this evaluation, we increased the sampling parameters to assess the models’ robustness under more diverse generation conditions. Following Chen et al. (2021), we adopted a temperature of 0.8, top of 0.9, and 100 samples. These aggressive parameters allow the model to explore a wider range of possibilities, useful when multiple completion attempts are possible. Due to the higher computational costs, we limited our evaluation to a subset of models compared to the main evaluation.
Metrics
For this evaluation, we expand our metrics to include pass@10 and pass@100, alongside the standard pass@1 and ExcessCode. pass@10 and pass@100 offer deeper insights into model performance by evaluating the success rate across the top 10 and 100 completions, respectively. These metrics are crucial for understanding how models perform in scenarios that permit multiple attempts, such as when users are provided with a range of completions to select from or when an external verifier is used to determine the best completion.
A.2.1 Results
The results of our evaluation at higher sampling parameters are shown in Table 5. We draw several conclusions from the results.
pass@1 decreases for open models. Closed models maintain consistent pass@1 performance under higher sampling parameters. In contrast, open source models generally exhibit a decline, showing a 2-3% reduction in pass@1 performance compared to the main evaluation (Table 3).
Multiple trials benefit open source models. Open source models significantly improve with multiple trials, showing larger gains in pass@10 and pass@100 compared to closed models, which also improve but to a lesser degree. Specifically, EditCoder-33b outperforms all models, including GPT-4, in pass@100 for descriptive instructions and matches closely in pass@10. However, EditCoder-33b lags behind GPT-4 in lazy instruction scenarios across all metrics and tends to generate more excess code for both prompt types. We expect that the performance of EditCoder on lazy instructions will improve with more data and larger pre-trained models.
Lazy instructions benefit more than descriptive instructions from multiple trials. The performance disparity between descriptive and lazy instructions persists, even under higher sampling parameters and multiple trials, as seen in pass@10 and pass@100. Despite this, the rate of improvement from multiple trials is greater for lazy instructions, with increases of 57.76% in pass@10 and 22.57% in pass@100, surpassing the gains for descriptive instructions, which are 48.18% and 17.23% respectively. This indicates a more pronounced benefit from multiple attempts in scenarios involving less structured prompts.
Significant increase in ExcessCode. The average ExcessCode metrics for both descriptive and lazy instructions, at 0.507 and 0.449 respectively, have increased from the main evaluation’s averages of 0.392 and 0.235. 111These values are calculated by averaging over the results from the models in this table, and not the entire set of models in the main evaluation. This is expected, as higher sampling parameters tend to yield a broader range of completions, consequently resulting in an increase in superfluous code.
A.3 Results Per Change Type
Model | Corrective | Adaptive | Perfective | ||||
Name | Size | p@1 | ExcessCode | p@1 | ExcessCode | p@1 | ExcessCode |
Closed Models | |||||||
GPT-4 | — | 62.21 | 0.05 0.03 | 57.29 | 0.31 0.19 | 53.43 | 0.08 0.06 |
GPT-3.5-Turbo | — | 47.93 | 0.00 0.00 | 42.29 | 0.17 0.12 | 46.07 | 0.60 0.54 |
Open Models | |||||||
EditCoder | 33b | 56.86 | 0.02 0.02 | 51.21 | 0.77 0.42 | 39.29 | 0.05 0.04 |
EditCoder | 6.7b | 48.64 | 0.00 0.00 | 42.71 | 0.43 0.21 | 40.07 | 0.66 0.42 |
EditCoder | 1.3b | 26.36 | 0.11 0.10 | 23.21 | 0.14 0.10 | 22.57 | 0.26 0.18 |
In this section we analyze the results of EditCoder against the OpenAI models on CanItEdit per change type. We hope to gain insights into the strengths and weaknesses of these models for different types of code changes. Table 6 shows the results of our analysis, we aggregate the results for lazy and descriptive prompts across all problems, this is done for conciseness and to minimize the noise of our results, as each pass@1 and ExcessCode metric is calculated across 70 problems per change type, instead of 35. We utilize the same sampling parameters as in our main evaluation. Our key findings include:
-
•
GPT-4 outperforms other models in functional correctness across all types of changes.
-
•
Corrective changes typically incur minimal excess code, with some models achieving perfect scores in this area. Adaptive changes, intuitively, tend to introduce the most excess code.
-
•
Our EditCoder-33b model surpasses GPT-3.5-Turbo in both corrective and adaptive changes, while EditCoder-6.7b shows comparable performance to GPT-3.5-Turbo in these categories. However, for perfective changes, both EditCoder-33b and EditCoder-6.7b underperform compared to GPT-3.5-Turbo. This suggests a potential improvement in training data for perfective changes. As demonstrated in Figure 3, verbs associated with perfective changes, such as refactor or improve, appear less frequently than those related to corrective or adaptive changes, such as fix or add, respectively. Artificially balancing the dataset with more examples of perfective changes could potentially enhance EditCoder’s performance. We leave this as an area for future work.
-
•
According to our dataset, the most challenging changes are perfective, followed by adaptive, with corrective being the simplest.
A.4 Evaluation on HumanEvalPack
Model | pass@1 | ||
Name | Size | Fix | Synthesize |
GPT-4 | — | 47.0 | 86.6 |
EditCoder | 33b | 53.0 | 63.5 |
DeepSeekCoder-Instruct | 33b | 47.5 | 79.2 |
CodeLlama-Instruct | 34b | 36.5 | 43.8 |
StarCoder2 | 15b | 48.6 | 51.6 |
StarCoderBase | 15b | 25.6 | 33.6 |
OctoCoder | 15b | 30.4 | 35.1 |
CodeLlama-Instruct | 13b | 19.4 | 23.7 |
EditCoder | 6.7b | 46.6 | 52.9 |
DeepSeekCoder-Instruct | 6.7b | 44.9 | 76.2 |
EditCoder | 1.3b | 24.3 | 28.3 |
DeepSeekCoder-Instruct | 1.3b | 9.1 | 62.4 |
To draw similarities and differences between CanItEdit and HumanEvalPack, in this section we evaluate models on the Python subset of HumanEvalFix and HumanEvalSynthesize. Results are available in Table 7.
Benchmark Overview
HumanEvalPack is a benchmark comprised of 164 single-function problems aimed at evaluating both code generation and code editing. The problems are designed to not require domain-specific knowledge or familiarity with popular external libraries. HumanEvalFix contains only corrective code changes, while HumanEvalSynthesize purely focuses on code generation, tasking the model to generate a function from its signature and docstring.
Results
In this benchmark, EditCoder-33b outperforms even GPT-4 in fixing bugs, while maintaining competitive synthesis capabilities against models trained on general instructional data, with EditCoder-6.7b outperforming even larger models like CodeLlama-Instruct-34b. Additionally, we find a large disparity between the performance of models on HumanEvalFix and HumanEvalSynthesize for DeepSeekCoder-Instruct-1.3b, which performs significantly worse in fixing bugs than in synthesizing functions. These results demonstrate that HumanEvalPack and CanItEdit complement each other: the former focuses on single-function algorithmic and puzzle-like problems, while the latter emphasizes code editing tasks requiring broader knowledge of software engineering concepts in a wide range of domains.
Example Problems
To illustrate typical problems in HumanEvalPack, we selected examples from HumanEvalSynthesize and HumanEvalFix. LABEL:appendix:lst:humanevalsynth presents a HumanEvalSynthesize problem where the model must complete a function based on its signature and docstring. LABEL:appendix:lst:humanevalfix demonstrates an incorrect function implementation from HumanEvalFix along with it’s ground truth unit test suite, where the model’s task is to correct the implementation. Unlike in CanItEdit, the model must infer the correct implementation solely from the faulty code and the test suite, without explicit instructions. One could argue that this falls under intrinsic code editing, rather than instructional code editing, since the model is not given any instructions about the intent of the function, making this benchmark more suitable for evaluating works such as Li et al. (2023a) and Gupta et al. (2023).
A.5 Comparison of StarCoder Models
Model | Descriptive | Lazy | |||
Name | Size | pass@1 | ExcessCode | pass@1 | ExcessCode |
StarCoder2 | 15b | 41.95 | 0.36 0.20 | 31.48 | 0.04 0.04 |
StarCoder | 15b | 37.10 | 0.56 0.28 | 27.62 | 0.42 0.34 |
StarCoderBase | 15b | 35.33 | 1.55 0.89 | 27.05 | 0.85 0.55 |
StarCoderBase | 7b | 32.90 | 0.43 0.17 | 21.95 | 0.49 0.37 |
StarCoder2 | 7b | 25.10 | 1.47 0.78 | 13.76 | 1.81 1.22 |
StarCoder2 | 3b | 15.95 | 0.91 0.39 | 13.33 | 1.09 0.98 |
StarCoderBase | 3b | 14.81 | 1.22 0.52 | 9.90 | 1.17 0.76 |
StarCoderBase | 1b | 4.90 | 0.99 0.85 | 5.48 | 0.00 0.00 |
StarCoder Models
The first version of StarCoder models were pre-trained on several gigabytes of GitHub commits, as discussed in (Li et al., 2023b). In contrast, the second version, StarCoder2, did not include commit data in its training process. However, it was trained on GitHub issues, which provides it instruction following capabilities. Issue data encompasses a broader scope than commits, including discussions, bug reports, and feature requests. With prompt-engineering, StarCoder2 models can be used for code editing tasks, as demonstrated in Lozhkov et al. (2024). Furthermore, in most benchmarks evaluated in Lozhkov et al. (2024), StarCoder2 surpasses its previous version, StarCoderBase, in code generation tasks.
The original StarCoder model, StarCoderBase-15b, was additionally trained on Python code from GitHub. StarCoderBase models are available in four sizes: 15b, 7b, 3b, and 1b. On the other hand, StarCoder2 has not undergone further training on additional Python code and is available in three sizes: 15b, 7b, and 3b.
Evaluation
We evaluate all sizes of StarCoder and StarCoder2 models on CanItEdit and present the results in Table 8. We find that StarCoder2 outperforms StarCoder in the 15b and 3b sizes, but not in the 7b size, where StarCoderBase-7b significantly outperforms StarCoder2-7b. Additionally, for the 7b and 3b sizes, StarCoder2 models tend to generate more excess code than StarCoderBase models. We attribute the performance improvements in StarCoder2 to the broader training data and architectural enhancements, as discussed in Lozhkov et al. (2024), rather than to the superiority of issue data over commit data for code editing tasks. However, we also believe that for utilizing StarCoder2 models directly for code editing tasks, the issue prompt format is a viable alternative to the commit format previously utilized by StarCoderBase models. We provide the prompt format we utilized for StarCoder2 models in Figure 5.
A.6 Prompt Templates Used in Evaluation
<user> You are PythonEditGPT. You will be provided the original code snippet and an instruction that specifies the changes you need to make. You will produce the changed code, based on the original code and the instruction given. Only produce the code, do not include any additional prose. ## Code Before ‘‘‘py def add(a, b): return a + b ‘‘‘ ## Instruction Add a "sub" function that subtracts two numbers. Also write docstrings for both functions and change a,b to x,y. <assistant> ## Code After ‘‘‘py def add(x, y): """Adds two numbers.""" return x + y def sub(x, y): """Subtracts two numbers.""" return x - y ‘‘‘ <user> You are PythonEditGPT. You will be provided the original code snippet and an instruction that specifies the changes you need to make. You will produce the changed code, based on the original code and the instruction given. Only produce the code, do not include any additional prose. ## Code Before ‘‘‘py {before} ‘‘‘ ## Instruction {instruction}
<system> You are PythonEditGPT. You will be provided the original code snippet and an instruction that specifies the changes you need to make. You will produce the changed code, based on the original code and the instruction given. Only produce the code, do not include any additional prose. <user> ## Code Before ‘‘‘py def add(a, b): return a + b ‘‘‘ ## Instruction Add a "sub" function that subtracts two numbers. Also write docstrings for both functions and change a,b to x,y. <assistant> ## Code After ‘‘‘py def add(x, y): """Adds two numbers.""" return x + y def sub(x, y): """Subtracts two numbers.""" return x - y ‘‘‘ <user> ## Code Before ‘‘‘py {before} ‘‘‘ ## Instruction {instruction}
<commit_before> {before} <commit_msg> {instruction} <commit_after>
## Code Before: {before} ## Instruction: {instruction} ## Code After:
<issue_start>username_0: I have a program in Python that I’d like to change. Here is the code for the program: ‘‘‘py def add(a, b): return a + b ‘‘‘ The change I’d like to make is: Add a "sub" function that subtracts two numbers. Also write docstrings for both functions and change a,b to x,y. Please someone help me. Can you also provide the full code with the change? <issue_comment>username_1: Sure, no problem. I will be able to help. I am an expert in editing Python code. Here is the full code with the change: ‘‘‘py def add(x, y): \"\"\"Adds two numbers.\"\"\" return x + y def sub(x, y): \"\"\"Subtracts two numbers.\"\"\" return x - y ‘‘‘ Upvotes: 200<issue_comment>username_0: Thank you so much! I have another program in Python that I’d like to change. Here is the code for the program: ‘‘‘py {before} ‘‘‘ The change I’d like to make is: {instruction} Please someone help me. Can you also provide the full code with the change? Upvotes: 100<issue_comment>username_1: Sure, no problem. I will be able to help. I am an expert in editing Python code. Here is the full code with the change: ‘‘‘py {after} ‘‘‘
We evaluate all of our models on CanItEdit using the same evaluation pipeline. However, for each model, we may utilize different prompts to generate the completions. These prompts are most aligned to how the model was trained, and are intended to maximize the model’s performance on the task, while kee** the prompts as similar as possible across models. Figure 4 shows the prompts used for each model. For a fair comparison, we evaluate all models not trained on commits or explicit code editing tasks using a basic 1-shot prompt, showing the model how to add a sub function to a code segment with a add function, and changing the variable names from a and b to x and y.
Furthermore, given the natural language characteristics of GitHub issue data, significant prompt-engineering was required to facilitate code editing tasks for StarCoder2 models. The specific prompt format used for StarCoder2 models is provided separately in Figure 5.
Appendix B Training Details
In this section we provide details on the training process for EditCoder and ablation results on the loss masking technique used in training.
B.1 Training Tools and Configuration
For training all of our EditCoder models, we utilize a fine-tuning pipeline based on the HuggingFace Transformers library (Wolf et al., 2020). Additionally, we utilize DeepSpeed ZeRO 3 (Rajbhandari et al., 2020) to efficiently shard the model across multiple GPUs. Due to memory constraints, we offload the optimizer to the CPU for the 33b model. We also use FlashAttention 2 (Dao, 2023) to speed up training on large context window sizes. All of our models are trained on a single machine equipped with 8 NVIDIA H100 (80GB) HGX GPUs. The effective micro-batch size is set at ( gradient accumulation steps, with a single batch per GPU). We utilize the AdamW optimizer with a learning rate of , a linear decay scheduler, and warmup steps. These parameters were chosen based on previous work on fine-tuning for code generation tasks (Cassano et al., 2023a), it is likely that we could get superior results by running a hyperparameter search. To facilitate reproducibility, we set the random seed to for all experiments.
Prior to training, we shuffled the dataset randomly and deduplicated222Deduplication, achieved by concatenating the ‘before’ and ‘after’ code segments, helps mitigate overfitting to specific training examples (Lee et al., 2022). it following the method outlined by Li et al. (2023b). This process combines MinHash (Broder, 2000) and Locality Sensitive Hashing (LSH) (Leskovec et al., 2014). We format the training data as a prompt, with the ‘before’ code segment followed by the ‘instruction’ and the ‘after’ code segment, and mask the loss calculation to only consider the ‘after’ code segment. All models underwent training for epochs, with a packed context window of tokens, including padding for the remaining tokens. We select the model from the epoch with the highest performance on a held-out validation set. The number of epochs chosen for each EditCoder is the following:
-
•
EditCoder-1.3b: 8
-
•
EditCoder-6.7b: 4
-
•
EditCoder-33b: 2
As shown by the number of epochs, we found that larger models overfit to the data more quickly, suggesting that we could achieve better results with a larger dataset.
B.2 Effect of Loss Masking
Masking | pass@1 | ExcessCode |
---|---|---|
Yes | 41.6 | 0.26 0.14 |
No | 40.5 | 0.43 0.20 |
In our training pipeline, we mask the loss calculation to only consider the ‘after’ code segment. The intuition behind this is that we don’t need the model to learn how to reproduce the ‘before’ and ‘instruction’ segments, as these are always going to be provided as input to the model at inference time. The exact prompt format we use for training is shown in Figure 3(d).
We wish to verify that this loss masking is beneficial for our task. To assess our hypothesis, we train two DeepSeekCoder-6.7b-Base models on EditPackFT333We chose EditPackFT for this experiment as it is the smallest dataset we use for training, allowing us to quickly compare the two methods., one with loss masking and one without, calculating the loss on all tokens. We then evaluate both models on CanItEdit, and report the pass@1 and ExcessCode metrics in Figure 7. The reported result with loss masking is the same as the one reported in Table 4. We find that the model trained with loss masking outperforms the model trained without it, and leads to a decrease in ExcessCode and its standard error. Furthermore, we plot the training loss curves for both models in Figure 7. We observe that the model trained with the loss masking technique is more stable and converges faster than the model trained without it.
Appendix C Example CanItEdit Benchmark Problems
We showcase four examples from the CanItEdit benchmark, which we believe are representative of the types of problems present in the dataset.
External Libraries in CanItEdit
Our benchmark includes 21 problems that import external libraries, which are libraries outside of Python’s standard environment. We report the list of external libraries used and their number of appearances in the dataset: NumPy (13), Pandas (6), SciPy (3), scikit-learn (3), PyTorch (3), Z3 (2), autograd (2), Flask (1), vLLM (1)
oop_refactor
Figure 8 details a task where the model refactors code using object-oriented programming (OOP) principles. Initially, the code is a function for formatting messages based on type. The refactoring involves creating TextMessage and ImageMessage as subclasses of an abstract Message class and implementing a MessageFactory for message construction. This task provides an example of a perfective edit, focusing on reorganizing the code into an OOP style without adding new features. The transformation is quite significant, and the largest relative transformation in our dataset: from a single function to a multi-class OOP program. The goal is to assess the model’s proficiency in converting functional code into well-structured OOP designs based on comprehensive instructions and for the model to restructure small programs into much larger ones. Our test suites verify both functional correctness and the proper hierarchical class structure.
group_theory
Figure 9 features a task to modify a class from representing group to group , including its operations like inverse and product. The problem highlights domain-specific problems in CanItEdit, this one being set in the context of cyclic groups. Testing domain-specific edits is crucial, especially when comparing the capabilities of large proprietary models like GPT-4 with smaller open models. It requires the model to transform the C4 class (representing a 4-element cyclic group) into the C8 class (for an 8-element group), requiring extensive edits across various code sections. This complexity presents a significant test for other code editing approaches, such as fill-in-the-middle (Bavarian et al., 2022; Fried et al., 2023), which may struggle with multiple edit locations (Yee & Guha, 2023). Key edits involve altering the size and elements methods. The necessary understanding for these modifications stems from group theory, which is not explicitly explained in the problem. This setup tests the model’s capability to execute domain-specific edits where contextual knowledge is implied rather than provided.
strategy
Figure 10 presents an open-ended problem where the model devises a game strategy to defeat the already implemented CornerStrategy in Tic Tac Toe. This task represents an adaptive edit, focused on develo** a new feature without altering existing classes. The uniqueness in this problem lies in the lack of providing rules for the game, but rather requiring the model to infer them through understanding of the code. Additionally, it leaves the strategy design entirely to the model’s discretion. Our tests ensure that the Game class remain intact and that the model’s strategy consistently outperforms CornerStrategy in the game.
sudoku_solver
Figure 11 presents a sudoku solver problem leveraging the Z3 satisfiability modulo (SMT) solver. The problem starts with an incomplete solver that lacks checks for 3x3 subgrids, both in its solving logic and board validity function. In sudoku, each 3x3 grid must contain distinct numbers from 1 to 9. The task involves adding these checks to ensure the solver can correctly solve a sudoku board. This problem assesses the model’s capability to implement edits across different code sections. Although it uses Z3, in-depth knowledge of the library or SMT isn’t required; the necessary features needed to solve the problem can be inferred from the existing code, which already includes checks for row and column uniqueness.
Instruction | Type |
---|---|
Abstract the code into an object-oriented version of itself. To do that, create an abstract class ‘Message(ABC)‘, which can be initialized with a ‘content‘ string. The class should have an abstract method ‘process(self)‘, which should return a string. Create two children classes ‘TextMessage‘ and ‘ImageMessage‘, which implement the ‘process‘ method. Finally, create a ‘MessageFactory‘ that has a static method ‘get_message(message_type, content) -> Message‘; static methods can be defined with the ‘@staticmethod‘ decorator. The ‘get_message‘ method should return ‘Message‘ corresponding to the ‘message_type‘ (either ‘text‘ or ‘image‘), and it should throw a ValueError if the ‘message_type‘ is not valid.
|
Descriptive |
Make the code object-oriented. Specifically, create an abstract class ‘Message‘, and children classes ‘TextMessage‘ and ‘ImageMessage‘. The ‘Message‘ class should have a method ‘process(self)‘ that returns the message which was given to the constructor. Also, create a ‘MessageFactory‘ that has a static method ‘get_message(message_type, content) -> Message‘; should raise an exception if the message type is not supported.
|
Lazy |
Instruction | Type |
---|---|
Edit the C4 class, which represents rotations of 0, 90, 180 and 270 degrees, to represent the class C8, which represents rotations of 0, 45, 90, 135, 180, 225, 270 and 315 degrees.
|
Descriptive |
Edit the C4 class and its methods to represent the C8 group instead
|
Lazy |
Instruction | Type |
---|---|
The following code describes a tic-tac-toe game which takes in two strategies and determines who wins if they play each other. The ‘Strategy‘ class defines an abstract method, ‘returnMove(board)‘, which returns a tuple representing where this strategy will move, given a board state. The ‘CornerStrategy‘ class is a subclass of ‘Strategy‘ with a concrete implementation of ‘returnMove(board)‘. The ‘Game‘ class constructor takes in two strategies. It has a method ‘player1Won‘ which determines if the first strategy provided will beat the other if they both take turns alternating between moves. There are two methods, ‘playerXWon‘ and ‘gameOver‘ which determine how a game is won and when it is over. Create a class ‘GoodStrategy‘ which extends ‘Strategy‘ such that ‘Game(GoodStrategy(), CornerStrategy()).player1Won()‘ returns ‘True‘. This can not be solved by modifying the ‘Game‘, ‘Strategy‘, or ‘CornerStrategy‘ classes in any way.
|
Descriptive |
Create a strategy ‘GoodStrategy‘, that beats ‘CornerStrategy‘. Do not modify the ‘Game‘ class.
|
Lazy |
Instruction | Type |
---|---|
This version of the sudoku solver and checker does not reflect the original game of sudoku; the original game also checks for the uniqueness of 3x3 subgrids in addition to the rows and columns. Update the ‘assert_uniq‘ function to add new constraints for all nine 3x3 subgrids, and update the ‘check_valid‘ function to make sure that input grids have unique 3x3 subgrids.
|
Descriptive |
Make both the sudoku solver and verifier support the nine 3x3 subgrids that are in the original sudoku game.
|
Lazy |