Out of style: Misadventures with LLMs and code style transfer

Karl Munson1    Chih-Kai Ting1    Serenity Wade1    Anish Savla1    Julian Dolby2    Kiran Kate2&Kavitha Srinivas2 1University of California, Santa Cruz, USA
2IBM Research, USA
{ksmunson, cting3, swade1, ansavla}@ucsc.edu, [email protected], [email protected], [email protected]
Abstract

Like text, programs have styles, and certain programming styles are more desirable than others for program readability, maintainability, and performance. Code style transfer, however, is difficult to automate except for trivial style guidelines such as limits on line length. Inspired by the success of using language models for text style transfer, we investigate if code language models can perform code style transfer. Code style transfer, unlike text transfer, has rigorous requirements: the system needs to identify lines of code to change, change them correctly, and leave the rest of the program untouched. We designed CSB (Code Style Benchmark), a benchmark suite of code style transfer tasks across five categories including converting for-loops to list comprehensions, eliminating duplication in code, adding decorators to methods, etc. We then used these tests to see if large pre-trained code language models or fine-tuned models perform style transfer correctly, based on rigorous metrics to test that the transfer did occur, and the code still passes functional tests. Surprisingly, language models failed to perform all of the tasks, suggesting that they perform poorly on tasks that require code understanding. We will make available large-scale corpora to help the community build better code models.

1 Introduction

As with text, programs have style attributes, and certain styles are more desirable than others for program readability, maintainability, and even performance (Allamanis et al. (2014)). Most enterprises consider program style so important that they mandate adherence to organization-specific styles; in fact, Allamanis et al. (2014) point out that one-third of manual code reviews are about adherence to style. The process of reviewing and transferring across program styles, however, is a difficult problem to automate using symbolic approaches. As a result, enterprises rely mostly on manual code reviews to enforce style requirements; only the most syntactic style constraints such as new lines after method calls or lines that do not exceed a certain length can be checked and performed automatically.

Inspired by work on text style transfer, we ask whether it is possible to use neural models for program style transfer. Existing work on text style transfer shows that neural models perform well on text transfer (Riley et al. (2021), Tikhonov et al. (2019), Reif et al. (2021), ** et al. (2020), Reif et al. (2022)), where a single sentence is translated into a different styled sentence.

Refer to caption
Figure 1: An example of the code encapsulation task.

However, program style transfer has different challenges compared to text style transfer:

  • Unlike existing research on text transfer, which tends to require that a short single sentence be transformed as a whole into another sentence, style transfer in programs involves identifying one or more specific snippets of code in a much larger script to replace with an alternative while leaving large parts of the code unchanged. Most importantly, the semantics of programs are formally defined, which means that the tasks can serve as a stress test to the capabilities of language models.

  • Evaluation of text style transfer is subjective because precisely checking meaning preservation, as well as style transfer strength, is difficult for text. For code, there are rigorous metrics one can employ to measure how well programs perform style transfer.

Figure 1 illustrates an example of these challenges in a code encapsulation transfer task. The methods readString and readInt in the example have the same first 3 lines of code outlined in red. These lines are duplicated and encapsulating them into a new method extracted_0 is better programming practice. To do so, however, requires a non-trivial level of code understanding; duplicate lines need to be identified and moved into a new method, and the method needs to be called correctly with the right parameters. Additionally, all other lines need to be left as is.

The goal in this work was to examine in detail, the capabilities of large language models for code. Although language models have been shown to perform well on code generation tasks, we found that existing language models do not perform code style transfer at all. Surprisingly, even seemingly simple style transfer tasks such as casing were difficult for language models. Casing of variables seems trivial, but it does require an understanding of program scope, potential naming conflicts with existing variables, and appropriate changes in references to renamed variables. In addition, even when models changed the required lines correctly, they often changed other lines of code they should have left untouched. It was also fairly common for language models to ’pass’ functional tests, by not transforming the code at all. We therefore designed a new metric DiffCorrect that combines functional correctness with diffs to provide insight into these types of errors, to help us understand exactly what language models do with code.

To document and understand code style effects more carefully, we designed a set of benchmarks for code called the Code Style Benchmark (CSB) in two different languages (Python and Java). Each program had one of five classes of style changes performed on it, ensuring that changing the program’s style did not change the semantics of the program. We generated parallel corpora for (a) changing casing conventions for a language (e.g. changing variables to lowercase), (b) creating list comprehensions from for-loops for Python, (c) eliminating duplicate code in a program for reuse, (d) adding documentation to code, and (e) adding decorators to methods. We note that these are useful style transfer tasks that would be nice to have in IDEs but in fact, no IDEs exist yet to do this because many transformations such as even casing are hard to perform algorithmically.111Refactoring to extract code into methods does exist in IDEs, but, since it requires explicit code ranges to extract, it is intractable to apply it exhaustively to eliminate duplication. We also note that to generate the parallel corpora, we can easily undo casing or eliminate documentation, but the actual desired stylistic convention (i.e., to have correctly cased variables in code) is difficult to generate. To our knowledge, our benchmark is the first attempt at defining a dataset for code transformations that are potentially useful features that can be added to IDEs; the purpose of this dataset is to provide a rigorous benchmark to evaluate if language models can perform such transformations correctly. We evaluated all semantics preserving transforms of the code on the CodeNet (Puri et al. (2021)) benchmark because that benchmark has functional tests.

The contributions of this paper can be summarized as follows:

  • (i)

    We tested whether large language models understand code well enough to perform code style transforms while preserving the semantics of code.

  • (ii)

    We defined a code style benchmark (CSB) across 2 programming languages. CSB has parallel corpora for 5 style transform tasks and the dataset creation has been rigorously tested for functional correctness on the CodeNet dataset. To the best of our knowledge, CSB is the first benchmark for code style transfer. We will release the datasets publicly for further research on code style transfer.

  • (iii)

    Based on the observation that functional tests can be passed while not performing any transform at all, we define a new metric called DiffCorrect that is broadly applicable to code generation tasks where the output is a slightly modified version of the input.

  • (iv)

    The code to perform the transforms and generate parallel corpora is provided as a toolkit, so new datasets can be generated for the purpose of training LLMs for the future. To help increase the diversity of programs for training, we also ran transforms on a large Python training dataset and used it for fine-tuning. The training datasets are available at here, and the code is available here.

2 Related Work / Background

2.1 Code Style

Style is recognized as an important aspect of programming (Kernighan and Plauger (1982)), and coding style guides usually have a list of recommendations. IDEs and code formatting tools enable automatic style transformations for line lengths, spacing, etc. In this paper, we focus on complex aspects of style based on language features commonly used. Specifically, our work was informed by the analysis of common language features in popular open-source projects, performed by Peng et al. (2021) and Yang et al. (2022). Their work, however, does not address the style transfer task. Allamanis et al. (2014) uses statistical techniques to identify styles from code bases for Java code and suggest single identifier names and formatting. The style transformations in CSB are less about identifier names and more about a set of language features that define style for a programming language; casing might be thought to be similar to the work in Allamanis et al. (2014), but the goal of Allamanis et al. (2014) is to predict entire variable names rather than change their case. Mariano et al. (2022) propose a new neural network architecture to convert from imperative code snippets to their equivalent functional code. This task is similar to the list comprehension task we include in CSB, but the authors do not focus on a general style transfer benchmark.

2.2 Code Language Models and Tasks

Self-supervised training of language models has been extended to code and has led to a variety of pre-trained models. There are encoder-only pre-trained models such as CodeBERT Feng et al. (2020), CuBERT Kanade et al. (2020), and C-BERT Buratti et al. (2020), encoder-decoder models such as CodeT5 Wang et al. (2021) and PLBART Ahmad et al. (2021) and generative models such as Codex Chen et al. (2021), Starcoder Li et al. (2023). We use CodeT5 for fine-tuning to demonstrate the usefulness of our benchmarks in terms of fine-tuning and Starcoder to see if very powerful neural models can be used with prompting to transform program style.

These models have been successfully applied to many code tasks such as bug detection, clone detection, translation, code search, documentation generation, code completion, code summarization, variable misuse, incorrect use of operators or operands, etc. There are benchmark datasets available for these tasks including CodexGLUE Lu et al. (2021a), and CodeSearchNet Husain et al. (2019). To the best of our knowledge, style transfer for list comprehension, decorators, casing, and code encapsulation are novel tasks and there are no parallel corpora available for them. Documentation generation and its related task of comment generation are well known but we include it in our style transfer benchmark because docstrings and comments are often considered desirable style attributes of code.

3 Benchmark Task Definitions

This section explains the tasks under CSB and provides details of how we created the parallel corpora for each task.

In most cases, it can be observed it is easier to transform the code in one direction, but not the other. As an example, it is easy to convert an existing list comprehension to a for-loop; the reverse is harder because not every for-loop can be written as a list comprehension. For this case, our target style is functional i.e. with list comprehensions, but we do the reverse transformation to obtain a parallel corpus for that task. For most of the tasks we considered, we modified the original code, and then used the transformed code as input for the models, with the original being the gold label. The one exception was refactoring duplicate code in Java, where we exploited existing tools to refactor the code in the correct direction. We nevertheless include it in the benchmark because, although such refactoring tools exist in Java, it requires the programmer to select the code to be refactored. We exhaustively searched lines of code to compute duplication; this is possible only for smaller programs. Our goal here is to have neural models learn it for small programs, but then apply it to larger ones, and better yet generalize the transform to dynamically typed languages where such refactoring is even more difficult. To test the correctness of the transforms, we tested against the Python and Java subsets of the CodeNet (Puri et al. (2021)) corpus, so each transform was checked for correctness by verifying that the output of the transformed program on the specified inputs matches the output of the original program.

3.1 List Comprehensions

List comprehension is a popular functional programming feature in Python, and is often considered a representative of idiomatic Python (Alexandru et al. (2018)). List comprehensions make the code more readable and also execute efficiently (Yang et al. (2022)). It is an example of a transform that is relatively straightforward to convert the comprehension to a for-loop, but not all for-loops can be converted into comprehensions, and that is why a statistical approach to the style change is useful. Figure 2 shows an example of such a transform on a Python snippet; the list comprehension on line 11 gets converted into a for loop in lines 11-13. To transform list comprehensions, we traverse the AST iterating through the body of nodes looking for instances of ast.Assign and where the value of the body is ast.ListComp. An equivalent for-loop to the list comprehension is created.

\inputminted

[linenos,highlightlines=11,fontsize=]pythonfigures/examples/s188426131.py \inputminted[linenos,highlightlines=11-13,fontsize=]pythonfigures/examples/s188426131_transformed_uncomp.py

Figure 2: Original program and the list comprehension transform

3.2 Decorators

Using decorators to add functionality to existing functions is a commonly used feature in Python (Peng et al. (2021)). To transform code to eliminate decorators, we eliminated each AST node’s decorator list in the Python AST and rewrote the AST tree back to the source. Decorators were removed from ast.FunctionDef, ast.ClassDef, and ast.AsyncFunctionDef nodes and this was applied recursively to child nodes. Figure 3 shows an example of the original program with a decorator on line 4 that is eliminated by the transform. Note here that the code does contain imports on line 2 which should help the model reconstruct the missing decorator. This is typically the case with Python. Decorators in Python do sometimes change the semantics (e.g., classmethod), while others like lru_cache are hints for the performance characteristics of the program. We therefore kept only the decorators that do not change the semantics of code, which were lru_cache and njit.

We performed the analogous transformation in Java, eliminating annotations. To preserve program meaning, we eliminated only those annotations that are defined to be @Retention(value=SOURCE); in the CodeNet programs, that comprised @Override and SupressWarnings.

\inputminted

[linenos,highlightlines=4,fontsize=]pythonfigures/examples/s140959838.py \inputminted[linenos,fontsize=]pythonfigures/examples/s140959838_decorator_transform.py

Figure 3: Original program and its decorator transform

3.3 Casing

Consistent casing of identifier names is an important programming convention that improves readability. We created two different versions of this task for Java and Python, by focusing on the casing of variables alone. To create parallel corpora, we identified assignments to variables using the AST parser for Python and the Eclipse refactoring for Java; in both cases, we lower-cased them and stripped out underscores. In identifying assignments that should be changed, we ensured that we did not rename the variable to one that already existed in the program scope, rename it to a reserved keyword in the language, alias to method names, or alias to class names within the program. All references to that variable within the program scope were then renamed to the same variable. Figure 4 illustrates how case is changed carefully, to preserve program semantics. MOD is left unchanged because it is referenced as a parameter, all other variables are changed, consistently within scope.

For Java, we used Eclipse refactoring tools to perform renaming. Eclipse gives us two things: a way to iterate across variable names only, and it ensures that renamed variables do not cause conflicts with existing variables, do not become a keyword, etc. Thus our renaming just iterates across all variables in each method, trying to convert it to lowercase and skip** variables for which the refactoring fails.

\inputminted

[linenos,highlightlines=1, 5-8,11,fontsize=]pythonfigures/examples/s489853164.py \inputminted[linenos,highlightlines=1, 5-8,11,fontsize=]pythonfigures/examples/s489853164.py_transformed.py

Figure 4: Original program, and its casing transform

3.4 Documentation

Documentation for modules, classes, functions, and methods is commonly specified as docstrings in Python. Inferring documentation from code is a well-known task Lu et al. (2021b), Chen et al. (2021), and we include it here because it is an important attribute of coding style. Dataset generation was performed via the removal of docstrings using a Python AST parser. This was implemented by traversing the AST for ast.Module, ast.FunctionDef, ast.ClassDef, and ast.AsyncFunctionDef nodes. For each of these nodes, docstrings were removed by checking if the first part of the body of the nodes listed above is an expression with a string value. We do not include actual examples here because the deletion is obvious.

3.5 Code Reuse

Good style is often taken to imply little redundancy, so we created a task of removing redundancy. Since code equivalence in general is undecidable, we focus on a specific task: extracting duplicate chunks of code into a method. For constructing our dataset, this refactoring must be automatable across a repository of code. To our knowledge, the combination of non-trivial method extraction and automation is best supported by Eclipse JDT, so this task applies only to our Java code.

In particular, we apply the Extract Method refactoring to every contiguous range of code of more than one statement, and Eclipse determines whether it can be extracted into a method. If so, Eclipse also finds any other chunks of code that can use the same function. If there is at least one such chunk, we execute the refactoring, yielding a new method with at least two calls.

This approach requires an exhaustive search across possible code chunks, and, for each of them, a search for duplicates. As such, it is very expensive and suitable only for tasks such as dataset curation, where running for hours is acceptable. Even then, the search is, at best, quadratic in program size, so it cannot scale to large programs. It is sufficient for CodeNet Java.

\inputminted

[linenos,tabsize=1,highlightlines=9-10,fontsize=]javafigures/examples/reuse-before-s807028993.java \inputminted[linenos,tabsize=1,highlightlines=9-10,14-17,fontsize=]javafigures/examples/reuse-after-s807028993.java

Figure 5: Original program and the expected transformation for the reuse task

Figure 5 shows an example program for this type of refactoring. This example illustrates some aspects of this transformation:

  1. 1.

    lines 9 and 10 of the original code have the same form but use different local variables, which requires the tool to understand the dataflow of those distinct variables.

  2. 2.

    the extracted requires parameters to be passed the created methods, which requires the tool to consistently add the parameter to the method and arguments to each call.

4 Benchmark Data Characteristics

4.1 CodeNet

CodeNet has 1,430,135 Python programs and 74,984 Java programs. Each of these programs was transformed as described in Section 3 for each transform. Table 1 has the total number of transformed code examples for each task. The Test column shows the subset of programs in the task that we used to probe the different LLMs for code style changes. For the programs with reuse, 906 of them had refactored methods that require passing parameters; thus the refactoring step is quite difficult.

Transform #Total #Test
List Comprehension 179,017 1357
Decorator (Python) 2,493 400
Annotations (Java) 750 75
Casing (Python) 635,306 1357
Docstrings (Python) 9,950 1357
Casing (Java) 30,887 250
Code Encapsulation 1695 225
Table 1: Number of transforms for each task in the CodeNet dataset.

To test the functional correctness of the code after the transform, we ran the original code and the transformed code on the CodeNet corpus with the given inputs. For Java, we ran on all 74,984 Java programs to ensure the transforms produce semantically correct code.

For the Python code, there were about 6,000 cases of Python programs in CodeNet where we were not able to run the code due to missing dependencies. Another 2,722 programs failed to run on the transformed code because the library that wrote the transformed AST back to the source wrote code that caused syntax errors. For the casing transform, 45 programs failed because the transformation involved a variable that was executed using dynamically created code (i.e., using eval or exec). All the other transformed code for Python passed the functional tests. For Java, all changed programs executed correctly.

4.2 BigQuery

Although CodeNet has this desirable property of being executable, the diversity of the types of programs in CodeNet is somewhat limited. To increase the diversity of programs useful for training LLMs we ran on Google’s BigQuery dataset (bigquery-public-data.github_repos). As in prior work Kanade et al. (2020), files were deduplicated; a subset of BigQuery is the same as the Kanade et al. (2020) dataset. Because our refactoring for Java requires compilable code, we could not take the BigQuery dataset for Java and perform transforms on it. Therefore our Java training datasets were drawn from CodeNet. Table 2 shows the characteristics of the train, validation, and test sets for BigQuery. Note that we did not use these test sets in our evaluation, but we did create the splits so we could use training and validation for our fine-tuning.

Transform #Train #Val #Test #Test
Subset
List Comprehension 61383 6821 477 477
Decorator 146260 36566 9105 2001
Casing (Python) 648540 162135 68620 1997
Docstrings 2281778 253531 17353 2001
Table 2: Number of training, validation, and test examples for each task from the BigQuery datasets.

5 Experiments

Task Model CodeBLEU Added Removed No Unexp. No Unexp. Passed Passed
Correctly Correctly Removed Added Correct
List Starcoder .82 .81 .84 .62 .47 .61 .36
Comprehension Code Llama .91 .83 .86 .60 .60 .72 .41
CodeT5 .99 .96 .99 .74 .61 .57 .55
Decorator Starcoder .66 .34 - .86 .36 0.62 .26
(Python) Code Llama .72 .41 - .92 .44 .55 .23
CodeT5 .92 .16 - 0 .13 0 0
Decorator Starcoder .92 0 - .14 0 .54 0
(Java) Code Llama .97 0 - .50 .02 .73 0
CodeT5 .99 .93 - 1 .88 .46 .41
Casing Starcoder .81 .40 .41 .80 .79 .81 .29
(Python) Code Llama .84 .40 .41 .85 .81 .80 .32
CodeT5 .96 .48 .54 .63 .25 .31 .12
Casing Starcoder .92 .89 .99 .13 .004 .25 0
(Java) Code Llama .85 .86 .99 .20 .01 .72 0
CodeT5 .96 .99 .99 .08 .08 .46 .07
Docstrings Starcoder .67 .97 - .97 .95 .85 .83
Code Llama .61 .97 - .89 .16 .15 .13
CodeT5 .67 .98 - .61 .15 .56 .12
Code encapsulation Starcoder .86 0 .04 .11 .04 .27 0
(Java) Code Llama .89 0 .09 .20 .03 .70 0
CodeT5 .92 .47 .59 .70 .59 .12 .02
Table 3: The performance of models, showing which lines were added or removed correctly, with no unexpected additions or deletions; the table shows fractions of programs where the given model correctly computed the given metric

We examined the level to which large pre-trained code models understand code well enough to change program style with few shot prompting. To assess possible improvements in code understanding through fine-tuning, we also fine-tuned a language model but on a smaller code model due to a lack of sufficient resources to train large models.

Large language models (LLMs) have been shown to be capable of text style transfer with in-context learning (Reif et al. (2022); Sun et al. (2023)). To evaluate LLMs on the code style transfer task, we chose Starcoder-15.5B Li et al. (2023) and codellama-34B-instruct Rozière et al. (2023) for in-context learning. Both are state-of-the-art open-source models which have been trained on multiple programming languages. We use 1-shot prompting with a carefully chosen example and a control instruction for each of the tasks. The prompts are included in the appendix.

To examine if performance is any better from fine-tuning, we used our parallel corpora for fine-tuning a small model. We picked CodeT5 (Wang et al. (2021)) for fine-tuning as it was trained using multiple code tasks. We fine-tuned CodeT5-small (60M) with two different learning rates (2e32𝑒32e-32 italic_e - 3 and 2e52𝑒52e-52 italic_e - 5) and found that CodeT5-small with a learning rate of 2e32𝑒32e-32 italic_e - 3 performed the best. The max length of CodeT5 was set to 512. Each model was fine-tuned for four epochs for all tasks.

5.1 Model output examples

Figure 6 provides an example of the types of errors we observe with code LLMs. The model (in this example both codeLLAMA) does find the variables numofsock and converts it correctly to camel case (which is language appropriate), and changes its declaration and uses. However, it incorrectly changes the case of A and B as well, even though it would be unchanged in Java. We designed a metric called DiffCorrect to capture these sorts of errors because the model could easily pass functional tests without making any of the requisite changes. The core idea is to capture the categories of lines that are Added Correctly and Removed Correctly (green lines), but with no unexpected additions or deletions (red lines).

\inputminted

[linenos,obeytabs=true,tabsize=1,highlightlines=6, 8, 10, 11,fontsize=]javafigures/examples/s786937571.java \inputminted[linenos,obeytabs=true,tabsize=1,highlightlines=6, 8, 10, 11,fontsize=]javafigures/examples/s786937571_transformed.java {minted}[linenos,obeytabs,tabsize=1,escapeinside=——,fontsize=, escapeinside=——]java public class Main public static void main(String[] args) FastReader s = new FastReader(); —int A = s.nextInt();— —int B = s.nextInt();— —int numOfSock = 1;— int count = 0; —while(numOfSock ¡ B)— —numOfSock -= 1;— —numOfSock += A;— count++; System.out.println(count);

Figure 6: Original program, its transform, and CodeLlama output for casing changes on the same program in Java. Highlighted lines in the original program show the lines to be changed, such that variables are camel-cased. Note that CodeLlama correctly identified the lines to be removed and added them (in green), but also changed lines in red incorrectly.

5.2 DiffCorrect

As described above, we look at the changes expected for a given style transfer instance and see how well they got done while leaving code that should not be touched unchanged. This is accomplished by the code in Figure 7 which computes the DiffCorrect metric.

  Function CodeCompare input, output, expected
  
  er𝑒𝑟absenter\leftarrowitalic_e italic_r ← Filter(isDeletion, Diff(input, expected))
  ar𝑎𝑟absentar\leftarrowitalic_a italic_r ← Filter(isDeletion, Diff(input, output))
  removedCorrectly(erar)=𝑟𝑒𝑚𝑜𝑣𝑒𝑑𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦𝑒𝑟𝑎𝑟removedCorrectly\leftarrow\left(er-ar\right)=\varnothingitalic_r italic_e italic_m italic_o italic_v italic_e italic_d italic_C italic_o italic_r italic_r italic_e italic_c italic_t italic_l italic_y ← ( italic_e italic_r - italic_a italic_r ) = ∅
  noUnexpectedRemoved(arer)=𝑛𝑜𝑈𝑛𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑅𝑒𝑚𝑜𝑣𝑒𝑑𝑎𝑟𝑒𝑟noUnexpectedRemoved\leftarrow\left(ar-er\right)=\varnothingitalic_n italic_o italic_U italic_n italic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d italic_R italic_e italic_m italic_o italic_v italic_e italic_d ← ( italic_a italic_r - italic_e italic_r ) = ∅
  
  ea𝑒𝑎absentea\leftarrowitalic_e italic_a ← Filter(isAddition, Diff(input, expected))
  aa𝑎𝑎absentaa\leftarrowitalic_a italic_a ← Filter(isAddition, Diff(input, output))
  addedCorrectly(eaaa)=𝑎𝑑𝑑𝑒𝑑𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦𝑒𝑎𝑎𝑎addedCorrectly\leftarrow\left(ea-aa\right)=\varnothingitalic_a italic_d italic_d italic_e italic_d italic_C italic_o italic_r italic_r italic_e italic_c italic_t italic_l italic_y ← ( italic_e italic_a - italic_a italic_a ) = ∅
  noUnexpectedAdded(aaea)=𝑛𝑜𝑈𝑛𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝐴𝑑𝑑𝑒𝑑𝑎𝑎𝑒𝑎noUnexpectedAdded\leftarrow\left(aa-ea\right)=\varnothingitalic_n italic_o italic_U italic_n italic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d italic_A italic_d italic_d italic_e italic_d ← ( italic_a italic_a - italic_e italic_a ) = ∅
  
  EndFunction

Figure 7: Computing model changes

Figure 7 has three parameters, input𝑖𝑛𝑝𝑢𝑡inputitalic_i italic_n italic_p italic_u italic_t, output𝑜𝑢𝑡𝑝𝑢𝑡outputitalic_o italic_u italic_t italic_p italic_u italic_t, and expected𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑expecteditalic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d which are, respectively, the original program, the model output, and the expected restyled program. The approach is to compute diffs of diffs: er𝑒𝑟eritalic_e italic_r gets all lines removed from input𝑖𝑛𝑝𝑢𝑡inputitalic_i italic_n italic_p italic_u italic_t in expected𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑expecteditalic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d, and ar𝑎𝑟aritalic_a italic_r gets all the lines removed from input𝑖𝑛𝑝𝑢𝑡inputitalic_i italic_n italic_p italic_u italic_t in output𝑜𝑢𝑡𝑝𝑢𝑡outputitalic_o italic_u italic_t italic_p italic_u italic_t. We then compute (erar)=𝑒𝑟𝑎𝑟\left(er-ar\right)=\varnothing( italic_e italic_r - italic_a italic_r ) = ∅, meaning whether everything expected to be removed is indeed removed, assigning that to removedCorrectly𝑟𝑒𝑚𝑜𝑣𝑒𝑑𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦removedCorrectlyitalic_r italic_e italic_m italic_o italic_v italic_e italic_d italic_C italic_o italic_r italic_r italic_e italic_c italic_t italic_l italic_y. Then (arer)=𝑎𝑟𝑒𝑟\left(ar-er\right)=\varnothing( italic_a italic_r - italic_e italic_r ) = ∅ captures whether everything removed is expected to be removed, which is assigned to noUnexpectedRemoved𝑛𝑜𝑈𝑛𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑅𝑒𝑚𝑜𝑣𝑒𝑑noUnexpectedRemoveditalic_n italic_o italic_U italic_n italic_e italic_x italic_p italic_e italic_c italic_t italic_e italic_d italic_R italic_e italic_m italic_o italic_v italic_e italic_d. There is analogous code for additions.

Ideally, all four results are 1, meaning exactly what should be added and removed is added and removed respectively, with no unexpected changes to the code.

We tried multiple implementations of diff for this approach, and we ultimately settled on difflib222https://docs.python.org/3/library/difflib.html as providing the smallest diffs of files; a larger diff includes lines that did not in fact change, which could make our results seem worse than they are.

5.3 Results

Table 3 shows the results of the models on the DiffCorrect metrics. For transforms that do not involve any deletion (e.g. docstrings and decorators) removing lines is not needed so we leave that metric blank. CodeBLEU is provided for comparison, but DiffCorrect gives us more insights into where the models go wrong. As shown in the table, models have difficulty changing only the requisite lines of code, and leaving other lines unchanged. Both LLMs were weaker in Java than in Python. The most surprising result is for docstrings; powerful models such as Code Llama change the code to the point where it fails functional tests. In the few cases we examined manually, it was because Code Llama produced docstrings that would not parse as a string literal.

Fine-tuning did help a much smaller model (Code T5) perform list comprehension better, and for casing, it helped identify the lines to be changed correctly. However, fine-tuning did not always help with the tendency to edit incorrect lines. We also see how the fine-grained nature of the DiffCorrect metric helps us understand exactly where models go wrong compared to traditional metrics such as CodeBLEU. CodeBLEU combined with pass rates would have provided a misleading picture of the ability of these models to transform code.

6 Conclusions and Future Work

We evaluated the capability of current code LLMs to perform code style changes because this is an important requirement for many organizations. To our surprise, existing code LLMs could not perform this task well. To help LLMs perform better on such a task, we built benchmarks and defined a new metric called DiffCorrect that can help in building better models. Strengthening the ability of LLMs to perform these types of transformations is a target for future work.

References

  • Ahmad et al. [2021] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. CoRR, abs/2103.06333, 2021.
  • Alexandru et al. [2018] Carol V. Alexandru, José J. Merchante, Sebastiano Panichella, Sebastian Proksch, Harald C. Gall, and Gregorio Robles. On the usage of pythonic idioms. In Proceedings of the 2018 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2018, page 1–11, New York, NY, USA, 2018. Association for Computing Machinery.
  • Allamanis et al. [2014] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, page 281–293, New York, NY, USA, 2014. Association for Computing Machinery.
  • Buratti et al. [2020] Luca Buratti, Saurabh Pujar, Mihaela A. Bornea, J. Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, and Giacomo Domeniconi. Exploring software naturalness through neural language models. CoRR, abs/2006.12641, 2020.
  • Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
  • Feng et al. [2020] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages, 2020. cite arxiv:2002.08155Comment: Accepted to Findings of EMNLP 2020. 12 pages.
  • Husain et al. [2019] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search, 2019.
  • ** et al. [2020] Di **, Zhi**g **, Zhiting Hu, Olga Vechtomova, and Rada Mihalcea. Deep learning for text style transfer: A survey, 2020.
  • Kanade et al. [2020] Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Learning and evaluating contextual embedding of source code. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 12-18 July 2020, Proceedings of Machine Learning Research. PMLR, 2020.
  • Kernighan and Plauger [1982] Brian W. Kernighan and P. J. Plauger. The Elements of Programming Style. McGraw-Hill, Inc., USA, 2nd edition, 1982.
  • Li et al. [2023] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you!, 2023.
  • Lu et al. [2021a] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation. CoRR, abs/2102.04664, 2021.
  • Lu et al. [2021b] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation. CoRR, abs/2102.04664, 2021.
  • Mariano et al. [2022] Benjamin Mariano, Yanju Chen, Yu Feng, Greg Durrett, and Işil Dillig. Automated transpilation of imperative to functional code using neural-guided program synthesis. Proc. ACM Program. Lang., 6(OOPSLA1), apr 2022.
  • Peng et al. [2021] Yun Peng, Yu Zhang, and Mingzhe Hu. An empirical study for common language features used in python projects. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 24–35, 2021.
  • Puri et al. [2021] Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, and Frederick Reiss. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks, 2021.
  • Reif et al. [2021] Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. A recipe for arbitrary text style transfer with large language models, 2021.
  • Reif et al. [2022] Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. A recipe for arbitrary text style transfer with large language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 837–848, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • Riley et al. [2021] Parker Riley, Noah Constant, Mandy Guo, Girish Kumar, David Uthus, and Zarana Parekh. Text{settr}: Label-free text style extraction and tunable targeted restyling, 2021.
  • Rozière et al. [2023] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2023.
  • Sun et al. [2023] Huashan Sun, Yixiao Wu, Yinghao Li, Jiawei Li, Yizhe Yang, and Yang Gao. Tsst: A benchmark and evaluation models for text speech-style transfer, 2023.
  • Tikhonov et al. [2019] Alexey Tikhonov, Viacheslav Shibaev, Aleksander Nagaev, Aigul Nugmanova, and Ivan P. Yamshchikov. Style transfer for texts: Retrain, report errors, compare with rewrites. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.
  • Wang et al. [2021] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, 2021.
  • Yang et al. [2022] Yi Yang, Ana Milanova, and Martin Hirzel. Complex python features in the wild. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pages 282–293, 2022.