HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: inconsolata
  • failed: realboxes
  • failed: arydshln
  • failed: circledsteps

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2401.06628v2 [cs.CL] 21 Feb 2024

OOP: Object-Oriented Programming Evaluation Benchmark
for Large Language Models

Paper ID: 3791    Shuai Wang1 Liang Ding2 Li Shen3 Yong Luo1 Bo Du1 Dacheng Tao2
1Wuhan University 2The University of Sydney 3JD Explore Academy
[email protected], [email protected]
Abstract

Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favour of functional programming (FP), e.g., HumanEval and MBPP. To address this, ❶ our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. ❷ We propose a novel evaluation metric, pass@onormal-ooitalic_o, tailored for OOP, enhancing traditional pass@k𝑘kitalic_k metric. ❸ Our evaluation of 23232323 leading large language models (LLMs), including both general and code-specialized models, reveals three key insights: 1) pass@o𝑜oitalic_o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.

1 Introduction

Large language models (LLMs, Ouyang et al., 2022a; Touvron et al., 2023), consisting of billions or even trillions of parameters’ Transformer blocks Vaswani et al. (2017), have emerged like mushrooms after the rain, especially since the emergence of ChatGPT111https://chat.openai.com. In comparison to small models, LLMs exhibit stronger generalization and reasoning capabilities Wei et al. (2022). Currently, LLMs are playing a crucial role in various tasks, e.g., code generation Chen et al. (2021); Li et al. (2022); Roziere et al. (2023), language understanding Zhong et al. (2023), human-computer interaction Tolomei et al. (2023); Moslem et al. (2023), and translation Peng et al. (2023); Lu et al. (2023).

Refer to caption
Figure 1: The performance comparison of widely-used code language models on functional programming (FP) and object-oriented programming (OOP) code generation benchmarks, in terms of pass@1111 scores. We see that all models perform relatively well on FP benchmarks, i.e., Humaneval Chen et al. (2021) and MBPP Austin et al. (2021), while exhibiting poor performance on our OOP benchmark.

Benchmark Number NL PL Task Type HumanEval Chen et al. (2021) 164164164164 en Python Function Programming MBPP Austin et al. (2021) 974974974974 en Python Function Programming APPS Hendrycks et al. (2021) 5000500050005000 en Python Function Programming CodeContests Li et al. (2022) 165165165165 en Multi Function Programming MultiPL-MBPP Cassano et al. (2023) 974*superscript974974^{*}974 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT Multi Multi Function Programming HumanEval-X Zheng et al. (2023) 164*superscript164164^{*}164 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT en Multi Function Programming MultiPL-HumanEval Cassano et al. (2023) 164*superscript164164^{*}164 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT en Multi Function Programming MTPB Nijkamp et al. (2022) 115115115115 en Python Function Programming ODEX Wang et al. (2022) 945945945945 Multi Python Function Programming PandasEval Zan et al. (2022) 101101101101 en Python Function Programming BIG-Bench Srivastava et al. (2022) 32323232 en Python Function Programming CodeApex{}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT Fu et al. (2023) 476*superscript476476^{*}476 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT zh&en C++ Function Programming OOP (Our) 431431431431 en Python Object-Oriented Programming

Table 1: Overview of existing code evaluation benchmarks. (“NL” denotes natural language describing the problem or requirements; “PL” represents the generated programming language; “en” and “zh” denote English and Chinese, respectively, and “Multi” means containing multiple NLs or PLs; “*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT” indicates the number of samples for each language; “\star” means that in CodeApex, we only considered code generation tasks.)

The process of code generation entails crafting code in a suitable programming language from natural language descriptions of problems or requirements, aiming to effectively solve the problems or fulfill the requirements. Given that hiring professional programmers to write code consumes a significant amount of human and material resources, the importance of automated programming becomes particularly evident. Currently, the question of how to use the rising LLMs to generate more accurate automated programming codes based on problems or requirements stated by actual natural language has become an important research topic Liu et al. (2023); Zhong and Wang (2023). In the research process, code generation evaluation is crucial. Code generation evaluation not only needs to objectively and impartially reflect the current performance of LLMs in programming but also should disclose the shortcomings in LLM programming to further enhance its potential.

[Importance of OOP] According to the November programming language rankings by TIOBE 222https://www.tiobe.com/tiobe-index/, four out of the top five programming languages are OOP, which reflects the importance of OOP languages. OOP is centred on designing code around data or objects rather than organizing it based on functionality and logic Stroustrup (1988); Stefik and Bobrow (1985). OOP focuses more on the programming paradigm of class and object Wegner (1990). Functions are commonly referred to as methods in OOP.

[Motivation] However, existing code generation evaluation benchmarks primarily focus on the evaluation of FP, and lack the evaluation of relevant concepts and features of OOP, e.g., class, inheritance, encapsulation methods, etc. If using existing benchmarks in Table 1 for evaluation can only show the performance of LLMs in FP, it fails to reflect their potential in OOP, as illustrated in Figure 1.

[OOP benchmark and metric] Considering the limitations of current code generation evaluation FP benchmarks and the widespread use of the Python programming language, we propose the first OOP evaluation benchmark based on Python. OOP benchmark consists of 431431431431 Python programs, covering key concepts and features of OOP, including class, inheritance, encapsulation methods, etc. Furthermore, to prevent the issue where LLMs may not generate concepts and features of OOP, we have optimized the pass@k𝑘kitalic_k Kulal et al. (2019); Chen et al. (2021) metric by matching key points in natural language with key points in the programming language, i.e., the class names and private function names, etc, for natural language requirements are matched with the class names and private function names, etc., in the programming language. Our main contributions are summarized as follows:

  1. 1.

    We construct and release the first OOP evaluation benchmark, which encompasses concepts and features of OOP, e.g., class, polymorphism, encapsulation methods, etc.

  2. 2.

    We devise a new metric pass@o𝑜oitalic_o based on conventional pass@k𝑘kitalic_k, tailored for the OOP code generation task, by matching key points in natural language and programming language.

  3. 3.

    We extensively evaluated our OOP with 23232323 advanced LLMs, demonstrating that i) there is still significant room for improving the OOP tasks, ii) our benchmark could serve as a robust and fair indicator that helps the community quantify LLMs’ OOP performance.

2 Related work

Code Evaluation Benchmark

In the early days of LLMs, researchers from Google and OpenAI launched artificial handwritten code evaluation benchmarks, namely MBPP Austin et al. (2021) and HumanEval Chen et al. (2021), respectively. MBPP and HumanEval are currently the mainstream code generation evaluation benchmarks, but both of them are based on the Python programming language. Subsequently, MultiPL-MBPP Cassano et al. (2023) and MultiPL-HumanEval Cassano et al. (2023) expanded upon these two benchmarks by translating the Python programming language into eighteen other programming languages, e.g., Java, C++, PHP, etc, to evaluate the performance of LLMs across others programming languages. Additionally, HumanEval-X Zheng et al. (2023) incorporated multiple test cases into the HumanEval benchmark. Apart from the extensions made to these two benchmarks, other benchmarks like CodeApex Fu et al. (2023) and ODEX Wang et al. (2022) exhibit distinctive features across different natural languages and task types. Unlike existing code evaluation benchmarks, our proposed OOP benchmark primarily focuses on the concepts and features of OOP, e.g., class, inheritance, etc. These works are summarized in Table 1.

Code Evaluation Metrics

Existing evaluation metrics can be broadly categorized into two types: dynamic evaluation metrics and static evaluation metrics. Dynamic evaluation metrics evaluate the executability of generated codes by using test cases, with pass@k𝑘kitalic_k Kulal et al. (2019); Chen et al. (2021) serving as the primary representative. The calculation process for pass@k𝑘kitalic_k is shown in Appendix A. Additionally, this category of metrics includes n𝑛nitalic_n@k𝑘kitalic_k Li et al. (2022). Static evaluation metrics calculate BLUE Papineni et al. (2002), ROUGE Lin (2004), Codescore Dong et al. (2023) and CodeBLEU Ren et al. (2020) among manually written examples and generated programs. However, these code evaluation metrics do not specifically focus on evaluating the concepts and features of OOP. Therefore, we further optimized the pass@k𝑘kitalic_k metric based on the evaluation benchmark for OOP.

Refer to caption
Figure 2: The generation of private functions cannot be evaluated using pass@knormal-kkitalic_k. (We instructed ChatGPT Ouyang et al. (2022b); OpenAI (2023) model to generate the class class SS, public function public_Shortest_subarray, and private function def __private_Shortest_subarray based on a given prompt and implement the corresponding requirements within the functions. However, ChatGPT does not generate the private functions named private_Shortest_subarray outlined in the red box.)

3 Evaluation Framework

3.1 Overview

Existing code generation benchmarks in Table 1 for are confined to FP and do not involve essential concepts and features of OOP. We take the frequently used benchmarks, HumanEval Chen et al. (2021) and MBPP Austin et al. (2021) in Table 1, as examples. They primarily evaluate the capabilities of LLMs in FP. The detailed descriptions of HumanEval and MBPP are provided in Appendix B. If we use existing benchmarks in Table 1 for evaluation, it does not show the capability of LLMs in OOP, as illustrated in Figure 1, that is, the seemingly decent LLMs (on FP tasks) perform relatively worse on OOP tasks. In addition, existing code generation evaluation metrics primarily use pass@k𝑘kitalic_k to evaluate the executability of the generated code. However, using the pass@k𝑘kitalic_k metric can not reflect whether LLMs generate concepts and features related to OOP, as illustrated in Figure 2. Therefore, pass@k𝑘kitalic_k can not objectively and fairly reflect the OOP capabilities of LLMs.

Refer to caption
Figure 3: The construction process of our object-oriented programming (OOP) benchmark.

As a result, we established an OOP benchmark and proposed the evaluation metric pass@k𝑘kitalic_k for OOP. The process for constructing the OOP benchmark is illustrated in Figure 3.

3.2 Building OOP Benchmarks

Data Filtering.

The training data for current LLMs mostly comes from the internet. If we directly evaluate LLMs using existing OOP data from the web, it would not reflect the OOP capabilities of LLMs. Therefore, we first rigorously selected 500500500500 natural language description-based problems or requirements based on Python from platforms like LeetCode 333https://leetcode.com/, open-source repositories on GitHub 444https://github.com/, Stack Overflow 555https://stackoverflow.com/, and Codewars 666https://www.codewars.com/. These 500500500500 questions or requirements only are limited to FP and do not involve concepts and features related to OOP.

Human Rewritten.

Subsequently, we manually rewrite the collected 500 questions or requirements by adhering to the following rules:

  1. 1.

    Designing, based on the problems or requirements, with relevant OOP concepts and features, e.g., class names, inheritance name (i.e., parent class name), encapsulation methods name (i.e., public function name and private function names), etc.

  2. 2.

    Related problems or requirements are implemented within the public function and private function of the class while ensuring the encapsulation of that implementation.

  3. 3.

    Convert the variables associated with problems or requirements into class attribute variables, ensuring that these variables are accessible in both public and private functions.

  4. 4.

    If the implementation of problems or requirements is placed within the private function of the class, it is necessary to design a corresponding public function for access.

  5. 5.

    The rewritten OOP relevant problems or requirements can be successfully implemented and accessed through objects.

Following the five rules mentioned above, we conducted a standardized rewriting of the 500500500500 Python-based problems or requirements.

Case Design.

Finally, we designed corresponding test cases to evaluate OOP. Finally, we obtained 431 samples of OOP, as shown in Figure 3. The specific construction details of OOP are provided in Appendix C.

Level Classification.

Given the difficulty nature of programming, we divided the designed OOP benchmark into three levels: Simple-level OOP, Moderate-level OOP, and Difficult-level OOP, as shown in Figure 7.

Simple-level OOP has 77777777 program samples, and includes only class, and public function. Moderate-level OOP builds upon simple-level OOP by adding attribute variables and private functions, and has 179179179179 program samples. Nevertheless, the difficult-level OOP is based on the Simple-level of OOP, and adds inheritance, polymorphism and other related concepts and features of OOP. There are a total of 175175175175 program samples for difficult-level OOP. Although private functions are not involved in the difficulty level, the problems or requirements in difficult-level OOP are more complex and varied. Using such a level of classification, we can not only evaluate the performance of existing LLMs in OOP but also analyze the shortcomings of LLMs, which allows us to better unearth the potential of LLMs in OOP. Using this approach makes it more convenient for us to improve the OOP performance of LLMs.

3.3 Evaluation Metrics Pass@o𝑜oitalic_o

To evaluate whether LLMs generate concepts and features related to OOP, i.e., generated subclass name, parent class name, private function name and public function name, etc, in the programming language, we proposed a pass@o𝑜oitalic_o metric based on OOP. The pass@o𝑜oitalic_o metric adds keyword points matching between natural language with programming language based on the pass@k𝑘kitalic_k, i.e.,

α=i=1nf(Xi),𝛼superscriptsubscript𝑖1𝑛𝑓subscript𝑋𝑖\displaystyle\quad\quad\quad\quad\,\,\,\,\alpha=\sum_{i=1}^{n}f\left(X_{i}% \right),italic_α = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
wheref(Xi)=𝑤𝑒𝑟𝑒𝑓subscript𝑋𝑖absent\displaystyle wheref(X_{i})=italic_w italic_h italic_e italic_r italic_e italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = (1)
{1,ifutf(Xi)passedandjmxjXi0,otherwise,\displaystyle\left\{\begin{aligned} 1,&\,if\,utf\left(X_{i}\right)\,passed\,% and\,\sum_{j}^{m}x_{j}\exists X_{i}\\ 0,&\,\mathrm{otherwise}\end{aligned}\right.,{ start_ROW start_CELL 1 , end_CELL start_CELL italic_i italic_f italic_u italic_t italic_f ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p italic_a italic_s italic_s italic_e italic_d italic_a italic_n italic_d ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∃ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL roman_otherwise end_CELL end_ROW ,
pass@o𝑜oitalic_o :=𝔼Problems[1(nαo)(no)],assignabsentsubscript𝔼𝑃𝑟𝑜𝑏𝑙𝑒𝑚𝑠delimited-[]1binomial𝑛𝛼𝑜binomial𝑛𝑜\displaystyle:=\mathop{\mathbb{E}}_{Problems}\left[1-\frac{{\binom{n-\alpha}{o% }}}{\binom{n}{o}}\right],:= blackboard_E start_POSTSUBSCRIPT italic_P italic_r italic_o italic_b italic_l italic_e italic_m italic_s end_POSTSUBSCRIPT [ 1 - divide start_ARG ( FRACOP start_ARG italic_n - italic_α end_ARG start_ARG italic_o end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_n end_ARG start_ARG italic_o end_ARG ) end_ARG ] , (2)

In Eq. (3.3), n𝑛nitalic_n represents the number of code generations for a given problem; Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_i-th𝑡{th}italic_t italic_h generated program code; α𝛼\alphaitalic_α represents the quantity of n𝑛nitalic_n generated codes passing tests and matches; ut()𝑢𝑡ut\left(\cdot\right)italic_u italic_t ( ⋅ ) denotes the unit test function; m𝑚mitalic_m represents the number of keyword points in the current prompt𝑝𝑟𝑜𝑚𝑝𝑡promptitalic_p italic_r italic_o italic_m italic_p italic_t; and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the j𝑗jitalic_j-th𝑡{th}italic_t italic_h keyword points in natural language. In Eq. (2), on𝑜𝑛o\leq nitalic_o ≤ italic_n.

The pass@o𝑜oitalic_o metric not only optimizes the limitations of pass@k𝑘kitalic_k evaluation but also objectively and fairly reflects the OOP performance of LLMs.

4 Experiments

4.1 Experimental Setup

Evaluated LLMs

In the OOP task, we conduct experiments on 23232323 mainstream LLMs. These models include both general LLMs, i.g., ChatGPT Ouyang et al. (2022b); OpenAI (2023), Llama2 Touvron et al. (2023), InternLm Team (2023a), MPT Team (2023b), DeepSeek Team (2024), Falcon Almazrouei et al. (2023), Qwen Bai et al. (2023), Yi 777https://01.ai/cn and code-specialized LLMs, e.g., CodeLlama Roziere et al. (2023), WizardCoder Luo et al. (2023), StarCoder Li et al. (2023), as shown in Table 6. The details description of 24242424 LLMs are shown in Appendix D.

Parameter Settings.

In the experiment, we followed the settings on Llama2 Touvron et al. (2023), configuring the temperature to 0.10.10.10.1 and 0.80.80.80.8 for code generation. The remaining parameters (topp=0.95,n=200,on)formulae-sequence𝑡𝑜𝑝𝑝0.95formulae-sequence𝑛200𝑜𝑛(top-p=0.95,n=200,o\leq n)( italic_t italic_o italic_p - italic_p = 0.95 , italic_n = 200 , italic_o ≤ italic_n ), consistently remained unchanged. We evaluate the OOP benchmark on eight NVIDIA A100 GPUs using the vllm Kwon et al. (2023) 0.2.1.post1 framework 888https://github.com/vllm-project/vllm.

Metrics.

In terms of evaluation metrics, we use for pass@k𝑘kitalic_k and the proposed pass@o𝑜oitalic_o metrics.

Model 1 80 100
pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ )
General Falcon-7b 0.010.010.010.01 0.000.000.000.00 -0.01 0.370.370.370.37 0.190.190.190.19 -0.18 0.470.470.470.47 0.230.230.230.23 -0.24
Falcon-40b 0.010.010.010.01 0.000.000.000.00 -0.01 2.902.902.902.90 1.111.111.111.11 -1.79 3.423.423.423.42 1.261.261.261.26 -2.16
Llama2-7b 0.010.010.010.01 0.010.010.010.01 -0.00 4.024.024.024.02 1.721.721.721.72 -2.30 4.624.624.624.62 1.941.941.941.94 -2.68
InternLm-7b 0.030.030.030.03 0.020.020.020.02 -0.01 1.041.041.041.04 0.520.520.520.52 -0.52 1.221.221.221.22 0.580.580.580.58 -0.64
Yi-6b 0.070.070.070.07 0.010.010.010.01 -0.06 5.075.075.075.07 1.671.671.671.67 -3.40 6.006.006.006.00 1.981.981.981.98 -4.02
Llama2-13b 0.090.090.090.09 0.060.060.060.06 -0.03 7.287.287.287.28 2.172.172.172.17 -5.11 8.248.248.248.24 2.412.412.412.41 -5.83
MPT-7b 0.280.280.280.28 0.020.020.020.02 -0.26 4.774.774.774.77 1.271.271.271.27 -3.50 5.505.505.505.50 1.461.461.461.46 -4.04
Qwen-7b 0.940.940.940.94 0.610.610.610.61 -0.33 15.0215.0215.0215.02 5.685.685.685.68 -9.34 16.3516.3516.3516.35 5.835.835.835.83 -10.52
Qwen-14b 1.521.521.521.52 0.750.750.750.75 -0.77 26.2826.2826.2826.28 10.5810.5810.5810.58 -15.70 28.1028.1028.1028.10 11.4811.4811.4811.48 -16.62
DeepSeek-7b 1.531.531.531.53 0.500.500.500.50 -1.03 16.8316.8316.8316.83 7.727.727.727.72 -9.11 18.7018.7018.7018.70 8.708.708.708.70 -10.00
Yi-34b 2.202.202.202.20 1.091.091.091.09 -1.11 21.9621.9621.9621.96 8.438.438.438.43 -13.53 23.6823.6823.6823.68 9.229.229.229.22 -14.46
Llama2-70b 3.553.553.553.55 1.251.251.251.25 -2.30 21.0121.0121.0121.01 9.979.979.979.97 -11.04 23.1423.1423.1423.14 11.1611.1611.1611.16 -11.98
DeepSeek-67b 8.028.028.028.02 3.713.713.713.71 -3.95 49.3149.3149.3149.31 27.4227.4227.4227.42 -21.89¯¯-21.89{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\underline{% \textbf{-21.89}}}under¯ start_ARG -21.89 end_ARG 51.6051.6051.6051.60 29.4729.4729.4729.47 -22.13¯¯-22.13{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\underline{% \textbf{-22.13}}}under¯ start_ARG -22.13 end_ARG
Qwen-72b 11.2011.2011.2011.20 4.624.624.624.62 -6.58 57.4857.4857.4857.48 35.7035.7035.7035.70 -21.78 59.5259.5259.5259.52 37.8337.8337.8337.83 -21.69
ChatGPT 42.88 15.69 -27.19¯¯-27.19{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\underline{% \textbf{-27.19}}}under¯ start_ARG -27.19 end_ARG 75.71 58.28 -17.43 76.20 59.80 -16.40
\hdashlineSpecialized GPT_BigCode 0.100.100.100.10 0.060.060.060.06 -0.04 7.007.007.007.00 2.582.582.582.58 -4.42 8.018.018.018.01 2.922.922.922.92 -5.09
CodeLlama-7b 2.672.672.672.67 1.201.201.201.20 -1.47 24.0924.0924.0924.09 9.169.169.169.16 -14.93 25.7225.7225.7225.72 9.929.929.929.92 -15.80
CodeLlama-13b-Python 2.802.802.802.80 1.031.031.031.03 -1.77 36.3436.3436.3436.34 17.2217.2217.2217.22 -19.12 38.7538.7538.7538.75 18.9618.9618.9618.96 -19.79
StarCoder 4.614.614.614.61 1.261.261.261.26 -3.35 28.6728.6728.6728.67 10.0510.0510.0510.05 -18.62 30.4430.4430.4430.44 10.8810.8810.8810.88 -19.56
CodeLlama-7b-Python 4.684.684.684.68 1.271.271.271.27 -3.41 28.6828.6828.6828.68 12.9012.9012.9012.90 -15.78 30.3330.3330.3330.33 14.0714.0714.0714.07 -16.26
CodeLlama-34b 6.246.246.246.24 1.581.581.581.58 -4.66¯¯-4.66{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\underline{% \textbf{-4.66}}}under¯ start_ARG -4.66 end_ARG 46.31 22.59 -23.72¯¯-23.72{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\underline{% \textbf{-23.72}}}under¯ start_ARG -23.72 end_ARG 49.01 24.68 -24.33¯¯-24.33{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\underline{% \textbf{-24.33}}}under¯ start_ARG -24.33 end_ARG
WizardCoder-15b 6.836.836.836.83 3.02 -3.81 28.1028.1028.1028.10 9.509.509.509.50 -18.60 29.4129.4129.4129.41 10.0110.0110.0110.01 -19.40
CodeLlama-13b 6.87 2.922.922.922.92 -3.95 32.6932.6932.6932.69 12.2012.2012.2012.20 -20.49 34.5334.5334.5334.53 13.1113.1113.1113.11 -21.42
Table 2: Performance of 23232323 large language models (LLMs) on object-oriented programming (OOP) tasks. We also reported the differences in evaluation results between pass@k and pass@o. (All LLMs are evaluated in zero-shot fashion. For pass@100100100100 and pass@80808080 scores, we use a temperature of 0.80.80.80.8 and top-p𝑝pitalic_p=0.950.950.950.95. For pass@1111 scores, we use a temperature of 0.10.10.10.1 and top-p𝑝pitalic_p=0.950.950.950.95. The best results are highlighted in black bold; Red indicates the differences evaluated using the pass@o and pass@k metrics; Underlined indicates the maximum disparities evaluated between pass@o and pass@k metrics; Gray indicates models with a larger number of parameters.)

4.2 Overall Evaluation Result

The evaluation results of the LLMs with temperatures set to 0.1 and 0.8 are presented in Table 2, respectively. From the experimental results, We have obtained the following conclusions:

The OOP capabilities of the existing LLMs fall far short of the ideal state

In Table 2, we can observe that LLMs with strong coding capabilities (e.g., WizardCoder-15b, CodeLlama-7b-Python, and CodeLlama-13b, achieved scores of 58.1258.1258.1258.12, 40.4840.4840.4840.48, and 35.0735.0735.0735.07, respectively, in the HumanEval code leaderboard 999https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), exhibit performance in OOP benchmarks that falls significantly short of the ideal state. WizardCoder-15b, CodeLlama-7b-Python, and CodeLlama-13b scored 3.023.023.023.02, 1.271.271.271.27, and 2.922.922.922.92, respectively, on the OOP benchmark at pass@1111. Their scores on pass@100100100100 were also 10.0110.0110.0110.01, 14.0714.0714.0714.07, and 13.1113.1113.1113.11, respectively. Even the current ChatGPT model with strong general capabilities scores 15.6915.6915.6915.69 on pass@1111 and 59.8059.8059.8059.80 on pass@100100100100. The results indicate that the untapped potential of existing LLMs in OOP has not been fully explored.

Limitations of pass@k𝑘kitalic_k evaluated OOP

The scores from Table 2 indicate that using pass@k𝑘kitalic_k does not objectively reflect the OOP performance of LLMs, e.g., the WizardCoder-15b model achieves scores of 6.836.836.836.83, 28.1028.1028.1028.10, and 29.4129.4129.4129.41 using pass@k𝑘kitalic_k, while its scores drop to 3.023.023.023.02, 9.509.509.509.50, and 10.0110.0110.0110.01 when using pass@o𝑜oitalic_o. The evaluation scores of other LLMs using pass@o𝑜oitalic_o in Table 2 showed a decline, once again proving the limitations of pass@k𝑘kitalic_k in evaluation OOP.

In addition, we also observed a significant phenomenon, e.g., when evaluated using pass@k𝑘kitalic_k, Qwen-14b (score 26.2826.2826.2826.28) scored lower than WizardCoder-15b (score 28.1028.1028.1028.10) on pass@80808080. However, when evaluated using pass@o𝑜oitalic_o, Qwen-14b (score 10.5810.5810.5810.58) scored higher than WizardCoder-15b (score 9.509.509.509.50) on pass@80808080. Analyzing the experimental results of Qwen-14b and WizardCoder-15b, we observed that when evaluated using pass@o𝑜oitalic_o, Qwen-14b outperforms WizardCoder-15b in terms of the ability to correctly generate OOP concepts and feature keywords, as illustrated in Figure 4. It also reiterates that pass@k𝑘kitalic_k cannot objectively and fairly reflect the evaluation results of OOP.

A larger model scale does not necessarily perform better on pass@1111

In Table 2, when evaluated using pass@o𝑜oitalic_o, CodeLlama-34b scores 1.581.581.581.58 on pass@1111, whereas CodeLlama-13b scores 2.922.922.922.92 on pass@1111. Additionally, CodeLlama-13b-Python scores 1.031.031.031.03 on pass@1111, while the corresponding CodeLlama-7b-Python scores 1.271.271.271.27 on pass@1111. However, CodeLlama-7b scores 1.201.201.201.20 on pass@1111, which is lower than the score achieved by CodeLlama-13b. The scores of CodeLlama-7b, CodeLlama-13b, CodeLlama-34b, CodeLlama-7b-Python, and CodeLlama-13b-Python on pass@1111 indicate that a larger model scale does not necessarily result in the highest scores on pass@1111.

Refer to caption
Figure 4: The case comparison of generation results between Qwen-14b and WizardCoder-15b in the OOP benchmark. We see: 1) Qwen-14b can accurately generate private functions, while WizardCoder-15b cannot accurately generate private functions; 2) The results generated by Qwen-14b and WizardCoder-15b can both pass the evaluation using pass@k𝑘kitalic_k; 3) The results generated by Qwen-14b can pass the evaluation using pass@o𝑜oitalic_o, but the results generated by WizardCoder-15b cannot pass the evaluation using pass@o𝑜oitalic_o.

4.3 Different-level Evaluation Results

Following the classification of OOP benchmarks in Section 3.2, we conducted evaluations for three levels of OOP benchmarks, and the results are presented in Tables 34, and 5, respectively. we have drawn the following conclusions:

LLMs perform better at the simple-level OOP compared to the moderate-level and difficult-level OOP

From the simple-level OOP evaluation results in Table 3, we can see that the evaluation results using pass@k𝑘kitalic_k and pass@o𝑜oitalic_o are the same. It also indicates that LLMs can comprehend the fundamental concepts and features of OOP, e.g., class, and encapsulation methods (i.e., public function). However, in Tables 4 and 5, LLMs exhibit a weaker understanding of concepts and features related to OOP, e.g., encapsulation methods (i.e., private function), inheritance, and polymorphism, and are unable to generate corresponding code accurately. Detailed descriptions are in Appendix E.

ChatGPT has large gap in moderate-level usage using pass@k𝑘kitalic_k and pass@o𝑜oitalic_o

From the results in Table 4, We observe that with pass@k𝑘kitalic_k evaluation, ChatGPT scores are 51.7151.7151.7151.71, 83.3083.3083.3083.30, and 83.6783.6783.6783.67, but with pass@o𝑜oitalic_o evaluation, ChatGPT only achieves scores of 2.532.532.532.53, 51.5451.5451.5451.54, and 54.7854.7854.7854.78. We analyzed the moderate-level OOP results for ChatGPT, and found that its understanding of private functions is relatively poor. When evaluated using pass@k𝑘kitalic_k, a total of 5551555155515551 codes can pass the test cases correctly. However, when evaluated using pass@o𝑜oitalic_o, only 272272272272 codes can successfully pass the test cases. Among them, 5279527952795279 codes fail to match the pass@o𝑜oitalic_o criteria. Upon careful examination, we found that all these 5279527952795279 codes resulted from errors generated by private functions To further validate the authenticity of the experimental results, we randomly selected prompts corresponding to three error results. Subsequently, we input prompts of the erroneous results into the web version of ChatGPT for code generation, as illustrated in Figure 13,  14 and 15. We found that the code generated by online ChatGPT 101010https://chat.openai.com/ is also private function error.

ChatGPT outperforms moderate-level in difficult-level evaluation results

According to the evaluation results from Tables 3 and 4, we observe that the performance of ChatGPT at the difficult level is stronger than at the moderate level. At the difficult-level OOP, ChatGPT scores are 19.7019.7019.7019.70, 71.8371.8371.8371.83, and 73.3773.3773.3773.37, whereas at the moderate-level OOP, ChatGPT scores are only 2.532.532.532.53, 51.5451.5451.5451.54, and 54.7854.7854.7854.78.

Refer to caption
Figure 5: Distribution of search results for ChatGPT and CodeLlama-34b. (In program, “class” serves as the indicator for program class names. If the program does not contain a “class”, it signifies an error in the generation of class names by the LLM. Similarly, it can be deduced that “def _” and “def __” serve as indicators for private function names; “def” signifies a public function name; and “def __init__” represents the indicator for attribute variables name. Moreover, In our OOP benchmark, the LLM should ideally generate at least 86,200 “class”, 36,000 “def __” or “def _”, 86,200 “def”, and 70,800 “def __init__”.)

5 Discussion

In this section, we will explore the reasons behind the generally lower scores of LLMs in OOP, as well as the applicability of the Chain-of-Thoughts (CoT) method to OOP.

Why LLMs score lower in OOP benchmarks?

We use the experimental results of ChatGPT and CodeLlama-34b on pass@1111 as examples for analysis. As we instruct LLMs to generate relevant class names, private function names, public function names, etc., We conducted searches using simple keywords, e.g., “class”, “def _”, “def __”, “def __init__”, and “def” on both CodeLlama-34b and ChatGPT results. A detailed description of the retrieval process is provided in Appendix F. We compiled and analyzed the distribution of retrieval “class”, “def _”, “def __”, “def __init__”, and “def”, as shown in Figure 5, concluding that: 1) Weak knowledge, e.g., class, encapsulation methods, etc, of OOP in LLMs; 2) LLMs particularly lack cognition of private functions; 3) There is a certain degree of gap between CodeLlama-34b and ChatGPT. Specific example is shown in Figure 11.

The applicability of CoT in OOP.

Taking CodeLlama-13b, StarCoder, and WizardCoder-15b as examples, we respectively incorporate the few-shot, zero-shot CoT, and few-shot CoT methods to validate whether CoT approaches demonstrate applicability in OOP, as shown in Table 7. We observed a significant improvement in the scores of LLMs in OOP when using the few-shot approach, e.g., CodeLlama-13b achieved scores of 14.5014.5014.5014.50, 48.1348.1348.1348.13, and 49.8549.8549.8549.85 using the few-shot method, representing improvements of 396.58%percent396.58396.58\%396.58 %, 294.51%percent294.51294.51\%294.51 %, and 280.24%percent280.24280.24\%280.24 %, respectively, compared to the zero-shot method. In Table 7, we also observe that CodeLlama-13b achieves scores of 1.331.331.331.33, 13.3113.3113.3113.31, and 14.6214.6214.6214.62 in zero-shot CoT, but its score at pass@1111 is lower at 2.922.922.922.92 compared to zero-shot. Additionally, StarCoder scores 0.250.250.250.25, 6.586.586.586.58, and 7.077.077.077.07 in zero-shot CoT, which are lower than StarCoder scores in zero-shot at 1.261.261.261.26, 10.0510.0510.0510.05, and 10.8810.8810.8810.88, respectively. The scores of the CodeLlama-13b, StarCoder, and WizardCoder-15b models on few-shot CoT are also lower than their scores on few-shot. We analyzed the experimental results of zero-shot and zero-shot CoT and found that using the CoT method introduces an illusion to the model, preventing it from directly generating the corresponding code, as illustrated in Figure 12. Therefore, it is necessary to integrate the concepts and features of OOP to design appropriate CoT strategies in order to enhance the effectiveness of generating OOP by LLMs. Appendix G provides detailed prompts for few-shot, zero-shot CoT, and few-shot CoT.

6 Conclusion

In this paper, we propose the first OOP evaluation benchmark based on Python, consisting of 431431431431 Python programs, encompassing key concepts and features of OOP, e.g., class, encapsulation methods, etc. Simultaneously, we propose the evaluation metric pass@o𝑜oitalic_o for the OOP benchmark. pass@o𝑜oitalic_o improves upon the limitations of pass@k𝑘kitalic_k by matching keyword points between natural language with program language. We evaluate 23232323 mainstream LLMs using the proposed OOP benchmark and pass@o𝑜oitalic_o metric. Experimental results show that the current OOP of LLMs is far from ideal, which also reveals that LLMs have room for further improvement. Furthermore, Existing LLMs have a certain gap with ChatGPT in OOP. Moreover, we also investigate that applying some of the current improvement strategies directly to the OOP benchmark does not show significant improvement. In the future, we need to further strengthen the OOP knowledge of LLMs, especially regarding private functions. At the same time, we also hope that more researchers can contribute to the advancement of research in OOP.

Limitations

Our OOP benchmark has several limitations: (1) Our proposed OOP benchmark is based on the Python programming language and does not cover other OOP languages. (2) Given the incorporation of crucial concepts like polymorphism and inheritance in the OOP benchmark, it does not specifically address challenges associated with more intricate scenarios, e.g., multiple inheritance and overloading. (3) While OOP languages hold a significant share, non-OOP languages, e.g., C and Go languages, also play irreplaceable roles. In future work, we plan to consider expanding the OOP benchmark to cover a broader spectrum. Additionally, we encourage researchers to explore the potential of LLMs through evaluations based on the OOP benchmark.

Ethics Statement

We take ethical considerations very seriously. This paper focuses on establishing benchmarks for OOP to analyze the performance of existing LLMs. Our research reveals that existing LLMs fall far short of ideal performance in OOP. We conducted experiments on open and publicly available LLMs and accurately and objectively report the findings and conclusions of this paper. Therefore, we believe that this study does not raise ethical concerns.

References

Refer to caption
(a) HumanEval.
Refer to caption
(b) MBPP.
Refer to caption
(c) OOP.
Figure 6: Differences between OOP benchmarks and HumanEval, as well as MBPP Benchmarks (bold-…\dotsbold_… indicates that the few-shot content in MBPP is omitted ). We can see that: 1) the HumanEval benchmark requires models to complete based on the context within the function; 2) the MBPP benchmark directly requires models to generate based on prompt requirements; 3) However, our proposed OOP benchmark requirements are generated based on specified prompt as well as concepts and features of OOP. Therefore, HumanEval and MBPP do not reflect the concepts and features of OOP.
Refer to caption
(a) Simple-level.
Refer to caption
(b) Moderate-level.
Refer to caption
(c) Difficult-level.
Figure 7: Examples of different levels for object-oriented programming (OOP) tasks.
Model 1 80 100
pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ )
General Falcon-7b 0.000.000.000.00 0.000.000.000.00 -0.00 1.041.041.041.04 1.041.041.041.04 -0.00 1.301.301.301.30 1.301.301.301.30 -0.00
Falcon-40b 0.000.000.000.00 0.000.000.000.00 -0.00 5.105.105.105.10 5.105.105.105.10 -0.00 5.685.685.685.68 5.685.685.685.68 -0.00
Yi-6b 0.000.000.000.00 0.000.000.000.00 -0.00 5.875.875.875.87 5.875.875.875.87 -0.00 6.766.766.766.76 6.766.766.766.76 -0.00
Llama2-7b 0.030.030.030.03 0.030.030.030.03 -0.00 9.569.569.569.56 9.569.569.569.56 -0.00 10.7710.7710.7710.77 10.7710.7710.7710.77 -0.00
InternLm-7b 0.090.090.090.09 0.090.090.090.09 -0.00 2.872.872.872.87 2.872.872.872.87 -0.00 3.213.213.213.21 3.213.213.213.21 -0.00
MPT-7b 0.130.130.130.13 0.130.130.130.13 -0.00 7.037.037.037.03 7.037.037.037.03 -0.00 8.138.138.138.13 8.138.138.138.13 -0.00
Llama2-13b 0.320.320.320.32 0.320.320.320.32 -0.00 12.0512.0512.0512.05 12.0512.0512.0512.05 -0.00 13.3913.3913.3913.39 13.3913.3913.3913.39 -0.00
DeepSeek-7b 0.720.720.720.72 0.720.720.720.72 -0.00 24.0324.0324.0324.03 24.0324.0324.0324.03 -0.00 26.1226.1226.1226.12 26.1226.1226.1226.12 -0.00
Qwen-7b 3.363.363.363.36 3.363.363.363.36 -0.00 30.5330.5330.5330.53 30.5330.5330.5330.53 -0.00 31.2431.2431.2431.24 31.2431.2431.2431.24 -0.00
Yi-34b 3.413.413.413.41 3.413.413.413.41 -0.00 26.1626.1626.1626.16 26.1626.1626.1626.16 -0.00 27.6327.6327.6327.63 27.6327.6327.6327.63 -0.00
Llama2-70b 3.793.793.793.79 3.793.793.793.79 -0.00 27.1527.1527.1527.15 27.1527.1527.1527.15 -0.00 29.5229.5229.5229.52 29.5229.5229.5229.52 -0.00
Qwen-14b 4.064.064.064.06 4.064.064.064.06 -0.00 36.8936.8936.8936.89 36.8936.8936.8936.89 -0.00 37.8737.8737.8737.87 37.8737.8737.8737.87 -0.00
DeepSeek-67b 10.3610.3610.3610.36 10.3610.3610.3610.36 -0.00 52.7552.7552.7552.75 52.7552.7552.7552.75 -0.00 53.4853.4853.4853.48 53.4853.4853.4853.48 -0.00
Qwen-72b 15.1215.1215.1215.12 15.1215.1215.1215.12 -0.00 53.8853.8853.8853.88 53.8853.8853.8853.88 -0.00 54.66 54.66 -0.00
ChatGPT 37.34 37.34 -0.00 54.21 54.21 -0.00 54.4554.4554.4554.45 54.4554.4554.4554.45 -0.00
\hdashlineSpecialized GPT_BigCode 0.340.340.340.34 0.340.340.340.34 -0.00 12.2812.2812.2812.28 12.2812.2812.2812.28 -0.00 13.6313.6313.6313.63 13.6313.6313.6313.63 -0.00
CodeLlama-34b 4.084.084.084.08 4.084.084.084.08 -0.00 47.3647.3647.3647.36 47.3647.3647.3647.36 -0.00 48.99 48.99 -0.00
CodeLlama-13b-Python 5.315.315.315.31 5.315.315.315.31 -0.00 44.3744.3744.3744.37 44.3744.3744.3744.37 -0.00 46.3946.3946.3946.39 46.3946.3946.3946.39 -0.00
CodeLlama-7b 6.386.386.386.38 6.386.386.386.38 -0.00 38.4438.4438.4438.44 38.4438.4438.4438.44 -0.00 40.0240.0240.0240.02 40.0240.0240.0240.02 -0.00
CodeLlama-7b-Python 6.736.736.736.73 6.736.736.736.73 -0.00 43.7843.7843.7843.78 43.7843.7843.7843.78 -0.00 45.4345.4345.4345.43 45.4345.4345.4345.43 -0.00
StarCoder 6.996.996.996.99 6.996.996.996.99 -0.00 39.7639.7639.7639.76 39.7639.7639.7639.76 -0.00 41.2841.2841.2841.28 41.2841.2841.2841.28 -0.00
CodeLlama-13b 16.2116.2116.2116.21 16.2116.2116.2116.21 -0.00 47.72 47.72 -0.00 48.7448.7448.7448.74 48.7448.7448.7448.74 -0.00
WizardCoder-15b 16.79 16.79 -0.00 44.5644.5644.5644.56 44.5644.5644.5644.56 -0.00 45.9645.9645.9645.96 45.9645.9645.9645.96 -0.00
Table 3: Scores of 23232323 large language models (LLMs) on simple-level object-oriented programming (OOP) tasks. We also reported the differences in evaluation results between pass@k and pass@o. (All LLMs are evaluated in zero-shot fashion. For pass@100100100100 and pass@80808080 scores, we use a temperature of 0.80.80.80.8 and top-p𝑝pitalic_p=0.950.950.950.95. For pass@1111 scores, we use a temperature of 0.10.10.10.1 and top-p𝑝pitalic_p=0.950.950.950.95. Red indicates the differences evaluated using the pass@o and pass@k metrics; Underlined indicates the maximum disparities evaluated between pass@o and pass@k metrics; Gray indicates models with a larger number of parameters.)
Model 1 80 100
pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ )
General Falcon-7b 0.020.020.020.02 0.000.000.000.00 -0.02 0.220.220.220.22 0.000.000.000.00 -0.22 0.280.280.280.28 0.000.000.000.00 -0.28
Falcon-40b 0.020.020.020.02 0.000.000.000.00 -0.02 0.230.230.230.23 0.000.000.000.00 -0.23 0.720.720.720.72 0.000.000.000.00 -0.72
Llama2-7b 0.020.020.020.02 0.000.000.000.00 -0.02 5.515.515.515.51 0.000.000.000.00 -5.51 6.416.416.416.41 0.000.000.000.00 -6.41
InternLm-7b 0.030.030.030.03 0.000.000.000.00 -0.03 1.031.031.031.03 0.000.000.000.00 -1.03 1.261.261.261.26 0.000.000.000.00 -1.26
Llama2-13b 0.080.080.080.08 0.000.000.000.00 -0.08 11.7811.7811.7811.78 0.000.000.000.00 -11.78 13.3913.3913.3913.39 0.000.000.000.00 -13.39
Yi-6b 0.080.080.080.08 0.000.000.000.00 -0.08 6.236.236.236.23 0.360.360.360.36 -5.87 7.397.397.397.39 0.420.420.420.42 -6.97
MPT-7b 0.610.610.610.61 0.000.000.000.00 -0.61 8.168.168.168.16 0.000.000.000.00 -8.16 9.389.389.389.38 0.000.000.000.00 -9.38
Qwen-7b 0.800.800.800.80 0.000.000.000.00 -0.80 20.7920.7920.7920.79 0.000.000.000.00 -20.79 23.2723.2723.2723.27 0.000.000.000.00 -23.27
DeepSeek-7b 1.511.511.511.51 0.000.000.000.00 -1.51 15.4715.4715.4715.47 0.450.450.450.45 -15.02 17.1417.1417.1417.14 0.560.560.560.56 -16.58
Qwen-14b 1.821.821.821.82 0.000.000.000.00 -1.82 37.5837.5837.5837.58 5.125.125.125.12 -32.46 40.1040.1040.1040.10 6.126.126.126.12 -33.98
Yi-34b 2.102.102.102.10 0.000.000.000.00 -2.10 25.6125.6125.6125.61 0.580.580.580.58 -25.03 27.7927.7927.7927.79 0.700.700.700.70 -27.09
Llama2-70b 5.015.015.015.01 0.000.000.000.00 -5.01 21.9421.9421.9421.94 1.341.341.341.34 -20.60 24.2724.2724.2724.27 1.681.681.681.68 -22.59
DeepSeek-67b 7.897.897.897.89 0.000.000.000.00 -7.89 49.7949.7949.7949.79 13.0313.0313.0313.03 -36.76 52.4352.4352.4352.43 15.3015.3015.3015.30 -37.13
Qwen-72b 13.0213.0213.0213.02 0.280.280.280.28 -12.74 63.4163.4163.4163.41 26.9726.9726.9726.97 -36.44 65.2165.2165.2165.21 29.7129.7129.7129.71 -35.50
ChatGPT 51.71 2.53 -49.18 83.30 51.54 -31.76 83.67 54.78 -28.89
\hdashlineSpecialized GPT_BigCode 0.080.080.080.08 0.000.000.000.00 -0.08 9.229.229.229.22 0.670.670.670.67 -8.55 10.5510.5510.5510.55 0.840.840.840.84 -9.71
CodeLlama-7b 3.463.463.463.46 0.000.000.000.00 -3.46 36.8536.8536.8536.85 3.663.663.663.66 -33.19 39.1539.1539.1539.15 4.404.404.404.40 -34.75
CodeLlama-13b-Python 4.314.314.314.31 0.010.010.010.01 -4.30 42.1242.1242.1242.12 10.0610.0610.0610.06 -32.06 45.0745.0745.0745.07 11.8411.8411.8411.84 -33.23
StarCoder 8.018.018.018.01 0.010.010.010.01 -8.00 44.4044.4044.4044.40 4.284.284.284.28 -40.12 46.7046.7046.7046.70 5.075.075.075.07 -41.63
CodeLlama-7b-Python 8.138.138.138.13 0.010.010.010.01 -8.12 43.9643.9643.9643.96 9.289.289.289.28 -34.68 46.1746.1746.1746.17 10.7610.7610.7610.76 -35.41
WizardCoder-15b 9.109.109.109.10 0.000.000.000.00 -9.10 45.2545.2545.2545.25 1.291.291.291.29 -43.96 47.4147.4147.4147.41 1.501.501.501.50 -45.91
CodeLlama-13b 9.469.469.469.46 0.000.000.000.00 -9.46 51.73 7.627.627.627.62 -44.11 54.55 9.129.129.129.12 -45.43
CodeLlama-34b 10.23 0.000.00\pagecolor[rgb]{.906, .902, .902}0.000.00 -10.23 51.6851.6851.6851.68 11.41 -40.27 54.2254.2254.2254.22 13.48 -40.74
Table 4: Scores of 23232323 large language models (LLMs) on moderate-level object-oriented programming (OOP) tasks. We also reported the differences in evaluation results between pass@k and pass@o. (All LLMs are evaluated in zero-shot fashion. For pass@100100100100 and pass@80808080 scores, we use a temperature of 0.80.80.80.8 and top-p𝑝pitalic_p=0.950.950.950.95. For pass@1111 scores, we use a temperature of 0.10.10.10.1 and top-p𝑝pitalic_p=0.950.950.950.95. Red indicates the differences evaluated using the pass@o and pass@k metrics; Underlined indicates the maximum disparities evaluated between pass@o and pass@k metrics; Gray indicates models with a larger number of parameters.)
Model 1 80 100
pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ ) pass@k𝑘kitalic_k pass@o𝑜oitalic_o 𝚫()𝚫\boldsymbol{\Delta}\left(\downarrow\right)bold_Δ ( ↓ )
General Llama2-7b 0.000.000.000.00 0.000.000.000.00 -0.00 0.000.000.000.00 0.000.000.000.00 -0.00 0.000.000.000.00 0.000.000.000.00 -0.00
Falcon-7b 0.000.000.000.00 0.000.000.000.00 -0.00 0.220.220.220.22 0.000.000.000.00 -0.22 0.280.280.280.28 0.000.000.000.00 -0.28
MPT-7b 0.000.000.000.00 0.000.000.000.00 -0.00 0.230.230.230.23 0.000.000.000.00 -0.23 0.290.290.290.29 0.000.000.000.00 -0.29
Llama2-13b 0.000.000.000.00 0.000.000.000.00 -0.00 0.470.470.470.47 0.000.000.000.00 -0.47 0.580.580.580.58 0.000.000.000.00 -0.58
Qwen-7b 0.010.010.010.01 0.010.010.010.01 -0.00 2.082.082.082.08 0.460.460.460.46 -1.62 2.472.472.472.47 0.510.510.510.51 -1.96
InternLm-7b 0.010.010.010.01 0.000.000.000.00 -0.01 0.230.230.230.23 0.000.000.000.00 -0.23 0.290.290.290.29 0.000.000.000.00 -0.29
Falcon-40b 0.020.020.020.02 0.000.000.000.00 -0.01 0.230.230.230.23 0.000.000.000.00 -0.23 0.720.720.720.72 0.000.000.000.00 -0.72
Qwen-14b 0.070.070.070.07 0.060.060.060.06 -0.01 9.779.779.779.77 4.704.704.704.70 -5.07 11.2411.2411.2411.24 5.535.535.535.53 -5.73
Yi-6b 0.090.090.090.09 0.030.030.030.03 -0.06 3.523.523.523.52 1.161.161.161.16 -2.36 4.204.204.204.20 1.451.451.451.45 -2.75
DeepSeek-7b 1.511.511.511.51 0.000.000.000.00 -1.51 15.4715.4715.4715.47 0.450.450.450.45 -15.02 17.1417.1417.1417.14 0.560.560.560.56 -16.58
Yi-34b 1.771.771.771.77 1.201.201.201.20 -0.57 16.2716.2716.2716.27 8.668.668.668.66 -7.61 17.6217.6217.6217.62 9.839.839.839.83 -7.79
Llama2-70b 1.941.941.941.94 1.421.421.421.42 -0.52 17.2117.2117.2117.21 11.2611.2611.2611.26 -5.95 19.0519.0519.0519.05 12.8112.8112.8112.81 -6.24
Qwen-72b 7.547.547.547.54 4.434.434.434.43 -3.11 53.2853.2853.2853.28 36.5636.5636.5636.56 -16.72 56.1156.1156.1156.11 38.6738.6738.6738.67 -17.44
DeepSeek-67b 7.897.897.897.89 0.000.000.000.00 -7.89 49.7949.7949.7949.79 13.0313.0313.0313.03 -36.76 52.4352.4352.4352.43 15.3015.3015.3015.30 -37.13
ChatGPT 36.52 19.70 -16.82 78.94 71.83 -7.11 79.95 73.37 -6.58
\hdashlineSpecialized WizardCoder-15b 0.000.000.000.00 0.000.000.000.00 -0.00 3.053.053.053.05 2.352.352.352.35 -0.70 3.483.483.483.48 2.762.762.762.76 -0.72
CodeLlama-13b 0.000.000.000.00 0.000.000.000.00 -0.00 5.695.695.695.69 1.071.071.071.07 -4.62 6.756.756.756.75 1.311.311.311.31 -5.44
StarCoder 0.000.000.000.00 0.000.000.000.00 -0.00 7.347.347.347.34 2.742.742.742.74 -4.60 8.658.658.658.65 3.313.313.313.31 -5.34
GPT_BigCode 0.010.010.010.01 0.000.000.000.00 -0.01 2.082.082.082.08 0.230.230.230.23 -1.85 2.542.542.542.54 0.290.290.290.29 -2.25
CodeLlama-7b-Python 0.170.170.170.17 0.150.150.150.15 -0.02 5.935.935.935.93 2.842.842.842.84 -3.09 7.027.027.027.02 3.493.493.493.49 -3.53
CodeLlama-13b-Python 0.170.170.170.17 0.170.170.170.17 -0.00 26.7426.7426.7426.74 16.6116.6116.6116.61 -10.13 25.7525.7525.7525.75 18.6418.6418.6418.64 -7.11
CodeLlama-7b 0.180.180.180.18 0.130.130.130.13 -0.05 4.654.654.654.65 1.771.771.771.77 -2.88 5.645.645.645.64 2.182.182.182.18 -3.46
CodeLlama-34b 3.08 2.13 -0.95 40.26 27.83 -12.43 43.59 30.51 -13.08
Table 5: Scores of 23232323 large language models (LLMs) on difficult-level object-oriented programming (OOP) tasks. We also reported the differences in evaluation results between pass@k and pass@o. (All LLMs are evaluated in zero-shot fashion. For pass@100100100100 and pass@80808080 scores, we use a temperature of 0.80.80.80.8 and top-p𝑝pitalic_p=0.950.950.950.95. For pass@1111 scores, we use a temperature of 0.10.10.10.1 and top-p𝑝pitalic_p=0.950.950.950.95. Red indicates the differences evaluated using the pass@o and pass@k metrics; Underlined indicates the maximum disparities evaluated between pass@o and pass@k metrics; Gray indicates models with a larger number of parameters.)
Refer to caption
Figure 8: Example of a prompt using the zero-shot CoT approach. (The green content indicates guiding the model to generate code step by step using the CoT approach.)
Refer to caption
Figure 9: Prompt using the few-shot approach. (The green color indicates the added few-shot content.)
Refer to caption
Figure 10: Prompt of using the few-shot CoT approach. (The green color indicates the added few-shot content; The blue color indicates guiding the model to generate code step by step using the CoT approach.)
Refer to caption
Figure 11: An example of Code generated by CodeLlama-34b and ChatGPT. We can see that CodeLlama-34b did not generate the corresponding class and public function.
Refer to caption
Figure 12: Comparison of results generated by zero-shot and zero-shot CoT. We can see that: 1) using the zero-shot CoT approach can lead the model to generate illusions, thus preventing it from generating the corresponding code. 2) using the zero-shot approach, the model is directly prompted to generate the corresponding code.
Refer to caption
Figure 13: Case 1 of generating code using the web version of ChatGPT.
Refer to caption
Figure 14: Case 2 of generating code using the web version of ChatGPT.
Refer to caption
Figure 15: Case 3 of generating code using the web version of ChatGPT.

Appendix A Pass@k𝑘kitalic_k calculation process

The calculation process for pass@k𝑘kitalic_k is:

pass@k:=𝔼Problems[1(nck)(nk)]assignpass@ksubscript𝔼𝑃𝑟𝑜𝑏𝑙𝑒𝑚𝑠delimited-[]1binomial𝑛𝑐𝑘binomial𝑛𝑘\textit{pass@$k$}:=\mathop{\mathbb{E}}_{Problems}\left[1-\frac{{\binom{n-c}{k}% }}{\binom{n}{k}}\right]pass@ italic_k := blackboard_E start_POSTSUBSCRIPT italic_P italic_r italic_o italic_b italic_l italic_e italic_m italic_s end_POSTSUBSCRIPT [ 1 - divide start_ARG ( FRACOP start_ARG italic_n - italic_c end_ARG start_ARG italic_k end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) end_ARG ] (3)

In Eq. (3), n𝑛nitalic_n represents the number of code generations for a given problem; c𝑐citalic_c represents the quantity of n𝑛nitalic_n generated codes passing tests.

Appendix B Limitations of HumanEval and MBPP benchmarks

Existing HumanEval Chen et al. (2021) and MBPP Austin et al. (2021) benchmarks primarily focus on FP to evaluate the programming capabilities of LLMs, as illustrated in Figure 6.

Appendix C Detailed construction process of OOP.

In the process of establishing the OOP benchmark, we hired a total of nine fourth-year undergraduate computer science students. Among them, two students were involved in the data collection process, four students participated in the rewriting process, and two students contributed to the use case construction phase, as shown in Figure 3.

During the data collection process, problems or requirements described in Non-English natural language are translated using the Google API, followed by manual verification. In the use case construction phase, we begin by inputting the rewritten prompt into ChatGPT to generate the corresponding code. Subsequently, the generated code is used for input testing. Finally, the output results are saved along with the input tests to serve as test cases. However, the code generated by ChatGPT may not always be correct, requiring manual inspection and correction. During the process of building the benchmark for OOP, we spent a total of $200.

Model name Size Years Open-source Task type
Falcon 7b, 40b 2023202320232023 General
DeepSeek 7b, 67b 2023202320232023 General
Llama2 7b, 13b, 70b 2023202320232023 General
Yi 6b, 34b 2023202320232023 General
InternLm 7b 2023202320232023 General
MPT 7b 2023202320232023 General
Qwen 7b, 14b, 72b 2023202320232023 General
ChatGPT N/A 2023202320232023 General
GPT_BigCode 1.12b 2023202320232023 Code-specialized
CodeLlama 7b, 13b, 34b 2023202320232023 Code-specialized
CodeLlama-Python 7b, 13b 2023202320232023 Code-specialized
StarCoder 15b 2023202320232023 Code-specialized
WizardCoder 15b 2023202320232023 Code-specialized
Table 6: Overview of the Evaluated Models.

Appendix D The details of 23232323 LLMs

We have selected a total of 23232323 mainstream LLMs, including both code-specialized models and general models, e.g.,

ChatGPT Ouyang et al. (2022b); OpenAI (2023): ChatGPT was released by OpenAI in November 2022202220222022 and has been widely recognized for its astonishing conversational generation capabilities. In March 2023202320232023, OpenAI released ChatGPT 4.04.04.04.0. In our experiments, we chose to use ChatGPT 3.53.53.53.5 (gpt-3.5-turb) to explore its OOP.

GPT_BigCode Allal et al. (2023): GPT_BigCode, derived from the BigCode project, is a 1.121.121.121.12 billion parameter model trained on subsets of Java, JavaScript, and Python from The Stack.

CodeLlama Roziere et al. (2023): CodeLlama is a series of large-scale code language models based on Llama2 that offers state-of-the-art performance in open modeling, function completion, support for large input contexts, and zero-shot instruction following capabilities for programming tasks. CodeLlama includes the base model (CodeLlama), the Python specialized model (CodeLlama-Python), and the instruction-following model (CodeLlama-Instruct), each available with 7b, 13b, and 34b parameters. In our experiments, we selected the base models with 7b, 13b, and 34b parameters, as well as the Python-specialized models with 7b and 13b parameters.

WizardCoder Luo et al. (2023): WizardCoder is a model fine-tuned using the Evol-Instruct Xu et al. (2023) method based on CodeLlama. WizardCoder includes the base model and the Python specialized model (WizardCoder-Python). The base model comes in 1b, 3b, and 15b variants, while the Python specialized model is available in 7b, 13b, and 34b. In our experiments, we selected the 15b version of the base model.

StarCoder Li et al. (2023): StarCoderBase is trained on The Stack (v1.2) 111111https://huggingface.co/datasets/bigcode/the-stack data in the GitHub repository. The StarCoder model is fine-tuned based on the StarCoderBase model.

Llama2 Touvron et al. (2023): The Llama2 model was released by the Meta team in July 2023. Llama2 is a large language model (LLM) that has undergone pre-training and fine-tuning, with a range of parameters from 7 billion to 70 billion. In our experiments, we selected models with 7b, 13b, and 70b parameters.

InternLm Team (2023a): InternLM encompasses models designed for practical scenarios. The InternLM model includes both a base model and a chat model with 7b and 20b parameters. In our experiments, we selected the base model with 7b parameters.

MPT Team (2023b): The MPT model is a decoder-style transformer trained by MosaicML. In our experiments, we selected the base model with 7b parameters.

DeepSeek Team (2024): DeepSeek is an LLM based on the power-law scaling, encompassing models with 7b and 67b parameters. In our experiments, we opted to utilize the foundational models with 7b and 67b parameters.

Falcon Almazrouei et al. (2023): The Falcon series models are primarily trained on diverse and high-quality corpora assembled from web data, including the 7b, 40b, and 180b parameter models. In our experiments, we opted to use models with 7b and 40b parameters.

Qwen Bai et al. (2023): The Qwen model is a large language model based on the Transformer architecture, trained on a vast and diverse dataset for pre-training. The dataset encompasses a wide range of types, including extensive web text, professional books, code, and more. During our experiments, we selected the base models with 7b, 14b, and 72b parameters.

Yi 121212https://01.ai/cn: The Yi series models are developed as bilingual language models with a focus on Chinese and English. Yi models are trained on a 3T multilingual corpus and demonstrate promising prospects in language understanding, common sense reasoning, and reading comprehension. In our experiments, we selected models with 6 billion and 34 billion parameters.

We use 23232323 mainstream code-specialized and general models with the aim of better illustrating the performance of existing LLMs in OOP. The overview of the evaluated models is presented in Table 6.

Appendix E Analysis of results

In simple-level OOP of Table 3, ChatGPT scored 37.3437.3437.3437.34 at pass@1111. However, in the difficult-level and Moderate-level OOP, ChatGPT scored only 19.7019.7019.7019.70 and 2.532.532.532.53 at pass@1111, respectively. CodeLlama-13b scored 16.2116.2116.2116.21 at pass@1111 in the simple-level OOP. In the difficult-level and Moderate-level OOP, CodeLlama-13b scored only 0.000.000.000.00 and 0.000.000.000.00 at pass@1111, respectively. Additionally, WizardCoder-15b scored 16.7916.7916.7916.79 at pass@1111 in the simple-level OOP., while in the difficult-level and Moderate-level OOP, it scored only 0.000.000.000.00 and 0.000.000.000.00 at pass@1111, respectively. It indicates that LLMs can comprehend and execute simple class, and public functions. However, their understanding of private functions, inheritance, and polymorphism is relatively weak. It also provides us with room for improvement.

Appendix F Detailed description of the retrieval process

During the retrieval process, we first search for the class class and attribute variables def __init__. Subsequently, we replace def __init__ in the generated code snippets with <endoftext>, and finally, we search for private functions def _ and def __. Using this approach helps prevent the inadvertent retrieval of attribute variables as private functions during the search for private functions. The process of searching for public functions def follows a similar method.

Model CodeLlama_13b WizardCoder_15b StarCoder
pass@o𝑜oitalic_o 1111 80808080 100100100100 1111 80808080 100100100100 1111 80808080 100100100100
zero-shot CoT 1.33(-1.59)subscript1.33(-1.59)1.33_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-1.59% )}}}1.33 start_POSTSUBSCRIPT (-1.59) end_POSTSUBSCRIPT 13.31(+1.11)subscript13.31(+1.11)13.31_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+1.11% )}}}13.31 start_POSTSUBSCRIPT (+1.11) end_POSTSUBSCRIPT 14.62(+1.51)subscript14.62(+1.51)14.62_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+1.51% )}}}14.62 start_POSTSUBSCRIPT (+1.51) end_POSTSUBSCRIPT 2.67(-0.35)subscript2.67(-0.35)2.67_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-0.35% )}}}2.67 start_POSTSUBSCRIPT (-0.35) end_POSTSUBSCRIPT 13.33(-3.89)subscript13.33(-3.89)13.33_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-3.89% )}}}13.33 start_POSTSUBSCRIPT (-3.89) end_POSTSUBSCRIPT 14.19(-4.18)subscript14.19(-4.18)14.19_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-4.18% )}}}14.19 start_POSTSUBSCRIPT (-4.18) end_POSTSUBSCRIPT 0.28(-0.98)subscript0.28(-0.98)0.28_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-0.98% )}}}0.28 start_POSTSUBSCRIPT (-0.98) end_POSTSUBSCRIPT 6.58(-3.47)subscript6.58(-3.47)6.58_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-3.47% )}}}6.58 start_POSTSUBSCRIPT (-3.47) end_POSTSUBSCRIPT 7.07(-3.81)subscript7.07(-3.81)7.07_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-3.81% )}}}7.07 start_POSTSUBSCRIPT (-3.81) end_POSTSUBSCRIPT
few-shot 14.50(+11.58)subscript14.50(+11.58)14.50_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+11.5% 8)}}}14.50 start_POSTSUBSCRIPT (+11.58) end_POSTSUBSCRIPT 48.13(+35.93)subscript48.13(+35.93)48.13_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+35.9% 3)}}}48.13 start_POSTSUBSCRIPT (+35.93) end_POSTSUBSCRIPT 49.85(+36.74)subscript49.85(+36.74)49.85_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+36.7% 4)}}}49.85 start_POSTSUBSCRIPT (+36.74) end_POSTSUBSCRIPT 17.34(+14.32)subscript17.34(+14.32)17.34_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+14.3% 2)}}}17.34 start_POSTSUBSCRIPT (+14.32) end_POSTSUBSCRIPT 48.25(+38.75)subscript48.25(+38.75)48.25_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+38.7% 5)}}}48.25 start_POSTSUBSCRIPT (+38.75) end_POSTSUBSCRIPT 49.78(+39.77)subscript49.78(+39.77)49.78_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+39.7% 7)}}}49.78 start_POSTSUBSCRIPT (+39.77) end_POSTSUBSCRIPT 14.47(+13.21)subscript14.47(+13.21)14.47_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+13.2% 1)}}}14.47 start_POSTSUBSCRIPT (+13.21) end_POSTSUBSCRIPT 46.59(+36.54)subscript46.59(+36.54)46.59_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+36.5% 4)}}}46.59 start_POSTSUBSCRIPT (+36.54) end_POSTSUBSCRIPT 48.19(+37.31)subscript48.19(+37.31)48.19_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}% \pgfsys@color@rgb@stroke{0}{0}{1}\pgfsys@color@rgb@fill{0}{0}{1}\textbf{(+37.3% 1)}}}48.19 start_POSTSUBSCRIPT (+37.31) end_POSTSUBSCRIPT
few-shot CoT 11.06(-3.44)subscript11.06(-3.44)11.06_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-3.44% )}}}11.06 start_POSTSUBSCRIPT (-3.44) end_POSTSUBSCRIPT 42.30(-5.83)subscript42.30(-5.83)42.30_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-5.83% )}}}42.30 start_POSTSUBSCRIPT (-5.83) end_POSTSUBSCRIPT 43.79(-6.06)subscript43.79(-6.06)43.79_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-6.06% )}}}43.79 start_POSTSUBSCRIPT (-6.06) end_POSTSUBSCRIPT 2.91(-14.43)subscript2.91(-14.43)2.91_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-14.4% 3)}}}2.91 start_POSTSUBSCRIPT (-14.43) end_POSTSUBSCRIPT 36.40(-11.85)subscript36.40(-11.85)36.40_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-11.8% 5)}}}36.40 start_POSTSUBSCRIPT (-11.85) end_POSTSUBSCRIPT 38.61(-11.17)subscript38.61(-11.17)38.61_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-11.1% 7)}}}38.61 start_POSTSUBSCRIPT (-11.17) end_POSTSUBSCRIPT 6.51(-7.96)subscript6.51(-7.96)6.51_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-7.96% )}}}6.51 start_POSTSUBSCRIPT (-7.96) end_POSTSUBSCRIPT 39.71(-6.88)subscript39.71(-6.88)39.71_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-6.88% )}}}39.71 start_POSTSUBSCRIPT (-6.88) end_POSTSUBSCRIPT 41.76(-6.43)subscript41.76(-6.43)41.76_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}% \pgfsys@color@rgb@stroke{1}{0}{0}\pgfsys@color@rgb@fill{1}{0}{0}\textbf{(-6.43% )}}}41.76 start_POSTSUBSCRIPT (-6.43) end_POSTSUBSCRIPT
Table 7: Performance of the CodeLlama_13b, StarCoder, and WizardCoder_15b models with advanced prompting strategies, i.e., few-shot, zero-shot CoT, few-shot CoT, on the OOP benchmark. Additionally, we reported the delta in results between few-shot and few-shot CoT, zero-shot and zero-shot CoT, as well as between few-shot and zero-shot prompting strategies. (Red indicates decline, while blue indicates increase.)

Appendix G Details of using the CoT strategy.

zero-shot CoT. We incorporate "Let’s think step by step" on top of the zero-shot, enabling LLMs to stepwise infer and thus complete the entire code generation process, as shown in Figure 8.

few-shot. We randomly selected three samples from MBPP Austin et al. (2021), but these three samples are limited to functions and do not involve relevant concepts and features of OOP. Subsequently, we manually re-write the selected three samples into examples of OOP based on the five major principles. Finally, the constructed samples were integrated into zero-shot to form a few-shot, as shown in Figure 9.

few-shot CoT. On the foundation of a few-shot, we first instruct the LLMs to generate corresponding steps based on the question and then proceed step by step to complete the entire code generation process, as shown in Figure 10.