Evaluating the Application of Large Language Models to Generate Feedback in Programming Education

1^st Sven Jacobs Computer Science Education
University of Siegen
Siegen, Germany
[email protected] 2^nd Steffen Jaschke Computer Science Education
University of Siegen
Siegen, Germany
[email protected]

Abstract

This study investigates the application of large language models, specifically GPT-4, to enhance programming education. The research outlines the design of a web application that uses GPT-4 to provide feedback on programming tasks, without giving away the solution. A web application for working on programming tasks was developed for the study and evaluated with 51 students over the course of one semester. The results show that most of the feedback generated by GPT-4 effectively addressed code errors. However, challenges with incorrect suggestions and hallucinated issues indicate the need for further improvements.

I Introduction

In courses with numerous exercises, such as programming courses, providing feedback can be time-consuming for teachers. Simultaneously, it can be disadvantageous for learners if they have to wait too long for feedback before they can continue working on the assignment. To address this issue, many automated solutions have been developed in recent years [1]. The development of large language models (LLMs) like GPT-4 [2] has opened up a wide range of new possibilities in this area [3] [4]. GPT-4’s ability to solve introductory programming tasks is already around 95% [5]. It can even solve and explain more difficult tasks effectively [6], and its performance can be further improved with prompting strategies [7]. Google Deepmind’s AlphaCode 2 demonstrates that the combination of fine-tuning and prompting strategies can already reach the 85th percentile on average on the code contest platform Codeforces [8].

While applications like ChatGPT or GitHub Copilot, which use LLMs, can help with programming, their primary purpose is to increase productivity rather than to aid learning. To be effective for skill development, such tools would require specific prompts to ensure that the correct solution is not provided immediately, allowing learners to engage in the problem-solving process. Therefore, we integrated GPT-4 into a new programming practice environment in an introductory computer science course to provide students with timely feedback. The research question for this work is: To what extent is the large language model GPT-4 able to provide feedback for programming education?

II Related work

In the context of programming exercises, a variety of tools are used that already support different types of feedback [9] [10]. The use of LLMs has opened up new possibilities for the automated creation of teaching materials and the analysis of student work, such as the generation of feedback [3].

Hellas et al. [11] compared Codex and GPT-3.5 in generating LLM responses to student help requests. They found that in 55% of student help requests, GPT-3.5 identified and mentioned all actual issues (Codex: 13%). Notably, 99% of the responses included code, despite the model being prompted not to do so. In an attempt to generate feedback as students would do, Kiesler et al. used ChatGPT (March 2023: GPT-3) to examine its responses to incorrect student solutions. In their research 79% of ChatGPT responses contained code and 62% contained misleading information [12]. Aziaz et al.’s research on feedback generation for programming exercises obtained 52% fully correct and complete feedback with GPT-4 Turbo [13], surpassing their previous research with GPT-3.5, which obtained 31% [14]. The feedback they generated almost always contained code. Since it is difficult to avoid code in the generated feedback, Lifton et al. use a second LLM for ”code removal”, which rewrites the generated answer [15]. It appears that the models, including GPT-3.5 and GPT-3.5 Turbo, are biased towards providing a complete solution or code, and they do not consistently follow instructions telling them not to do so [15] [16].

Refer to caption — Figure 1: Tutor Kai user interface

It is difficult to compare existing research as the exact model used is not always specified and there are several different models from OpenAI for GPT-3, GPT-3.5, GPT-4 and GPT-4 Turbo. For example, the GPT-4 Turbo is currently available in two different models, ”gpt-4-0125-preview” and ”gpt-4-1106-preview”. Furthermore, the parameters used, such as temperature, which controls the randomness of the response and can therefore have a large effect, are not always reported.

The input for the LLM (prompt) used also makes a difference [16] [17]. Sometimes only a simple question and the student’s code [12] or additionally the task description [14][13][11] are used for the prompt. Compiler errors [18] and examples of good output (few-shot prompting) [19] can additionally be added for context. Phung et al. show that it can also be helpful to add the result of test cases to the prompt. They also show how GPT-3.5 can be used to simulate a student response to feedback generated by GPT-4. If the generated student response does not pass the task-specific unit tests, this information can be used again to improve the feedback [20].

For the feedback generation in this work, the model GPT-4-0314 (temperature = 0) with a complex prompt is used to prevent code in the answers. The generated feedback was evaluated not only by experts, but also directly by the students.

III Evaluation

To evaluate the extent to which LLMs are able to provide feedback for programming education, we have developed a web application called Tutor Kai (Fig. 1). In Tutor Kai, computer science students can complete weekly tasks for the introductory course ”Object-Oriented and Functional Programming” and receive automated feedback generated by an LLM. Students must also rate the feedback they receive on a scale of 1 to 7.

III-A Evaluation Setup

Within Tutor Kai, students select the current week and one of the associated programming tasks. They are then presented with the task and a code editor that may already contain code. Within the code editor (Fig. 1: top right), the student solves the task and can execute their solution. The solution is compiled and unit tests are run to verify its correctness. The student receives both error messages from the compiler and the results of the unit tests as text (Fig. 1: bottom right).

However, the problem with semantic bugs in particular is that while it is communicated that there is a bug, no information is provided about what it is related to. This is where the strength of LLMs for reasoning, coding, and human like text generation comes into play. Students can therefore request feedback after they have executed the program code. For this purpose, the prompt (Fig 2) is sent to the GPT-4 API of OpenAI containing the task, the programming language, the program code, the compiler output and the result of the unit tests as context. We used the first version of GPT-4, which is currently available under the name ”gpt-4-0314”. The temperature parameter was changed to zero to decrease randomness. The LLM feedback generated in this way must be rated by the student (Fig. 1: bottom left).

III-B Method

When evaluating generated feedback, the question arises as to whether it should be viewed atomically or in its chronological sequence in the context of further code submissions per student. The advantage of the second method would be that it is immediately apparent whether the feedback has helped when the student has received more points in a subsequent code submission. However, it cannot be ruled out that students have consulted other external sources to obtain information or have used a different development environment in combination. The second option would therefore only be practical if it could be ensured that only the evaluation setup presented here was used. Such an investigation in a controlled environment is planned for the future.

Based on the methodology of Hellas et al. [11], we therefore analyzed the feedback atomically using the following categories:

1.

Identification and mention of at least one actual issue
2.

Identification and mention of all issues
3.

Wrong suggestions for improvement
4.

Hallucinated issues
5.

Unnecessary suggestions for improvement
6.

Includes code

Identification and mention of at least one actual issue is present if at least one problem, bug or error that prevents a correct solution is mentioned in the feedback. Likewise, we flagged when all of them were mentioned. Wrong suggestions for improvements are present if the feedback gives a wrong hint or suggests a direction that leads to a wrong solution. Hallucinated issues are issues where non-existent problems, bugs or errors were mentioned in the code submission. Only code snippets that are longer than one line have been marked as code. Individual variables and expressions were not marked, as these are necessary for understanding the feedback. Although these could also be avoided, this would make the feedback longer and more difficult to understand.

To learn more about the students’ perspective on how they perceived the feedback, we collected additional data. After they received feedback and before they could submit new code, they were forced to rate the feedback on a scale of one to seven. Tutor Kai was once offered to students by email for voluntary use.

IV Results

In this section, we present the overall aggregated data over the course of the semester and take a closer look at the feedback that was generated for three specific tasks.

Category	Result
Number of students	51
Number of tasks	26
Number of code submissions	3159
Number of generated feedback	1684
Number of student rated feedback	1243
Average feedback rating	5.54
Average feedback rating (adjusted)	5.05

TABLE I: Aggregated Data of all tasks and students

Task	Description
Capital-Value	Create a recursive function named ‘capitalValue()‘ to calculate the value of a principal amount of 1000 Euros after ‘n‘ years at a constant interest rate.
Maximum-Value	Implement the ‘max()‘ method in the ‘Starter‘ class within the provided code skeleton (Starter.java), which returns the larger of two natural numbers ‘a‘ and ‘b‘, or the common value if both are equal.
Sum	Create a recursive function named ‘summe‘ that takes two integers ‘m‘ and ‘n‘ as arguments and returns the sum of all integers between ‘m‘ and ‘n‘ (inclusive), with the result being 0 if ‘m‘ is greater than ‘n‘.

TABLE II: Task Overview (translated from german)

	Task
Category	Capital-Value	Maximum-Value	Sum	All 3 Tasks
Programming Language	Python	Java	Python
Number of code submissions	175	62	26	263
Number of generated feedback	97	25	15	137
Average Feedback Rating	6.3	5.4	4.8	6.0
Identification and mention of at least one actual issue	73 (75%)	25 (100%)	15 (100%)	113 (87%)
Identification and mention of all actual issues	49 (51%)	23 (92%)	13 (87%)	85 (62%)
Wrong suggestions for improvement	14 (14%)	1 (4%)	1 (7%)	16 (12%)
Hallucinated issues	8 (8%)	0 (0%)	0 (0%)	8 (6%)
Unnecessary suggestion for improvement	8 (8%)	7 (28%)	4 (27%)	19 (14%)
Includes code	4 (4%)	0 (0%)	0 (0%)	4 (3%)

TABLE III: Feedback Evaluation

IV-A Aggregated Data

An analysis of the aggregated data from all students across the 26 tasks reveals that Tutor Kai was extensively utilized by the participants (Table I). Of the 51 students who tried the tool, 25 students used it at least ten times. Twelve students used it more than 100 times. Overall, the students gave the feedback an average rating of 5.54 on a scale of 1 to 7. There were four students with more than 100 ratings who almost always gave the highest rating for feedback. In interviews it became clear that the forced rating of feedback interfered with task completion and disrupted the flow. Therefore, the highest rating was given immediately in order to continue. After removing these four students from the data set, the average feedback rating drops to 5.05. Another strategy used by the students to avoid the feedback rating was to reload the page. For this reason, the number of feedback ratings given by students is lower than the number of feedback.

IV-B Individual task evaluation

The feedback generated on the students’ solutions to three tasks (Table II) was evaluated using the methodology described. Overall, 87% of the feedback identified and mentioned at least one actual issue. In 62% of the feedback, all of them were identified and mentioned (Table III).

Hallucinated issues occurred several times in the Capital-Value task for the following reason. The unit tests associated with the task have expected values that are accurate to the twelfth decimal digit. Due to rounding, correct solutions were recognized as incorrect by the unit tests. GPT-4 receives two pieces of divergent information in the prompt template (Fig. 2). On the one hand, the unit tests indicate that the code is incorrect and on the other hand, the correct solution as code. This leads to GPT-4 hallucinating and describing errors in the correct solution that do not exist [21].

Unnecessary suggestions for improvement usually occurred when the student’s solution was already correct. Using the prompt described (Fig. 2), the model still wants to give feedback. It usually refers to the lack of comments in the solution or suggests alternative ways that could lead to a simpler solution. A positive correlation between these values and the average feedback rating of the students is not evident.

V Discussion

Despite multiple prompts, the issue of code appearing in the feedback could not be completely avoided. This exemplifies that the behavior of LLMs can neither be fully controlled nor predicted. While the inclusion of code in the feedback may be unproblematic in this case, it could potentially become a more significant issue if LLMs were to be used for grading assignments.

Compared to the results of Hellas et al. [11] and Aziaz et al. [13], the feedback from Tutor Kai appears to perform better in terms of fully correct and complete feedback at first glance. However, it is unclear whether this is due to the different assignments, the more extensive prompt including the results of the unit tests in Tutor Kai, or the specific LLM used. To attempt to reproduce and compare the results of similar research projects, the exact model and parameters used would need to be known. Even then, it is possible that the providing company may restrict access to the models.

A problem to be discussed is the deployment and subsequent evaluation of such applications in educational settings. Since the completion of assignments often takes place asynchronously, it cannot be guaranteed that only the evaluated tool was used. To address this, the tools would need to be able to compete with applications like Visual Studio Code in terms of developer experience. During the evaluation of Tutor Kai, the willingness of students to rate the feedback from 1-7 was problematic, even though it only required a single click in the user interface. It remains an open question how this can be improved in the future without negatively impacting the students’ workflow too much.

VI Conclusion

The evaluation of Tutor Kai demonstrates that the feedback generated by GPT-4 already identifies and mentions most of the issues in the code. Simultaneously, the problem of code appearing in the feedback, which related studies [15] [11] have encountered in the past, has been almost entirely resolved. Overall, the students have rated the feedback relatively positively, with an average of 5.05 on a scale from 1 to 7. One issue that was identified is that when students are required to evaluate all feedback, they may seek ways to circumvent this process, potentially distorting the data.

Furthermore, it has been shown that faulty unit tests in combination with a correct student solution can lead to hallucinated issues. In such cases, errors are addressed that are not present in the student’s solution. To avoid this, care must be taken to ensure that GPT-4 does not receive contradictory information.

VII Outlook

In the future, the data already collected will also be evaluated with regard to the different types of feedback [22] [9]. A framework for automating this process will be developed and evaluated. Based on this, faster iterations of models and prompts would then be possible.

It is expected that future models such as GPT-5 or Llama3, which may be released as early as 2024, will be even more capable of formulating and explaining code and providing feedback. The maximum number of tokens that can be processed by the LLM is also likely to increase. Google’s Gemini 1.5 increases this by a factor of eight compared to GPT-4 Turbo, to over one million. This increases the potential for prompting strategies and the use of Retrieval Augmented Generation (RAG) to fill the context window with relevant information about the situational and individual circumstances of students or the content of a lecture. It is therefore already possible to use complete lecture notes or textbooks as input for the LLM when generating feedback. How the increasingly large context windows of LLMs can be filled in a didactically meaningful way in educational contexts needs to be researched more intensively in the future.

References

[1] J. Jeuring, H. Keuning, S. Marwan, D. Bouvier, C. Izu, N. Kiesler, T. Lehtinen, D. Lohr, A. Peterson, and S. Sarsa, “Towards Giving Timely Formative Feedback and Hints to Novice Programmers,” in Proceedings of the 2022 Working Group Reports on Innovation and Technology in Computer Science Education, ser. ITiCSE-WGR ’22. New York, USA: ACM, 2022, pp. 95–115, doi: 10.1145/3571785.3574124.
[2] OpenAI, “GPT-4 Technical Report,” 2023, doi: 10.48550/ARXIV.2303.08774.
[3] J. Prather, P. Denny, J. Leinonen, B. A. Becker, I. Albluwi, M. Craig, H. Keuning, N. Kiesler, T. Kohn, A. Luxton-Reilly, S. MacNeil, A. Petersen, R. Pettit, B. N. Reeves, and J. Savelka, “The Robots Are Here: Navigating the Generative AI Revolution in Computing Education,” in Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education. New York, USA: ACM, 2023, pp. 108–159, doi: 10.1145/3623762.3633499.
[4] P. Denny, J. Prather, B. A. Becker, J. Finnie-Ansley, A. Hellas, J. Leinonen, A. Luxton-Reilly, B. N. Reeves, E. A. Santos, and S. Sarsa, “Computing Education in the Era of Generative AI,” Communications of the ACM, no. 67, pp. 55–67, 2023, doi: 10.1145/3624720.
[5] N. Kiesler and D. Schiffner, “Large Language Models in Introductory Programming Education: ChatGPT’s Performance and Implications for Assessments,” 2023, doi: 10.48550/arXiv.2308.08572.
[6] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of Artificial General Intelligence: Early experiments with GPT-4,” 2023, doi: 10.48550/arXiv.2303.12712.
[7] H. Nori, Y. T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li, W. Liu, R. Luo, S. M. McKinney, R. O. Ness, H. Poon, T. Qin, N. Usuyama, C. White, and E. Horvitz, “Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine,” 2023, doi: 10.48550/arXiv.2311.16452.
[8] AlphaCode, “AlphaCode 2 Technical Report,” 2023.
[9] H. Keuning, J. Jeuring, and B. Heeren, “A Systematic Literature Review of Automated Feedback Generation for Programming Exercises,” ACM Transactions on Computing Education, vol. 19, pp. 1–43, 2019, doi: 10.1145/3231711.
[10] J. C. Paiva, J. P. Leal, and Á. Figueira, “Automated Assessment in Computer Science Education: A State-of-the-Art Review,” ACM Transactions on Computing Education, vol. 22, no. 3, pp. 1–40, 2022, doi: 10.1145/3513140.
[11] A. Hellas, J. Leinonen, S. Sarsa, C. Koutcheme, L. Kujanpää, and J. Sorva, “Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests,” in Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1, ser. ICER ’23, vol. 1. New York, USA: ACM, 2023, pp. 93–105, doi: 10.1145/3568813.3600139.
[12] N. Kiesler, D. Lohr, and H. Keuning, “Exploring the Potential of Large Language Models to Generate Formative Programming Feedback,” in 2023 IEEE Frontiers in Education Conference (FIE). College Station, USA: IEEE, 2023, pp. 1–5, doi: 10.1109/FIE58773.2023.10343457.
[13] I. Azaiz, N. Kiesler, and S. Strickroth, “Feedback-Generation for Programming Exercises With GPT-4,” 2024, doi: 10.48550/arXiv.2403.04449.
[14] I. Azaiz, O. Deckarm, and S. Strickroth, “AI-Enhanced Auto-Correction of Programming Exercises: How Effective is GPT-3.5?” International Journal of Engineering Pedagogy (iJEP), vol. 13, no. 8, pp. 67–83, 2023, doi: 10.3991/ijep.v13i8.45621.
[15] M. Liffiton, B. Sheese, J. Savelka, and P. Denny, “CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes,” in Proceedings of the 23rd Koli Calling International Conference on Computing Education Research. Koli, Finland: ACM, 2023, doi: 10.1145/3631802.3631830.
[16] L. Roest, H. Keuning, and J. Jeuring, “Next-Step Hint Generation for Introductory Programming Using Large Language Models,” in Proceedings of the 26th Australasian Computing Education Conference. Sydney, Australia: ACM, 2023, pp. 144–153, doi: 10.1145/3636243.3636259.
[17] E. A. Santos, P. Prasad, and B. A. Becker, “Always Provide Context: The Effects of Code Context on Programming Error Message Enhancement,” in Proceedings of the ACM Conference on Global Computing Education Vol 1. Hyderabad, India: ACM, 2023, pp. 147–153, doi: 10.1145/3576882.3617909.
[18] M. Pankiewicz and R. S. Baker, “Large Language Models (GPT) for automating feedback on programming assignments,” in Proceedings of the 31st International Conference on Computers in Education, 2023, doi: 10.48550/arXiv.2307.00150.
[19] M. Kazemitabaar, R. Ye, X. Wang, A. Z. Henley, P. Denny, M. Craig, and T. Grossman, “CodeAid: Evaluating a Classroom Deployment of an LLM-based Programming Assistant that Balances Student and Educator Needs,” 2024, arXiv: 2401.11314.
[20] T. Phung, V.-A. Pădurean, J. Cambronero, S. Gulwani, T. Kohn, R. Majumdar, A. Singla, and G. Soares, “Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors,” in Proceedings of the 2023 ACM Conference on International Computing Education Research. Chicago, USA: ACM, 2023, pp. 41–42, doi: 10.1145/3568812.3603476.
[21] S. Jacobs and S. Jaschke, “Large Language Models in der Berufsausbildung von IT-Fachkräften,” in INFOS 2023 - Informatikunterricht zwischen Aktualität und Zeitlosigkeit. Würzburg, Germany: Gesellschaft für Informatik, 2023, doi: 10.18420/INFOS2023-017.
[22] S. Narciss, “Feedback Strategies for Interactive Learning Tasks,” in Handbook of Research on Educational Communications and Technology, M. Spector, D. Merill, J. Van Merrienboer, and M. Driscoll, Eds. New York, USA: Lawrence Erlbaum Associates, 2008, pp. 125–144.