Search | arXiv e-print repository

Evaluating Language Models for Generating and Judging Programming Feedback

Authors: Charles Koutcheme, Nicola Dainese, Arto Hellas, Sami Sarsa, Juho Leinonen, Syed Ashraf, Paul Denny

Abstract: The emergence of large language models (LLMs) has transformed research and practice in a wide range of domains. Within the computing education research (CER) domain, LLMs have received plenty of attention especially in the context of learning programming. Much of the work on LLMs in CER has however focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency o… ▽ More The emergence of large language models (LLMs) has transformed research and practice in a wide range of domains. Within the computing education research (CER) domain, LLMs have received plenty of attention especially in the context of learning programming. Much of the work on LLMs in CER has however focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments, and in judging the quality of the programming feedback, contrasting the results against proprietary models. Our evaluations on a dataset of students' submissions to Python introductory programming exercises suggest that the state-of-the-art open-source LLMs (Meta's Llama3) are almost on-par with proprietary models (GPT-4o) in both the generation and assessment of programming feedback. We further demonstrate the efficiency of smaller LLMs in the tasks, and highlight that there are a wide range of LLMs that are accessible even for free for educators and practitioners. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2406.04817 [pdf, other]

Experiences from Integrating Large Language Model Chatbots into the Classroom

Authors: Arto Hellas, Juho Leinonen, Leo Leppänen

Abstract: In the present study, we provided students an unfiltered access to a state-of-the-art large language model (LLM) chatbot. The chatbot was intentionally designed to mimic proprietary commercial chatbots such as ChatGPT where the chatbot has not been tailored for the educational context; the underlying engine was OpenAI GPT-4. The chatbot was integrated into online learning materials of three course… ▽ More In the present study, we provided students an unfiltered access to a state-of-the-art large language model (LLM) chatbot. The chatbot was intentionally designed to mimic proprietary commercial chatbots such as ChatGPT where the chatbot has not been tailored for the educational context; the underlying engine was OpenAI GPT-4. The chatbot was integrated into online learning materials of three courses. One of the courses focused on software engineering with LLMs, while the two other courses were not directly related to LLMs. Our results suggest that only a minority of students engage with the chatbot in the courses that do not relate to LLMs. At the same time, unsurprisingly, nearly all students in the LLM-focused course leveraged the chatbot. In all courses, the majority of the LLM usage came from a few superusers, whereas the majority of the students did not heavily use the chatbot even though it was readily available and effectively provided a free access to the OpenAI GPT-4 model. We also observe that in addition to students using the chatbot for course-specific purposes, many use the chatbot for their own purposes. These results suggest that the worst fears of educators -- all students overrelying on LLMs -- did not materialize even when the chatbot access was unfiltered. We finally discuss potential reasons for the low usage, suggesting the need for more tailored and scaffolded LLM experiences targeted for specific types of student use cases. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 7 pages, 1 figure, 5 tables

arXiv:2405.05347 [pdf, other]

Benchmarking Educational Program Repair

Authors: Charles Koutcheme, Nicola Dainese, Sami Sarsa, Juho Leinonen, Arto Hellas, Paul Denny

Abstract: The emergence of large language models (LLMs) has sparked enormous interest due to their potential application across a range of educational tasks. For example, recent work in programming education has used LLMs to generate learning resources, improve error messages, and provide feedback on code. However, one factor that limits progress within the field is that much of the research uses bespoke da… ▽ More The emergence of large language models (LLMs) has sparked enormous interest due to their potential application across a range of educational tasks. For example, recent work in programming education has used LLMs to generate learning resources, improve error messages, and provide feedback on code. However, one factor that limits progress within the field is that much of the research uses bespoke datasets and different evaluation metrics, making direct comparisons between results unreliable. Thus, there is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches. One task where LLMs show great promise is program repair, which can be used to provide debugging support and next-step hints to students. In this article, we propose a novel educational program repair benchmark. We curate two high-quality publicly available programming datasets, present a unified evaluation procedure introducing a novel evaluation metric rouge@k for approximating the quality of repairs, and evaluate a set of five recent models to establish baseline performance. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 15 pages, 2 figures, 3 tables. Non-archival report presented at the NeurIPS'23 Workshop on Generative AI for Education (GAIED)

arXiv:2405.05253 [pdf, other]

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Authors: Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, Paul Denny

Abstract: Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models… ▽ More Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts. However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models. This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied. This is a concern as providing flawed or misleading generated feedback could be detrimental to student learning. Inspired by recent work that has utilised very powerful LLMs, such as GPT-4, to evaluate the outputs produced by less powerful models, we conduct an automated analysis of the quality of the feedback produced by several open source models using a dataset from an introductory programming course. First, we investigate the viability of employing GPT-4 as an automated evaluator by comparing its evaluations with those of a human expert. We observe that GPT-4 demonstrates a bias toward positively rating feedback while exhibiting moderate agreement with human raters, showcasing its potential as a feedback evaluator. Second, we explore the quality of feedback generated by several leading open-source LLMs by using GPT-4 to evaluate the feedback. We find that some models offer competitive performance with popular proprietary LLMs, such as ChatGPT, indicating opportunities for their responsible use in educational settings. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 7 pages, 4 figures, 2 tables. Accepted for publication at the 29th annual ACM conference on Innovation and Technology in Computer Science Education (ITiCSE 2024)

arXiv:2404.11734 [pdf, other]

doi 10.1145/3639474.3640058

Let's Ask AI About Their Programs: Exploring ChatGPT's Answers To Program Comprehension Questions

Authors: Teemu Lehtinen, Charles Koutcheme, Arto Hellas

Abstract: Recent research has explored the creation of questions from code submitted by students. These Questions about Learners' Code (QLCs) are created through program analysis, exploring execution paths, and then creating code comprehension questions from these paths and the broader code structure. Responding to the questions requires reading and tracing the code, which is known to support students' lear… ▽ More Recent research has explored the creation of questions from code submitted by students. These Questions about Learners' Code (QLCs) are created through program analysis, exploring execution paths, and then creating code comprehension questions from these paths and the broader code structure. Responding to the questions requires reading and tracing the code, which is known to support students' learning. At the same time, computing education researchers have witnessed the emergence of Large Language Models (LLMs) that have taken the community by storm. Researchers have demonstrated the applicability of these models especially in the introductory programming context, outlining their performance in solving introductory programming problems and their utility in creating new learning resources. In this work, we explore the capability of the state-of-the-art LLMs (GPT-3.5 and GPT-4) in answering QLCs that are generated from code that the LLMs have created. Our results show that although the state-of-the-art LLMs can create programs and trace program execution when prompted, they easily succumb to similar errors that have previously been recorded for novice programmers. These results demonstrate the fallibility of these models and perhaps dampen the expectations fueled by the recent LLM hype. At the same time, we also highlight future research possibilities such as using LLMs to mimic students as their behavior can indeed be similar for some specific tasks. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2403.09409 [pdf, ps, other]

"Like a Nesting Doll": Analyzing Recursion Analogies Generated by CS Students using Large Language Models

Authors: Seth Bernstein, Paul Denny, Juho Leinonen, Lauren Kan, Arto Hellas, Matt Littlefield, Sami Sarsa, Stephen MacNeil

Abstract: Gras** complex computing concepts often poses a challenge for students who struggle to anchor these new ideas to familiar experiences and understandings. To help with this, a good analogy can bridge the gap between unfamiliar concepts and familiar ones, providing an engaging way to aid understanding. However, creating effective educational analogies is difficult even for experienced instructors.… ▽ More Gras** complex computing concepts often poses a challenge for students who struggle to anchor these new ideas to familiar experiences and understandings. To help with this, a good analogy can bridge the gap between unfamiliar concepts and familiar ones, providing an engaging way to aid understanding. However, creating effective educational analogies is difficult even for experienced instructors. We investigate to what extent large language models (LLMs), specifically ChatGPT, can provide access to personally relevant analogies on demand. Focusing on recursion, a challenging threshold concept, we conducted an investigation analyzing the analogies generated by more than 350 first-year computing students. They were provided with a code snippet and tasked to generate their own recursion-based analogies using ChatGPT, optionally including personally relevant topics in their prompts. We observed a great deal of diversity in the analogies produced with student-prescribed topics, in contrast to the otherwise generic analogies, highlighting the value of student creativity when working with LLMs. Not only did students enjoy the activity and report an improved understanding of recursion, but they described more easily remembering analogies that were personally and culturally relevant. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 7 pages, 2 figures, ITiCSE 2024 preprint

arXiv:2311.16017 [pdf, other]

Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models

Authors: Stephen MacNeil, Paul Denny, Andrew Tran, Juho Leinonen, Seth Bernstein, Arto Hellas, Sami Sarsa, Joanne Kim

Abstract: Identifying and resolving logic errors can be one of the most frustrating challenges for novices programmers. Unlike syntax errors, for which a compiler or interpreter can issue a message, logic errors can be subtle. In certain conditions, buggy code may even exhibit correct behavior -- in other cases, the issue might be about how a problem statement has been interpreted. Such errors can be hard t… ▽ More Identifying and resolving logic errors can be one of the most frustrating challenges for novices programmers. Unlike syntax errors, for which a compiler or interpreter can issue a message, logic errors can be subtle. In certain conditions, buggy code may even exhibit correct behavior -- in other cases, the issue might be about how a problem statement has been interpreted. Such errors can be hard to spot when reading the code, and they can also at times be missed by automated tests. There is great educational potential in automatically detecting logic errors, especially when paired with suitable feedback for novices. Large language models (LLMs) have recently demonstrated surprising performance for a range of computing tasks, including generating and explaining code. These capabilities are closely linked to code syntax, which aligns with the next token prediction behavior of LLMs. On the other hand, logic errors relate to the runtime performance of code and thus may not be as well suited to analysis by LLMs. To explore this, we investigate the performance of two popular LLMs, GPT-3 and GPT-4, for detecting and providing a novice-friendly explanation of logic errors. We compare LLM performance with a large cohort of introductory computing students $(n=964)$ solving the same error detection task. Through a mixed-methods analysis of student and model responses, we observe significant improvement in logic error identification between the previous and current generation of LLMs, and find that both LLM generations significantly outperform students. We outline how such models could be integrated into computing education tools, and discuss their potential for supporting students when learning programming. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2309.05669 [pdf, other]

Implications of Edge Computing for Static Site Generation

Authors: Juho Vepsäläinen, Arto Hellas, Petri Vuorimaa

Abstract: Static site generation (SSG) is a common technique in the web development space to create performant websites that are easy to host. Numerous SSG tools exist, and the approach has been complemented by newer approaches, such as Jamstack, that extend its usability. Edge computing represents a new option to extend the usefulness of SSG further by allowing the creation of dynamic sites on top of a sta… ▽ More Static site generation (SSG) is a common technique in the web development space to create performant websites that are easy to host. Numerous SSG tools exist, and the approach has been complemented by newer approaches, such as Jamstack, that extend its usability. Edge computing represents a new option to extend the usefulness of SSG further by allowing the creation of dynamic sites on top of a static backdrop, providing dynamic resources close to the user. In this paper, we explore the impact of the recent developments in the edge computing space and consider its implications for SSG. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 14 pages, 3 figures, 1 table, approved for WEBIST 2023

ACM Class: D.2.11

arXiv:2309.04188 [pdf, other]

The State of Disappearing Frameworks in 2023

Authors: Juho Vepsäläinen, Arto Hellas, Petri Vuorimaa

Abstract: Disappearing frameworks represent a new type of thinking for web development. In the current mainstream JavaScript frameworks, the focus has been on developer experience at the cost of user experience. Disappearing frameworks shift the focus by aiming to deliver as little, even zero, JavaScript to the client. In this paper, we look at the options available in the ecosystem in mid-2023 and characte… ▽ More Disappearing frameworks represent a new type of thinking for web development. In the current mainstream JavaScript frameworks, the focus has been on developer experience at the cost of user experience. Disappearing frameworks shift the focus by aiming to deliver as little, even zero, JavaScript to the client. In this paper, we look at the options available in the ecosystem in mid-2023 and characterize them in terms of functionality and features to provide a state-of-the-art view of the trend. We found that the frameworks rely heavily on compilers, often support progressive enhancement, and most of the time support static output. While solutions like Astro are UI library agnostic, others, such as Marko, are more opinionated. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 15 pages, 1 figure, 2 tables, approved for WEBIST 2023

ACM Class: D.2.11

arXiv:2306.10509 [pdf, other]

Can We Trust AI-Generated Educational Content? Comparative Analysis of Human and AI-Generated Learning Resources

Authors: Paul Denny, Hassan Khosravi, Arto Hellas, Juho Leinonen, Sami Sarsa

Abstract: As an increasing number of students move to online learning platforms that deliver personalized learning experiences, there is a great need for the production of high-quality educational content. Large language models (LLMs) appear to offer a promising solution to the rapid creation of learning materials at scale, reducing the burden on instructors. In this study, we investigated the potential for… ▽ More As an increasing number of students move to online learning platforms that deliver personalized learning experiences, there is a great need for the production of high-quality educational content. Large language models (LLMs) appear to offer a promising solution to the rapid creation of learning materials at scale, reducing the burden on instructors. In this study, we investigated the potential for LLMs to produce learning resources in an introductory programming context, by comparing the quality of the resources generated by an LLM with those created by students as part of a learnersourcing activity. Using a blind evaluation, students rated the correctness and helpfulness of resources generated by AI and their peers, after both were initially provided with identical exemplars. Our results show that the quality of AI-generated resources, as perceived by students, is equivalent to the quality of resources generated by their peers. This suggests that AI-generated resources may serve as viable supplementary material in certain contexts. Resources generated by LLMs tend to closely mirror the given exemplars, whereas student-generated resources exhibit greater variety in terms of content length and specific syntax features used. The study highlights the need for further research exploring different types of learning resources and a broader range of subject areas, and understanding the long-term impact of AI-generated resources on learning outcomes. △ Less

Submitted 3 July, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

arXiv:2306.05715 [pdf, ps, other]

doi 10.1145/3568813.3600139

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests

Authors: Arto Hellas, Juho Leinonen, Sami Sarsa, Charles Koutcheme, Lilja Kujanpää, Juha Sorva

Abstract: Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence. Objectives: In this article, we explore such opportunities and threats in a specific area: responding to student programmers' help requests. More specifically, we assess how good… ▽ More Background and Context: Over the past year, large language models (LLMs) have taken the world by storm. In computing education, like in other walks of life, many opportunities and threats have emerged as a consequence. Objectives: In this article, we explore such opportunities and threats in a specific area: responding to student programmers' help requests. More specifically, we assess how good LLMs are at identifying issues in problematic code that students request help on. Method: We collected a sample of help requests and code from an online programming course. We then prompted two different LLMs (OpenAI Codex and GPT-3.5) to identify and explain the issues in the students' code and assessed the LLM-generated answers both quantitatively and qualitatively. Findings: GPT-3.5 outperforms Codex in most respects. Both LLMs frequently find at least one actual issue in each student program (GPT-3.5 in 90% of the cases). Neither LLM excels at finding all the issues (GPT-3.5 finding them 57% of the time). False positives are common (40% chance for GPT-3.5). The advice that the LLMs provide on the issues is often sensible. The LLMs perform better on issues involving program logic rather than on output formatting. Model solutions are frequently provided even when the LLM is prompted not to. LLM responses to prompts in a non-English language are only slightly worse than responses to English prompts. Implications: Our results continue to highlight the utility of LLMs in programming education. At the same time, the results highlight the unreliability of LLMs: LLMs make some of the same mistakes that students do, perhaps especially when formatting output as required by automated assessment systems. Our study informs teachers interested in using LLMs as well as future efforts to customize LLMs for the needs of programming education. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: 13 pages, 1 figure. To be published in Proceedings of the 2023 ACM Conference on International Computing Education Research V.1 (ICER '23 V1)

arXiv:2306.02608 [pdf, other]

Computing Education in the Era of Generative AI

Authors: Paul Denny, James Prather, Brett A. Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N. Reeves, Eddie Antonio Santos, Sami Sarsa

Abstract: The computing education community has a rich history of pedagogical innovation designed to support students in introductory courses, and to support teachers in facilitating student learning. Very recent advances in artificial intelligence have resulted in code generation models that can produce source code from natural language problem descriptions -- with impressive accuracy in many cases. The wi… ▽ More The computing education community has a rich history of pedagogical innovation designed to support students in introductory courses, and to support teachers in facilitating student learning. Very recent advances in artificial intelligence have resulted in code generation models that can produce source code from natural language problem descriptions -- with impressive accuracy in many cases. The wide availability of these models and their ease of use has raised concerns about potential impacts on many aspects of society, including the future of computing education. In this paper, we discuss the challenges and opportunities such models present to computing educators, with a focus on introductory programming classrooms. We summarize the results of two recent articles, the first evaluating the performance of code generation models on typical introductory-level programming problems, and the second exploring the quality and novelty of learning resources generated by these models. We consider likely impacts of such models upon pedagogical practice in the context of the most recent advances at the time of writing. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: Accepted for publication as a Contributed Article in Communications of the ACM (CACM)

arXiv:2304.03938 [pdf, other]

doi 10.1145/3587102.3588785

Comparing Code Explanations Created by Students and Large Language Models

Authors: Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, Arto Hellas

Abstract: Reasoning about code and explaining its purpose are fundamental skills for computer scientists. There has been extensive research in the field of computing education on the relationship between a student's ability to explain code and other skills such as writing and tracing code. In particular, the ability to describe at a high-level of abstraction how code will behave over all possible inputs cor… ▽ More Reasoning about code and explaining its purpose are fundamental skills for computer scientists. There has been extensive research in the field of computing education on the relationship between a student's ability to explain code and other skills such as writing and tracing code. In particular, the ability to describe at a high-level of abstraction how code will behave over all possible inputs correlates strongly with code writing skills. However, develo** the expertise to comprehend and explain code accurately and succinctly is a challenge for many students. Existing pedagogical approaches that scaffold the ability to explain code, such as producing exemplar code explanations on demand, do not currently scale well to large classrooms. The recent emergence of powerful large language models (LLMs) may offer a solution. In this paper, we explore the potential of LLMs in generating explanations that can serve as examples to scaffold students' ability to understand and explain code. To evaluate LLM-created explanations, we compare them with explanations created by students in a large course ($n \approx 1000$) with respect to accuracy, understandability and length. We find that LLM-created explanations, which can be produced automatically on demand, are rated as being significantly easier to understand and more accurate summaries of code than student-created explanations. We discuss the significance of this finding, and suggest how such models can be incorporated into introductory programming education. △ Less

Submitted 8 April, 2023; originally announced April 2023.

Comments: 8 pages, 3 figures. To be published in Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1

arXiv:2304.01947 [pdf, other]

doi 10.1007/978-3-031-34444-2_23

The Rise of Disappearing Frameworks in Web Development

Authors: Juho Vepsäläinen, Arto Hellas, Petri Vuorimaa

Abstract: The evolution of the web can be characterized as an emergence of frameworks paving the way from static websites to dynamic web applications. As the scope of web applications has grown, new technical challenges have emerged, leading to the need for new solutions. The latest of these developments is the rise of so-called disappearing web frameworks that question the axioms of earlier generations of… ▽ More The evolution of the web can be characterized as an emergence of frameworks paving the way from static websites to dynamic web applications. As the scope of web applications has grown, new technical challenges have emerged, leading to the need for new solutions. The latest of these developments is the rise of so-called disappearing web frameworks that question the axioms of earlier generations of web frameworks, providing benefits of the early web and simple static sites. △ Less

Submitted 15 June, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: 9 pages, 1 figure, ICWE 2023

ACM Class: D.2.11

Journal ref: ICWE 2023. Lecture Notes in Computer Science, vol 13893. Springer, Cham

arXiv:2301.07509 [pdf, other]

Coverage of Course Topics in Learnersourced SQL Exercises

Authors: Nea Pirttinen, Arto Hellas, Juho Leinonen

Abstract: Learnersourcing is a common task in modern computing classrooms, where it is used, for example, for the creation of educational resources such as multiple-choice questions and programming exercises. One less studied type of learnersourced artefact is SQL exercises. In this work, we explore how well different SQL topics are covered by learnersourced SQL exercises. Covering most course topics would… ▽ More Learnersourcing is a common task in modern computing classrooms, where it is used, for example, for the creation of educational resources such as multiple-choice questions and programming exercises. One less studied type of learnersourced artefact is SQL exercises. In this work, we explore how well different SQL topics are covered by learnersourced SQL exercises. Covering most course topics would allow students to practice the full content of the course by completing learnersourced exercises. Our results suggest that learnersourcing can be used to create a large pool of SQL exercises that cover most of the topics of the course. △ Less

Submitted 18 January, 2023; originally announced January 2023.

arXiv:2212.07763 [pdf, ps, other]

Synthesizing Research on Programmers' Mental Models of Programs, Tasks and Concepts -- a Systematic Literature Review

Authors: Ava Heinonen, Bettina Lehtelä, Arto Hellas, Fabian Fagerholm

Abstract: Programmers' mental models represent their knowledge and understanding of programs, programming concepts, and programming in general. They guide programmers' work and influence their task performance. Understanding mental models is important for designing work systems and practices that support programmers. Although the importance of programmers' mental models is widely acknowledged, research on m… ▽ More Programmers' mental models represent their knowledge and understanding of programs, programming concepts, and programming in general. They guide programmers' work and influence their task performance. Understanding mental models is important for designing work systems and practices that support programmers. Although the importance of programmers' mental models is widely acknowledged, research on mental models has decreased over the years. The results are scattered and do not take into account recent developments in software engineering. We analyze the state of research into programmers' mental models and provide an overview of existing research. We connect results on mental models from different strands of research to form a more unified knowledge base on the topic. We conducted a systematic literature review on programmers' mental models. We analyzed literature addressing mental models in different contexts, including mental models of programs, programming tasks, and programming concepts. Using nine search engines, we found 3678 articles (excluding duplicates). 84 were selected for further analysis. Using the snowballing technique, we obtained a final result set containing 187 articles. We show that the literature shares a kernel of shared understanding of mental models. By collating and connecting results on mental models from different fields of research, we uncovered some well-researched aspects, which we argue are fundamental characteristics of programmers' mental models. This work provides a basis for future work on mental models. The research field on programmers' mental models still faces many challenges rising from a lack of a shared knowledge base and poorly defined constructs. We created a unified knowledge base on the topic. We also point to directions for future studies. In particular, we call for studies that examine programmers working with modern practices and tools. △ Less

Submitted 15 December, 2022; originally announced December 2022.

Comments: Submitted to Information and Software Technology

ACM Class: F.3.2; F.3.3

arXiv:2212.05113 [pdf, other]

doi 10.1145/3545947.3569630

Automatically Generating CS Learning Materials with Large Language Models

Authors: Stephen MacNeil, Andrew Tran, Juho Leinonen, Paul Denny, Joanne Kim, Arto Hellas, Seth Bernstein, Sami Sarsa

Abstract: Recent breakthroughs in Large Language Models (LLMs), such as GPT-3 and Codex, now enable software developers to generate code based on a natural language prompt. Within computer science education, researchers are exploring the potential for LLMs to generate code explanations and programming assignments using carefully crafted prompts. These advances may enable students to interact with code in ne… ▽ More Recent breakthroughs in Large Language Models (LLMs), such as GPT-3 and Codex, now enable software developers to generate code based on a natural language prompt. Within computer science education, researchers are exploring the potential for LLMs to generate code explanations and programming assignments using carefully crafted prompts. These advances may enable students to interact with code in new ways while hel** instructors scale their learning materials. However, LLMs also introduce new implications for academic integrity, curriculum design, and software engineering careers. This workshop will demonstrate the capabilities of LLMs to help attendees evaluate whether and how LLMs might be integrated into their pedagogy and research. We will also engage attendees in brainstorming to consider how LLMs will impact our field. △ Less

Submitted 9 December, 2022; originally announced December 2022.

Comments: In Proceedings of the 54th ACM Technical Symposium on Computing Science Education

arXiv:2211.04715 [pdf, other]

Robosourcing Educational Resources -- Leveraging Large Language Models for Learnersourcing

Authors: Paul Denny, Sami Sarsa, Arto Hellas, Juho Leinonen

Abstract: In this article, we introduce and evaluate the concept of robosourcing for creating educational content. Robosourcing lies in the intersection of crowdsourcing and large language models, where instead of a crowd of humans, requests to large language models replace some of the work traditionally performed by the crowd. Robosourcing includes a human-in-the-loop to provide priming (input) as well as… ▽ More In this article, we introduce and evaluate the concept of robosourcing for creating educational content. Robosourcing lies in the intersection of crowdsourcing and large language models, where instead of a crowd of humans, requests to large language models replace some of the work traditionally performed by the crowd. Robosourcing includes a human-in-the-loop to provide priming (input) as well as to evaluate and potentially adjust the generated artefacts; these evaluations could also be used to improve the large language models. We propose a system to outline the robosourcing process. We further study the feasibility of robosourcing in the context of education by conducting an evaluation of robosourced and programming exercises, generated using OpenAI Codex. Our results suggest that robosourcing could significantly reduce human effort in creating diverse educational content while maintaining quality similar to human-created content. △ Less

Submitted 9 November, 2022; originally announced November 2022.

arXiv:2211.02265 [pdf, other]

Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book

Authors: Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, Juho Leinonen

Abstract: Advances in natural language processing have resulted in large language models (LLMs) that are capable of generating understandable and sensible written text. Recent versions of these models, such as OpenAI Codex and GPT-3, can generate code and code explanations. However, it is unclear whether and how students might engage with such explanations. In this paper, we report on our experiences genera… ▽ More Advances in natural language processing have resulted in large language models (LLMs) that are capable of generating understandable and sensible written text. Recent versions of these models, such as OpenAI Codex and GPT-3, can generate code and code explanations. However, it is unclear whether and how students might engage with such explanations. In this paper, we report on our experiences generating multiple code explanation types using LLMs and integrating them into an interactive e-book on web software development. We modified the e-book to make LLM-generated code explanations accessible through buttons next to code snippets in the materials, which allowed us to track the use of the explanations as well as to ask for feedback on their utility. Three different types of explanations were available for students for each explainable code snippet; a line-by-line explanation, a list of important concepts, and a high-level summary of the code. Our preliminary results show that all varieties of explanations were viewed by students and that the majority of students perceived the code explanations as helpful to them. However, student engagement appeared to vary by code snippet complexity, explanation type, and code snippet length. Drawing on our experiences, we discuss future directions for integrating explanations generated by LLMs into existing computer science classrooms. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2210.11630 [pdf, ps, other]

doi 10.1145/3545945.3569770

Using Large Language Models to Enhance Programming Error Messages

Authors: Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reeves, Paul Denny, James Prather, Brett A. Becker

Abstract: A key part of learning to program is learning to understand programming error messages. They can be hard to interpret and identifying the cause of errors can be time-consuming. One factor in this challenge is that the messages are typically intended for an audience that already knows how to program, or even for programming environments that then use the information to highlight areas in code. Rese… ▽ More A key part of learning to program is learning to understand programming error messages. They can be hard to interpret and identifying the cause of errors can be time-consuming. One factor in this challenge is that the messages are typically intended for an audience that already knows how to program, or even for programming environments that then use the information to highlight areas in code. Researchers have been working on making these errors more novice friendly since the 1960s, however progress has been slow. The present work contributes to this stream of research by using large language models to enhance programming error messages with explanations of the errors and suggestions on how to fix the error. Large language models can be used to create useful and novice-friendly enhancements to programming error messages that sometimes surpass the original programming error messages in interpretability and actionability. These results provide further evidence of the benefits of large language models for computing educators, highlighting their use in areas known to be challenging for students. We further discuss the benefits and downsides of large language models and highlight future streams of research for enhancing programming error messages. △ Less

Submitted 20 October, 2022; originally announced October 2022.

Comments: 7 pages, accepted for publication at SIGCSE TS 2023

arXiv:2206.11861 [pdf, other]

doi 10.1145/3501385.3543957

Automatic Generation of Programming Exercises and Code Explanations using Large Language Models

Authors: Sami Sarsa, Paul Denny, Arto Hellas, Juho Leinonen

Abstract: This article explores the natural language generation capabilities of large language models with application to the production of two types of learning resources common in programming courses. Using OpenAI Codex as the large language model, we create programming exercises (including sample solutions and test cases) and code explanations, assessing these qualitatively and quantitatively. Our result… ▽ More This article explores the natural language generation capabilities of large language models with application to the production of two types of learning resources common in programming courses. Using OpenAI Codex as the large language model, we create programming exercises (including sample solutions and test cases) and code explanations, assessing these qualitatively and quantitatively. Our results suggest that the majority of the automatically generated content is both novel and sensible, and in some cases ready to use as is. When creating exercises we find that it is remarkably easy to influence both the programming concepts and the contextual themes they contain, simply by supplying keywords as input to the model. Our analysis suggests that there is significant value in massive generative machine learning models as a tool for instructors, although there remains a need for some oversight to ensure the quality of the generated content before it is delivered to students. We further discuss the implications of OpenAI Codex and similar tools for introductory programming education and highlight future research streams that have the potential to improve the quality of the educational experience for both teachers and students alike. △ Less

Submitted 26 June, 2022; v1 submitted 3 June, 2022; originally announced June 2022.

Comments: 18 pages, 1 figure, accepted in ICER

arXiv:2112.15072 [pdf, other]

Empirical Evaluation of Deep Learning Models for Knowledge Tracing: Of Hyperparameters and Metrics on Performance and Replicability

Authors: Sami Sarsa, Juho Leinonen, Arto Hellas

Abstract: We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets, and with a novel data set of students learning to program. The evaluated knowledge tracing models include Vanilla-DKT, two Long Short-Term Memory Deep Knowledge Tracing (LSTM-DKT) variants, two Dynamic Key-Value Memory Network (DKVMN) variants, and Self-Attentive Knowledg… ▽ More We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets, and with a novel data set of students learning to program. The evaluated knowledge tracing models include Vanilla-DKT, two Long Short-Term Memory Deep Knowledge Tracing (LSTM-DKT) variants, two Dynamic Key-Value Memory Network (DKVMN) variants, and Self-Attentive Knowledge Tracing (SAKT). As baselines, we evaluate simple non-learning models, logistic regression and Bayesian Knowledge Tracing (BKT). To evaluate how different aspects of DLKT models influence model performance, we test input and output layer variations found in the compared models that are independent of the main architectures. We study maximum attempt count options, including filtering out long attempt sequences, that have been implicitly and explicitly used in prior studies. We contrast the observed performance variations against variations from non-model properties such as randomness and hardware. Performance of models is assessed using multiple metrics, whereby we also contrast the impact of the choice of metric on model performance. The key contributions of this work are: Evidence that DLKT models generally outperform more traditional models, but not necessarily by much and not always; Evidence that even simple baselines with little to no predictive value may outperform DLKT models, especially in terms of accuracy -- highlighting importance of selecting proper baselines for comparison; Disambiguation of properties that affect performance in DLKT models including metric choice, input and output layer variations, common hyperparameters, random seeding and hardware; Discussion of issues in replicability when evaluating DLKT models, including discrepancies in prior reported results and methodology. Model implementations, evaluation code, and data are published as a part of this work. △ Less

Submitted 5 April, 2022; v1 submitted 30 December, 2021; originally announced December 2021.

Comments: 70 pages, 8 figures, submitted to JEDM, added acknowledgments, modified after first round of review

ACM Class: K.3; I.2

arXiv:2103.01752 [pdf, other]

Morning or Evening? An Examination of Circadian Rhythms of CS1 Students

Authors: Albina Zavgorodniaia, Raj Shrestha, Juho Leinonen, Arto Hellas, John Edwards

Abstract: Circadian rhythms are the cycles of our internal clock that play a key role in governing when we sleep and when we are active. A related concept is chronotype, which is a person's natural tendency toward activity at certain times of day and typically governs when the individual is most alert and productive. In this work we investigate chronotypes in the setting of an Introductory Computer Programm… ▽ More Circadian rhythms are the cycles of our internal clock that play a key role in governing when we sleep and when we are active. A related concept is chronotype, which is a person's natural tendency toward activity at certain times of day and typically governs when the individual is most alert and productive. In this work we investigate chronotypes in the setting of an Introductory Computer Programming (CS1) course. Using keystroke data collected from students we investigate the existence of chronotypes through unsupervised learning. The chronotypes we find align with those of typical populations reported in the literature and our results support correlations of certain chronotypes to academic achievement. We also find a lack of support for the still-popular stereotype of a computer programmer as a night owl. The analyses are conducted on data from two universities, one in the US and one in Europe, that use different teaching methods. In comparison of the two contexts, we look into programming assignment design and administration that may promote better programming practices among students in terms of procrastination and effort. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Showing 1–23 of 23 results for author: Hellas, A