Search | arXiv e-print repository

Bugs in Large Language Models Generated Code: An Empirical Study

Authors: Florian Tambon, Arghavan Moradi Dakhel, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, Giuliano Antoniol

Abstract: Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community.… ▽ More Large Language Models (LLMs) for code have gained significant attention recently. They can generate code in different programming languages based on provided prompts, fulfilling a long-lasting dream in Software Engineering (SE), i.e., automatic code generation. Similar to human-written code, LLM-generated code is prone to bugs, and these bugs have not yet been thoroughly examined by the community. Given the increasing adoption of LLM-based code generation tools (e.g., GitHub Copilot) in SE activities, it is critical to understand the characteristics of bugs contained in code generated by LLMs. This paper examines a sample of 333 bugs collected from code generated using three leading LLMs (i.e., CodeGen, PanGu-Coder, and Codex) and identifies the following 10 distinctive bug patterns: Misinterpretations, Syntax Error, Silly Mistake, Prompt-biased code, Missing Corner Case, Wrong Input Type, Hallucinated Object, Wrong Attribute, Incomplete Generation, and Non-Prompted Consideration. The bug patterns are presented in the form of a taxonomy. The identified bug patterns are validated using an online survey with 34 LLM practitioners and researchers. The surveyed participants generally asserted the significance and prevalence of the bug patterns. Researchers and practitioners can leverage these findings to develop effective quality assurance techniques for LLM-generated code. This study sheds light on the distinctive characteristics of LLM-generated code. △ Less

Submitted 18 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

Comments: 47 pages, 7 figures

arXiv:2403.04896 [pdf, other]

The magnetic structure of Ce$_3$TiBi$_5$ and its relation to current-induced magnetization

Authors: Nicolas Gauthier, Romain Sibille, Vladimir Pomjakushin, Øystein S. Fjellvåg, James Fraser, Mathieu Desmarais, Andrea D. Bianchi, Jeffrey A. Quilliam

Abstract: The control of magnetization using electric fields has been extensively studied in magnetoelectric multiferroic insulator materials. Changes in magnetization in bulk metals caused by electric currents have attracted less attention. The recently discovered metallic magnet Ce$_3$TiBi$_5$ has been reported to exhibit current-induced magnetization. Here we determined the magnetic structure of Ce$_3$Ti… ▽ More The control of magnetization using electric fields has been extensively studied in magnetoelectric multiferroic insulator materials. Changes in magnetization in bulk metals caused by electric currents have attracted less attention. The recently discovered metallic magnet Ce$_3$TiBi$_5$ has been reported to exhibit current-induced magnetization. Here we determined the magnetic structure of Ce$_3$TiBi$_5$ using neutron diffraction, aiming to understand the microscopic origin of this magnetoelectric phenomenon in a metal. We established that the antiferromagnetic order emerging below $T_N=5$ K is a cycloid order described by $P6_3/mcm.1'(0,0,g)00sss$ with small moment sizes of $0.50(2)~μ_B$ and propagation vector ${\bf k}=(0,0,0.386)$. Surprisingly, the symmetry of this magnetic structure is inconsistent with the presence of current-induced magnetization and potential origins of this inconsistency with previous results are discussed. Additionally, our results suggest that moments order along their hard magnetic direction in Ce$_3$TiBi$_5$, a phenomenon which has been observed in other Kondo systems. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: 6 pages, 3 figures, accepted for publication in Phys. Rev. B as a Letter

arXiv:2308.16557 [pdf, other]

Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing

Authors: Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, Michel C. Desmarais

Abstract: One of the critical phases in software development is software testing. Testing helps with identifying potential bugs and reducing maintenance costs. The goal of automated test generation tools is to ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged Large Language Models (LLMs) of code to generate unit tests. While the code coverage of… ▽ More One of the critical phases in software development is software testing. Testing helps with identifying potential bugs and reducing maintenance costs. The goal of automated test generation tools is to ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged Large Language Models (LLMs) of code to generate unit tests. While the code coverage of generated tests was usually assessed, the literature has acknowledged that the coverage is weakly correlated with the efficiency of tests in bug detection. To improve over this limitation, in this paper, we introduce MuTAP for improving the effectiveness of test cases generated by LLMs in terms of revealing bugs by leveraging mutation testing. Our goal is achieved by augmenting prompts with surviving mutants, as those mutants highlight the limitations of test cases in detecting bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate their performance on different benchmarks. Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets. Among these, 17% remained undetected by both the current state-of-the-art fully automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57% on synthetic buggy code, outperforming all other approaches in our evaluation. Our findings suggest that although LLMs can serve as a useful tool to generate test cases, they require specific post-processing steps to enhance the effectiveness of the generated test cases which may suffer from syntactic or functional errors and may be ineffective in detecting certain types of bugs and testing corner cases PUTs. △ Less

Submitted 31 August, 2023; originally announced August 2023.

Comments: 16 pages, 3 figures

arXiv:2302.07738 [pdf, other]

Alloprof: a new French question-answer education dataset and its use in an information retrieval case study

Authors: Antoine Lefebvre-Brossard, Stephane Gazaille, Michel C. Desmarais

Abstract: Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algor… ▽ More Teachers and students are increasingly relying on online learning resources to supplement the ones provided in school. This increase in the breadth and depth of available resources is a great thing for students, but only provided they are able to find answers to their queries. Question-answering and information retrieval systems have benefited from public datasets to train and evaluate their algorithms, but most of these datasets have been in English text written by and for adults. We introduce a new public French question-answering dataset collected from Alloprof, a Quebec-based primary and high-school help website, containing 29 349 questions and their explanations in a variety of school subjects from 10 368 students, with more than half of the explanations containing links to other questions or some of the 2 596 reference pages on the website. We also present a case study of this dataset in an information retrieval task. This dataset was collected on the Alloprof public forum, with all questions verified for their appropriateness and the explanations verified both for their appropriateness and their relevance to the question. To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated. This dataset will allow researchers to develop question-answering, information retrieval and other algorithms specifically for the French speaking education context. Furthermore, the range of language proficiency, images, mathematical symbols and spelling mistakes will necessitate algorithms based on a multimodal comprehension. The case study we present as a baseline shows an approach that relies on recent techniques provides an acceptable performance level, but more work is necessary before it can reliably be used and trusted in a production setting. △ Less

Submitted 14 April, 2023; v1 submitted 10 February, 2023; originally announced February 2023.

arXiv:2207.05132 [pdf, other]

doi 10.1016/j.infsof.2023.107218

Dev2vec: Representing Domain Expertise of Developers in an Embedding Space

Authors: Arghavan Moradi Dakhel, Michel C. Desmarais, Foutse Khomh

Abstract: Accurate assessment of the domain expertise of developers is important for assigning the proper candidate to contribute to a project or to attend a job role. Since the potential candidate can come from a large pool, the automated assessment of this domain expertise is a desirable goal. While previous methods have had some success within a single software project, the assessment of a developer's do… ▽ More Accurate assessment of the domain expertise of developers is important for assigning the proper candidate to contribute to a project or to attend a job role. Since the potential candidate can come from a large pool, the automated assessment of this domain expertise is a desirable goal. While previous methods have had some success within a single software project, the assessment of a developer's domain expertise from contributions across multiple projects is more challenging. In this paper, we employ doc2vec to represent the domain expertise of developers as embedding vectors. These vectors are derived from different sources that contain evidence of developers' expertise, such as the description of repositories that they contributed, their issue resolving history, and API calls in their commits. We name it dev2vec and demonstrate its effectiveness in representing the technical specialization of developers. Our results indicate that encoding the expertise of developers in an embedding vector outperforms state-of-the-art methods and improves the F1-score up to 21%. Moreover, our findings suggest that ``issue resolving history'' of developers is the most informative source of information to represent the domain expertise of developers in embedding spaces. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: 30 pages, 5 figures

arXiv:2206.15331 [pdf, other]

GitHub Copilot AI pair programmer: Asset or Liability?

Authors: Arghavan Moradi Dakhel, Vahid Majdinasab, Amin Nikanjam, Foutse Khomh, Michel C. Desmarais, Zhen Ming, Jiang

Abstract: Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it e… ▽ More Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (i) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (ii) comparing Copilot's proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of humans' solutions is greater than Copilot's suggestions, while the buggy solutions generated by Copilot require less effort to be repaired. △ Less

Submitted 14 April, 2023; v1 submitted 30 June, 2022; originally announced June 2022.

Comments: 27 pages, 8 figures

arXiv:1809.08713 [pdf, other]

Deep Knowledge Tracing and Dynamic Student Classification for Knowledge Tracing

Authors: Sein Minn, Yi Yu, Michel C. Desmarais, Feida Zhu, Jill Jenn Vie

Abstract: In Intelligent Tutoring System (ITS), tracing the student's knowledge state during learning has been studied for several decades in order to provide more supportive learning instructions. In this paper, we propose a novel model for knowledge tracing that i) captures students' learning ability and dynamically assigns students into distinct groups with similar ability at regular time intervals, and… ▽ More In Intelligent Tutoring System (ITS), tracing the student's knowledge state during learning has been studied for several decades in order to provide more supportive learning instructions. In this paper, we propose a novel model for knowledge tracing that i) captures students' learning ability and dynamically assigns students into distinct groups with similar ability at regular time intervals, and ii) combines this information with a Recurrent Neural Network architecture known as Deep Knowledge Tracing. Experimental results confirm that the proposed model is significantly better at predicting student performance than well known state-of-the-art techniques for student modelling. △ Less

Submitted 7 January, 2021; v1 submitted 23 September, 2018; originally announced September 2018.

Comments: IEEE International Conference on Data Mining, 2018

Showing 1–7 of 7 results for author: Desmarais, M