-
Automatic Programming: Large Language Models and Beyond
Authors:
Michael R. Lyu,
Baishakhi Ray,
Abhik Roychoudhury,
Shin Hwei Tan,
Patanamon Thongtanunam
Abstract:
Automatic programming has seen increasing popularity due to the emergence of tools like GitHub Copilot which rely on Large Language Models (LLMs). At the same time, automatically generated code faces challenges during deployment due to concerns around quality and trust. In this article, we study automated coding in a general sense and study the concerns around code quality, security and related is…
▽ More
Automatic programming has seen increasing popularity due to the emergence of tools like GitHub Copilot which rely on Large Language Models (LLMs). At the same time, automatically generated code faces challenges during deployment due to concerns around quality and trust. In this article, we study automated coding in a general sense and study the concerns around code quality, security and related issues of programmer responsibility. These are key issues for organizations while deciding on the usage of automatically generated code. We discuss how advances in software engineering such as program repair and analysis can enable automatic programming. We conclude with a forward looking view, focusing on the programming environment of the near future, where programmers may need to switch to different roles to fully utilize the power of automatic programming. Automated repair of automatically generated programs from LLMs, can help produce higher assurance code from LLMs, along with evidence of assurance
△ Less
Submitted 15 May, 2024; v1 submitted 3 May, 2024;
originally announced May 2024.
-
A Systematic Literature Review on Reasons and Approaches for Accurate Effort Estimations in Agile
Authors:
Jirat Pasuksmit,
Patanamon Thongtanunam,
Shanika Karunasekera
Abstract:
Background: Accurate effort estimation is crucial for planning in Agile iterative development. Agile estimation generally relies on consensus-based methods like planning poker, which require less time and information than other formal methods (e.g., COSMIC) but are prone to inaccuracies. Understanding the common reasons for inaccurate estimations and how proposed approaches can assist practitioner…
▽ More
Background: Accurate effort estimation is crucial for planning in Agile iterative development. Agile estimation generally relies on consensus-based methods like planning poker, which require less time and information than other formal methods (e.g., COSMIC) but are prone to inaccuracies. Understanding the common reasons for inaccurate estimations and how proposed approaches can assist practitioners is essential. However, prior systematic literature reviews (SLR) only focus on the estimation practices (e.g., [26, 127]) and the effort estimation approaches (e.g., [6]). Aim: We aim to identify themes of reasons for inaccurate estimations and classify approaches to improve effort estimation. Method: We conducted an SLR and identified the key themes and a taxonomy. Results: The reasons for inaccurate estimation are related to information quality, team, estimation practice, project management, and business influences. The effort estimation approaches were the most investigated in the literature, while only a few aim to support the effort estimation process. Yet, few automated approaches are at risk of data leakage and indirect validation scenarios. Recommendations: Practitioners should enhance the quality of information for effort estimation, potentially by adopting an automated approach. Future research should aim to improve the information quality, while avoiding data leakage and indirect validation scenarios.
△ Less
Submitted 14 April, 2024;
originally announced May 2024.
-
Practitioners' Challenges and Perceptions of CI Build Failure Predictions at Atlassian
Authors:
Yang Hong,
Chakkrit Tantithamthavorn,
Jirat Pasuksmit,
Patanamon Thongtanunam,
Arik Friedman,
Xing Zhao,
Anton Krasikov
Abstract:
Continuous Integration (CI) build failures could significantly impact the software development process and teams, such as delaying the release of new features and reducing developers' productivity. In this work, we report on an empirical study that investigates CI build failures throughout product development at Atlassian. Our quantitative analysis found that the repository dimension is the key fa…
▽ More
Continuous Integration (CI) build failures could significantly impact the software development process and teams, such as delaying the release of new features and reducing developers' productivity. In this work, we report on an empirical study that investigates CI build failures throughout product development at Atlassian. Our quantitative analysis found that the repository dimension is the key factor influencing CI build failures. In addition, our qualitative survey revealed that Atlassian developers perceive CI build failures as challenging issues in practice. Furthermore, we found that the CI build prediction can not only provide proactive insight into CI build failures but also facilitate the team's decision-making. Our study sheds light on the challenges and expectations involved in integrating CI build prediction tools into the Bitbucket environment, providing valuable insights for enhancing CI processes.
△ Less
Submitted 14 May, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
Improving Automated Code Reviews: Learning from Experience
Authors:
Hong Yi Lin,
Patanamon Thongtanunam,
Christoph Treude,
Wachiraphan Charoenwet
Abstract:
Modern code review is a critical quality assurance process that is widely adopted in both industry and open source software environments. This process can help newcomers learn from the feedback of experienced reviewers; however, it often brings a large workload and stress to reviewers. To alleviate this burden, the field of automated code reviews aims to automate the process, teaching large langua…
▽ More
Modern code review is a critical quality assurance process that is widely adopted in both industry and open source software environments. This process can help newcomers learn from the feedback of experienced reviewers; however, it often brings a large workload and stress to reviewers. To alleviate this burden, the field of automated code reviews aims to automate the process, teaching large language models to provide reviews on submitted code, just as a human would. A recent approach pre-trained and fine-tuned the code intelligent language model on a large-scale code review corpus. However, such techniques did not fully utilise quality reviews amongst the training data. Indeed, reviewers with a higher level of experience or familiarity with the code will likely provide deeper insights than the others. In this study, we set out to investigate whether higher-quality reviews can be generated from automated code review models that are trained based on an experience-aware oversampling technique. Through our quantitative and qualitative evaluation, we find that experience-aware oversampling can increase the correctness, level of information, and meaningfulness of reviews generated by the current state-of-the-art model without introducing new data. The results suggest that a vast amount of high-quality reviews are underutilised with current training strategies. This work sheds light on resource-efficient ways to boost automated code review models.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Encoding Version History Context for Better Code Representation
Authors:
Huy Nguyen,
Christoph Treude,
Patanamon Thongtanunam
Abstract:
With the exponential growth of AI tools that generate source code, understanding software has become crucial. When developers comprehend a program, they may refer to additional contexts to look for information, e.g. program documentation or historical code versions. Therefore, we argue that encoding this additional contextual information could also benefit code representation for deep learning. Re…
▽ More
With the exponential growth of AI tools that generate source code, understanding software has become crucial. When developers comprehend a program, they may refer to additional contexts to look for information, e.g. program documentation or historical code versions. Therefore, we argue that encoding this additional contextual information could also benefit code representation for deep learning. Recent papers incorporate contextual data (e.g. call hierarchy) into vector representation to address program comprehension problems. This motivates further studies to explore additional contexts, such as version history, to enhance models' understanding of programs. That is, insights from version history enable recognition of patterns in code evolution over time, recurring issues, and the effectiveness of past solutions. Our paper presents preliminary evidence of the potential benefit of encoding contextual information from the version history to predict code clones and perform code classification. We experiment with two representative deep learning models, ASTNN and CodeBERT, to investigate whether combining additional contexts with different aggregations may benefit downstream activities. The experimental result affirms the positive impact of combining version history into source code representation in all scenarios; however, to ensure the technique performs consistently, we need to conduct a holistic investigation on a larger code base using different combinations of contexts, aggregation, and models. Therefore, we propose a research agenda aimed at exploring various aspects of encoding additional context to improve code representation and its optimal utilisation in specific situations.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Toward Effective Secure Code Reviews: An Empirical Study of Security-Related Coding Weaknesses
Authors:
Wachiraphan Charoenwet,
Patanamon Thongtanunam,
Van-Thuan Pham,
Christoph Treude
Abstract:
Identifying security issues early is encouraged to reduce the latent negative impacts on software systems. Code review is a widely-used method that allows developers to manually inspect modified code, catching security issues during a software development cycle. However, existing code review studies often focus on known vulnerabilities, neglecting coding weaknesses, which can introduce real-world…
▽ More
Identifying security issues early is encouraged to reduce the latent negative impacts on software systems. Code review is a widely-used method that allows developers to manually inspect modified code, catching security issues during a software development cycle. However, existing code review studies often focus on known vulnerabilities, neglecting coding weaknesses, which can introduce real-world security issues that are more visible through code review. The practices of code reviews in identifying such coding weaknesses are not yet fully investigated.
To better understand this, we conducted an empirical case study in two large open-source projects, OpenSSL and PHP. Based on 135,560 code review comments, we found that reviewers raised security concerns in 35 out of 40 coding weakness categories. Surprisingly, some coding weaknesses related to past vulnerabilities, such as memory errors and resource management, were discussed less often than the vulnerabilities. Developers attempted to address raised security concerns in many cases (39%-41%), but a substantial portion was merely acknowledged (30%-36%), and some went unfixed due to disagreements about solutions (18%-20%). This highlights that coding weaknesses can slip through code review even when identified. Our findings suggest that reviewers can identify various coding weaknesses leading to security issues during code reviews. However, these results also reveal shortcomings in current code review practices, indicating the need for more effective mechanisms or support for increasing awareness of security issue management in code reviews.
△ Less
Submitted 8 May, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Repeated Builds During Code Review: An Empirical Study of the OpenStack Community
Authors:
Rungroj Maipradit,
Dong Wang,
Patanamon Thongtanunam,
Raula Gaikovina Kula,
Yasutaka Kamei,
Shane McIntosh
Abstract:
Code review is a popular practice where developers critique each others' changes. Since automated builds can identify low-level issues (e.g., syntactic errors, regression bugs), it is not uncommon for software organizations to incorporate automated builds in the code review process. In such code review deployment scenarios, submitted change sets must be approved for integration by both peer code r…
▽ More
Code review is a popular practice where developers critique each others' changes. Since automated builds can identify low-level issues (e.g., syntactic errors, regression bugs), it is not uncommon for software organizations to incorporate automated builds in the code review process. In such code review deployment scenarios, submitted change sets must be approved for integration by both peer code reviewers and automated build bots. Since automated builds may produce an unreliable signal of the status of a change set (e.g., due to ``flaky'' or non-deterministic execution behaviour), code review tools, such as Gerrit, allow developers to request a ``recheck'', which repeats the build process without updating the change set. We conjecture that an unconstrained recheck command will waste time and resources if it is not applied judiciously. To explore how the recheck command is applied in a practical setting, in this paper, we conduct an empirical study of 66,932 code reviews from the OpenStack community.
We quantitatively analyze (i) how often build failures are rechecked; (ii) the extent to which invoking recheck changes build failure outcomes; and (iii) how much waste is generated by invoking recheck. We observe that (i) 55% of code reviews invoke the recheck command after a failing build is reported; (ii) invoking the recheck command only changes the outcome of a failing build in 42% of the cases; and (iii) invoking the recheck command increases review waiting time by an average of 2,200% and equates to 187.4 compute years of waste -- enough compute resources to compete with the oldest land living animal on earth.
△ Less
Submitted 19 August, 2023;
originally announced August 2023.
-
An Exploration of Cross-Patch Collaborations via Patch Linkage in OpenStack
Authors:
Dong Wang,
Patanamon Thongtanunam,
Raula Gaikovina Kula,
Kenichi Matsumoto
Abstract:
Contemporary development projects benefit from code review as it improves the quality of a project. Large ecosystems of inter-dependent projects like OpenStack generate a large number of reviews, which poses new challenges for collaboration (improving patches, fixing defects). Review tools allow developers to link between patches, to indicate patch dependency, competing solutions, or provide broad…
▽ More
Contemporary development projects benefit from code review as it improves the quality of a project. Large ecosystems of inter-dependent projects like OpenStack generate a large number of reviews, which poses new challenges for collaboration (improving patches, fixing defects). Review tools allow developers to link between patches, to indicate patch dependency, competing solutions, or provide broader context. We hypothesize that such patch linkage may also simulate cross-collaboration.
With a case study of OpenStack, we take a first step to explore collaborations that occur after a patch linkage was posted between two patches (i.e., cross-patch collaboration). Our empirical results show that although patch linkage that requests collaboration is relatively less prevalent, the probability of collaboration is relatively higher. Interestingly, the results also show that collaborative contributions via patch linkage are non-trivial, i.e, contributions can affect the review outcome (such as voting) or even improve the patch (i.e., revising). This work opens up future directions to understand barriers and opportunities related to this new kind of collaboration, that assists with code review and development tasks in large ecosystems.
△ Less
Submitted 27 November, 2022;
originally announced November 2022.
-
Automatically Recommend Code Updates: Are We There Yet?
Authors:
Yue Liu,
Chakkrit Tantithamthavorn,
Yonghui Liu,
Patanamon Thongtanunam,
Li Li
Abstract:
In recent years, large pre-trained Language Models of Code (CodeLMs) have shown promising results on various software engineering tasks. One such task is automatic code update recommendation, which transforms outdated code snippets into their approved and revised counterparts. Although many CodeLM-based approaches have been proposed, claiming high accuracy, their effectiveness and reliability on r…
▽ More
In recent years, large pre-trained Language Models of Code (CodeLMs) have shown promising results on various software engineering tasks. One such task is automatic code update recommendation, which transforms outdated code snippets into their approved and revised counterparts. Although many CodeLM-based approaches have been proposed, claiming high accuracy, their effectiveness and reliability on real-world code update tasks remain questionable. In this paper, we present the first extensive evaluation of state-of-the-art CodeLMs for automatically recommending code updates. We assess their performance on two diverse datasets of paired updated methods, considering factors such as temporal evolution, project specificity, method size, and update complexity. Our results reveal that while CodeLMs perform well in settings that ignore temporal information, they struggle in more realistic time-wise scenarios and generalize poorly to new projects. Furthermore, CodeLM performance decreases significantly for larger methods and more complex updates. Furthermore, we observe that many CodeLM-generated "updates" are actually null, especially in time-wise settings, and meaningful edits remain challenging. Our findings highlight the significant gap between the perceived and actual effectiveness of CodeLMs for real-world code update recommendation and emphasize the need for more research on improving their practicality, robustness, and generalizability.
△ Less
Submitted 12 May, 2024; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Giving Back: Contributions Congruent to Library Dependency Changes in a Software Ecosystem
Authors:
Supatsara Wattanakriengkrai,
Dong Wang,
Raula Gaikovina Kula,
Christoph Treude,
Patanamon Thongtanunam,
Takashi Ishio,
Kenichi Matsumoto
Abstract:
Popular adoption of third-party libraries for contemporary software development has led to the creation of large inter-dependency networks, where sustainability issues of a single library can have widespread network effects. Maintainers of these libraries are often overworked, relying on the contributions of volunteers to sustain these libraries. In this work, we measure contributions that are ali…
▽ More
Popular adoption of third-party libraries for contemporary software development has led to the creation of large inter-dependency networks, where sustainability issues of a single library can have widespread network effects. Maintainers of these libraries are often overworked, relying on the contributions of volunteers to sustain these libraries. In this work, we measure contributions that are aligned with dependency changes, to understand where they come from (i.e., non-maintainer, client maintainer, library maintainer, and library and client maintainer), analyze whether they contribute to library dormancy (i.e., a lack of activity), and investigate the similarities between these contributions and developers' typical contributions. Hence, we leverage socio-technical techniques to measure the dependency-contribution congruence (DC congruence), i.e., the degree to which contributions align with dependencies. We conduct a large-scale empirical study to measure the DC congruence for the NPM ecosystem using 1.7 million issues, 970 thousand pull requests (PR), and over 5.3 million commits belonging to 107,242 NPM packages. At the ecosystem level, we pinpoint in time peaks of congruence with dependency changes (i.e., 16% DC congruence score). Surprisingly, these contributions came from the ecosystem itself (i.e., non-maintainers of either client and library). At the project level, we find that DC congruence shares a statistically significant relationship with the likelihood of a package becoming dormant. Finally, by comparing source code of contributions, we find that congruent contributions are statistically different to typical contributions. Our work has implications to encourage and sustain contributions, especially to support library maintainers that require dependency changes.
△ Less
Submitted 26 May, 2022;
originally announced May 2022.
-
Towards Just-Enough Documentation for Agile Effort Estimation: What Information Should Be Documented?
Authors:
Jirat Pasuksmit,
Patanamon Thongtanunam,
Shanika Karunasekera
Abstract:
Effort estimation is an integral part of activities planning in Agile iterative development. An Agile team estimates the effort of a task based on the available information which is usually conveyed through documentation. However, as documentation has a lower priority in Agile, little is known about how documentation effort can be optimized while achieving accurate estimation. Hence, to help pract…
▽ More
Effort estimation is an integral part of activities planning in Agile iterative development. An Agile team estimates the effort of a task based on the available information which is usually conveyed through documentation. However, as documentation has a lower priority in Agile, little is known about how documentation effort can be optimized while achieving accurate estimation. Hence, to help practitioners achieve just-enough documentation for effort estimation, we investigated the different types of documented information that practitioners considered useful for effort estimation. We conducted a survey study with 121 Agile practitioners across 25 countries. Our survey results showed that (1) despite the lower priority of documentation in Agile practices, 98% of the respondents considered documented information moderately to extremely important when estimating effort, (2) 73% of them reported that they would re-estimate a task when the documented information was changed, and (3) functional requirements, user stories, definition of done, UI wireframes, acceptance criteria, and task dependencies were ranked as the most useful types of documented information for effort estimation. Nevertheless, many respondents reported that these useful types of documented information were occasionally changing or missing. Based on our study results, we provide recommendations for agile practitioners on how effort estimation can be improved by focusing on just-enough documentation.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
Assessing the Students' Understanding and their Mistakes in Code Review Checklists -- An Experience Report of 1,791 Code Review Checklist Questions from 394 Students
Authors:
Chun Yong Chong,
Patanamon Thongtanunam,
Chakkrit Tantithamthavorn
Abstract:
Code review is a widely-used practice in software development companies to identify defects. Hence, code review has been included in many software engineering curricula at universities worldwide. However, teaching code review is still a challenging task because the code review effectiveness depends on the code reading and analytical skills of a reviewer. While several studies have investigated the…
▽ More
Code review is a widely-used practice in software development companies to identify defects. Hence, code review has been included in many software engineering curricula at universities worldwide. However, teaching code review is still a challenging task because the code review effectiveness depends on the code reading and analytical skills of a reviewer. While several studies have investigated the code reading techniques that students should use to find defects during code review, little has focused on a learning activity that involves analytical skills. Indeed, develo** a code review checklist should stimulate students to develop their analytical skills to anticipate potential issues (i.e., software defects). Yet, it is unclear whether students can anticipate potential issues given their limited experience in software development (programming, testing, etc.). We perform a qualitative analysis to investigate whether students are capable of creating code review checklists, and if the checklists can be used to guide reviewers to find defects. In addition, we identify common mistakes that students make when develo** a code review checklist. Our results show that while there are some misconceptions among students about the purpose of code review, students are able to anticipate potential defects and create a relatively good code review checklist. Hence, our results lead us to conclude that develo** a code review checklist can be a part of the learning activities for code review in order to scaffold students' skills.
△ Less
Submitted 12 January, 2021;
originally announced January 2021.
-
Predicting Defective Lines Using a Model-Agnostic Technique
Authors:
Supatsara Wattanakriengkrai,
Patanamon Thongtanunam,
Chakkrit Tantithamthavorn,
Hideaki Hata,
Kenichi Matsumoto
Abstract:
Defect prediction models are proposed to help a team prioritize source code areas files that need Software QualityAssurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole filewhile only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in thi…
▽ More
Defect prediction models are proposed to help a team prioritize source code areas files that need Software QualityAssurance (SQA) based on the likelihood of having defects. However, developers may waste their unnecessary effort on the whole filewhile only a small fraction of its source code lines are defective. Indeed, we find that as little as 1%-3% of lines of a file are defective. Hence, in this work, we propose a novel framework (called LINE-DP) to identify defective lines using a model-agnostic technique, i.e., an Explainable AI technique that provides information why the model makes such a prediction. Broadly speaking, our LINE-DP first builds a file-level defect model using code token features. Then, our LINE-DP uses a state-of-the-art model-agnostic technique (i.e.,LIME) to identify risky tokens, i.e., code tokens that lead the file-level defect model to predict that the file will be defective. Then, the lines that contain risky tokens are predicted as defective lines. Through a case study of 32 releases of nine Java open source systems, our evaluation results show that our LINE-DP achieves an average recall of 0.61, a false alarm rate of 0.47, a top 20%LOC recall of0.27, and an initial false alarm of 16, which are statistically better than six baseline approaches. Our evaluation shows that our LINE-DP requires an average computation time of 10 seconds including model construction and defective line identification time. In addition, we find that 63% of defective lines that can be identified by our LINE-DP are related to common defects (e.g., argument change, condition change). These results suggest that our LINE-DP can effectively identify defective lines that contain common defectswhile requiring a smaller amount of inspection effort and a manageable computation cost.
△ Less
Submitted 8 September, 2020;
originally announced September 2020.
-
Automatically Generating Documentation for Lambda Expressions in Java
Authors:
Anwar Alqaimi,
Patanamon Thongtanunam,
Christoph Treude
Abstract:
When lambda expressions were introduced to the Java programming language as part of the release of Java 8 in 2014, they were the language's first step into functional programming. Since lambda expressions are still relatively new, not all developers use or understand them. In this paper, we first present the results of an empirical study to determine how frequently developers of GitHub repositorie…
▽ More
When lambda expressions were introduced to the Java programming language as part of the release of Java 8 in 2014, they were the language's first step into functional programming. Since lambda expressions are still relatively new, not all developers use or understand them. In this paper, we first present the results of an empirical study to determine how frequently developers of GitHub repositories make use of lambda expressions and how they are documented. We find that 11% of Java GitHub repositories use lambda expressions, and that only 6% of the lambda expressions are accompanied by source code comments. We then present a tool called LambdaDoc which can automatically detect lambda expressions in a Java repository and generate natural language documentation for them. Our evaluation of LambdaDoc with 23 professional developers shows that they perceive the generated documentation to be complete, concise, and expressive, while the majority of the documentation produced by our participants without tool support was inadequate. Our contribution builds an important step towards automatically generating documentation for functional programming constructs in an object-oriented language.
△ Less
Submitted 14 March, 2019;
originally announced March 2019.
-
The Impact of Human Factors on the Participation Decision of Reviewers in Modern Code Review
Authors:
Shade Ruangwan,
Patanamon Thongtanunam,
Akinori Ihara,
Kenichi Matsumoto
Abstract:
Modern Code Review (MCR) plays a key role in software quality practices. In MCR process, a new patch (i.e., a set of code changes) is encouraged to be examined by reviewers in order to identify weaknesses in source code prior to an integration into main software repositories. To mitigate the risk of having future defects, prior work suggests that MCR should be performed with sufficient review part…
▽ More
Modern Code Review (MCR) plays a key role in software quality practices. In MCR process, a new patch (i.e., a set of code changes) is encouraged to be examined by reviewers in order to identify weaknesses in source code prior to an integration into main software repositories. To mitigate the risk of having future defects, prior work suggests that MCR should be performed with sufficient review participation. Indeed, recent work shows that a low number of participated reviewers is associated with poor software quality. However, there is a likely case that a new patch still suffers from poor review participation even though reviewers were invited. Hence, in this paper, we set out to investigate the factors that are associated with the participation decision of an invited reviewer. Through a case study of 230,090 patches spread across the Android, LibreOffice, OpenStack and Qt systems, we find that (1) 16%-66% of patches have at least one invited reviewer who did not respond to the review invitation; (2) human factors play an important role in predicting whether or not an invited reviewer will participate in a review; (3) a review participation rate of an invited reviewers and code authoring experience of an invited reviewer are highly associated with the participation decision of an invited reviewer. These results can help practitioners better understand about how human factors associate with the participation decision of reviewers and serve as guidelines for inviting reviewers, leading to a better inviting decision and a better reviewer participation.
△ Less
Submitted 26 June, 2018;
originally announced June 2018.