Search | arXiv e-print repository

An Empirical Study of Token-based Micro Commits

Authors: Masanari Kondo, Daniel M. German, Yasutaka Kamei, Naoyasu Ubayashi, Osamu Mizuno

Abstract: In software development, developers frequently apply maintenance activities to the source code that change a few lines by a single commit. A good understanding of the characteristics of such small changes can support quality assurance approaches (e.g., automated program repair), as it is likely that small changes are addressing deficiencies in other changes; thus, understanding the reasons for cre… ▽ More In software development, developers frequently apply maintenance activities to the source code that change a few lines by a single commit. A good understanding of the characteristics of such small changes can support quality assurance approaches (e.g., automated program repair), as it is likely that small changes are addressing deficiencies in other changes; thus, understanding the reasons for creating small changes can help understand the types of errors introduced. Eventually, these reasons and the types of errors can be used to enhance quality assurance approaches for improving code quality. While prior studies used code churns to characterize and investigate the small changes, such a definition has a critical limitation. Specifically, it loses the information of changed tokens in a line. For example, this definition fails to distinguish the following two one-line changes: (1) changing a string literal to fix a displayed message and (2) changing a function call and adding a new parameter. These are definitely maintenance activities, but we deduce that researchers and practitioners are interested in supporting the latter change. To address this limitation, in this paper, we define micro commits, a type of small change based on changed tokens. Our goal is to quantify small changes using changed tokens. Changed tokens allow us to identify small changes more precisely. In fact, this token-level definition can distinguish the above example. We investigate defined micro commits in four OSS projects and understand their characteristics as the first empirical study on token-based micro commits. We find that micro commits mainly replace a single name or literal token, and micro commits are more likely used to fix bugs. Additionally, we propose the use of token-based information to support software engineering approaches in which very small changes significantly affect their effectiveness. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2404.09223 [pdf, other]

OSS Myths and Facts

Authors: Yukako Iimura, Masanari Kondo, Kazushi Tomoto, Yasutaka Kamei, Naoyasu Ubayashi, Shinobu Saito

Abstract: We have selected six myths about the OSS community and have tested whether they are true or not. The purpose of this report is to identify the lessons that can be learned from the development style of the OSS community and the issues that need to be addressed in order to achieve better Employee Experience (EX) in software development within companies and organizations. The OSS community has been l… ▽ More We have selected six myths about the OSS community and have tested whether they are true or not. The purpose of this report is to identify the lessons that can be learned from the development style of the OSS community and the issues that need to be addressed in order to achieve better Employee Experience (EX) in software development within companies and organizations. The OSS community has been led by a group of skilled developers known as hackers. We have great respect for the engineers and activities of the OSS community and aim to learn from them. On the other hand, it is important to recognize that having high expectations can sometimes result in misunderstandings. When there are excessive expectations and concerns, misunderstandings (referred to as myths) can arise, particularly when individuals who are not practitioners rely on hearsay to understand the practices of practitioners. We selected the myths to be tested based on a literature review and interviews. These myths are held by software development managers and customers who are not direct participants in the OSS community. We answered questions about each myth through: 1) Our own analysis of repository data, 2) A literature survey of data analysis conducted by previous studies, or 3) A combination of the two approaches. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: English Version: 28 pages + Japanese Version: 23 pages

arXiv:2402.01438 [pdf, ps, other]

Exploring the Effect of Multiple Natural Languages on Code Suggestion Using GitHub Copilot

Authors: Kei Koyanagi, Dong Wang, Kotaro Noguchi, Masanari Kondo, Alexander Serebrenik, Yasutaka Kamei, Naoyasu Ubayashi

Abstract: GitHub Copilot is an AI-enabled tool that automates program synthesis. It has gained significant attention since its launch in 2021. Recent studies have extensively examined Copilot's capabilities in various programming tasks, as well as its security issues. However, little is known about the effect of different natural languages on code suggestion. Natural language is considered a social bias in… ▽ More GitHub Copilot is an AI-enabled tool that automates program synthesis. It has gained significant attention since its launch in 2021. Recent studies have extensively examined Copilot's capabilities in various programming tasks, as well as its security issues. However, little is known about the effect of different natural languages on code suggestion. Natural language is considered a social bias in the field of NLP, and this bias could impact the diversity of software engineering. To address this gap, we conducted an empirical study to investigate the effect of three popular natural languages (English, Japanese, and Chinese) on Copilot. We used 756 questions of varying difficulty levels from AtCoder contests for evaluation purposes. The results highlight that the capability varies across natural languages, with Chinese achieving the worst performance. Furthermore, regardless of the type of natural language, the performance decreases significantly as the difficulty of questions increases. Our work represents the initial step in comprehending the significance of natural languages in Copilot's capability and introduces promising opportunities for future endeavors. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2308.10078 [pdf, other]

Repeated Builds During Code Review: An Empirical Study of the OpenStack Community

Authors: Rungroj Maipradit, Dong Wang, Patanamon Thongtanunam, Raula Gaikovina Kula, Yasutaka Kamei, Shane McIntosh

Abstract: Code review is a popular practice where developers critique each others' changes. Since automated builds can identify low-level issues (e.g., syntactic errors, regression bugs), it is not uncommon for software organizations to incorporate automated builds in the code review process. In such code review deployment scenarios, submitted change sets must be approved for integration by both peer code r… ▽ More Code review is a popular practice where developers critique each others' changes. Since automated builds can identify low-level issues (e.g., syntactic errors, regression bugs), it is not uncommon for software organizations to incorporate automated builds in the code review process. In such code review deployment scenarios, submitted change sets must be approved for integration by both peer code reviewers and automated build bots. Since automated builds may produce an unreliable signal of the status of a change set (e.g., due to ``flaky'' or non-deterministic execution behaviour), code review tools, such as Gerrit, allow developers to request a ``recheck'', which repeats the build process without updating the change set. We conjecture that an unconstrained recheck command will waste time and resources if it is not applied judiciously. To explore how the recheck command is applied in a practical setting, in this paper, we conduct an empirical study of 66,932 code reviews from the OpenStack community. We quantitatively analyze (i) how often build failures are rechecked; (ii) the extent to which invoking recheck changes build failure outcomes; and (iii) how much waste is generated by invoking recheck. We observe that (i) 55% of code reviews invoke the recheck command after a failing build is reported; (ii) invoking the recheck command only changes the outcome of a failing build in 42% of the cases; and (iii) invoking the recheck command increases review waiting time by an average of 2,200% and equates to 187.4 compute years of waste -- enough compute resources to compete with the oldest land living animal on earth. △ Less

Submitted 19 August, 2023; originally announced August 2023.

Comments: conference

arXiv:2307.07117 [pdf, other]

When Conversations Turn Into Work: A Taxonomy of Converted Discussions and Issues in GitHub

Authors: Dong Wang, Masanari Kondo, Yasutaka Kamei, Raula Gaikovina Kula, Naoyasu Ubayashi

Abstract: Popular and large contemporary open-source projects now embrace a diverse set of documentation for communication channels. Examples include contribution guidelines (i.e., commit message guidelines, coding rules, submission guidelines), code of conduct (i.e., rules and behavior expectations), governance policies, and Q&A forum. In 2020, GitHub released Discussion to distinguish between communicatio… ▽ More Popular and large contemporary open-source projects now embrace a diverse set of documentation for communication channels. Examples include contribution guidelines (i.e., commit message guidelines, coding rules, submission guidelines), code of conduct (i.e., rules and behavior expectations), governance policies, and Q&A forum. In 2020, GitHub released Discussion to distinguish between communication and collaboration. However, it remains unclear how developers maintain these channels, how trivial it is, and whether deciding on conversion takes time. We conducted an empirical study on 259 NPM and 148 PyPI repositories, devising two taxonomies of reasons for converting discussions into issues and vice-versa. The most frequent conversion from a discussion to an issue is when developers request a contributor to clarify their idea into an issue (Reporting a Clarification Request -35.1% and 34.7%, respectively), while agreeing that having non actionable topic (QA, ideas, feature requests -55.0% and 42.0%, respectively}) is the most frequent reason of converting an issue into a discussion. Furthermore, we show that not all reasons for conversion are trivial (e.g., not a bug), and raising a conversion intent potentially takes time (i.e., a median of 15.2 and 35.1 hours, respectively, taken from issues to discussions). Our work contributes to complementing the GitHub guidelines and hel** developers effectively utilize the Issue and Discussion communication channels to maintain their collaboration. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2307.07111 [pdf, other]

More Than React: Investigating The Role of Emoji Reaction in GitHub Pull Requests

Authors: Dong Wang, Tao Xiao, Teyon Son, Raula Gaikovina Kula, Takashi Ishio, Yasutaka Kamei, Kenichi Matsumoto

Abstract: Open source software development has become more social and collaborative, evident GitHub. Since 2016, GitHub started to support more informal methods such as emoji reactions, with the goal to reduce commenting noise when reviewing any code changes to a repository. From a code review context, the extent to which emoji reactions facilitate a more efficient review process is unknown. We conduct an e… ▽ More Open source software development has become more social and collaborative, evident GitHub. Since 2016, GitHub started to support more informal methods such as emoji reactions, with the goal to reduce commenting noise when reviewing any code changes to a repository. From a code review context, the extent to which emoji reactions facilitate a more efficient review process is unknown. We conduct an empirical study to mine 1,850 active repositories across seven popular languages to analyze 365,811 Pull Requests (PRs) for their emoji reactions against the review time, first-time contributors, comment intentions, and the consistency of the sentiments. Answering these four research perspectives, we first find that the number of emoji reactions has a significant correlation with the review time. Second, our results show that a PR submitted by a first-time contributor is less likely to receive emoji reactions. Third, the results reveal that the comments with an intention of information giving, are more likely to receive an emoji reaction. Fourth, we observe that only a small proportion of sentiments are not consistent between comments and emoji reactions, i.e., with 11.8% of instances being identified. In these cases, the prevalent reason is when reviewers cheer up authors that admit to a mistake, i.e., acknowledge a mistake. Apart from reducing commenting noise, our work suggests that emoji reactions play a positive role in facilitating collaborative communication during the review process. △ Less

Submitted 13 July, 2023; originally announced July 2023.

arXiv:2303.15684 [pdf, other]

Understanding the Role of Images on Stack Overflow

Authors: Dong Wang, Tao Xiao, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, Yasutaka Kamei

Abstract: Images are increasingly being shared by software developers in diverse channels including question-and-answer forums like Stack Overflow. Although prior work has pointed out that these images are meaningful and provide complementary information compared to their associated text, how images are used to support questions is empirically unknown. To address this knowledge gap, in this paper we specifi… ▽ More Images are increasingly being shared by software developers in diverse channels including question-and-answer forums like Stack Overflow. Although prior work has pointed out that these images are meaningful and provide complementary information compared to their associated text, how images are used to support questions is empirically unknown. To address this knowledge gap, in this paper we specifically conduct an empirical study to investigate (I) the characteristics of images, (II) the extent to which images are used in different question types, and (III) the role of images on receiving answers. Our results first show that user interface is the most common image content and undesired output is the most frequent purpose for sharing images. Moreover, these images essentially facilitate the understanding of 68% of sampled questions. Second, we find that discrepancy questions are more relatively frequent compared to those without images, but there are no significant differences observed in description length in all types of questions. Third, the quantitative results statistically validate that questions with images are more likely to receive accepted answers, but do not speed up the time to receive answers. Our work demonstrates the crucial role that images play by approaching the topic from a new angle and lays the foundation for future opportunities to use images to assist in tasks like generating questions and identifying question-relatedness. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2202.06157 [pdf, ps, other]

doi 10.1109/MSR.2017.4

The Impact of Using Regression Models to Build Defect Classifiers

Authors: Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, Ahmed E. Hassan

Abstract: It is common practice to discretize continuous defect counts into defective and non-defective classes and use them as a target variable when building defect classifiers (discretized classifiers). However, this discretization of continuous defect counts leads to information loss that might affect the performance and interpretation of defect classifiers. Another possible approach to build defect cla… ▽ More It is common practice to discretize continuous defect counts into defective and non-defective classes and use them as a target variable when building defect classifiers (discretized classifiers). However, this discretization of continuous defect counts leads to information loss that might affect the performance and interpretation of defect classifiers. Another possible approach to build defect classifiers is through the use of regression models then discretizing the predicted defect counts into defective and non-defective classes (regression-based classifiers). In this paper, we compare the performance and interpretation of defect classifiers that are built using both approaches (i.e., discretized classifiers and regression-based classifiers) across six commonly used machine learning classifiers (i.e., linear/logistic regression, random forest, KNN, SVM, CART, and neural networks) and 17 datasets. We find that: i) Random forest based classifiers outperform other classifiers (best AUC) for both classifier building approaches; ii) In contrast to common practice, building a defect classifier using discretized defect counts (i.e., discretized classifiers) does not always lead to better performance. Hence we suggest that future defect classification studies should consider building regression-based classifiers (in particular when the defective ratio of the modeled dataset is low). Moreover, we suggest that both approaches for building defect classifiers should be explored, so the best-performing classifier can be used when determining the most influential features. △ Less

Submitted 12 February, 2022; originally announced February 2022.

Journal ref: IEEE/ACM 14th International Conference on Mining Software Repositories (MSR), 2017, pp. 135-145

arXiv:2202.06146 [pdf, other]

doi 10.1109/TSE.2019.2924371

Impact of Discretization Noise of the Dependent variable on Machine Learning Classifiers in Software Engineering

Authors: Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, Ahmed E. Hassan

Abstract: Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discreti… ▽ More Researchers usually discretize a continuous dependent variable into two target classes by introducing an artificial discretization threshold (e.g., median). However, such discretization may introduce noise (i.e., discretization noise) due to ambiguous class loyalty of data points that are close to the artificial threshold. Previous studies do not provide a clear directive on the impact of discretization noise on the classifiers and how to handle such noise. In this paper, we propose a framework to help researchers and practitioners systematically estimate the impact of discretization noise on classifiers in terms of its impact on various performance measures and the interpretation of classifiers. Through a case study of 7 software engineering datasets, we find that: 1) discretization noise affects the different performance measures of a classifier differently for different datasets; 2) Though the interpretation of the classifiers are impacted by the discretization noise on the whole, the top 3 most important features are not affected by the discretization noise. Therefore, we suggest that practitioners and researchers use our framework to understand the impact of discretization noise on the performance of their built classifiers and estimate the exact amount of discretization noise to be discarded from the dataset to avoid the negative impact of such noise. △ Less

Submitted 12 February, 2022; originally announced February 2022.

Journal ref: IEEE Transactions on Software Engineering, Vol 47, Issue 7 (2021), 1414-1430

arXiv:2202.02389 [pdf, other]

doi 10.1109/TSE.2021.3056941

The impact of feature importance methods on the interpretation of defect classifiers

Authors: Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei, Ahmed E. Hassan

Abstract: Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance met… ▽ More Classifier specific (CS) and classifier agnostic (CA) feature importance methods are widely used (often interchangeably) by prior studies to derive feature importance ranks from a defect classifier. However, different feature importance methods are likely to compute different feature importance ranks even for the same dataset and classifier. Hence such interchangeable use of feature importance methods can lead to conclusion instabilities unless there is a strong agreement among different methods. Therefore, in this paper, we evaluate the agreement between the feature importance ranks associated with the studied classifiers through a case study of 18 software projects and six commonly used classifiers. We find that: 1) The computed feature importance ranks by CA and CS methods do not always strongly agree with each other. 2) The computed feature importance ranks by the studied CA methods exhibit a strong agreement including the features reported at top-1 and top-3 ranks for a given dataset and classifier, while even the commonly used CS methods yield vastly different feature importance ranks. Such findings raise concerns about the stability of conclusions across replicated studies. We further observe that the commonly used defect datasets are rife with feature interactions and these feature interactions impact the computed feature importance ranks of the CS methods (not the CA methods). We demonstrate that removing these feature interactions, even with simple methods like CFS improves agreement between the computed feature importance ranks of CA and CS methods. In light of our findings, we provide guidelines for stakeholders and practitioners when performing model interpretation and directions for future research, e.g., future research is needed to investigate the impact of advanced feature interaction removal methods on computed feature importance ranks of different CS methods. △ Less

Submitted 4 February, 2022; originally announced February 2022.

arXiv:2012.08053 [pdf, other]

A Quantitative Study of Security Bug Fixes of GitHub Repositories

Authors: Daito Nakano, Mingyang Yin, Ryosuke Sato, Abram Hindle, Yasutaka Kamei, Naoyasu Ubayashi

Abstract: Software is prone to bugs and failures. Security bugs are those that expose or share privileged information and access in violation of the software's requirements. Given the seriousness of security bugs, there are centralized mechanisms for supporting and tracking these bugs across multiple products, one such mechanism is the Common Vulnerabilities and Exposures (CVE) ID description. When a bug ge… ▽ More Software is prone to bugs and failures. Security bugs are those that expose or share privileged information and access in violation of the software's requirements. Given the seriousness of security bugs, there are centralized mechanisms for supporting and tracking these bugs across multiple products, one such mechanism is the Common Vulnerabilities and Exposures (CVE) ID description. When a bug gets a CVE, it is referenced by its CVE ID. Thus we explore thousands of Free/Libre Open Source Software (FLOSS) projects, on Github, to determine if developers reference or discuss CVEs in their code, commits, and issues. CVEs will often refer to 3rd party software dependencies of a project and thus the bug will not be in the actual product itself. We study how many of these references are intentional CVE references, and how many are relevant bugs within the projects themselves. We investigate how the bugs that reference CVEs are fixed and how long it takes to fix these bugs. The results of our manual classification for 250 bug reports show that 88 (35%), 32 (13%), and 130 (52%) are classified into "Version Update", "Fixing Code", and "Discussion". To understand how long it takes to fix those bugs, we compare two periods, Reporting Period, a period between the disclosure date of vulnerability information in CVE repositories and the creation date of the bug report in a project, and Fixing Period, a period between the creation date of the bug report and the fixing date of the bug report. We find that 44% of bug reports that are classified into "Version Update" or "Fixing Code" have longer Reporting Period than Fixing Period. This suggests that those who submit CVEs should notify affected projects more directly. △ Less

Submitted 14 December, 2020; originally announced December 2020.

arXiv:1812.10578 [pdf, other]

Towards effective AI-powered agile project management

Authors: Hoa Khanh Dam, Truyen Tran, John Grundy, Aditya Ghose, Yasutaka Kamei

Abstract: The rise of Artificial intelligence (AI) has the potential to significantly transform the practice of project management. Project management has a large socio-technical element with many uncertainties arising from variability in human aspects e.g., customers' needs, developers' performance and team dynamics. AI can assist project managers and team members by automating repetitive, high-volume task… ▽ More The rise of Artificial intelligence (AI) has the potential to significantly transform the practice of project management. Project management has a large socio-technical element with many uncertainties arising from variability in human aspects e.g., customers' needs, developers' performance and team dynamics. AI can assist project managers and team members by automating repetitive, high-volume tasks to enable project analytics for estimation and risk prediction, providing actionable recommendations, and even making decisions. AI is potentially a game changer for project management in hel** to accelerate productivity and increase project success rates. In this paper, we propose a framework where AI technologies can be leveraged to offer support for managing agile projects, which have become increasingly popular in the industry. △ Less

Submitted 26 December, 2018; originally announced December 2018.

Comments: In Proceedings of International Conference on Software Engineering (ICSE 2019), (To appear), NIER track, May 2019 (Montreal, Canada)

arXiv:1810.09723 [pdf, ps, other]

Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding

Authors: Xiaochen Li, He Jiang, Yasutaka Kamei, Xin Chen

Abstract: Developers increasingly rely on text matching tools to analyze the relation between natural language words and APIs. However, semantic gaps, namely textual mismatches between words and APIs, negatively affect these tools. Previous studies have transformed words or APIs into low-dimensional vectors for matching; however, inaccurate results were obtained due to the failure of modeling words and APIs… ▽ More Developers increasingly rely on text matching tools to analyze the relation between natural language words and APIs. However, semantic gaps, namely textual mismatches between words and APIs, negatively affect these tools. Previous studies have transformed words or APIs into low-dimensional vectors for matching; however, inaccurate results were obtained due to the failure of modeling words and APIs simultaneously. To resolve this problem, two main challenges are to be addressed: the acquisition of massive words and APIs for mining and the alignment of words and APIs for modeling. Therefore, this study proposes Word2API to effectively estimate relatedness of words and APIs. Word2API collects millions of commonly used words and APIs from code repositories to address the acquisition challenge. Then, a shuffling strategy is used to transform related words and APIs into tuples to address the alignment challenge. Using these tuples, Word2API models words and APIs simultaneously. Word2API outperforms baselines by 10%-49.6% of relatedness estimation in terms of precision and NDCG. Word2API is also effective on solving typical software tasks, e.g., query expansion and API documents linking. A simple system with Word2API-expanded queries recommends up to 21.4% more related APIs for developers. Meanwhile, Word2API improves comparison algorithms by 7.9%-17.4% in linking questions in Question&Answer communities to API documents. △ Less

Submitted 23 October, 2018; originally announced October 2018.

Comments: accepted by IEEE Transactions on Software Engineering

Showing 1–13 of 13 results for author: Kamei, Y