Search | arXiv e-print repository

From First Patch to Long-Term Contributor: Evaluating Onboarding Recommendations for OSS Newcomers

Authors: Asif Kamal Turzo, Sayma Sultana, Amiangshu Bosu

Abstract: Attracting and retaining a steady stream of new contributors is crucial to ensuring the long-term survival of open-source software (OSS) projects. However, there are two key research gaps regarding recommendations for onboarding new contributors to OSS projects. First, most of the existing recommendations are based on a limited number of projects, which raises concerns about their generalizability… ▽ More Attracting and retaining a steady stream of new contributors is crucial to ensuring the long-term survival of open-source software (OSS) projects. However, there are two key research gaps regarding recommendations for onboarding new contributors to OSS projects. First, most of the existing recommendations are based on a limited number of projects, which raises concerns about their generalizability. If a recommendation yields conflicting results in a different context, it could hinder a newcomer's onboarding process rather than help them. Second, it's unclear whether these recommendations also apply to experienced contributors. If certain recommendations are specific to newcomers, continuing to follow them after their initial contributions are accepted could hinder their chances of becoming long-term contributors. To address these gaps, we conducted a two-stage mixed-method study. In the first stage, we conducted a Systematic Literature Review (SLR) and identified 15 task-related actionable recommendations that newcomers to OSS projects can follow to improve their odds of successful onboarding. In the second stage, we conduct a large-scale empirical study of five Gerrit-based projects and 1,155 OSS projects from GitHub to assess whether those recommendations assist newcomers' successful onboarding. Our results suggest that four recommendations positively correlate with newcomers' first patch acceptance in most contexts. Four recommendations are context-dependent, and four indicate significant negative associations for most projects. Our results also found three newcomer-specific recommendations, which OSS joiners should abandon at non-newcomer status to increase their odds of becoming long-term contributors. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2307.03852 [pdf, other]

Towards Automated Classification of Code Review Feedback to Support Analytics

Authors: Asif Kamal Turzo, Fahim Faysal, Ovi Poddar, Jaydeb Sarker, Anindya Iqbal, Amiangshu Bosu

Abstract: Background: As improving code review (CR) effectiveness is a priority for many software development organizations, projects have deployed CR analytics platforms to identify potential improvement areas. The number of issues identified, which is a crucial metric to measure CR effectiveness, can be misleading if all issues are placed in the same bin. Therefore, a finer-grained classification of issue… ▽ More Background: As improving code review (CR) effectiveness is a priority for many software development organizations, projects have deployed CR analytics platforms to identify potential improvement areas. The number of issues identified, which is a crucial metric to measure CR effectiveness, can be misleading if all issues are placed in the same bin. Therefore, a finer-grained classification of issues identified during CRs can provide actionable insights to improve CR effectiveness. Although a recent work by Fregnan et al. proposed automated models to classify CR-induced changes, we have noticed two potential improvement areas -- i) classifying comments that do not induce changes and ii) using deep neural networks (DNN) in conjunction with code context to improve performances. Aims: This study aims to develop an automated CR comment classifier that leverages DNN models to achieve a more reliable performance than Fregnan et al. Method: Using a manually labeled dataset of 1,828 CR comments, we trained and evaluated supervised learning-based DNN models leveraging code context, comment text, and a set of code metrics to classify CR comments into one of the five high-level categories proposed by Turzo and Bosu. Results: Based on our 10-fold cross-validation-based evaluations of multiple combinations of tokenization approaches, we found a model using CodeBERT achieving the best accuracy of 59.3%. Our approach outperforms Fregnan et al.'s approach by achieving 18.7% higher accuracy. Conclusion: Besides facilitating improved CR analytics, our proposed model can be useful for developers in prioritizing code review feedback and selecting reviewers. △ Less

Submitted 7 July, 2023; originally announced July 2023.

Journal ref: Proceedings of the 17th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2023

arXiv:2302.11686 [pdf, other]

What Makes a Code Review Useful to OpenDev Developers? An Empirical Investigation

Authors: Asif Kamal Turzo, Amiangshu Bosu

Abstract: Context: Due to the association of significant efforts, even a minor improvement in the effectiveness of Code Reviews(CR) can incur significant savings for a software development organization. Aim: This study aims to develop a finer grain understanding of what makes a code review comment useful to OSS developers, to what extent a code review comment is considered useful to them, and how various co… ▽ More Context: Due to the association of significant efforts, even a minor improvement in the effectiveness of Code Reviews(CR) can incur significant savings for a software development organization. Aim: This study aims to develop a finer grain understanding of what makes a code review comment useful to OSS developers, to what extent a code review comment is considered useful to them, and how various contextual and participant-related factors influence its usefulness level. Method: On this goal, we have conducted a three-stage mixed-method study. We randomly selected 2,500 CR comments from the OpenDev Nova project and manually categorized the comments. We designed a survey of OpenDev developers to better understand their perspectives on useful CRs. Combining our survey-obtained scores with our manually labeled dataset, we trained two regression models - one to identify factors that influence the usefulness of CR comments and the other to identify factors that improve the odds of `Functional' defect identification over the others. Key findings: The results of our study suggest that a CR comment's usefulness is dictated not only by its technical contributions such as defect findings or quality improvement tips but also by its linguistic characteristics such as comprehensibility and politeness. While a reviewer's coding experience positively associates with CR usefulness, the number of mutual reviews, comment volume in a file, the total number of lines added /modified, and CR interval has the opposite associations. While authorship and reviewership experiences for the files under review have been the most popular attributes for reviewer recommendation systems, we do not find any significant association of those attributes with CR usefulness. △ Less

Submitted 19 June, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

Journal ref: Empirical Software Engineering, 2023

arXiv:2210.00139 [pdf, other]

Code Reviews in Open Source Projects : How Do Gender Biases Affect Participation and Outcomes?

Authors: Sayma Sultana, Asif Kamal Turzo, Amiangshu Bosu

Abstract: Context: Contemporary software development organizations lack diversity and the ratios of women in Free and open-source software (FOSS) communities are even lower than the industry average. Although the results of recent studies hint the existence of biases against women, it is unclear to what extent such biases influence the outcomes of various software development tasks. Aim: We aim to identif… ▽ More Context: Contemporary software development organizations lack diversity and the ratios of women in Free and open-source software (FOSS) communities are even lower than the industry average. Although the results of recent studies hint the existence of biases against women, it is unclear to what extent such biases influence the outcomes of various software development tasks. Aim: We aim to identify whether the outcomes of or participation in code reviews (or pull requests) are influenced by the gender of a developer.. Approach: With this goal, this study includes a total 1010 FOSS projects. We developed six regression models for each of the 14 dataset (i.e., 10 Gerrit based and four Github) to identify if code acceptance, review intervals, and code review participation differ based on the gender and gender neutral profile of a developer. Key findings: Our results find significant gender biases during code acceptance among 13 out of the 14 datasets, with seven seven favoring men and the remaining six favoring women. We also found significant differences between men and women in terms of code review intervals, with women encountering longer delays in three cases and the opposite in seven. Our results indicate reviewer selection as one of the most gender biased aspects among most of the projects, with women having significantly lower code review participation among 11 out of the 14 cases. Since most of the review assignments are based on invitations, this result suggests possible affinity biases among the developers. Conclusion: Though gender bias exists among many projects, direction and amplitude of bias varies based on project size, community and culture. Similar bias mitigation strategies may not work across all communities, as characteristics of biases and their underlying causes differ. △ Less

Submitted 7 February, 2023; v1 submitted 30 September, 2022; originally announced October 2022.

arXiv:2202.13056 [pdf, other]

Automated Identification of Toxic Code Reviews Using ToxiCR

Authors: Jaydeb Sarker, Asif Kamal Turzo, Ming Dong, Amiangshu Bosu

Abstract: Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy intera… ▽ More Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy interactions among its members. However, off-the-shelf toxicity detectors perform poorly on Software Engineering (SE) datasets, such as one curated from code review comments. To encounter this challenge, we present ToxiCR, a supervised learning-based toxicity identification tool for code review interactions. ToxiCR includes a choice to select one of the ten supervised learning algorithms, an option to select text vectorization techniques, eight preprocessing steps, and a large-scale labeled dataset of 19,571 code review comments. Two out of those eight preprocessing steps are SE domain specific. With our rigorous evaluation of the models with various combinations of preprocessing steps and vectorization techniques, we have identified the best combination for our dataset that boosts 95.8% accuracy and 88.9% F1 score. ToxiCR significantly outperforms existing toxicity detectors on our dataset. We have released our dataset, pre-trained models, evaluation results, and source code publicly available at: https://github.com/WSU-SEAL/ToxiCR △ Less

Submitted 7 February, 2023; v1 submitted 25 February, 2022; originally announced February 2022.

Journal ref: ACM Transactions on Software Engineering Methodology (TOSEM), 2023

arXiv:2102.06909 [pdf, other]

Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project

Authors: Rajshakhar Paul, Asif Kamal Turzo, Amiangshu Bosu

Abstract: Peer code review has been found to be effective in identifying security vulnerabilities. However, despite practicing mandatory code reviews, many Open Source Software (OSS) projects still encounter a large number of post-release security vulnerabilities, as some security defects escape those. Therefore, a project manager may wonder if there was any weakness or inconsistency during a code review th… ▽ More Peer code review has been found to be effective in identifying security vulnerabilities. However, despite practicing mandatory code reviews, many Open Source Software (OSS) projects still encounter a large number of post-release security vulnerabilities, as some security defects escape those. Therefore, a project manager may wonder if there was any weakness or inconsistency during a code review that missed a security vulnerability. Answers to this question may help a manager pinpointing areas of concern and taking measures to improve the effectiveness of his/her project's code reviews in identifying security defects. Therefore, this study aims to identify the factors that differentiate code reviews that successfully identified security defects from those that missed such defects. With this goal, we conduct a case-control study of Chromium OS project. Using multi-stage semi-automated approaches, we build a dataset of 516 code reviews that successfully identified security defects and 374 code reviews where security defects escaped. The results of our empirical study suggest that the are significant differences between the categories of security defects that are identified and that are missed during code reviews. A logistic regression model fitted on our dataset achieved an AUC score of 0.91 and has identified nine code review attributes that influence identifications of security defects. While time to complete a review, the number of mutual reviews between two developers, and if the review is for a bug fix have positive impacts on vulnerability identification, opposite effects are observed from the number of directories under review, the number of total reviews by a developer, and the total number of prior commits for the file under review. △ Less

Submitted 13 February, 2021; originally announced February 2021.

Journal ref: 43rd International Conference on Software Engineering (ICSE), 2021

arXiv:2009.09331 [pdf, other]

A Benchmark Study of the Contemporary Toxicity Detectors on Software Engineering Interactions

Authors: Jaydeb Sarker, Asif Kamal Turzo, Amiangshu Bosu

Abstract: Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter thi… ▽ More Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter this challenge, an SE specific tool has been proposed by the CMU Strudel Lab (referred as the `STRUDEL' hereinafter) by combining the output of the Perspective API with the output from a customized version of the Stanford's Politeness detector tool. However, since STRUDEL's evaluation was very limited with only 654 SE text, its practical applicability is unclear. Therefore, this study aims to empirically evaluate the Strudel tool as well as four state-of-the-art general purpose toxicity detectors on a large scale SE dataset. On this goal, we empirically developed a rubric to manually label toxic SE interactions. Using this rubric, we manually labeled a dataset of 6,533 code review comments and 4,140 Gitter messages. The results of our analyses suggest significant degradation of all tools' performances on our datasets. Those degradations were significantly higher on our dataset of formal SE communication such as code review than on our dataset of informal communication such as Gitter messages. Two of the models from our study showed significant performance improvements during 10-fold cross validations after we retrained those on our SE datasets. Based on our manual investigations of the incorrectly classified text, we have identified several recommendations for develo** an SE specific toxicity detector. △ Less

Submitted 19 September, 2020; originally announced September 2020.

Journal ref: Proceedings of the 27th Asia-Pacific Software Engineering Conference (APSEC 2020)

Showing 1–7 of 7 results for author: Turzo, A K