Search | arXiv e-print repository

Bridging Gaps, Building Futures: Advancing Software Developer Diversity and Inclusion Through Future-Oriented Research

Authors: Sonja M. Hyrynsalmi, Sebastian Baltes, Chris Brown, Rafael Prikladnicki, Gema Rodriguez-Perez, Alexander Serebrenik, Jocelyn Simmonds, Bianca Trinkenreich, Yi Wang, Grischa Liebel

Abstract: Software systems are responsible for nearly all aspects of modern life and society. However, the demographics of software development teams that are tasked with designing and maintaining these software systems rarely match the demographics of users. As the landscape of software engineering (SE) evolves due to technological innovations, such as the rise of automated programming assistants powered b… ▽ More Software systems are responsible for nearly all aspects of modern life and society. However, the demographics of software development teams that are tasked with designing and maintaining these software systems rarely match the demographics of users. As the landscape of software engineering (SE) evolves due to technological innovations, such as the rise of automated programming assistants powered by artificial intelligence (AI) and machine learning, more effort is needed to promote software developer diversity and inclusion (SDDI) to ensure inclusive work environments for development teams and usable software for diverse populations. To this end, we present insights from SE researchers and practitioners on challenges and solutions regarding diversity and inclusion in SE. Based on these findings, we share potential utopian and dystopian visions of the future and provide future research directions and implications for academia and industry to promote SDDI in the age of AI-driven SE. △ Less

Submitted 10 April, 2024; originally announced April 2024.

arXiv:2401.13802 [pdf, other]

Investigating the Efficacy of Large Language Models for Code Clone Detection

Authors: Mohamad Khajezade, Jie JW Wu, Fatemeh Hendijani Fard, Gema Rodríguez-Pérez, Mohamed Sami Shehata

Abstract: Large Language Models (LLMs) have demonstrated remarkable success in various natural language processing and software engineering tasks, such as code generation. The LLMs are mainly utilized in the prompt-based zero/few-shot paradigm to guide the model in accomplishing the task. GPT-based models are one of the popular ones studied for tasks such as code comment generation or test generation. These… ▽ More Large Language Models (LLMs) have demonstrated remarkable success in various natural language processing and software engineering tasks, such as code generation. The LLMs are mainly utilized in the prompt-based zero/few-shot paradigm to guide the model in accomplishing the task. GPT-based models are one of the popular ones studied for tasks such as code comment generation or test generation. These tasks are `generative' tasks. However, there is limited research on the usage of LLMs for `non-generative' tasks such as classification using the prompt-based paradigm. In this preliminary exploratory study, we investigated the applicability of LLMs for Code Clone Detection (CCD), a non-generative task. By building a mono-lingual and cross-lingual CCD dataset derived from CodeNet, we first investigated two different prompts using ChatGPT to detect Type-4 code clones in Java-Java and Java-Ruby pairs in a zero-shot setting. We then conducted an analysis to understand the strengths and weaknesses of ChatGPT in CCD. ChatGPT surpasses the baselines in cross-language CCD attaining an F1-score of 0.877 and achieves comparable performance to fully fine-tuned models for mono-lingual CCD, with an F1-score of 0.878. Also, the prompt and the difficulty level of the problems has an impact on the performance of ChatGPT. Finally we provide insights and future directions based on our initial analysis △ Less

Submitted 30 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

arXiv:2204.04318 [pdf, other]

Towards Understanding Barriers and Mitigation Strategies of Software Engineers with Non-traditional Educational and Occupational Backgrounds

Authors: Tavian Barnes, Ken Jen Lee, Cristina Tavares, Gema Rodríguez-Pérez, Meiyappan Nagappan

Abstract: The traditional path to a software engineering career involves a post-secondary diploma in Software Engineering, Computer Science, or a related field. However, many software engineers take a non-traditional path to their career, starting from other industries or fields of study. This paper proposes a study on barriers faced by software engineers with non-traditional educational and occupational ba… ▽ More The traditional path to a software engineering career involves a post-secondary diploma in Software Engineering, Computer Science, or a related field. However, many software engineers take a non-traditional path to their career, starting from other industries or fields of study. This paper proposes a study on barriers faced by software engineers with non-traditional educational and occupational backgrounds, and possible mitigation strategies for those barriers. We propose a two-stage methodology, consisting of an exploratory study, followed by a validation study. The exploratory study will involve a grounded-theory-based qualitative analysis of relevant Reddit data to yield a framework around the barriers and possible mitigation strategies. These findings will then be validated using a survey in the validation study. Making software engineering more accessible to those with non-traditional backgrounds will not only bring about the benefits of functional diversity, but also serves as a method of filling in the labour shortages of the software engineering industry. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: 8 pages, 5 figures, accepted at the MSR 2022 Registered Reports Track as a Continuity Acceptance (CA)

ACM Class: D.2; K.4.2

arXiv:2104.06143 [pdf, other]

On the Relationship Between the Developer's Perceptible Race and Ethnicity and the Evaluation of Contributions in OSS

Authors: Reza Nadri, Gema Rodríguez-Pérez, Meiyappan Nagappan

Abstract: Open Source Software (OSS) projects are typically the result of collective efforts performed by developers with different backgrounds. Although the quality of developers' contributions should be the only factor influencing the evaluation of the contributions to OSS projects, recent studies have shown that diversity issues are correlated with the acceptance or rejection of developers' contributions… ▽ More Open Source Software (OSS) projects are typically the result of collective efforts performed by developers with different backgrounds. Although the quality of developers' contributions should be the only factor influencing the evaluation of the contributions to OSS projects, recent studies have shown that diversity issues are correlated with the acceptance or rejection of developers' contributions. This paper assists this emerging state-of-the-art body on diversity research with the first empirical study that analyzes how developers' perceptible race and ethnicity relates to the evaluation of the contributions in OSS. We performed a large-scale quantitative study of OSS projects in GitHub. We extracted the developers' perceptible race and ethnicity from their names in GitHub using the Name-Prism tool and applied regression modeling of contributions (i.e, pull requests) data from GHTorrent and GitHub. We observed that among the developers whose perceptible race and ethnicity was captured by the tool, only 16.56% were perceptible as Non-White developers; contributions from perceptible White developers have about 6-10% higher odds of being accepted when compared to contributions from perceptible Non-White developers; and submitters with perceptible non-white races and ethnicities are more likely to get their pull requests accepted when the integrator is estimated to be from their same race and ethnicity rather than when the integrator is estimated to be White. Our initial analysis shows a low number of Non-White developers participating in OSS. Furthermore, the results from our regression analysis lead us to believe that there may exist differences between the evaluation of the contributions from different perceptible races and ethnicities. Thus, our findings reinforce the need for further studies on racial and ethnic diversity in software engineering to foster healthier OSS communities. △ Less

Submitted 13 April, 2021; originally announced April 2021.

arXiv:2103.15180 [pdf, other]

doi 10.1109/TSE.2020.3021380

Watch out for Extrinsic Bugs! A Case Study of their Impact in Just-In-Time Bug Prediction Models on the OpenStack project

Authors: Gema Rodriguez-Perez, Meiyappan Nagappan, Gregorio Robles

Abstract: Intrinsic bugs are bugs for which a bug introducing change can be identified in the version control system of a software. In contrast, extrinsic bugs are caused by external changes to a software, such as errors in external APIs; thereby they do not have an explicit bug introducing change in the version control system. Although most previous research literature has assumed that all bugs are of intr… ▽ More Intrinsic bugs are bugs for which a bug introducing change can be identified in the version control system of a software. In contrast, extrinsic bugs are caused by external changes to a software, such as errors in external APIs; thereby they do not have an explicit bug introducing change in the version control system. Although most previous research literature has assumed that all bugs are of intrinsic nature, in a previous study, we show that not all bugs are intrinsic. This paper shows an example of how considering extrinsic bugs can affect software engineering research. Specifically, we study the impact of extrinsic bugs in Just In Time bug prediction by partially replicating a recent study by McIntosh and Kamei on JIT models. These models are trained using properties of earlier bug-introducing changes. Since extrinsic bugs do not have bug introducing changes in the version control system, we manually curate McIntosh and Kamei's dataset to distinguish between intrinsic and extrinsic bugs. Then, we address their original research questions, this time removing extrinsic bugs, to study whether bug-introducing changes are a moving target in Just-In-Time bug prediction. Finally, we study whether characteristics of intrinsic and extrinsic bugs are different. Our results show that intrinsic and extrinsic bugs are of different nature. When removing extrinsic bugs the performance is different up to 16 % Area Under the Curve points. This indicates that our JIT models obtain a more accurate representation of the real world. We conclude that extrinsic bugs negatively impact Just-In-Time models. Furthermore, we offer evidence that extrinsic bugs should be further investigated, as they can significantly impact how software engineers understand bugs. △ Less

Submitted 28 March, 2021; originally announced March 2021.

Comments: in IEEE Transactions on Software Engineering, 2020

arXiv:2011.06244 [pdf, other]

A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Authors: Steffen Herbold, Alexander Trautsch, Benjamin Ledel, Alireza Aghamohammadi, Taher Ahmed Ghaleb, Kuljit Kaur Chahal, Tim Bossenmaier, Bhaveet Nagaria, Philip Makedonski, Matin Nili Ahmadabadi, Kristof Szabados, Helge Spieker, Matej Madeja, Nathaniel Hoy, Valentina Lenarduzzi, Shangwen Wang, Gema Rodríguez-Pérez, Ricardo Colomo-Palacios, Roberto Verdecchia, Paramvir Singh, Yihao Qin, Debasish Chakroborti, Willard Davis, Vijay Walunj, Hongjun Wu , et al. (23 additional authors not shown)

Abstract: Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Metho… ▽ More Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise. △ Less

Submitted 13 October, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: Status: Accepted at Empirical Software Engineering

Showing 1–6 of 6 results for author: Rodríguez-Pérez, G