-
Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments
Authors:
Sayma Sultana,
Jaydeb Sarker,
Farzana Israt,
Rajshakhar Paul,
Amiangshu Bosu
Abstract:
In an industry dominated by straight men, many developers representing other gender identities and sexual orientations often encounter hateful or discriminatory messages. Such communications pose barriers to participation for women and LGBTQ+ persons. Due to sheer volume, manual inspection of all communications for discriminatory communication is infeasible for a large-scale Free Open-Source Softw…
▽ More
In an industry dominated by straight men, many developers representing other gender identities and sexual orientations often encounter hateful or discriminatory messages. Such communications pose barriers to participation for women and LGBTQ+ persons. Due to sheer volume, manual inspection of all communications for discriminatory communication is infeasible for a large-scale Free Open-Source Software (FLOSS) community. To address this challenge, this study aims to develop an automated mechanism to identify Sexual orientation and Gender identity Discriminatory (SGID) texts from software developers' communications. On this goal, we trained and evaluated SGID4SE ( Sexual orientation and Gender Identity Discriminatory text identification for (4) Software Engineering texts) as a supervised learning-based SGID detection tool. SGID4SE incorporates six preprocessing steps and ten state-of-the-art algorithms. SGID4SE implements six different strategies to improve the performance of the minority class. We empirically evaluated each strategy and identified an optimum configuration for each algorithm. In our ten-fold cross-validation-based evaluations, a BERT-based model boosts the best performance with 85.9% precision, 80.0% recall, and 82.9% F1-Score for the SGID class. This model achieves 95.7% accuracy and 80.4% Matthews Correlation Coefficient. Our dataset and tool establish a foundation for further research in this direction.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Towards Automated Classification of Code Review Feedback to Support Analytics
Authors:
Asif Kamal Turzo,
Fahim Faysal,
Ovi Poddar,
Jaydeb Sarker,
Anindya Iqbal,
Amiangshu Bosu
Abstract:
Background: As improving code review (CR) effectiveness is a priority for many software development organizations, projects have deployed CR analytics platforms to identify potential improvement areas. The number of issues identified, which is a crucial metric to measure CR effectiveness, can be misleading if all issues are placed in the same bin. Therefore, a finer-grained classification of issue…
▽ More
Background: As improving code review (CR) effectiveness is a priority for many software development organizations, projects have deployed CR analytics platforms to identify potential improvement areas. The number of issues identified, which is a crucial metric to measure CR effectiveness, can be misleading if all issues are placed in the same bin. Therefore, a finer-grained classification of issues identified during CRs can provide actionable insights to improve CR effectiveness. Although a recent work by Fregnan et al. proposed automated models to classify CR-induced changes, we have noticed two potential improvement areas -- i) classifying comments that do not induce changes and ii) using deep neural networks (DNN) in conjunction with code context to improve performances. Aims: This study aims to develop an automated CR comment classifier that leverages DNN models to achieve a more reliable performance than Fregnan et al. Method: Using a manually labeled dataset of 1,828 CR comments, we trained and evaluated supervised learning-based DNN models leveraging code context, comment text, and a set of code metrics to classify CR comments into one of the five high-level categories proposed by Turzo and Bosu. Results: Based on our 10-fold cross-validation-based evaluations of multiple combinations of tokenization approaches, we found a model using CodeBERT achieving the best accuracy of 59.3%. Our approach outperforms Fregnan et al.'s approach by achieving 18.7% higher accuracy. Conclusion: Besides facilitating improved CR analytics, our proposed model can be useful for developers in prioritizing code review feedback and selecting reviewers.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Automated Identification of Toxic Code Reviews Using ToxiCR
Authors:
Jaydeb Sarker,
Asif Kamal Turzo,
Ming Dong,
Amiangshu Bosu
Abstract:
Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy intera…
▽ More
Toxic conversations during software development interactions may have serious repercussions on a Free and Open Source Software (FOSS) development project. For example, victims of toxic conversations may become afraid to express themselves, therefore get demotivated, and may eventually leave the project. Automated filtering of toxic conversations may help a FOSS community to maintain healthy interactions among its members. However, off-the-shelf toxicity detectors perform poorly on Software Engineering (SE) datasets, such as one curated from code review comments. To encounter this challenge, we present ToxiCR, a supervised learning-based toxicity identification tool for code review interactions. ToxiCR includes a choice to select one of the ten supervised learning algorithms, an option to select text vectorization techniques, eight preprocessing steps, and a large-scale labeled dataset of 19,571 code review comments. Two out of those eight preprocessing steps are SE domain specific. With our rigorous evaluation of the models with various combinations of preprocessing steps and vectorization techniques, we have identified the best combination for our dataset that boosts 95.8% accuracy and 88.9% F1 score. ToxiCR significantly outperforms existing toxicity detectors on our dataset. We have released our dataset, pre-trained models, evaluation results, and source code publicly available at: https://github.com/WSU-SEAL/ToxiCR
△ Less
Submitted 7 February, 2023; v1 submitted 25 February, 2022;
originally announced February 2022.
-
A Benchmark Study of the Contemporary Toxicity Detectors on Software Engineering Interactions
Authors:
Jaydeb Sarker,
Asif Kamal Turzo,
Amiangshu Bosu
Abstract:
Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter thi…
▽ More
Automated filtering of toxic conversations may help an Open-source software (OSS) community to maintain healthy interactions among the project participants. Although, several general purpose tools exist to identify toxic contents, those may incorrectly flag some words commonly used in the Software Engineering (SE) context as toxic (e.g., 'junk', 'kill', and 'dump') and vice versa. To encounter this challenge, an SE specific tool has been proposed by the CMU Strudel Lab (referred as the `STRUDEL' hereinafter) by combining the output of the Perspective API with the output from a customized version of the Stanford's Politeness detector tool. However, since STRUDEL's evaluation was very limited with only 654 SE text, its practical applicability is unclear. Therefore, this study aims to empirically evaluate the Strudel tool as well as four state-of-the-art general purpose toxicity detectors on a large scale SE dataset. On this goal, we empirically developed a rubric to manually label toxic SE interactions. Using this rubric, we manually labeled a dataset of 6,533 code review comments and 4,140 Gitter messages. The results of our analyses suggest significant degradation of all tools' performances on our datasets. Those degradations were significantly higher on our dataset of formal SE communication such as code review than on our dataset of informal communication such as Gitter messages. Two of the models from our study showed significant performance improvements during 10-fold cross validations after we retrained those on our SE datasets. Based on our manual investigations of the incorrectly classified text, we have identified several recommendations for develo** an SE specific toxicity detector.
△ Less
Submitted 19 September, 2020;
originally announced September 2020.