Search | arXiv e-print repository

Legal Aspects for Software Developers Interested in Generative AI Applications

Authors: Steffen Herbold, Brian Valerius, Anamaria Mojica-Hanke, Isabella Lex, Joel Mittel

Abstract: Recent successes in Generative Artificial Intelligence (GenAI) have led to new technologies capable of generating high-quality code, natural language, and images. The next step is to integrate GenAI technology into products, a task typically conducted by software developers. Such product development always comes with a certain risk of liability. Within this article, we want to shed light on the cu… ▽ More Recent successes in Generative Artificial Intelligence (GenAI) have led to new technologies capable of generating high-quality code, natural language, and images. The next step is to integrate GenAI technology into products, a task typically conducted by software developers. Such product development always comes with a certain risk of liability. Within this article, we want to shed light on the current state of two such risks: data protection and copyright. Both aspects are crucial for GenAI. This technology deals with data for both model training and generated output. We summarize key aspects regarding our current knowledge that every software developer involved in product development using GenAI should be aware of to avoid critical mistakes that may expose them to liability claims. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Submission under review

arXiv:2309.12697 [pdf, other]

Semantic similarity prediction is better than other semantic similarity measures

Authors: Steffen Herbold

Abstract: Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned mo… ▽ More Semantic similarity between natural language texts is typically measured either by looking at the overlap between subsequences (e.g., BLEU) or by using embeddings (e.g., BERTScore, S-BERT). Within this paper, we argue that when we are only interested in measuring the semantic similarity, it is better to directly predict the similarity using a fine-tuned model for such a task. Using a fine-tuned model for the Semantic Textual Similarity Benchmark tasks (STS-B) from the GLUE benchmark, we define the STSScore approach and show that the resulting similarity is better aligned with our expectations on a robust semantic similarity measure than other approaches. △ Less

Submitted 17 January, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

Comments: Accepted at TMLR: https://openreview.net/forum?id=bfsNmgN5je

arXiv:2308.12095 [pdf, other]

On Using Information Retrieval to Recommend Machine Learning Good Practices for Software Engineers

Authors: Laura Cabra-Acela, Anamaria Mojica-Hanke, Mario Linares-Vásquez, Steffen Herbold

Abstract: Machine learning (ML) is nowadays widely used for different purposes and in several disciplines. From self-driving cars to automated medical diagnosis, machine learning models extensively support users' daily activities, and software engineering tasks are no exception. Not embracing good ML practices may lead to pitfalls that hinder the performance of an ML system and potentially lead to unexpecte… ▽ More Machine learning (ML) is nowadays widely used for different purposes and in several disciplines. From self-driving cars to automated medical diagnosis, machine learning models extensively support users' daily activities, and software engineering tasks are no exception. Not embracing good ML practices may lead to pitfalls that hinder the performance of an ML system and potentially lead to unexpected results. Despite the existence of documentation and literature about ML best practices, many non-ML experts turn towards gray literature like blogs and Q&A systems when looking for help and guidance when implementing ML systems. To better aid users in distilling relevant knowledge from such sources, we propose a recommender system that recommends ML practices based on the user's context. As a first step in creating a recommender system for machine learning practices, we implemented Idaka. A tool that provides two different approaches for retrieving/generating ML best practices: i) an information retrieval (IR) engine and ii) a large language model. The IR-engine uses BM25 as the algorithm for retrieving the practices, and a large language model, in our case Alpaca. The platform has been designed to allow comparative studies of best practices retrieval tools. Idaka is publicly available at GitHub: https://bit.ly/idaka. Video: https://youtu.be/cEb-AhIPxnM. △ Less

Submitted 25 August, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: Accepted for Publication at ESEC/FSE demonstrations track

arXiv:2304.14276 [pdf, other]

AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays

Authors: Steffen Herbold, Annette Hautli-Janisz, Ute Heuer, Zlata Kikteva, Alexander Trautsch

Abstract: Background: Recently, ChatGPT and similar generative AI models have attracted hundreds of millions of users and become part of the public discourse. Many believe that such models will disrupt society and will result in a significant change in the education system and information generation in the future. So far, this belief is based on either colloquial evidence or benchmarks from the owners of th… ▽ More Background: Recently, ChatGPT and similar generative AI models have attracted hundreds of millions of users and become part of the public discourse. Many believe that such models will disrupt society and will result in a significant change in the education system and information generation in the future. So far, this belief is based on either colloquial evidence or benchmarks from the owners of the models -- both lack scientific rigour. Objective: Through a large-scale study comparing human-written versus ChatGPT-generated argumentative student essays, we systematically assess the quality of the AI-generated content. Methods: A large corpus of essays was rated using standard criteria by a large number of human experts (teachers). We augment the analysis with a consideration of the linguistic characteristics of the generated essays. Results: Our results demonstrate that ChatGPT generates essays that are rated higher for quality than human-written essays. The writing style of the AI models exhibits linguistic characteristics that are different from those of the human-written essays, e.g., it is characterized by fewer discourse and epistemic markers, but more nominalizations and greater lexical diversity. Conclusions: Our results clearly demonstrate that models like ChatGPT outperform humans in generating argumentative essays. Since the technology is readily available for anyone to use, educators must act immediately. We must re-invent homework and develop teaching concepts that utilize these AI models in the same way as math utilized the calculator: teach the general concepts first and then use AI tools to free up time for other learning objectives. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: Submitted

arXiv:2304.06367 [pdf, other]

Understanding issues related to personal data and data protection in open source projects on GitHub

Authors: Anne Henning, Lukas Schulte, Steffen Herbold, Oksana Kulyk, Peter Mayer

Abstract: Context: Data protection regulations such as the GDPR and the CCPA affect how software may handle the personal data of its users and how consent for handling of such data may be given. Prior literature focused on how this works in operation, but lacks a perspective of the impact on the software development process. Objective: Within our work, we will address this gap and explore how software dev… ▽ More Context: Data protection regulations such as the GDPR and the CCPA affect how software may handle the personal data of its users and how consent for handling of such data may be given. Prior literature focused on how this works in operation, but lacks a perspective of the impact on the software development process. Objective: Within our work, we will address this gap and explore how software development itself is impacted. We want to understand which data protection-related issues are reported, who reports them, and how developers react to such issues. Method: We will conduct an exploratory study based on issues that are reported with respect to data protection in open source software on GitHub. We will determine the roles of the actors involved, the status of such issues, and we use inductive coding to understand the data protection issues. We qualitatively analyze the issues as part of the inductive coding and further explore the reasoning for resolutions. We quantitatively analyze the relation between the roles, resolutions, and data protection issues to understand correlations. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: Registered Report with Continuity Acceptance (CA) for submission to Empirical Software Engineering granted by RR-Committee of the MSR'23

arXiv:2304.05358 [pdf, ps, other]

An exploratory study of bug-introducing changes: what happens when bugs are introduced in open source software?

Authors: Lukas Schulte, Anamaria Mojica-Hanke, Mario Linares-Vásquez, Steffen Herbold

Abstract: Context: Many studies consider the relation between individual aspects and bug-introduction, e.g., software testing and code review. Due to the design of the studies the results are usually only about correlations as interactions or interventions are not considered. Objective: Within this study, we want to narrow this gap and provide a broad empirical view on aspects of software development and… ▽ More Context: Many studies consider the relation between individual aspects and bug-introduction, e.g., software testing and code review. Due to the design of the studies the results are usually only about correlations as interactions or interventions are not considered. Objective: Within this study, we want to narrow this gap and provide a broad empirical view on aspects of software development and their relation to bug-introducing changes. Method: We consider the bugs, the type of work when the bug was introduced, aspects of the build process, code review, software tests, and any other discussion related to the bug that we can identify. We use a qualitative approach that first describes variables of the development process and then groups the variables based on their relations. From these groups, we can induce how their (pair-wise) interactions affect bug-introducing changes. △ Less

Submitted 11 April, 2023; originally announced April 2023.

Comments: Registered Report with Continuity Acceptance (CA) for submission to Empirical Software Engineering granted by RR-Committee of the MSR'23

arXiv:2301.10516 [pdf, other]

What are the Machine Learning best practices reported by practitioners on Stack Exchange?

Authors: Anamaria Mojica-Hanke, Andrea Bayona, Mario Linares-Vásquez, Steffen Herbold, Fabio A. González

Abstract: Machine Learning (ML) is being used in multiple disciplines due to its powerful capability to infer relationships within data. In particular, Software Engineering (SE) is one of those disciplines in which ML has been used for multiple tasks, like software categorization, bugs prediction, and testing. In addition to the multiple ML applications, some studies have been conducted to detect and unders… ▽ More Machine Learning (ML) is being used in multiple disciplines due to its powerful capability to infer relationships within data. In particular, Software Engineering (SE) is one of those disciplines in which ML has been used for multiple tasks, like software categorization, bugs prediction, and testing. In addition to the multiple ML applications, some studies have been conducted to detect and understand possible pitfalls and issues when using ML. However, to the best of our knowledge, only a few studies have focused on presenting ML best practices or guidelines for the application of ML in different domains. In addition, the practices and literature presented in previous literature (i) are domain-specific (e.g., concrete practices in biomechanics), (ii) describe few practices, or (iii) the practices lack rigorous validation and are presented in gray literature. In this paper, we present a study listing 127 ML best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites and validated by four independent ML experts. The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system; for each practice, we include explanations and examples. In all the practices, the provided examples focus on SE tasks. We expect this list of practices could help practitioners to understand better the practices and use ML in a more informed way, in particular newcomers to this new area that sits at the intersection of software engineering and machine learning. △ Less

Submitted 25 January, 2023; originally announced January 2023.

arXiv:2209.07623 [pdf, other]

Studying the explanations for the automated prediction of bug and non-bug issues using LIME and SHAP

Authors: Benjamin Ledel, Steffen Herbold

Abstract: Context: The identification of bugs within the reported issues in an issue tracker is crucial for the triage of issues. Machine learning models have shown promising results regarding the performance of automated issue type prediction. However, we have only limited knowledge beyond our assumptions how such models identify bugs. LIME and SHAP are popular technique to explain the predictions of class… ▽ More Context: The identification of bugs within the reported issues in an issue tracker is crucial for the triage of issues. Machine learning models have shown promising results regarding the performance of automated issue type prediction. However, we have only limited knowledge beyond our assumptions how such models identify bugs. LIME and SHAP are popular technique to explain the predictions of classifiers. Objective: We want to understand if machine learning models provide explanations for the classification that are reasonable to us as humans and align with our assumptions of what the models should learn. We also want to know if the prediction quality is correlated with the quality of explanations. Method: We conduct a study where we rate LIME and SHAP explanations based on their quality of explaining the outcome of an issue type prediction model. For this, we rate the quality of the explanations themselves, i.e., if they align with our expectations and if they help us to understand the underlying machine learning model. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: This registered report received a In-Principal Acceptance (IPA) in the ESEM 2022 RR track

arXiv:2207.11976 [pdf, other]

Differential testing for machine learning: an analysis for classification algorithms beyond deep learning

Authors: Steffen Herbold, Steffen Tunkel

Abstract: Context: Differential testing is a useful approach that uses different implementations of the same algorithms and compares the results for software testing. In recent years, this approach was successfully used for test campaigns of deep learning frameworks. Objective: There is little knowledge on the application of differential testing beyond deep learning. Within this article, we want to close… ▽ More Context: Differential testing is a useful approach that uses different implementations of the same algorithms and compares the results for software testing. In recent years, this approach was successfully used for test campaigns of deep learning frameworks. Objective: There is little knowledge on the application of differential testing beyond deep learning. Within this article, we want to close this gap for classification algorithms. Method: We conduct a case study using Scikit-learn, Weka, Spark MLlib, and Caret in which we identify the potential of differential testing by considering which algorithms are available in multiple frameworks, the feasibility by identifying pairs of algorithms that should exhibit the same behavior, and the effectiveness by executing tests for the identified pairs and analyzing the deviations. Results: While we found a large potential for popular algorithms, the feasibility seems limited because often it is not possible to determine configurations that are the same in other frameworks. The execution of the feasible tests revealed that there is a large amount of deviations for the scores and classes. Only a lenient approach based on statistical significance of classes does not lead to a huge amount of test failures. Conclusions: The potential of differential testing beyond deep learning seems limited for research into the quality of machine learning libraries. Practitioners may still use the approach if they have deep knowledge about implementations, especially if a coarse oracle that only considers significant differences of classes is sufficient. △ Less

Submitted 25 July, 2022; originally announced July 2022.

Comments: Under review

arXiv:2205.01335 [pdf, ps, other]

Predicting Issue Types with seBERT

Authors: Alexander Trautsch, Steffen Herbold

Abstract: Pre-trained transformer models are the current state-of-the-art for natural language models processing. seBERT is such a model, that was developed based on the BERT architecture, but trained from scratch with software engineering data. We fine-tuned this model for the NLBSE challenge for the task of issue type prediction. Our model dominates the baseline fastText for all three issue types in both… ▽ More Pre-trained transformer models are the current state-of-the-art for natural language models processing. seBERT is such a model, that was developed based on the BERT architecture, but trained from scratch with software engineering data. We fine-tuned this model for the NLBSE challenge for the task of issue type prediction. Our model dominates the baseline fastText for all three issue types in both recall and precisio} to achieve an overall F1-score of 85.7%, which is an increase of 4.1% over the baseline. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: Accepted for Publication at the NLBSE'22 Tool Competition

arXiv:2111.09188 [pdf, other]

Are automated static analysis tools worth it? An investigation into relative warning density and external software quality

Authors: Alexander Trautsch, Steffen Herbold, Jens Grabowski

Abstract: Automated Static Analysis Tools (ASATs) are part of software development best practices. ASATs are able to warn developers about potential problems in the code. On the one hand, ASATs are based on best practices so there should be a noticeable effect on software quality. On the other hand, ASATs suffer from false positive warnings, which developers have to inspect and then ignore or mark as invali… ▽ More Automated Static Analysis Tools (ASATs) are part of software development best practices. ASATs are able to warn developers about potential problems in the code. On the one hand, ASATs are based on best practices so there should be a noticeable effect on software quality. On the other hand, ASATs suffer from false positive warnings, which developers have to inspect and then ignore or mark as invalid. In this article, we ask the question if ASATs have a measurable impact on external software quality, using the example of PMD for Java. We investigate the relationship between ASAT warnings emitted by PMD on defects per change and per file. Our case study includes data for the history of each file as well as the differences between changed files and the project in which they are contained. We investigate whether files that induce a defect have more static analysis warnings than the rest of the project. Moreover, we investigate the impact of two different sets of ASAT rules. We find that, bug inducing files contain less static analysis warnings than other files of the project at that point in time. However, this can be explained by the overall decreasing warning density. When compared with all other changes, we find a statistically significant difference in one metric for all rules and two metrics for a subset of rules. However, the effect size is negligible in all cases, showing that the actual difference in warning density between bug inducing changes and other changes is small at best. △ Less

Submitted 18 November, 2021; v1 submitted 17 November, 2021; originally announced November 2021.

arXiv:2109.11902 [pdf, other]

Broccoli: Bug localization with the help of text search engines

Authors: Benjamin Ledel, Steffen Herbold

Abstract: Bug localization is a tedious activity in the bug fixing process in which a software developer tries to locate bugs in the source code described in a bug report. Since this process is time-consuming and requires additional knowledge about the software project, information retrieval techniques can aid the bug localization process. In this paper, we investigate if normal text search engines can impr… ▽ More Bug localization is a tedious activity in the bug fixing process in which a software developer tries to locate bugs in the source code described in a bug report. Since this process is time-consuming and requires additional knowledge about the software project, information retrieval techniques can aid the bug localization process. In this paper, we investigate if normal text search engines can improve existing bug localization approaches. In a case study, we evaluate the performance of our search engine approach Broccoli against seven state-of-the-art bug localization algorithms on 82 open source projects in two data sets. Our results show that including a search engine can increase the performance of the bug localization and that it is a useful extension to existing approaches. As part of our analysis we also exposed a flaw in a commonly used benchmark strategy, i.e., that files of a single release are considered. To increase the number of detectable files, we mitigate this flaw by considering the state of the software repository at the time of the bug report. Our results show that using single releases may lead to an underestimation of the the prediction performance. △ Less

Submitted 10 October, 2021; v1 submitted 24 September, 2021; originally announced September 2021.

arXiv:2109.04738 [pdf, other]

On the validity of pre-trained transformers for natural language processing in the software engineering domain

Authors: Julian von der Mosel, Alexander Trautsch, Steffen Herbold

Abstract: Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models… ▽ More Transformers are the current state-of-the-art of natural language processing in many domains and are using traction within software engineering research as well. Such models are pre-trained on large amounts of data, usually from the general domain. However, we only have a limited understanding regarding the validity of transformers within the software engineering domain, i.e., how good such models are at understanding words and sentences within a software engineering context and how this improves the state-of-the-art. Within this article, we shed light on this complex, but crucial issue. We compare BERT transformer models trained with software engineering data with transformers based on general domain data in multiple dimensions: their vocabulary, their ability to understand which words are missing, and their performance in classification tasks. Our results show that for tasks that require understanding of the software engineering context, pre-training with software engineering data is valuable, while general domain models are sufficient for general language understanding, also within the software engineering domain. △ Less

Submitted 12 May, 2022; v1 submitted 10 September, 2021; originally announced September 2021.

Comments: Review status: submitted

arXiv:2109.03544 [pdf, other]

What really changes when developers intend to improve their source code: a commit-level study of static metric value and static analysis warning changes

Authors: Alexander Trautsch, Johannes Erbel, Steffen Herbold, Jens Grabowski

Abstract: Many software metrics are designed to measure aspects that are believed to be related to software quality. Static software metrics, e.g., size, complexity and coupling are used in defect prediction research as well as software quality models to evaluate software quality. While this indicates a relationship between quality and software metrics, the extent of it is not well understood. Moreover, rec… ▽ More Many software metrics are designed to measure aspects that are believed to be related to software quality. Static software metrics, e.g., size, complexity and coupling are used in defect prediction research as well as software quality models to evaluate software quality. While this indicates a relationship between quality and software metrics, the extent of it is not well understood. Moreover, recent studies found that complexity metrics may be unreliable indicators for understandability of the source code. To explore this relationship, we leverage the intent of developers about what constitutes a quality improvement in their own code base. We manually classify a randomized sample of 2,533 commits from 54 Java open source projects as quality improving depending on the intent of the developer by inspecting the commit message. We distinguish between perfective and corrective maintenance via predefined guidelines and use this data as ground truth for the fine-tuning of a state-of-the art deep learning model for natural language processing. We use the model to increase our data set to 125,482 commits. Based on the resulting data set, we investigate the differences in size and 14 static source code metrics between changes that increase quality, as indicated by the developer, and other changes. We find that quality improving commits are smaller than other commits. Perfective changes have a positive impact on static source code metrics while corrective changes do tend to add complexity. Furthermore, we find that files which are the target of perfective maintenance already have a lower median complexity than other files. Our study results provide empirical evidence for which static source code metrics capture quality improvement from the developers point of view. This has implications for program understanding as well as code smell detection and recommender systems. △ Less

Submitted 30 May, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

arXiv:2104.02517 [pdf, other]

A new perspective on the competent programmer hypothesis through the reproduction of bugs with repeated mutations

Authors: Zaheed Ahmed, Eike Stein, Steffen Herbold, Fabian Trautsch, Jens Grabowski

Abstract: The competent programmer hypothesis states that most programmers are competent enough to create correct or almost correct source code. Because this implies that bugs should usually manifest through small variations of the correct code, the competent programmer hypothesis is one of the fundamental assumptions of mutation testing. Unfortunately, it is still unclear if the competent programmer hypoth… ▽ More The competent programmer hypothesis states that most programmers are competent enough to create correct or almost correct source code. Because this implies that bugs should usually manifest through small variations of the correct code, the competent programmer hypothesis is one of the fundamental assumptions of mutation testing. Unfortunately, it is still unclear if the competent programmer hypothesis holds and past research presents contradictory claims. Within this article, we provide a new perspective on the competent programmer hypothesis and its relation to mutation testing. We try to re-create real-world bugs through chains of mutations to understand if there is a direct link between mutation testing and bugs. The lengths of these paths help us to understand if the source code is really almost correct, or if large variations are required. Our results indicate that while the competent programmer hypothesis seems to be true, mutation testing is missing important operators to generate representative real-world bugs. △ Less

Submitted 15 May, 2023; v1 submitted 6 April, 2021; originally announced April 2021.

Comments: Submitted and under review

arXiv:2104.00566 [pdf, other]

Exploring the relationship between performance metrics and cost saving potential of defect prediction models

Authors: Steffen Tunkel, Steffen Herbold

Abstract: Context: Performance metrics are a core component of the evaluation of any machine learning model and used to compare models and estimate their usefulness. Recent work started to question the validity of many performance metrics for this purpose in the context of software defect prediction. Objective: Within this study, we explore the relationship between performance metrics and the cost saving… ▽ More Context: Performance metrics are a core component of the evaluation of any machine learning model and used to compare models and estimate their usefulness. Recent work started to question the validity of many performance metrics for this purpose in the context of software defect prediction. Objective: Within this study, we explore the relationship between performance metrics and the cost saving potential of defect prediction models. We study whether performance metrics are suitable proxies to evaluate the cost saving capabilities and derive a theory for the relationship between performance metrics and cost saving potential. Methods: We measure performance metrics and cost saving potential in defect prediction experiments. We use a multinomial logit model, decision, and random forest to model the relationship between the metrics and the cost savings. Results: We could not find a stable relationship between cost savings and performance metrics. We attribute the lack of the relationship to the inability of performance metrics to account for the property that a small proportion of very large software artifacts are the main driver of the costs. Conclusion: Any defect prediction study interested in finding the best prediction model, must consider cost savings directly, because no reasonable claims regarding the economic benefits of defect prediction can be made otherwise. △ Less

Submitted 27 July, 2022; v1 submitted 1 April, 2021; originally announced April 2021.

Comments: Under review

arXiv:2103.00255 [pdf, other]

doi 10.1121/10.0009322

Expert Decision Support System for aeroacoustic source type identification using clustering

Authors: Armin Goudarzi, Carsten Spehr, Steffen Herbold

Abstract: This paper presents an Expert Decision Support System for the identification of time-invariant, aeroacoustic source types. The system comprises two steps: first, acoustic properties are calculated based on spectral and spatial information. Second, clustering is performed based on these properties. The clustering aims at hel** and guiding an expert for quick identification of different source typ… ▽ More This paper presents an Expert Decision Support System for the identification of time-invariant, aeroacoustic source types. The system comprises two steps: first, acoustic properties are calculated based on spectral and spatial information. Second, clustering is performed based on these properties. The clustering aims at hel** and guiding an expert for quick identification of different source types, providing an understanding of how sources differ. This supports the expert in determining similar or atypical behavior. A variety of features are proposed for capturing the characteristics of the sources. These features represent aeroacoustic properties that can be interpreted by both the machine and by experts. The features are independent of the absolute Mach number which enables the proposed method to cluster data measured at different flow configurations. The method is evaluated on deconvolved beamforming data from two scaled airframe half-model measurements. For this exemplary data, the proposed support system method results in clusters that mostly correspond to the source types identified by the authors. The clustering also provides the mean feature values and the cluster hierarchy for each cluster and for each cluster member a clustering confidence. This additional information makes the results transparent and allows the expert to understand the clustering choices. △ Less

Submitted 18 November, 2021; v1 submitted 27 February, 2021; originally announced March 2021.

Comments: Preprint for JASA Journal

arXiv:2102.11540 [pdf, other]

MSR Mining Challenge: The SmartSHARK Repository Mining Data

Authors: Alexander Trautsch, Fabian Trautsch, Steffen Herbold

Abstract: The SmartSHARK repository mining data is a collection of rich and detailed information about the evolution of software projects. The data is unique in its diversity and contains detailed information about each change, issue tracking data, continuous integration data, as well as pull request and code review data. Moreover, the data does not contain only raw data scraped from repositories, but also… ▽ More The SmartSHARK repository mining data is a collection of rich and detailed information about the evolution of software projects. The data is unique in its diversity and contains detailed information about each change, issue tracking data, continuous integration data, as well as pull request and code review data. Moreover, the data does not contain only raw data scraped from repositories, but also annotations in form of labels determined through a combination of manual analysis and heuristics, as well as links between the different parts of the data set. The SmartSHARK data set provides a rich source of data that enables us to explore research questions that require data from different sources and/or longitudinal data over time. △ Less

Submitted 4 August, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

arXiv:2012.09643 [pdf, other]

doi 10.1121/10.0005885

Automatic source localization and spectra generation from sparse beamforming maps

Authors: Armin Goudarzi, Carsten Spehr, Steffen Herbold

Abstract: Beamforming is an imaging tool for the investigation of aeroacoustic phenomena and results in high dimensional data that is broken down to spectra by integrating spatial Regions Of Interest. This paper presents two methods that enable the automated identification of aeroacoustic sources in sparse beamforming maps and the extraction of their corresponding spectra to overcome the manual definition o… ▽ More Beamforming is an imaging tool for the investigation of aeroacoustic phenomena and results in high dimensional data that is broken down to spectra by integrating spatial Regions Of Interest. This paper presents two methods that enable the automated identification of aeroacoustic sources in sparse beamforming maps and the extraction of their corresponding spectra to overcome the manual definition of Regions Of Interest. The methods are evaluated on two scaled airframe half-model wind-tunnel measurements and on a generic monopole source. The first relies on the spatial normal distribution of aeroacoustic broadband sources in sparse beamforming maps. The second uses hierarchical clustering methods. Both methods are robust to statistical noise and predict the existence, location, and spatial probability estimation for sources based on which Regions Of Interest are automatically determined. △ Less

Submitted 22 July, 2021; v1 submitted 16 December, 2020; originally announced December 2020.

Comments: Preprint for JASA special issue on machine learning in acoustics, Revision 2

arXiv:2011.06244 [pdf, other]

A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits

Authors: Steffen Herbold, Alexander Trautsch, Benjamin Ledel, Alireza Aghamohammadi, Taher Ahmed Ghaleb, Kuljit Kaur Chahal, Tim Bossenmaier, Bhaveet Nagaria, Philip Makedonski, Matin Nili Ahmadabadi, Kristof Szabados, Helge Spieker, Matej Madeja, Nathaniel Hoy, Valentina Lenarduzzi, Shangwen Wang, Gema Rodríguez-Pérez, Ricardo Colomo-Palacios, Roberto Verdecchia, Paramvir Singh, Yihao Qin, Debasish Chakroborti, Willard Davis, Vijay Walunj, Hongjun Wu , et al. (23 additional authors not shown)

Abstract: Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Metho… ▽ More Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise. △ Less

Submitted 13 October, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: Status: Accepted at Empirical Software Engineering

arXiv:2009.01521 [pdf, other]

doi 10.1007/s10664-021-10073-7

Smoke Testing for Machine Learning: Simple Tests to Discover Severe Defects

Authors: Steffen Herbold, Tobias Haar

Abstract: Machine learning is nowadays a standard technique for data analysis within software applications. Software engineers need quality assurance techniques that are suitable for these new kinds of systems. Within this article, we discuss the question whether standard software testing techniques that have been part of textbooks since decades are also useful for the testing of machine learning software.… ▽ More Machine learning is nowadays a standard technique for data analysis within software applications. Software engineers need quality assurance techniques that are suitable for these new kinds of systems. Within this article, we discuss the question whether standard software testing techniques that have been part of textbooks since decades are also useful for the testing of machine learning software. Concretely, we try to determine generic and simple smoke tests that can be used to assert that basic functions can be executed without crashing. We found that we can derive such tests using techniques similar to equivalence classes and boundary value analysis. Moreover, we found that these concepts can also be applied to hyperparameters, to further improve the quality of the smoke tests. Even though our approach is almost trivial, we were able to find bugs in all three machine learning libraries that we tested and severe bugs in two of the three libraries. This demonstrates that common software testing techniques are still valid in the age of machine learning and that considerations how they can be adapted to this new context can help to find and prevent severe bugs, even in mature machine learning libraries. △ Less

Submitted 29 October, 2021; v1 submitted 3 September, 2020; originally announced September 2020.

Comments: Accepted at Empirical Software Engineering, Springer

arXiv:2003.05357 [pdf, other]

doi 10.1007/s10664-020-09885-w

On the feasibility of automated prediction of bug and non-bug issues

Authors: Steffen Herbold, Alexander Trautsch, Fabian Trautsch

Abstract: Context: Issue tracking systems are used to track and describe tasks in the development process, e.g., requested feature improvements or reported bugs. However, past research has shown that the reported issue types often do not match the description of the issue. Objective: We want to understand the overall maturity of the state of the art of issue type prediction with the goal to predict if iss… ▽ More Context: Issue tracking systems are used to track and describe tasks in the development process, e.g., requested feature improvements or reported bugs. However, past research has shown that the reported issue types often do not match the description of the issue. Objective: We want to understand the overall maturity of the state of the art of issue type prediction with the goal to predict if issues are bugs and evaluate if we can improve existing models by incorporating manually specified knowledge about issues. Method: We train different models for the title and description of the issue to account for the difference in structure between these fields, e.g., the length. Moreover, we manually detect issues whose description contains a null pointer exception, as these are strong indicators that issues are bugs. Results: Our approach performs best overall, but not significantly different from an approach from the literature based on the fastText classifier from Facebook AI Research. The small improvements in prediction performance are due to structural information about the issues we used. We found that using information about the content of issues in form of null pointer exceptions is not useful. We demonstrate the usefulness of issue type prediction through the example of labelling bugfixing commits. Conclusions: Issue type prediction can be a useful tool if the use case allows either for a certain amount of missed bug reports or the prediction of too many issues as bug is acceptable. △ Less

Submitted 8 October, 2021; v1 submitted 11 March, 2020; originally announced March 2020.

arXiv:2001.01972 [pdf, ps, other]

With Registered Reports Towards Large Scale Data Curation

Authors: Steffen Herbold

Abstract: The scale of manually validated data is currently limited by the effort that small groups of researchers can invest for the curation of such data. Within this paper, we propose the use of registered reports to scale the curation of manually validated data. The idea is inspired by the mechanical turk and replaces monetary payment with authorship of data set publication. The scale of manually validated data is currently limited by the effort that small groups of researchers can invest for the curation of such data. Within this paper, we propose the use of registered reports to scale the curation of manually validated data. The idea is inspired by the mechanical turk and replaces monetary payment with authorship of data set publication. △ Less

Submitted 7 January, 2020; originally announced January 2020.

arXiv:2001.01606 [pdf, other]

The SmartSHARK Ecosystem for Software Repository Mining

Authors: Alexander Trautsch, Fabian Trautsch, Steffen Herbold, Benjamin Ledel, Jens Grabowski

Abstract: Software repository mining is the foundation for many empirical software engineering studies. The collection and analysis of detailed data can be challenging, especially if data shall be shared to enable replicable research and open science practices. SmartSHARK is an ecosystem that supports replicable and reproducible research based on software repository mining. Software repository mining is the foundation for many empirical software engineering studies. The collection and analysis of detailed data can be challenging, especially if data shall be shared to enable replicable research and open science practices. SmartSHARK is an ecosystem that supports replicable and reproducible research based on software repository mining. △ Less

Submitted 6 January, 2020; originally announced January 2020.

Comments: Submitted to ICSE 2020 Demo Track

arXiv:1912.02179 [pdf, other]

doi 10.1007/s10664-020-09880-1

A Longitudinal Study of Static Analysis Warning Evolution and the Effects of PMD on Software Quality in Apache Open Source Projects

Authors: Alexander Trautsch, Steffen Herbold, Jens Grabowski

Abstract: Automated static analysis tools (ASATs) have become a major part of the software development workflow. Acting on the generated warnings, i.e., changing the code indicated in the warning, should be part of, at latest, the code review phase. Despite this being a best practice in software development, there is still a lack of empirical research regarding the usage of ASATs in the wild. In this work,… ▽ More Automated static analysis tools (ASATs) have become a major part of the software development workflow. Acting on the generated warnings, i.e., changing the code indicated in the warning, should be part of, at latest, the code review phase. Despite this being a best practice in software development, there is still a lack of empirical research regarding the usage of ASATs in the wild. In this work, we want to study ASAT warning trends in software via the example of PMD as an ASAT and its usage in open source projects. We analyzed the commit history of 54 projects (with 112,266 commits in total), taking into account 193 PMD rules and 61 PMD releases. We investigate trends of ASAT warnings over up to 17 years for the selected study subjects regarding changes of warning types, short and long term impact of ASAT use, and changes in warning severities. We found that large global changes in ASAT warnings are mostly due to coding style changes regarding braces and naming conventions. We also found that, surprisingly, the influence of the presence of PMD in the build process of the project on warning removal trends for the number of warnings per lines of code is small and not statistically significant. Regardless, if we consider defect density as a proxy for external quality, we see a positive effect if PMD is present in the build configuration of our study subjects. △ Less

Submitted 27 August, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

Comments: preprint

Journal ref: Empirical Software Engineering 25 (2020) 5137-5192

arXiv:1911.08938 [pdf, other]

doi 10.1007/s10664-021-10092-4

Problems with SZZ and Features: An empirical study of the state of practice of defect prediction data collection

Authors: Steffen Herbold, Alexander Trautsch, Fabian Trautsch, Benjamin Ledel

Abstract: Context: The SZZ algorithm is the de facto standard for labeling bug fixing commits and finding inducing changes for defect prediction data. Recent research uncovered potential problems in different parts of the SZZ algorithm. Most defect prediction data sets provide only static code metrics as features, while research indicates that other features are also important. Objective: We provide an em… ▽ More Context: The SZZ algorithm is the de facto standard for labeling bug fixing commits and finding inducing changes for defect prediction data. Recent research uncovered potential problems in different parts of the SZZ algorithm. Most defect prediction data sets provide only static code metrics as features, while research indicates that other features are also important. Objective: We provide an empirical analysis of the defect labels created with the SZZ algorithm and the impact of commonly used features on results. Method: We used a combination of manual validation and adopted or improved heuristics for the collection of defect data. We conducted an empirical study on 398 releases of 38 Apache projects. Results: We found that only half of the bug fixing commits determined by SZZ are actually bug fixing. If a six-month time frame is used in combination with SZZ to determine which bugs affect a release, one file is incorrectly labeled as defective for every file that is correctly labeled as defective. In addition, two defective files are missed. We also explored the impact of the relatively small set of features that are available in most defect prediction data sets, as there are multiple publications that indicate that, e.g., churn related features are important for defect prediction. We found that the difference of using more features is not significant. Conclusion: Problems with inaccurate defect labels are a severe threat to the validity of the state of the art of defect prediction. Small feature sets seem to be a less severe threat. △ Less

Submitted 11 November, 2021; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: Accepted at Empirical Software Engineering, Springer. First three authors are equally contributing

arXiv:1911.04309 [pdf, other]

doi 10.1109/TSE.2019.2957794

On the costs and profit of software defect prediction

Authors: Steffen Herbold

Abstract: Defect prediction can be a powerful tool to guide the use of quality assurance resources. However, while lots of research covered methods for defect prediction as well as methodological aspects of defect prediction research, the actual cost saving potential of defect prediction is still unclear. Within this article, we close this research gap and formulate a cost model for software defect predicti… ▽ More Defect prediction can be a powerful tool to guide the use of quality assurance resources. However, while lots of research covered methods for defect prediction as well as methodological aspects of defect prediction research, the actual cost saving potential of defect prediction is still unclear. Within this article, we close this research gap and formulate a cost model for software defect prediction. We derive mathematically provable boundary conditions that must be fulfilled by defect prediction models such that there is a positive profit when the defect prediction model is used. Our cost model includes aspects like the costs for quality assurance, the costs of post-release defects, the possibility that quality assurance fails to reveal predicted defects, and the relationship between software artifacts and defects. We initialize the cost model using different assumptions, perform experiments to show trends of the behavior of costs on real projects. Our results show that the unrealistic assumption that defects only affect a single software artifact, which is a standard practice in the defect prediction literature, leads to inaccurate cost estimations. Moreover, the results indicate that thresholds for machine learning metrics are also not suited to define success criteria for software defect prediction. △ Less

Submitted 11 November, 2019; originally announced November 2019.

Comments: Under Review (minor revision)

arXiv:1902.07499 [pdf, other]

A systematic map** study of developer social network research

Authors: Steffen Herbold, Aynur Amirfallah, Fabian Trautsch, Jens Grabowski

Abstract: Developer social networks (DSNs) are a tool for the analysis of community structures and collaborations between developers in software projects and software ecosystems. Within this paper, we present the results of a systematic map** study on the use of DSNs in software engineering research. We identified 255 primary studies on DSNs. We mapped the primary studies to research directions, collected… ▽ More Developer social networks (DSNs) are a tool for the analysis of community structures and collaborations between developers in software projects and software ecosystems. Within this paper, we present the results of a systematic map** study on the use of DSNs in software engineering research. We identified 255 primary studies on DSNs. We mapped the primary studies to research directions, collected information about the data sources and the size of the studies, and conducted a bibliometric assessment. We found that nearly half of the research investigates the structure of developer communities. Other frequent topics are prediction systems build using DSNs, collaboration behavior between developers, and the roles of developers. Moreover, we determined that many publications use a small sample size regarding the number of projects, which could be problematic for the external validity of the research. Our study uncovered several open issues in the state of the art, e.g., studying inter-company collaborations, using multiple information sources for DSN research, as well as general lack of reporting guidelines or replication studies. △ Less

Submitted 21 August, 2020; v1 submitted 20 February, 2019; originally announced February 2019.

Comments: Accepted at the Journal of Systems and Software

arXiv:1812.09746 [pdf, other]

A Multi-Objective Anytime Rule Mining System to Ease Iterative Feedback from Domain Experts

Authors: Tobias Baum, Steffen Herbold, Kurt Schneider

Abstract: Data extracted from software repositories is used intensively in Software Engineering research, for example, to predict defects in source code. In our research in this area, with data from open source projects as well as an industrial partner, we noticed several shortcomings of conventional data mining approaches for classification problems: (1) Domain experts' acceptance is of critical importance… ▽ More Data extracted from software repositories is used intensively in Software Engineering research, for example, to predict defects in source code. In our research in this area, with data from open source projects as well as an industrial partner, we noticed several shortcomings of conventional data mining approaches for classification problems: (1) Domain experts' acceptance is of critical importance, and domain experts can provide valuable input, but it is hard to use this feedback. (2) The evaluation of the model is not a simple matter of calculating AUC or accuracy. Instead, there are multiple objectives of varying importance, but their importance cannot be easily quantified. Furthermore, the performance of the model cannot be evaluated on a per-instance level in our case, because it shares aspects with the set cover problem. To overcome these problems, we take a holistic approach and develop a rule mining system that simplifies iterative feedback from domain experts and can easily incorporate the domain-specific evaluation needs. A central part of the system is a novel multi-objective anytime rule mining algorithm. The algorithm is based on the GRASP-PR meta-heuristic but extends it with ideas from several other approaches. We successfully applied the system in the industrial context. In the current article, we focus on the description of the algorithm and the concepts of the system. We provide an implementation of the system for reuse. △ Less

Submitted 23 December, 2018; originally announced December 2018.

arXiv:1812.09510 [pdf, other]

An Industrial Case Study on Shrinking Code Review Changesets through Remark Prediction

Authors: Tobias Baum, Steffen Herbold, Kurt Schneider

Abstract: Change-based code review is used widely in industrial software development. Thus, research on tools that help the reviewer to achieve better review performance can have a high impact. We analyze one possibility to provide cognitive support for the reviewer: Determining the importance of change parts for review, specifically determining which parts of the code change can be left out from the review… ▽ More Change-based code review is used widely in industrial software development. Thus, research on tools that help the reviewer to achieve better review performance can have a high impact. We analyze one possibility to provide cognitive support for the reviewer: Determining the importance of change parts for review, specifically determining which parts of the code change can be left out from the review without harm. To determine the importance of change parts, we extract data from software repositories and build prediction models for review remarks based on this data. The approach is discussed in detail. To gather the input data, we propose a novel algorithm to trace review remarks to their triggers. We apply our approach in a medium-sized software company. In this company, we can avoid the review of 25% of the change parts and of 23% of the changed Java source code lines, while missing only about 1% of the review remarks. Still, we also observe severe limitations of the tried approach: Much of the savings are due to simple syntactic rules, noise in the data hampers the search for better prediction models, and some developers in the case company oppose the taken approach. Besides the main results on the mining and prediction of triggers for review remarks, we contribute experiences with a novel, multi-objective and interactive rule mining approach. The anonymized dataset from the company is made available, as are the implementations for the devised algorithms. △ Less

Submitted 22 December, 2018; originally announced December 2018.

arXiv:1801.04107 [pdf, other]

Benchmarking cross-project defect prediction approaches with costs metrics

Authors: Steffen Herbold

Abstract: Defect prediction can be a powerful tool to guide the use of quality assurance resources. In recent years, many researchers focused on the problem of Cross-Project Defect Prediction (CPDP), i.e., the creation of prediction models based on training data from other projects. However, only few of the published papers evaluate the cost efficiency of predictions, i.e., if they save costs if they are us… ▽ More Defect prediction can be a powerful tool to guide the use of quality assurance resources. In recent years, many researchers focused on the problem of Cross-Project Defect Prediction (CPDP), i.e., the creation of prediction models based on training data from other projects. However, only few of the published papers evaluate the cost efficiency of predictions, i.e., if they save costs if they are used to guide quality assurance efforts. Within this paper, we provide a benchmark of 26 CPDP approaches based on cost metrics. Our benchmark shows that trivially assuming everything as defective is on average better than CPDP under cost considerations. Moreover, we show that our ranking of approaches using cost metrics is uncorrelated to a ranking based on metrics that do not directly consider costs. These findings show that we must put more effort into evaluating the actual benefits of CPDP, as the current state of the art of CPDP can actually be beaten by a trivial approach in cost-oriented evaluations. △ Less

Submitted 12 January, 2018; originally announced January 2018.

Comments: Rejected at ICSE Technical Track, will be presented as a poster and hopefully appear in an extended version in a journal at some point in 2018

arXiv:1707.09281 [pdf, other]

Correction of "A Comparative Study to Benchmark Cross-project Defect Prediction Approaches"

Authors: Steffen Herbold, Alexander Trautsch, Jens Grabowski

Abstract: Unfortunately, the article "A Comparative Study to Benchmark Cross-project Defect Prediction Approaches" has a problem in the statistical analysis which was pointed out almost immediately after the pre-print of the article appeared online. While the problem does not negate the contribution of the the article and all key findings remain the same, it does alter some rankings of approaches used in th… ▽ More Unfortunately, the article "A Comparative Study to Benchmark Cross-project Defect Prediction Approaches" has a problem in the statistical analysis which was pointed out almost immediately after the pre-print of the article appeared online. While the problem does not negate the contribution of the the article and all key findings remain the same, it does alter some rankings of approaches used in the study. Within this correction, we will explain the problem, how we resolved it, and present the updated results. △ Less

Submitted 27 July, 2017; originally announced July 2017.

arXiv:1705.06429 [pdf, other]

A systematic map** study on cross-project defect prediction

Authors: Steffen Herbold

Abstract: Cross-Project-Defect Prediction as a sub-topic of defect prediction in general has become a popular topic in research. In this article, we present a systematic map** study with the focus on CPDP, for which we found 50 publications. We summarize the approaches presented by each publication and discuss the case study setups and results. We discovered a great amount of heterogeneity in the way case… ▽ More Cross-Project-Defect Prediction as a sub-topic of defect prediction in general has become a popular topic in research. In this article, we present a systematic map** study with the focus on CPDP, for which we found 50 publications. We summarize the approaches presented by each publication and discuss the case study setups and results. We discovered a great amount of heterogeneity in the way case studies are conducted, because of differences in the data sets, classifiers, performance metrics, and baseline comparisons used. Due to this, we could not compare the results of our review on a qualitative basis, i.e., determine which approaches perform best for CPDP. △ Less

Submitted 18 May, 2017; originally announced May 2017.

Comments: Under Review

Showing 1–33 of 33 results for author: Herbold, S