-
A Causal Framework for Evaluating Deferring Systems
Authors:
Filippo Palomba,
Andrea Pugnana,
José Manuel Alvarez,
Salvatore Ruggieri
Abstract:
Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems. This…
▽ More
Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems. This allows us to identify the causal impact of the deferring strategy on predictive accuracy. We distinguish two scenarios. In the first one, we can access both the human and the ML model predictions for the deferred instances. In such a case, we can identify the individual causal effects for deferred instances and aggregates of them. In the second scenario, only human predictions are available for the deferred instances. In this case, we can resort to regression discontinuity design to estimate a local causal effect. We empirically evaluate our approach on synthetic and real datasets for seven deferring systems from the literature.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
When Code Smells Meet ML: On the Lifecycle of ML-specific Code Smells in ML-enabled Systems
Authors:
Gilberto Recupito,
Giammaria Giordano,
Filomena Ferrucci,
Dario Di Nucci,
Fabio Palomba
Abstract:
Context. The adoption of Machine Learning (ML)--enabled systems is steadily increasing. Nevertheless, there is a shortage of ML-specific quality assurance approaches, possibly because of the limited knowledge of how quality-related concerns emerge and evolve in ML-enabled systems. Objective. We aim to investigate the emergence and evolution of specific types of quality-related concerns known as ML…
▽ More
Context. The adoption of Machine Learning (ML)--enabled systems is steadily increasing. Nevertheless, there is a shortage of ML-specific quality assurance approaches, possibly because of the limited knowledge of how quality-related concerns emerge and evolve in ML-enabled systems. Objective. We aim to investigate the emergence and evolution of specific types of quality-related concerns known as ML-specific code smells, i.e., sub-optimal implementation solutions applied on ML pipelines that may significantly decrease both the quality and maintainability of ML-enabled systems. More specifically, we present a plan to study ML-specific code smells by empirically analyzing (i) their prevalence in real ML-enabled systems, (ii) how they are introduced and removed, and (iii) their survivability. Method. We will conduct an exploratory study, mining a large dataset of ML-enabled systems and analyzing over 400k commits about 337 projects. We will track and inspect the introduction and evolution of ML smells through CodeSmile, a novel ML smell detector that we will build to enable our investigation and to detect ML-specific code smells.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Classification, Challenges, and Automated Approaches to Handle Non-Functional Requirements in ML-Enabled Systems: A Systematic Literature Review
Authors:
Vincenzo De Martino,
Fabio Palomba
Abstract:
Context: Machine learning (ML) is nowadays so pervasive and diffused that virtually no application can avoid its use. Nonetheless, its enormous potential is often tempered by the need to manage non-functional requirements and navigate pressing, contrasting trade-offs. Objective: In this respect, we notice the lack of a comprehensive synthesis of the non-functional requirements affecting ML-enabled…
▽ More
Context: Machine learning (ML) is nowadays so pervasive and diffused that virtually no application can avoid its use. Nonetheless, its enormous potential is often tempered by the need to manage non-functional requirements and navigate pressing, contrasting trade-offs. Objective: In this respect, we notice the lack of a comprehensive synthesis of the non-functional requirements affecting ML-enabled systems, other than the major challenges faced to deal with them. Such a synthesis may not only provide a comprehensive summary of the state of the art, but also drive further research on the analysis, management, and optimization of non-functional requirements of ML-intensive systems. Method: In this paper, we propose a systematic literature review targeting two key aspects such as (1) the classification of the non-functional requirements investigated so far, and (2) the challenges to be faced when develo** models in ML-enabled systems. Through the combination of well-established guidelines for conducting systematic literature reviews and additional search criteria, we survey a total amount of 69 research articles. Results: Our findings report that current research identified 30 different non-functional requirements, which can be grouped into six main classes. We also compiled a catalog of more than 23 software engineering challenges, based on which further research should consider the nonfunctional requirements of machine learning-enabled systems. Conclusion: We conclude our work by distilling implications and a future outlook on the topic.
△ Less
Submitted 10 April, 2024; v1 submitted 29 November, 2023;
originally announced November 2023.
-
Test Code Refactoring Unveiled: Where and How Does It Affect Test Code Quality and Effectiveness?
Authors:
Luana Martins,
Valeria Pontillo,
Heitor Costa,
Filomena Ferrucci,
Fabio Palomba,
Ivan Machado
Abstract:
Context. Refactoring has been widely investigated in the past in relation to production code quality, yet still little is known on how developers apply refactoring on test code. Specifically, there is still a lack of investigation into how developers typically refactor test code and its effects on test code quality and effectiveness. Objective. This paper presents a research agenda aimed to bridge…
▽ More
Context. Refactoring has been widely investigated in the past in relation to production code quality, yet still little is known on how developers apply refactoring on test code. Specifically, there is still a lack of investigation into how developers typically refactor test code and its effects on test code quality and effectiveness. Objective. This paper presents a research agenda aimed to bridge this gap of knowledge by investigating (1) whether test refactoring actually targets test classes affected by quality and effectiveness concerns and (2) the extent to which refactoring contributes to the improvement of test code quality and effectiveness. Method. We plan to conduct an exploratory mining software repository study to collect test refactoring data of open-source Java projects from GitHub and statistically analyze them in combination with quality metrics, test smells, and code/mutation coverage indicators. Furthermore, we will measure how refactoring operations impact the quality and effectiveness of test code.
△ Less
Submitted 18 August, 2023;
originally announced August 2023.
-
A systematic literature review on the code smells datasets and validation mechanisms
Authors:
Morteza Zakeri-Nasrabadi,
Saeed Parsa,
Ehsan Esmaili,
Fabio Palomba
Abstract:
The accuracy reported for code smell-detecting tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the data…
▽ More
The accuracy reported for code smell-detecting tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the dataset. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets. We conclude that existing datasets suffer from imbalanced samples, lack of supporting severity level, and restriction to Java language.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
FedCSD: A Federated Learning Based Approach for Code-Smell Detection
Authors:
Sadi Alawadi,
Khalid Alkharabsheh,
Fahed Alkhabbas,
Victor Kebande,
Feras M. Awaysheh,
Fabio Palomba,
Mohammed Awad
Abstract:
This paper proposes a Federated Learning Code Smell Detection (FedCSD) approach that allows organizations to collaboratively train federated ML models while preserving their data privacy. These assertions have been supported by three experiments that have significantly leveraged three manually validated datasets aimed at detecting and examining different code smell scenarios. In experiment 1, whic…
▽ More
This paper proposes a Federated Learning Code Smell Detection (FedCSD) approach that allows organizations to collaboratively train federated ML models while preserving their data privacy. These assertions have been supported by three experiments that have significantly leveraged three manually validated datasets aimed at detecting and examining different code smell scenarios. In experiment 1, which was concerned with a centralized training experiment, dataset two achieved the lowest accuracy (92.30%) with fewer smells, while datasets one and three achieved the highest accuracy with a slight difference (98.90% and 99.5%, respectively). This was followed by experiment 2, which was concerned with cross-evaluation, where each ML model was trained using one dataset, which was then evaluated over the other two datasets. Results from this experiment show a significant drop in the model's accuracy (lowest accuracy: 63.80\%) where fewer smells exist in the training dataset, which has a noticeable reflection (technical debt) on the model's performance. Finally, the last and third experiments evaluate our approach by splitting the dataset into 10 companies. The ML model was trained on the company's site, then all model-updated weights were transferred to the server. Ultimately, an accuracy of 98.34% was achieved by the global model that has been trained using 10 companies for 100 training rounds. The results reveal a slight difference in the global model's accuracy compared to the highest accuracy of the centralized model, which can be ignored in favour of the global model's comprehensive knowledge, lower training cost, preservation of data privacy, and avoidance of the technical debt problem.
△ Less
Submitted 26 March, 2024; v1 submitted 31 May, 2023;
originally announced June 2023.
-
The Quantum Frontier of Software Engineering: A Systematic Map** Study
Authors:
Manuel De Stefano,
Fabiano Pecorelli,
Dario Di Nucci,
Fabio Palomba,
Andrea De Lucia
Abstract:
Context. Quantum computing is becoming a reality, and quantum software engineering (QSE) is emerging as a new discipline to enable developers to design and develop quantum programs.
Objective. This paper presents a systematic map** study of the current state of QSE research, aiming to identify the most investigated topics, the types and number of studies, the main reported results, and the mos…
▽ More
Context. Quantum computing is becoming a reality, and quantum software engineering (QSE) is emerging as a new discipline to enable developers to design and develop quantum programs.
Objective. This paper presents a systematic map** study of the current state of QSE research, aiming to identify the most investigated topics, the types and number of studies, the main reported results, and the most studied quantum computing tools/frameworks. Additionally, the study aims to explore the research community's interest in QSE, how it has evolved, and any prior contributions to the discipline before its formal introduction through the Talavera Manifesto.
Method. We searched for relevant articles in several databases and applied inclusion and exclusion criteria to select the most relevant studies. After evaluating the quality of the selected resources, we extracted relevant data from the primary studies and analyzed them.
Results. We found that QSE research has primarily focused on software testing, with little attention given to other topics, such as software engineering management. The most commonly studied technology for techniques and tools is Qiskit, although, in most studies, either multiple or none specific technologies were employed. The researchers most interested in QSE are interconnected through direct collaborations, and several strong collaboration clusters have been identified. Most articles in QSE have been published in non-thematic venues, with a preference for conferences.
Conclusions. The study's implications are providing a centralized source of information for researchers and practitioners in the field, facilitating knowledge transfer, and contributing to the advancement and growth of QSE.
△ Less
Submitted 1 June, 2023; v1 submitted 31 May, 2023;
originally announced May 2023.
-
Toward End-to-End MLOps Tools Map: A Preliminary Study based on a Multivocal Literature Review
Authors:
Sergio Moreschi,
Gilberto Recupito,
Valentina Lenarduzzi,
Fabio Palomba,
David Hastbacka,
Davide Taibi
Abstract:
MLOps tools enable continuous development of machine learning, following the DevOps process. Different MLOps tools have been presented on the market, however, such a number of tools often create confusion on the most appropriate tool to be used in each DevOps phase. To overcome this issue, we conducted a multivocal literature review map** 84 MLOps tools identified from 254 Primary Studies, on th…
▽ More
MLOps tools enable continuous development of machine learning, following the DevOps process. Different MLOps tools have been presented on the market, however, such a number of tools often create confusion on the most appropriate tool to be used in each DevOps phase. To overcome this issue, we conducted a multivocal literature review map** 84 MLOps tools identified from 254 Primary Studies, on the DevOps phases, highlighting their purpose, and possible incompatibilities. The result of this work will be helpful to both practitioners and researchers, as a starting point for future investigations on MLOps tools, pipelines, and processes.
△ Less
Submitted 6 April, 2023;
originally announced April 2023.
-
Machine Learning-Based Test Smell Detection
Authors:
Valeria Pontillo,
Dario Amoroso d'Aragona,
Fabiano Pecorelli,
Dario Di Nucci,
Filomena Ferrucci,
Fabio Palomba
Abstract:
Context: Test smells are symptoms of sub-optimal design choices adopted when develo** test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of such detectors is still limited and dependent on thresholds to be tuned.
Obje…
▽ More
Context: Test smells are symptoms of sub-optimal design choices adopted when develo** test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of such detectors is still limited and dependent on thresholds to be tuned.
Objective: We propose the design and experimentation of a novel test smell detection approach based on machine learning to detect four test smells.
Method: We plan to develop the largest dataset of manually-validated test smells. This dataset will be leveraged to train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we plan to compare our approach with state-of-the-art heuristic-based techniques.
△ Less
Submitted 16 August, 2022;
originally announced August 2022.
-
On the Adoption and Effects of Source Code Reuse on Defect Proneness and Maintenance Effort
Authors:
Giammaria Giordano,
Gerardo Festa,
Gemma Catolino,
Fabio Palomba,
Filomena Ferrucci,
Carmine Gravino
Abstract:
Context. Software reusability mechanisms, like inheritance and delegation in Object-Oriented programming, are widely recognized as key instruments of software design. These are used to reduce the risks of source code being affected by defects, other than to reduce the effort required to maintain and evolve source code. Previous work has traditionally employed source code reuse metrics for predicti…
▽ More
Context. Software reusability mechanisms, like inheritance and delegation in Object-Oriented programming, are widely recognized as key instruments of software design. These are used to reduce the risks of source code being affected by defects, other than to reduce the effort required to maintain and evolve source code. Previous work has traditionally employed source code reuse metrics for prediction purposes, e.g., in the context of defect prediction. Objective. However, our research identifies two noticeable limitations of current literature. First, still little is known on the extent to which developers actually employ code reuse mechanisms over time. Second, it is still unclear how these mechanisms may contribute to explain defect-proneness and maintenance effort during software evolution. We aim at bridging this gap of knowledge, as an improved understanding of these aspects might provide insights into the actual support provided by these mechanisms, e.g., by suggesting whether and how to use them for prediction purposes. Method. We propose an exploratory study aiming at (1) assessing how developers use inheritance and delegation during software evolution; and (2) statistically analyze the impact of inheritance and delegation on fault proneness and maintenance effort. The study will be conducted on the commits of 17 Java projects of the DEFECTS4J dataset.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Toward Granular Automatic Unit Test Case Generation
Authors:
Fabiano Pecorelli,
Giovanni Grano,
Fabio Palomba,
Harald C. Gall,
Andrea De Lucia
Abstract:
Unit testing verifies the presence of faults in individual software components. Previous research has been targeting the automatic generation of unit tests through the adoption of random or search-based algorithms. Despite their effectiveness, these approaches do not implement any strategy that allows them to create unit tests in a structured manner: indeed, they aim at creating tests by optimizin…
▽ More
Unit testing verifies the presence of faults in individual software components. Previous research has been targeting the automatic generation of unit tests through the adoption of random or search-based algorithms. Despite their effectiveness, these approaches do not implement any strategy that allows them to create unit tests in a structured manner: indeed, they aim at creating tests by optimizing metrics like code coverage without ensuring that the resulting tests follow good design principles. In order to structure the automatic test case generation process, we propose a two-step systematic approach to the generation of unit tests: we first force search-based algorithms to create tests that cover individual methods of the production code, hence implementing the so-called intra-method tests; then, we relax the constraints to enable the creation of intra-class tests that target the interactions among production code methods.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
Software Engineering for Quantum Programming: How Far Are We?
Authors:
Manuel De Stefano,
Fabiano Pecorelli,
Dario Di Nucci,
Fabio Palomba,
Andrea De Lucia
Abstract:
Quantum computing is no longer only a scientific interest but is rapidly becoming an industrially available technology that can potentially overcome the limits of classical computation. Over the last years, all major companies have provided frameworks and programming languages that allow developers to create their quantum applications. This shift has led to the definition of a new discipline calle…
▽ More
Quantum computing is no longer only a scientific interest but is rapidly becoming an industrially available technology that can potentially overcome the limits of classical computation. Over the last years, all major companies have provided frameworks and programming languages that allow developers to create their quantum applications. This shift has led to the definition of a new discipline called quantum software engineering, which is demanded to define novel methods for engineering large-scale quantum applications. While the research community is successfully embracing this call, we notice a lack of systematic investigations into the state of the practice of quantum programming. Understanding the challenges that quantum developers face is vital to precisely define the aims of quantum software engineering. Hence, in this paper, we first mine all the GitHub repositories that make use of the most used quantum programming frameworks currently on the market and then conduct coding analysis sessions to produce a taxonomy of the purposes for which quantum technologies are used. In the second place, we conduct a survey study that involves the contributors of the considered repositories, which aims to elicit the developers' opinions on the current adoption and challenges of quantum programming. On the one hand, the results highlight that the current adoption of quantum programming is still limited. On the other hand, there are many challenges that the software engineering community should carefully consider: these do not strictly pertain to technical concerns but also socio-technical matters.
△ Less
Submitted 11 April, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
A Critical Comparison on Six Static Analysis Tools: Detection, Agreement, and Precision
Authors:
Valentina Lenarduzzi,
Savanna Lujan,
Nyyti Saarimaki,
Fabio Palomba
Abstract:
Background. Developers use Automated Static Analysis Tools (ASATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on ASATs to favor th…
▽ More
Background. Developers use Automated Static Analysis Tools (ASATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on ASATs to favor the understanding of their actual capabilities.
Aims. Despite the work done so far, there is still a lack of knowledge regarding (1) which source quality problems can actually be detected by static analysis tool warnings, (2) what is their agreement, and (3) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular static analysis tools for Java projects: Better Code Hub, CheckStyle, Coverity Scan, Findbugs, PMD, and SonarQube.
Method. We analyze 47 Java projects and derive a taxonomy of warnings raised by 6 state-of-the-practice ASATs. To assess their agreement, we compared them by manually analyzing - at line-level - whether they identify the same issues. Finally, we manually evaluate the precision of the tools.
Results. The key results report a comprehensive taxonomy of ASATs warnings, show little to no agreement among the tools and a low degree of precision.
Conclusions. We provide a taxonomy that can be useful to researchers, practitioners, and tool vendors to map the current capabilities of the tools. Furthermore, our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision.
△ Less
Submitted 21 January, 2021;
originally announced January 2021.
-
DeepIaC: Deep Learning-Based Linguistic Anti-pattern Detection in IaC
Authors:
Nemania Borovits,
Indika Kumara,
Parvathy Krishnan,
Stefano Dalla Palma,
Dario Di Nucci,
Fabio Palomba,
Damian A. Tamburri,
Willem-Jan van den Heuvel
Abstract:
Linguistic anti-patterns are recurring poor practices concerning inconsistencies among the naming, documentation, and implementation of an entity. They impede readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in infrastructure as code (IaC) scripts used to provision and manage computing environments. In particular, we conside…
▽ More
Linguistic anti-patterns are recurring poor practices concerning inconsistencies among the naming, documentation, and implementation of an entity. They impede readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in infrastructure as code (IaC) scripts used to provision and manage computing environments. In particular, we consider inconsistencies between the logic/body of IaC code units and their names. To this end, we propose a novel automated approach that employs word embeddings and deep learning techniques. We build and use the abstract syntax tree of IaC code units to create their code embedments. Our experiments with a dataset systematically extracted from open source repositories show that our approach yields an accuracy between0.785and0.915in detecting inconsistencies
△ Less
Submitted 22 September, 2020;
originally announced September 2020.
-
Success and Failure in Software Engineering: a Followup Systematic Literature Review
Authors:
Damian A. Tamburri,
Fabio Palomba,
Rick Kazman
Abstract:
Success and failure in software engineering are still among the least understood phenomena in the discipline. In a recent special journal issue on the topic, Mantyla et al. started discussing these topics from different angles; the authors focused their contributions on offering a general overview of both topics without deeper detail. Recognising the importance and impact of the topic, we have exe…
▽ More
Success and failure in software engineering are still among the least understood phenomena in the discipline. In a recent special journal issue on the topic, Mantyla et al. started discussing these topics from different angles; the authors focused their contributions on offering a general overview of both topics without deeper detail. Recognising the importance and impact of the topic, we have executed a followup, more in-depth systematic literature review with additional analyses beyond what was previously provided. These new analyses offer: (a) a grounded-theory of success and failure factors, harvesting over 500+ factors from the literature; (b) 14 manually-validated clusters of factors that provide relevant areas for success- and failure-specific measurement and risk-analysis; (c) a quality model composed of previously unmeasured organizational structure quantities which are germane to software product, process, and community quality. We show that the topics of success and failure deserve further study as well as further automated tool support, e.g., monitoring tools and metrics able to track the factors and patterns emerging from our study. This paper provides managers with risks as well as a more fine-grained analysis of the parameters that can be appraised to anticipate the risks.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
Towards a Catalogue of Software Quality Metrics for Infrastructure Code
Authors:
Stefano Dalla Palma,
Dario Di Nucci,
Fabio Palomba,
Damian A. Tamburri
Abstract:
Infrastructure-as-code (IaC) is a practice to implement continuous deployment by allowing management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little…
▽ More
Infrastructure-as-code (IaC) is a practice to implement continuous deployment by allowing management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC practice in a measurable fashion. On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. However, unlike general-purpose programming languages (GPLs), IaC scripts use domain-specific languages, and metrics used for GPLs may not be applicable for IaC scripts. This article proposes a catalogue consisting of 46 metrics to identify IaC properties focusing on Ansible, one of the most popular IaC language to date, and shows how they can be used to analyze IaC scripts.
△ Less
Submitted 7 July, 2020; v1 submitted 27 May, 2020;
originally announced May 2020.
-
On The Effect Of Code Review On Code Smells
Authors:
Luca Pascarella,
Davide Spadini,
Fabio Palomba,
Alberto Bacchelli
Abstract:
Code smells are symptoms of poor design quality. Since code review is a process that also aims at improving code quality, we investigate whether and how code review influences the severity of code smells. In this study, we analyze more than 21,000 code reviews belonging to seven Java open-source projects; we find that active and participated code reviews have a significant influence on the likelih…
▽ More
Code smells are symptoms of poor design quality. Since code review is a process that also aims at improving code quality, we investigate whether and how code review influences the severity of code smells. In this study, we analyze more than 21,000 code reviews belonging to seven Java open-source projects; we find that active and participated code reviews have a significant influence on the likelihood of reducing the severity of code smells. This result seems to confirm the expectations around code review's influence on code quality. However, by manually investigating 365 cases in which the severity of a code smell in a file was reduced with a review, we found that-in 95% of the cases-the reduction was a side effect of changes that reviewers requested on matters unrelated to code smells. Data and materials [https://doi.org/10.5281/zenodo.3588501].
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
Not All Bugs Are the Same: Understanding, Characterizing, and Classifying the Root Cause of Bugs
Authors:
Gemma Catolino,
Fabio Palomba,
Andy Zaidman,
Filomena Ferrucci
Abstract:
Modern version control systems such as Git or SVN include bug tracking mechanisms, through which developers can highlight the presence of bugs through bug reports, i.e., textual descriptions reporting the problem and what are the steps that led to a failure. In past and recent years, the research community deeply investigated methods for easing bug triage, that is, the process of assigning the fix…
▽ More
Modern version control systems such as Git or SVN include bug tracking mechanisms, through which developers can highlight the presence of bugs through bug reports, i.e., textual descriptions reporting the problem and what are the steps that led to a failure. In past and recent years, the research community deeply investigated methods for easing bug triage, that is, the process of assigning the fixing of a reported bug to the most qualified developer. Nevertheless, only a few studies have reported on how to support developers in the process of understanding the type of a reported bug, which is the first and most time-consuming step to perform before assigning a bug-fix operation. In this paper, we target this problem in two ways: first, we analyze 1,280 bug reports of 119 popular projects belonging to three ecosystems such as Mozilla, Apache, and Eclipse, with the aim of building a taxonomy of the root causes of reported bugs; then, we devise and evaluate an automated classification model able to classify reported bugs according to the defined taxonomy. As a result, we found nine main common root causes of bugs over the considered systems. Moreover, our model achieves high F-Measure and AUC-ROC (64% and 74% on overall, respectively).
△ Less
Submitted 25 July, 2019;
originally announced July 2019.
-
Understanding Flaky Tests: The Developer's Perspective
Authors:
Moritz Eck,
Fabio Palomba,
Marco Castelluccio,
Alberto Bacchelli
Abstract:
Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) when run against the same, identical code. Previous work has examined fixes to flaky tests and has proposed automated solutions to locate as well as fix flaky tests--we complement it by examining the perceptions of software developers about the nature, relevance, and challenges of this phenomenon.
We asked 21 p…
▽ More
Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) when run against the same, identical code. Previous work has examined fixes to flaky tests and has proposed automated solutions to locate as well as fix flaky tests--we complement it by examining the perceptions of software developers about the nature, relevance, and challenges of this phenomenon.
We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature of the flakiness, the origin of the flakiness, and the fixing effort. We complement this analysis with information about the fixing strategy. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team's size and project's domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness. Data and materials [https://doi.org/10.5281/zenodo.3265785].
△ Less
Submitted 2 July, 2019;
originally announced July 2019.
-
Improving Change Prediction Models with Code Smell-Related Information
Authors:
Gemma Catolino,
Fabio Palomba,
Francesca Arcelli Fontana,
Andrea De Lucia,
Andy Zaidman,
Filomena Ferrucci
Abstract:
Code smells represent sub-optimal implementation choices applied by developers when evolving software systems. The negative impact of code smells has been widely investigated in the past: besides developers' productivity and ability to comprehend source code, researchers empirically showed that the presence of code smells heavily impacts the change-proneness of the affected classes. On the basis o…
▽ More
Code smells represent sub-optimal implementation choices applied by developers when evolving software systems. The negative impact of code smells has been widely investigated in the past: besides developers' productivity and ability to comprehend source code, researchers empirically showed that the presence of code smells heavily impacts the change-proneness of the affected classes. On the basis of these findings, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, ie models having as goal that of indicating to developers which classes are more likely to change in the future, so that they may apply preventive maintenance actions. Specifically, we exploit the so-called intensity index - a previously defined metric that captures the severity of a code smell - and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with the one of an alternative technique that considers the previously defined antipattern metrics, namely a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than that of the baselines and (ii) the intensity is a more powerful metric with respect to the alternative smell-related ones.
△ Less
Submitted 26 May, 2019;
originally announced May 2019.