-
Relevant information in TDD experiment reporting
Authors:
Fernando Uyaguari,
Silvia T. Acuña,
John W. Castro,
Davide Fucci,
Oscar Dieste,
Sira Vegas
Abstract:
Experiments are a commonly used method of research in software engineering (SE). Researchers report their experiments following detailed guidelines. However, researchers do not, in the field of test-driven development (TDD) at least, specify how they operationalized the response variables and the measurement process. This article has three aims: (i) identify the response variable operationalizatio…
▽ More
Experiments are a commonly used method of research in software engineering (SE). Researchers report their experiments following detailed guidelines. However, researchers do not, in the field of test-driven development (TDD) at least, specify how they operationalized the response variables and the measurement process. This article has three aims: (i) identify the response variable operationalization components in TDD experiments that study external quality; (ii) study their influence on the experimental results;(ii) determine if the experiment reports describe the measurement process components that have an impact on the results. Sequential mixed method. The first part of the research adopts a quantitative approach applying a statistical análisis (SA) of the impact of the operationalization components on the experimental results. The second part follows on with a qualitative approach applying a systematic map** study (SMS). The test suites, intervention types and measurers have an influence on the measurements and results of the SA of TDD experiments in SE. The test suites have a major impact on both the measurements and the results of the experiments. The intervention type has less impact on the results than on the measurements. While the measurers have an impact on the measurements, this is not transferred to the experimental results. On the other hand, the results of our SMS confirm that TDD experiments do not usually report either the test suites, the test case generation method, or the details of how external quality was measured. A measurement protocol should be used to assure that the measurements made by different measurers are similar. It is necessary to report the test cases, the experimental task and the intervention type in order to be able to reproduce the measurements and SA, as well as to replicate experiments and build dependable families of experiments.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
On (Mis)perceptions of testing effectiveness: an empirical study
Authors:
Sira Vegas,
Patricia Riofrio,
Esperanza Marcos,
Natalia Juristo
Abstract:
A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers' perceptions about them. A factor influencing people's perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different te…
▽ More
A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers' perceptions about them. A factor influencing people's perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. To do this, we conduct an empirical study plus a replication. During the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results show that participants' perceptions are wrong and that this mismatch is costly in terms of quality. In order to gain further insight into the results, we replicate the controlled experiment and extend the survey to include questions about participants' opinions on the techniques and programs. The results of the replicated study confirm the findings of the original study and suggest that participants' perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
Content and structure of laboratory packages for software engineering experiments
Authors:
Martín Solari,
Sira Vegas,
Natalia Juristo
Abstract:
Context: Experiment replications play a central role in the scientific method. Although software engineering experimentation has matured a great deal, the number of experiment replications is still relatively small. Software engineering experiments are composed of complex concepts, procedures and artefacts. Laboratory packages are a means of transfer-ring knowledge among researchers to facilitate…
▽ More
Context: Experiment replications play a central role in the scientific method. Although software engineering experimentation has matured a great deal, the number of experiment replications is still relatively small. Software engineering experiments are composed of complex concepts, procedures and artefacts. Laboratory packages are a means of transfer-ring knowledge among researchers to facilitate experiment replications. Objective: This paper investigates the experiment replication process to find out what information is needed to successfully replicate an experiment. Our objective is to propose the content and structure of laboratory packages for software engineering experiments. Method: We evaluated seven replications of three different families of experiments. Each replication had a different experimenter who was, at the time, unfamiliar with the experi-ment. During the first iterations of the study, we identified experimental incidents and then proposed a laboratory package structure that addressed these incidents, including docu-ment usability improvements. We used the later iterations to validate and generalize the laboratory package structure for use in all software engineering experiments. We aimed to solve a specific problem, while at the same time looking at how to contribute to the body of knowledge on laboratory packages. Results: We generated a laboratory package for three different experiments. These packages eased the replication of the respective experiments. The evaluation that we conducted shows that the laboratory package proposal is acceptable and reduces the effort currently required to replicate experiments in software engineering. Conclusion: We think that the content and structure that we propose for laboratory pack-ages can be useful for other software engineering experiments.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
Does Microservices Adoption Impact the Development Velocity? A Cohort Study. A Registered Report
Authors:
Nyyti Saarimaki,
Mikel Robredo,
Sira vegas,
Natalia Juristo,
David Taibi,
Valentina Lenarduzzi
Abstract:
[Context] Microservices enable the decomposition of applications into small and independent services connected together. The independence between services could positively affect the development velocity of a project, which is considered an important metric measuring the time taken to implement features and fix bugs. However, no studies have investigated the connection between microservices and de…
▽ More
[Context] Microservices enable the decomposition of applications into small and independent services connected together. The independence between services could positively affect the development velocity of a project, which is considered an important metric measuring the time taken to implement features and fix bugs. However, no studies have investigated the connection between microservices and development velocity. [Objective and Method] The goal of this study plan is to investigate the effect microservices have on development velocity. The study compares GitHub projects adopting microservices from the beginning and similar projects using monolithic architectures. We designed this study using a cohort study method, to enable obtaining a high level of evidence. [Results] The result of this work enables the confirmation of the effective improvement of the development velocity of microservices. Moreover, this study will contribute to the body of knowledge of empirical methods being among the first works adopting the cohort study methodology.
△ Less
Submitted 21 June, 2023; v1 submitted 3 June, 2023;
originally announced June 2023.
-
Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice
Authors:
Sira Vegas,
Sebastian Elbaum
Abstract:
Software engineering techniques are increasingly relying on deep learning approaches to support many software engineering tasks, from bug triaging to code generation. To assess the efficacy of such techniques researchers typically perform controlled experiments. Conducting these experiments, however, is particularly challenging given the complexity of the space of variables involved, from speciali…
▽ More
Software engineering techniques are increasingly relying on deep learning approaches to support many software engineering tasks, from bug triaging to code generation. To assess the efficacy of such techniques researchers typically perform controlled experiments. Conducting these experiments, however, is particularly challenging given the complexity of the space of variables involved, from specialized and intricate architectures and algorithms to a large number of training hyper-parameters and choices of evolving datasets, all compounded by how rapidly the machine learning technology is advancing, and the inherent sources of randomness in the training process. In this work we conduct a map** study, examining 194 experiments with techniques that rely on deep neural networks appearing in 55 papers published in premier software engineering venues to provide a characterization of the state-of-the-practice, pinpointing experiments common trends and pitfalls. Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings. More specifically, we find: weak analyses to determine that there is a true relationship between independent and dependent variables (87% of the experiments); limited control over the space of DNN relevant variables, which can render a relationship between dependent variables and treatments that may not be causal but rather correlational (100% of the experiments); and lack of specificity in terms of what are the DNN variables and their values utilized in the experiments (86% of the experiments) to define the treatments being applied, which makes it unclear whether the techniques designed are the ones being assessed, or how the sources of extraneous variation are controlled. We provide some practical recommendations to address these limitations.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Comparing 2D and Augmented Reality Visualizations for Microservice System Understandability: A Controlled Experiment
Authors:
Amr S. Abdelfattah,
Tomas Cerny,
Davide Taibi,
Sira Vegas
Abstract:
Microservice-based systems are often complex to understand, especially when their sizes grow. Abstracted views help practitioners with the system understanding from a certain perspective. Recent advancement in interactive data visualization begs the question of whether established software engineering models to visualize system design remain the most suited approach for the service-oriented design…
▽ More
Microservice-based systems are often complex to understand, especially when their sizes grow. Abstracted views help practitioners with the system understanding from a certain perspective. Recent advancement in interactive data visualization begs the question of whether established software engineering models to visualize system design remain the most suited approach for the service-oriented design of microservices. Our recent work proposed presenting a 3D visualization for microservices in augmented reality. This paper analyzes whether such an approach brings any benefits to practitioners when dealing with selected architectural questions related to system design quality. For this purpose, we conducted a controlled experiment involving 20 participants investigating their performance in identifying service dependency, service cardinality, and bottlenecks. Results show that the 3D enables novices to perform as well as experts in the detection of service dependencies, especially in large systems, while no differences are reported for the identification of service cardinality and bottlenecks. We recommend industry and researchers to further investigate AR for microservice architectural analysis, especially to ease the onboarding of new developers in microservice~projects.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
Test cases as a measurement instrument in experimentation
Authors:
Oscar Dieste,
Fernando Uyaguari,
Sira Vegas,
Natalia Juristo
Abstract:
Background: Test suites are frequently used to quantify relevant software attributes, such as quality or productivity. Problem: We have detected that the same response variable, measured using different test suites, yields different experiment results. Aims: Assess to which extent differences in test case construction influence measurement accuracy and experimental outcomes. Method: Two industry e…
▽ More
Background: Test suites are frequently used to quantify relevant software attributes, such as quality or productivity. Problem: We have detected that the same response variable, measured using different test suites, yields different experiment results. Aims: Assess to which extent differences in test case construction influence measurement accuracy and experimental outcomes. Method: Two industry experiments have been measured using two different test suites, one generated using an ad-hoc method and another using equivalence partitioning. The accuracy of the measures has been studied using standard procedures, such as ISO 5725, Bland-Altman and Interclass Correlation Coefficients. Results: There are differences in the values of the response variables up to +-60%, depending on the test suite (ad-hoc vs. equivalence partitioning) used. Conclusions: The disclosure of datasets and analysis code is insufficient to ensure the reproducibility of SE experiments. Experimenters should disclose all experimental materials needed to perform independent measurement and re-analysis.
△ Less
Submitted 25 April, 2022; v1 submitted 9 November, 2021;
originally announced November 2021.
-
Towards a Methodology for Participant Selection in Software Engineering Experiments. A Vision of the Future
Authors:
Valentina Lenarduzzi,
Oscar Dieste,
Davide Fucci,
Sira Vegas
Abstract:
Background. Software Engineering (SE) researchers extensively perform experiments with human subjects. Well-defined samples are required to ensure external validity. Samples are selected \textit{purposely} or by \textit{convenience}, limiting the generalizability of results. Objective. We aim to depict the current status of participants selection in empirical SE, identifying the main threats and h…
▽ More
Background. Software Engineering (SE) researchers extensively perform experiments with human subjects. Well-defined samples are required to ensure external validity. Samples are selected \textit{purposely} or by \textit{convenience}, limiting the generalizability of results. Objective. We aim to depict the current status of participants selection in empirical SE, identifying the main threats and how they are mitigated. We draft a robust approach to participants' selection. Method. We reviewed existing participants' selection guidelines in SE, and performed a preliminary literature review to find out how participants' selection is conducted in SE in practice. % and 3) we summarized the main issues identified. Results. We outline a new selection methodology, by 1) defining the characteristics of the desired population, 2) locating possible sources of sampling available for researchers, and 3) identifying and reducing the "distance" between the selected sample and its corresponding population. Conclusion. We propose a roadmap to develop and empirically validate the selection methodology.
△ Less
Submitted 27 August, 2021;
originally announced August 2021.
-
A Family of Experiments on Test-Driven Development
Authors:
Adrian Santos,
Sira Vegas,
Oscar Dieste,
Fernando Uyaguari,
Aysee Tosun,
Davide Fucci,
Burak Turhan,
Giuseppe Scanniello,
Simone Romano,
Itir Karac,
Marco Kuhrmann,
Vladimir Mandic,
Robert Ramac,
Dietmar Pfahl,
Christian Engblom,
Jarno Kyykka,
Kerli Rungi,
Carolina Palomeque,
Jaroslav Spisak,
Markku Oivo,
Natalia Juristo
Abstract:
Context: Test-driven development (TDD) is an agile software development approach that has been widely claimed to improve software quality. However, the extent to which TDD improves quality appears to be largely dependent upon the characteristics of the study in which it is evaluated (e.g., the research method, participant type, programming environment, etc.). The particularities of each study make…
▽ More
Context: Test-driven development (TDD) is an agile software development approach that has been widely claimed to improve software quality. However, the extent to which TDD improves quality appears to be largely dependent upon the characteristics of the study in which it is evaluated (e.g., the research method, participant type, programming environment, etc.). The particularities of each study make the aggregation of results untenable. Objectives: The goal of this paper is to: increase the accuracy and generalizability of the results achieved in isolated experiments on TDD, provide joint conclusions on the performance of TDD across different industrial and academic settings, and assess the extent to which the characteristics of the experiments affect the quality-related performance of TDD. Method: We conduct a family of 12 experiments on TDD in academia and industry. We aggregate their results by means of meta-analysis. We perform exploratory analyses to identify variables impacting the quality-related performance of TDD. Results: TDD novices achieve a slightly higher code quality with iterative test-last development (i.e., ITL, the reverse approach of TDD) than with TDD. The task being developed largely determines quality. The programming environment, the order in which TDD and ITL are applied, or the learning effects from one development approach to another do not appear to affect quality. The quality-related performance of professionals using TDD drops more than for students. We hypothesize that this may be due to their being more resistant to change and potentially less motivated than students. Conclusion: Previous studies seem to provide conflicting results on TDD performance (i.e., positive vs. negative, respectively). We hypothesize that these conflicting results may be due to different study durations, experiment participants being unfamiliar with the TDD process...
△ Less
Submitted 24 November, 2020;
originally announced November 2020.
-
Comparing the Results of Replications in Software Engineering
Authors:
Adrian Santos,
Sira Vegas,
Markku Oivo,
Natalia Juristo
Abstract:
Context: It has been argued that software engineering replications are useful for verifying the results of previous experiments. However, it has not yet been agreed how to check whether the results hold across replications. Besides, some authors suggest that replications that do not verify the results of previous experiments can be used to identify contextual variables causing the discrepancies. O…
▽ More
Context: It has been argued that software engineering replications are useful for verifying the results of previous experiments. However, it has not yet been agreed how to check whether the results hold across replications. Besides, some authors suggest that replications that do not verify the results of previous experiments can be used to identify contextual variables causing the discrepancies. Objective: Study how to assess the (dis)similarity of the results of SE replications when they are compared to verify the results of previous experiments and understand how to identify whether contextual variables are influencing results. Method: We run simulations to learn how different ways of comparing replication results behave when verifying the results of previous experiments. We illustrate how to deal with context-induced changes. To do this, we analyze three groups of replications from our own research on test-driven development and testing techniques. Results: The direct comparison of p-values and effect sizes does not appear to be suitable for verifying the results of previous experiments and examining the variables possibly affecting the results in software engineering. Analytical methods such as meta-analysis should be used to assess the similarity of software engineering replication results and identify discrepancies in results. Conclusion: The results achieved in baseline experiments should no longer be regarded as a result that needs to be reproduced, but as a small piece of evidence within a larger picture that only emerges after assembling many small pieces to complete the puzzle.
△ Less
Submitted 5 November, 2020;
originally announced November 2020.
-
Empirical Standards for Software Engineering Research
Authors:
Paul Ralph,
Nauman bin Ali,
Sebastian Baltes,
Domenico Bianculli,
Jessica Diaz,
Yvonne Dittrich,
Neil Ernst,
Michael Felderer,
Robert Feldt,
Antonio Filieri,
Breno Bernard Nicolau de França,
Carlo Alberto Furia,
Greg Gay,
Nicolas Gold,
Daniel Graziotin,
Pinjia He,
Rashina Hoda,
Natalia Juristo,
Barbara Kitchenham,
Valentina Lenarduzzi,
Jorge Martínez,
Jorge Melegati,
Daniel Mendez,
Tim Menzies,
Jefferson Molleri
, et al. (18 additional authors not shown)
Abstract:
Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear…
▽ More
Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair.
△ Less
Submitted 4 March, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Increasing Validity Through Replication: An Illustrative TDD Case
Authors:
Adrian Santos,
Sira Vegas,
Fernando Uyaguari,
Oscar Dieste,
Burak Turhan,
Natalia Juristo
Abstract:
Context: Software Engineering (SE) experiments suffer from threats to validity that may impact their results. Replication allows researchers building on top of previous experiments' weaknesses and increasing the reliability of the findings. Objective: Illustrating the benefits of replication to increase the reliability of the findings and uncover moderator variables. Method: We replicate an experi…
▽ More
Context: Software Engineering (SE) experiments suffer from threats to validity that may impact their results. Replication allows researchers building on top of previous experiments' weaknesses and increasing the reliability of the findings. Objective: Illustrating the benefits of replication to increase the reliability of the findings and uncover moderator variables. Method: We replicate an experiment on Test-Driven-Development (TDD) and address some of its threats to validity and those of a previous replication. We compare the replications' results and hypothesize on plausible moderators impacting results. Results: Differences across TDD replications' results might be due to the operationalization of the response variables, the allocation of subjects to treatments, the allowance to work outside the laboratory, the provision of stubs, or the task. Conclusion: Replications allow examining the robustness of the findings, hypothesizing on plausible moderators influencing results, and strengthening the evidence obtained.
△ Less
Submitted 11 April, 2020;
originally announced April 2020.
-
A Procedure and Guidelines for Analyzing Groups of Software Engineering Replications
Authors:
Adrian Santos,
Sira Vegas,
Markku Oivo,
Natalia Juristo
Abstract:
Context: Researchers from different groups and institutions are collaborating on building groups of experiments by means of replication (i.e., conducting groups of replications). Disparate aggregation techniques are being applied to analyze groups of replications. The application of unsuitable techniques to aggregate replication results may undermine the potential of groups of replications to prov…
▽ More
Context: Researchers from different groups and institutions are collaborating on building groups of experiments by means of replication (i.e., conducting groups of replications). Disparate aggregation techniques are being applied to analyze groups of replications. The application of unsuitable techniques to aggregate replication results may undermine the potential of groups of replications to provide in-depth insights from experiment results. Objectives: Provide an analysis procedure with a set of embedded guidelines to aggregate software engineering (SE) replication results. Method: We compare the characteristics of groups of replications for SE and other mature experimental disciplines such as medicine and pharmacology. In view of their differences, the limitations with regard to the joint data analysis of groups of SE replications and the guidelines provided in mature experimental disciplines to analyze groups of replications, we build an analysis procedure with a set of embedded guidelines specifically tailored to the analysis of groups of SE replications. We apply the proposed analysis procedure to a representative group of SE replications to illustrate its use. Results: All the information contained within the raw data should be leveraged during the aggregation of replication results. The analysis procedure that we propose encourages the use of stratified individual participant data and aggregated data in tandem to analyze groups of SE replications. Conclusion: The aggregation techniques used to analyze groups of replications should be justified in research articles. This will increase the reliability and transparency of joint results. The proposed guidelines should ease this endeavor.
△ Less
Submitted 11 April, 2020;
originally announced April 2020.