Search | arXiv e-print repository

doi 10.1007/s10664-020-09805-y

On (Mis)perceptions of testing effectiveness: an empirical study

Authors: Sira Vegas, Patricia Riofrio, Esperanza Marcos, Natalia Juristo

Abstract: A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers' perceptions about them. A factor influencing people's perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different te… ▽ More A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers' perceptions about them. A factor influencing people's perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. To do this, we conduct an empirical study plus a replication. During the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results show that participants' perceptions are wrong and that this mismatch is costly in terms of quality. In order to gain further insight into the results, we replicate the controlled experiment and extend the survey to include questions about participants' opinions on the techniques and programs. The results of the replicated study confirm the findings of the original study and suggest that participants' perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques. △ Less

Submitted 11 February, 2024; originally announced February 2024.

Journal ref: Empirical Software Engineering Journal, 25, pp. 2844-2896, 2020

arXiv:2402.07217 [pdf]

doi 10.1016/j.infsof.2017.12.016

Content and structure of laboratory packages for software engineering experiments

Authors: Martín Solari, Sira Vegas, Natalia Juristo

Abstract: Context: Experiment replications play a central role in the scientific method. Although software engineering experimentation has matured a great deal, the number of experiment replications is still relatively small. Software engineering experiments are composed of complex concepts, procedures and artefacts. Laboratory packages are a means of transfer-ring knowledge among researchers to facilitate… ▽ More Context: Experiment replications play a central role in the scientific method. Although software engineering experimentation has matured a great deal, the number of experiment replications is still relatively small. Software engineering experiments are composed of complex concepts, procedures and artefacts. Laboratory packages are a means of transfer-ring knowledge among researchers to facilitate experiment replications. Objective: This paper investigates the experiment replication process to find out what information is needed to successfully replicate an experiment. Our objective is to propose the content and structure of laboratory packages for software engineering experiments. Method: We evaluated seven replications of three different families of experiments. Each replication had a different experimenter who was, at the time, unfamiliar with the experi-ment. During the first iterations of the study, we identified experimental incidents and then proposed a laboratory package structure that addressed these incidents, including docu-ment usability improvements. We used the later iterations to validate and generalize the laboratory package structure for use in all software engineering experiments. We aimed to solve a specific problem, while at the same time looking at how to contribute to the body of knowledge on laboratory packages. Results: We generated a laboratory package for three different experiments. These packages eased the replication of the respective experiments. The evaluation that we conducted shows that the laboratory package proposal is acceptable and reduces the effort currently required to replicate experiments in software engineering. Conclusion: We think that the content and structure that we propose for laboratory pack-ages can be useful for other software engineering experiments. △ Less

Submitted 11 February, 2024; originally announced February 2024.

Journal ref: Information and Software Technology, 97, 64-79, 2018

arXiv:2306.02034 [pdf, other]

Does Microservices Adoption Impact the Development Velocity? A Cohort Study. A Registered Report

Authors: Nyyti Saarimaki, Mikel Robredo, Sira vegas, Natalia Juristo, David Taibi, Valentina Lenarduzzi

Abstract: [Context] Microservices enable the decomposition of applications into small and independent services connected together. The independence between services could positively affect the development velocity of a project, which is considered an important metric measuring the time taken to implement features and fix bugs. However, no studies have investigated the connection between microservices and de… ▽ More [Context] Microservices enable the decomposition of applications into small and independent services connected together. The independence between services could positively affect the development velocity of a project, which is considered an important metric measuring the time taken to implement features and fix bugs. However, no studies have investigated the connection between microservices and development velocity. [Objective and Method] The goal of this study plan is to investigate the effect microservices have on development velocity. The study compares GitHub projects adopting microservices from the beginning and similar projects using monolithic architectures. We designed this study using a cohort study method, to enable obtaining a high level of evidence. [Results] The result of this work enables the confirmation of the effective improvement of the development velocity of microservices. Moreover, this study will contribute to the body of knowledge of empirical methods being among the first works adopting the cohort study methodology. △ Less

Submitted 21 June, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2111.05287 [pdf, other]

Test cases as a measurement instrument in experimentation

Authors: Oscar Dieste, Fernando Uyaguari, Sira Vegas, Natalia Juristo

Abstract: Background: Test suites are frequently used to quantify relevant software attributes, such as quality or productivity. Problem: We have detected that the same response variable, measured using different test suites, yields different experiment results. Aims: Assess to which extent differences in test case construction influence measurement accuracy and experimental outcomes. Method: Two industry e… ▽ More Background: Test suites are frequently used to quantify relevant software attributes, such as quality or productivity. Problem: We have detected that the same response variable, measured using different test suites, yields different experiment results. Aims: Assess to which extent differences in test case construction influence measurement accuracy and experimental outcomes. Method: Two industry experiments have been measured using two different test suites, one generated using an ad-hoc method and another using equivalence partitioning. The accuracy of the measures has been studied using standard procedures, such as ISO 5725, Bland-Altman and Interclass Correlation Coefficients. Results: There are differences in the values of the response variables up to +-60%, depending on the test suite (ad-hoc vs. equivalence partitioning) used. Conclusions: The disclosure of datasets and analysis code is insufficient to ensure the reproducibility of SE experiments. Experimenters should disclose all experimental materials needed to perform independent measurement and re-analysis. △ Less

Submitted 25 April, 2022; v1 submitted 9 November, 2021; originally announced November 2021.

Comments: Author list fixed

arXiv:2110.09366 [pdf]

doi 10.1109/TSE.2021.3113558

Use and Misuse of the Term Experiment in Mining Software Repositories Research

Authors: Claudia Ayala, Burak Turhan, Xavier Franch, Natalia Juristo

Abstract: The significant momentum and importance of Mining Software Repositories (MSR) in Software Engineering (SE) has fostered new opportunities and challenges for extensive empirical research. However, MSR researchers seem to struggle to characterize the empirical methods they use into the existing empirical SE body of knowledge. This is especially the case of MSR experiments. To provide evidence on the… ▽ More The significant momentum and importance of Mining Software Repositories (MSR) in Software Engineering (SE) has fostered new opportunities and challenges for extensive empirical research. However, MSR researchers seem to struggle to characterize the empirical methods they use into the existing empirical SE body of knowledge. This is especially the case of MSR experiments. To provide evidence on the special characteristics of MSR experiments and their differences with experiments traditionally acknowledged in SE so far, we elicited the hallmarks that differentiate an experiment from other types of empirical studies and characterized the hallmarks and types of experiments in MSR. We analyzed MSR literature obtained from a small-scale systematic map** study to assess the use of the term experiment in MSR. We found that 19% of the papers claiming to be an experiment are indeed not an experiment at all but also observational studies, so they use the term in a misleading way. From the remaining 81% of the papers, only one of them refers to a genuine controlled experiment while the others stand for experiments with limited control. MSR researchers tend to overlook such limitations, compromising the interpretation of the results of their studies. We provide recommendations and insights to support the improvement of MSR experiments. △ Less

Submitted 18 October, 2021; originally announced October 2021.

arXiv:2108.12800 [pdf, ps, other]

A City upon a Hill: Casting Light on a Real Experimental Process

Authors: Efraín R. Fonseca C., Oscar Dieste, Natalia Juristo

Abstract: Context: The overall scientific community is proposing measures to improve the reproducibility and replicability of experiments. Reproducibility is relatively easy to achieve. However, replicability is considerably more complex in both the sciences and Empirical Software Engineering (ESE). Several strategies, e.g., replication packages and families of experiments, have been proposed to improve rep… ▽ More Context: The overall scientific community is proposing measures to improve the reproducibility and replicability of experiments. Reproducibility is relatively easy to achieve. However, replicability is considerably more complex in both the sciences and Empirical Software Engineering (ESE). Several strategies, e.g., replication packages and families of experiments, have been proposed to improve replication in ESE, with limited success. We wonder whether the failures are due to some mismatch, i.e., the researchers' needs are not satisfied by the proposed replication procedures. Objectives: Find out how experimental researchers conduct \textit{experiments in practice}. Methods: We carried out an ethnography study within a SE Research Group. Our main activity was to observe/approach the experimental researchers in their day-to-day settings for two years. Their preferred literature and experimental materials were studied. We used individual and group interviews to gain understanding and examine unclear topics in-depth. Results: We have created conceptual and process models that represent how experimentation is really conducted in the Research Group. Models fit the community's procedures and terminology at a high level, but they become particular in their minute details. Conclusion: The actual experimental process differs from textbooks in several points, namely: (1) Number and diversity of activities, (2) existence of different roles, (3) the granularity of the concepts used by the roles, and (4) the viewpoints that different sub-areas or families of experiments have about the overall process. △ Less

Submitted 29 August, 2021; originally announced August 2021.

arXiv:2106.12533 [pdf, ps, other]

doi 10.1145/3383219.3383233

Publication Bias: A Detailed Analysis of Experiments Published in ESEM

Authors: Rolando P. Reyes, Óscar Dieste, Efraín R. Fonseca C., Natalia Juristo

Abstract: Background: Publication bias is the failure to publish the results of a study based on the direction or strength of the study findings. The existence of publication bias is firmly established in areas like medical research. Recent research suggests the existence of publication bias in Software Engineering. Aims: Finding out whether experiments published in the International Workshop on Empirical S… ▽ More Background: Publication bias is the failure to publish the results of a study based on the direction or strength of the study findings. The existence of publication bias is firmly established in areas like medical research. Recent research suggests the existence of publication bias in Software Engineering. Aims: Finding out whether experiments published in the International Workshop on Empirical Software Engineering and Measurement (ESEM) are affected by publication bias. Method: We review experiments published in ESEM. We also survey with experimental researchers to triangulate our findings. Results: ESEM experiments do not define hypotheses and frequently perform multiple testing. One-tailed tests have a slightly higher rate of achieving statistically significant results. We could not find other practices associated with publication bias. Conclusions: Our results provide a more encouraging perspective of SE research than previous research: (1) ESEM publications do not seem to be strongly affected by biases and (2) we identify some practices that could be associated with p-hacking, but it is more likely that they are related to the conduction of exploratory research. △ Less

Submitted 23 June, 2021; originally announced June 2021.

Comments: 10 pages, 7 tables, 7 figures

Journal ref: Proceedings of the Evaluation and Assessment in Software Engineering. 2020. 130-139

arXiv:2105.03312 [pdf, other]

doi 10.1016/j.jss.2021.110937

Studying Test-Driven Development and its Retainment Over a Six-month Time Span

Authors: Maria Teresa Baldassarre, Danilo Caivano, Davide Fucci, Natalia Juristo, Simone Romano, Giuseppe Scanniello, BurakTurhan

Abstract: In this paper, we investigate the effect of TDD, as compared to a non-TDD approach, as well as its retainment (or retention) over a time span of (about) six months. To pursue these objectives, we conducted a (quantitative) longitudinal cohort study with 30 novice developers (i.e., third-year undergraduate students in Computer Science). We observed that TDD affects neither the external quality of s… ▽ More In this paper, we investigate the effect of TDD, as compared to a non-TDD approach, as well as its retainment (or retention) over a time span of (about) six months. To pursue these objectives, we conducted a (quantitative) longitudinal cohort study with 30 novice developers (i.e., third-year undergraduate students in Computer Science). We observed that TDD affects neither the external quality of software products nor developers' productivity. However, we observed that the participants applying TDD produced significantly more tests, with a higher fault-detection capability than those using a non-TDD approach. As for the retainment of TDD, we found that TDD is retained by novice developers for at least six months. △ Less

Submitted 11 May, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

Journal ref: Journal of Systems and Software, Volume 176, 2021, 110937, ISSN 0164-1212

arXiv:2011.11942 [pdf, other]

A Family of Experiments on Test-Driven Development

Authors: Adrian Santos, Sira Vegas, Oscar Dieste, Fernando Uyaguari, Aysee Tosun, Davide Fucci, Burak Turhan, Giuseppe Scanniello, Simone Romano, Itir Karac, Marco Kuhrmann, Vladimir Mandic, Robert Ramac, Dietmar Pfahl, Christian Engblom, Jarno Kyykka, Kerli Rungi, Carolina Palomeque, Jaroslav Spisak, Markku Oivo, Natalia Juristo

Abstract: Context: Test-driven development (TDD) is an agile software development approach that has been widely claimed to improve software quality. However, the extent to which TDD improves quality appears to be largely dependent upon the characteristics of the study in which it is evaluated (e.g., the research method, participant type, programming environment, etc.). The particularities of each study make… ▽ More Context: Test-driven development (TDD) is an agile software development approach that has been widely claimed to improve software quality. However, the extent to which TDD improves quality appears to be largely dependent upon the characteristics of the study in which it is evaluated (e.g., the research method, participant type, programming environment, etc.). The particularities of each study make the aggregation of results untenable. Objectives: The goal of this paper is to: increase the accuracy and generalizability of the results achieved in isolated experiments on TDD, provide joint conclusions on the performance of TDD across different industrial and academic settings, and assess the extent to which the characteristics of the experiments affect the quality-related performance of TDD. Method: We conduct a family of 12 experiments on TDD in academia and industry. We aggregate their results by means of meta-analysis. We perform exploratory analyses to identify variables impacting the quality-related performance of TDD. Results: TDD novices achieve a slightly higher code quality with iterative test-last development (i.e., ITL, the reverse approach of TDD) than with TDD. The task being developed largely determines quality. The programming environment, the order in which TDD and ITL are applied, or the learning effects from one development approach to another do not appear to affect quality. The quality-related performance of professionals using TDD drops more than for students. We hypothesize that this may be due to their being more resistant to change and potentially less motivated than students. Conclusion: Previous studies seem to provide conflicting results on TDD performance (i.e., positive vs. negative, respectively). We hypothesize that these conflicting results may be due to different study durations, experiment participants being unfamiliar with the TDD process... △ Less

Submitted 24 November, 2020; originally announced November 2020.

arXiv:2011.02861 [pdf, other]

Comparing the Results of Replications in Software Engineering

Authors: Adrian Santos, Sira Vegas, Markku Oivo, Natalia Juristo

Abstract: Context: It has been argued that software engineering replications are useful for verifying the results of previous experiments. However, it has not yet been agreed how to check whether the results hold across replications. Besides, some authors suggest that replications that do not verify the results of previous experiments can be used to identify contextual variables causing the discrepancies. O… ▽ More Context: It has been argued that software engineering replications are useful for verifying the results of previous experiments. However, it has not yet been agreed how to check whether the results hold across replications. Besides, some authors suggest that replications that do not verify the results of previous experiments can be used to identify contextual variables causing the discrepancies. Objective: Study how to assess the (dis)similarity of the results of SE replications when they are compared to verify the results of previous experiments and understand how to identify whether contextual variables are influencing results. Method: We run simulations to learn how different ways of comparing replication results behave when verifying the results of previous experiments. We illustrate how to deal with context-induced changes. To do this, we analyze three groups of replications from our own research on test-driven development and testing techniques. Results: The direct comparison of p-values and effect sizes does not appear to be suitable for verifying the results of previous experiments and examining the variables possibly affecting the results in software engineering. Analytical methods such as meta-analysis should be used to assess the similarity of software engineering replication results and identify discrepancies in results. Conclusion: The results achieved in baseline experiments should no longer be regarded as a result that needs to be reproduced, but as a small piece of evidence within a larger picture that only emerges after assembling many small pieces to complete the puzzle. △ Less

Submitted 5 November, 2020; originally announced November 2020.

arXiv:2010.03525 [pdf]

Empirical Standards for Software Engineering Research

Authors: Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, Breno Bernard Nicolau de França, Carlo Alberto Furia, Greg Gay, Nicolas Gold, Daniel Graziotin, Pinjia He, Rashina Hoda, Natalia Juristo, Barbara Kitchenham, Valentina Lenarduzzi, Jorge Martínez, Jorge Melegati, Daniel Mendez, Tim Menzies, Jefferson Molleri , et al. (18 additional authors not shown)

Abstract: Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around resear… ▽ More Empirical Standards are natural-language models of a scientific community's expectations for a specific kind of study (e.g. a questionnaire survey). The ACM SIGSOFT Paper and Peer Review Quality Initiative generated empirical standards for research methods commonly used in software engineering. These living documents, which should be continuously revised to reflect evolving consensus around research best practices, will improve research quality and make peer review more effective, reliable, transparent and fair. △ Less

Submitted 4 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: For the complete standards, supplements and other resources, see https://github.com/acmsigsoft/EmpiricalStandards

arXiv:2008.12528 [pdf, ps, other]

Researcher Bias in Software Engineering Experiments: a Qualitative Investigation

Authors: Simone Romano, Davide Fucci, Giuseppe Scanniello, Maria Teresa Baldassarre, Burak Turhan, Natalia Juristo

Abstract: Researcher Bias (RB) occurs when researchers influence the results of an empirical study based on their expectations.RB might be due to the use of Questionable Research Practices(QRPs). In research fields like medicine, blinding techniques have been applied to counteract RB. We conducted an explorative qualitative survey to investigate RB in Software Engineering (SE)experiments, with respect to (i… ▽ More Researcher Bias (RB) occurs when researchers influence the results of an empirical study based on their expectations.RB might be due to the use of Questionable Research Practices(QRPs). In research fields like medicine, blinding techniques have been applied to counteract RB. We conducted an explorative qualitative survey to investigate RB in Software Engineering (SE)experiments, with respect to (i) QRPs potentially leading to RB, (ii) causes behind RB, and (iii) possible actions to counteract including blinding techniques. Data collection was based on semi-structured interviews. We interviewed nine active experts in the empirical SE community. We then analyzed the transcripts of these interviews through thematic analysis. We found that some QRPs are acceptable in certain cases. Also, it appears that the presence of RB is perceived in SE and, to counteract RB, a number of solutions have been highlighted: some are intended for SE researchers and others for the boards of SE research outlets. △ Less

Submitted 28 August, 2020; originally announced August 2020.

Comments: Published at SEAA2020

arXiv:2004.05335 [pdf, ps, other]

Increasing Validity Through Replication: An Illustrative TDD Case

Authors: Adrian Santos, Sira Vegas, Fernando Uyaguari, Oscar Dieste, Burak Turhan, Natalia Juristo

Abstract: Context: Software Engineering (SE) experiments suffer from threats to validity that may impact their results. Replication allows researchers building on top of previous experiments' weaknesses and increasing the reliability of the findings. Objective: Illustrating the benefits of replication to increase the reliability of the findings and uncover moderator variables. Method: We replicate an experi… ▽ More Context: Software Engineering (SE) experiments suffer from threats to validity that may impact their results. Replication allows researchers building on top of previous experiments' weaknesses and increasing the reliability of the findings. Objective: Illustrating the benefits of replication to increase the reliability of the findings and uncover moderator variables. Method: We replicate an experiment on Test-Driven-Development (TDD) and address some of its threats to validity and those of a previous replication. We compare the replications' results and hypothesize on plausible moderators impacting results. Results: Differences across TDD replications' results might be due to the operationalization of the response variables, the allocation of subjects to treatments, the allowance to work outside the laboratory, the provision of stubs, or the task. Conclusion: Replications allow examining the robustness of the findings, hypothesizing on plausible moderators influencing results, and strengthening the evidence obtained. △ Less

Submitted 11 April, 2020; originally announced April 2020.

arXiv:2004.05332 [pdf, other]

A Procedure and Guidelines for Analyzing Groups of Software Engineering Replications

Authors: Adrian Santos, Sira Vegas, Markku Oivo, Natalia Juristo

Abstract: Context: Researchers from different groups and institutions are collaborating on building groups of experiments by means of replication (i.e., conducting groups of replications). Disparate aggregation techniques are being applied to analyze groups of replications. The application of unsuitable techniques to aggregate replication results may undermine the potential of groups of replications to prov… ▽ More Context: Researchers from different groups and institutions are collaborating on building groups of experiments by means of replication (i.e., conducting groups of replications). Disparate aggregation techniques are being applied to analyze groups of replications. The application of unsuitable techniques to aggregate replication results may undermine the potential of groups of replications to provide in-depth insights from experiment results. Objectives: Provide an analysis procedure with a set of embedded guidelines to aggregate software engineering (SE) replication results. Method: We compare the characteristics of groups of replications for SE and other mature experimental disciplines such as medicine and pharmacology. In view of their differences, the limitations with regard to the joint data analysis of groups of SE replications and the guidelines provided in mature experimental disciplines to analyze groups of replications, we build an analysis procedure with a set of embedded guidelines specifically tailored to the analysis of groups of SE replications. We apply the proposed analysis procedure to a representative group of SE replications to illustrate its use. Results: All the information contained within the raw data should be leveraged during the aggregation of replication results. The analysis procedure that we propose encourages the use of stratified individual participant data and aggregated data in tandem to analyze groups of SE replications. Conclusion: The aggregation techniques used to analyze groups of replications should be justified in research articles. This will increase the reliability and transparency of joint results. The proposed guidelines should ease this endeavor. △ Less

Submitted 11 April, 2020; originally announced April 2020.

arXiv:1809.01828 [pdf, ps, other]

Improving Development Practices through Experimentation: an Industrial TDD Case

Authors: Adrian Santos, Jaroslav Spisak, Markku Oivo, Natalia Juristo

Abstract: Test-Driven Development (TDD), an agile development approach that enforces the construction of software systems by means of successive micro-iterative testing coding cycles, has been widely claimed to increase external software quality. In view of this, some managers at Paf-a Nordic gaming entertainment company-were interested in knowing how would TDD perform at their premises. Eventually, if TDD… ▽ More Test-Driven Development (TDD), an agile development approach that enforces the construction of software systems by means of successive micro-iterative testing coding cycles, has been widely claimed to increase external software quality. In view of this, some managers at Paf-a Nordic gaming entertainment company-were interested in knowing how would TDD perform at their premises. Eventually, if TDD outperformed their traditional way of coding (i.e., YW, short for Your Way), it would be possible to switch to TDD considering the empirical evidence achieved at the company level. We conduct an experiment at Paf to evaluate the performance of TDD, YW and the reverse approach of TDD (i.e., ITL, short for Iterative-Test Last) on external quality. TDD outperforms YW and ITL at Paf. Despite the encouraging results, we cannot recommend Paf to immediately adopt TDD as the difference in performance between YW and TDD is small. However, as TDD looks promising at Paf, we suggest to move some developers to TDD and to run a future experiment to compare the performance of TDD and YW. TDD slightly outperforms ITL in controlled experiments for TDD novices. However, more industrial experiments are still needed to evaluate the performance of TDD in real-life contexts. △ Less

Submitted 6 September, 2018; originally announced September 2018.

arXiv:1807.06851 [pdf, ps, other]

Moving Beyond the Mean: Analyzing Variance in Software Engineering Experiments

Authors: Adrian Santos, Markku Oivo, Natalia Juristo

Abstract: Software Engineering (SE) experiments are traditionally analyzed with statistical tests (e.g., t-tests, ANOVAs, etc.) that assume equally spread data across treatments (i.e., the homogeneity of variances assumption). Differences across treatments' variances in SE are not seen as an opportunity to gain insights on technology performance, but instead, as a hindrance to analyze the data. We have stud… ▽ More Software Engineering (SE) experiments are traditionally analyzed with statistical tests (e.g., t-tests, ANOVAs, etc.) that assume equally spread data across treatments (i.e., the homogeneity of variances assumption). Differences across treatments' variances in SE are not seen as an opportunity to gain insights on technology performance, but instead, as a hindrance to analyze the data. We have studied the role of variance in mature experimental disciplines such as medicine. We illustrate the extent to which variance may inform on technology performance by means of simulation. We analyze a real-life industrial experiment on Test-Driven Development (TDD) where variance may impact technology desirability. Evaluating the performance of technologies just based on means-as traditionally done in SE-may be misleading. Technologies that make developers resemble more to each other (i.e., technologies with smaller variances) may be more suitable if the aim is minimizing the risk of adopting them in real practice. △ Less

Submitted 5 September, 2018; v1 submitted 18 July, 2018; originally announced July 2018.

arXiv:1807.06850 [pdf, other]

Does the performance of TDD hold across software companies and premises? A group of industrial experiments on TDD

Authors: Adrian Santos, Janne Jarvinen, Jari Partanen, Markku Oivo, Natalia Juristo

Abstract: Test-Driven Development (TDD) has been claimed to increase external software quality. However, the extent to which TDD increases external quality has been seldom studied in industrial experiments. We conduct four industrial experiments in two different companies to evaluate the performance of TDD on external quality. We study whether the performance of TDD holds across premises within the same com… ▽ More Test-Driven Development (TDD) has been claimed to increase external software quality. However, the extent to which TDD increases external quality has been seldom studied in industrial experiments. We conduct four industrial experiments in two different companies to evaluate the performance of TDD on external quality. We study whether the performance of TDD holds across premises within the same company and across companies. We identify participant-level characteristics impacting results. Iterative-Test Last (ITL), the reverse approach of TDD, outperforms TDD in three out of four premises. ITL outperforms TDD in both companies. The larger the experience with unit testing and testing tools, the larger the difference in performance between ITL and TDD (in favour of ITL). Technological environment (i.e., programming language and testing tool) seems not to impact results. Evaluating participant-level characteristics impacting results in industrial experiments may ease the understanding of the performance of TDD in realistic settings. △ Less

Submitted 20 July, 2018; v1 submitted 18 July, 2018; originally announced July 2018.

arXiv:1807.06809 [pdf, other]

Comparing Techniques for Aggregating Interrelated Replications in Software Engineering

Authors: Adrian Santos, Natalia Juristo

Abstract: Context: Researchers from different groups and institutions are collaborating towards the construction of groups of interrelated replications. Applying unsuitable techniques to aggregate interrelated replications' results may impact the reliability of joint conclusions. Objectives: Comparing the advantages and disadvantages of the techniques applied to aggregate interrelated replications' result… ▽ More Context: Researchers from different groups and institutions are collaborating towards the construction of groups of interrelated replications. Applying unsuitable techniques to aggregate interrelated replications' results may impact the reliability of joint conclusions. Objectives: Comparing the advantages and disadvantages of the techniques applied to aggregate interrelated replications' results in Software Engineering (SE). Method: We conducted a literature review to identify the techniques applied to aggregate interrelated replications' results in SE. We analyze a prototypical group of interrelated replications in SE with the techniques that we identified. We check whether the advantages and disadvantages of each technique -according to mature experimental disciplines such as medicine- materialize in the SE context. Results: Narrative synthesis and Aggregation of p-values do not take advantage of all the information contained within the raw-data for providing joint conclusions. Aggregated Data (AD) meta-analysis provides visual summaries of results and allows assessing experiment-level moderators. Individual Participant Data (IPD) meta-analysis allows interpreting results in natural units and assessing experiment-level and participant-level moderators. Conclusion: All the information contained within the raw-data should be used to provide joint conclusions. AD and IPD, when used in tandem, seem suitable to analyze groups of interrelated replications in SE. △ Less

Submitted 29 July, 2018; v1 submitted 18 July, 2018; originally announced July 2018.

arXiv:1807.04100 [pdf, other]

The Effect of Noise on Sofware Engineers' Performance

Authors: Simone Romano, Giuseppe Scanniello, Davide Fucci, Natalia Juristo, Burak Turhan

Abstract: Background: Noise, defined as an unwanted sound, is one of the commonest factors that could affect people's performance in their daily work activities. The software engineering research community has marginally investigated the effects of noise on software engineers' performance. Aims: We studied if noise affects software engineers' performance in (i) comprehending functional requirements and (ii)… ▽ More Background: Noise, defined as an unwanted sound, is one of the commonest factors that could affect people's performance in their daily work activities. The software engineering research community has marginally investigated the effects of noise on software engineers' performance. Aims: We studied if noise affects software engineers' performance in (i) comprehending functional requirements and (ii) fixing faults in the source code. Method: We conducted two experiments with final-year undergraduate students in Computer Science. In the first experiment, we asked 55 students to comprehend functional requirements exposing them or not to noise, while in the second experiment 42 students were asked to fix faults in Java code. Results: The participants in the second experiment, when exposed to noise, had significantly worse performance in fixing faults in the source code. On the other hand, we did not observe any statistically significant difference in the first experiment. Conclusions: Fixing faults in source code seems to be more vulnerable to noise than comprehending functional requirements. △ Less

Submitted 11 July, 2018; originally announced July 2018.

Comments: ESEM18, Oulu (Finland), October 2018

arXiv:1807.02971 [pdf, other]

A Longitudinal Cohort Study on the Retainment of Test-Driven Development

Authors: Davide Fucci, Simone Romano, Maria Teresa Baldassarre, Danilo Caivano, Giuseppe Scanniello, Burak Thuran, Natalia Juristo

Abstract: Background: Test-Driven Development (TDD) is an agile software development practice, which is claimed to boost both external quality of software products and developers' productivity. Aims: We want to study (i) the TDD effects on the external quality of software products as well as the developers' productivity, and (ii) the retainment of TDD over a period of five months. Method: We conducted a (qu… ▽ More Background: Test-Driven Development (TDD) is an agile software development practice, which is claimed to boost both external quality of software products and developers' productivity. Aims: We want to study (i) the TDD effects on the external quality of software products as well as the developers' productivity, and (ii) the retainment of TDD over a period of five months. Method: We conducted a (quantitative) longitudinal cohort study with 30 third year undergraduate students in Computer Science at the University of Bari in Italy. Results: The use of TDD has a statistically significant effect neither on the external quality of software products nor on the developers' productivity. However, we observed that participants using TDD produced significantly more tests than those applying a non-TDD development process and that the retainment of TDD is particularly noticeable in the amount of tests written. Conclusions: Our results should encourage software companies to adopt TDD because who practices TDD tends to write more tests---having more tests can come in handy when testing software systems or localizing faults---and it seems that novice developers retain TDD. △ Less

Submitted 9 July, 2018; originally announced July 2018.

Comments: ESEM, October 2018, Oulu, Finland

arXiv:1805.09009 [pdf, other]

Analyzing Families of Experiments in SE: A Systematic Map** Study

Authors: Adrian Santos, Omar Gomez, Natalia Juristo

Abstract: Context: Families of experiments (i.e., groups of experiments with the same goal) are on the rise in Software Engineering (SE). Selecting unsuitable aggregation techniques to analyze families may undermine their potential to provide in-depth insights from experiments' results. Objectives: Identifying the techniques used to aggregate experiments' results within families in SE. Raising awareness o… ▽ More Context: Families of experiments (i.e., groups of experiments with the same goal) are on the rise in Software Engineering (SE). Selecting unsuitable aggregation techniques to analyze families may undermine their potential to provide in-depth insights from experiments' results. Objectives: Identifying the techniques used to aggregate experiments' results within families in SE. Raising awareness of the importance of applying suitable aggregation techniques to reach reliable conclusions within families. Method: We conduct a systematic map** study (SMS) to identify the aggregation techniques used to analyze families of experiments in SE. We outline the advantages and disadvantages of each aggregation technique according to mature experimental disciplines such as medicine and pharmacology. We provide preliminary recommendations to analyze and report families of experiments in view of families' common limitations with regard to joint data analysis. Results: Several aggregation techniques have been used to analyze SE families of experiments, including Narrative synthesis, Aggregated Data (AD), Individual Participant Data (IPD) mega-trial or stratified, and Aggregation of p-values. The rationale used to select aggregation techniques is rarely discussed within families. Families of experiments are commonly analyzed with unsuitable aggregation techniques according to the literature of mature experimental disciplines. Conclusion: Data analysis' reporting practices should be improved to increase the reliability and transparency of joint results. AD and IPD stratified appear to be suitable to analyze SE families of experiments. △ Less

Submitted 2 August, 2018; v1 submitted 23 May, 2018; originally announced May 2018.

arXiv:1805.02544 [pdf, other]

Need for Sleep: the Impact of a Night of Sleep Deprivation on Novice Developers' Performance

Authors: Davide Fucci, Giuseppe Scanniello, Simone Romano, Natalia Juristo

Abstract: We present a quasi-experiment to investigate whether, and to what extent, sleep deprivation impacts the performance of novice software developers using the agile practice of test-first development (TFD). We recruited 45 undergraduates and asked them to tackle a programming task. Among the participants, 23 agreed to stay awake the night before carrying out the task, while 22 slept usually. We analy… ▽ More We present a quasi-experiment to investigate whether, and to what extent, sleep deprivation impacts the performance of novice software developers using the agile practice of test-first development (TFD). We recruited 45 undergraduates and asked them to tackle a programming task. Among the participants, 23 agreed to stay awake the night before carrying out the task, while 22 slept usually. We analyzed the quality (i.e., the functional correctness) of the implementations delivered by the participants in both groups, their engagement in writing source code (i.e., the amount of activities performed in the IDE while tackling the programming task) and ability to apply TFD (i.e., the extent to which a participant can use this practice). By comparing the two groups of participants, we found that a single night of sleep deprivation leads to a reduction of 50% in the quality of the implementations. There is important evidence that the developers' engagement and their prowess to apply TFD are negatively impacted. Our results also show that sleep-deprived developers make more fixes to syntactic mistakes in the source code. We conclude that sleep deprivation has possibly disruptive effects on software development activities. The results open opportunities for improving developers' performance by integrating the study of sleep with other psycho-physiological factors in which the software engineering research community has recently taken an interest in. △ Less

Submitted 7 May, 2018; originally announced May 2018.

Comments: Accepted to IEEE Transactions on Software Engineering

arXiv:1611.05994 [pdf, other]

doi 10.1109/TSE.2016.2616877

A Dissection of the Test-Driven Development Process: Does It Really Matter to Test-First or to Test-Last?

Authors: Davide Fucci, Hakan Erdogmus, Burak Turhan, Markku Oivo, Natalia Juristo

Abstract: Background: Test-driven development (TDD) is a technique that repeats short coding cycles interleaved with testing. The developer first writes a unit test for the desired functionality, followed by the necessary production code, and refactors the code. Many empirical studies neglect unique process characteristics related to TDD iterative nature. Aim: We formulate four process characteristic: seque… ▽ More Background: Test-driven development (TDD) is a technique that repeats short coding cycles interleaved with testing. The developer first writes a unit test for the desired functionality, followed by the necessary production code, and refactors the code. Many empirical studies neglect unique process characteristics related to TDD iterative nature. Aim: We formulate four process characteristic: sequencing, granularity, uniformity, and refactoring effort. We investigate how these characteristics impact quality and productivity in TDD and related variations. Method: We analyzed 82 data points collected from 39 professionals, each capturing the process used while performing a specific development task. We built regression models to assess the impact of process characteristics on quality and productivity. Quality was measured by functional correctness. Result: Quality and productivity improvements were primarily positively associated with the granularity and uniformity. Sequencing, the order in which test and production code are written, had no important influence. Refactoring effort was negatively associated with both outcomes. We explain the unexpected negative correlation with quality by possible prevalence of mixed refactoring. Conclusion: The claimed benefits of TDD may not be due to its distinctive test-first dynamic, but rather due to the fact that TDD-like processes encourage fine-grained, steady steps that improve focus and flow. △ Less

Submitted 18 November, 2016; originally announced November 2016.

Showing 1–23 of 23 results for author: Juristo, N