Automating Test Case Identification in Java Open Source Projects on GitHub
Authors:
Matej Madeja,
Jaroslav Porubän,
Michaela Bačíková,
Matúš Sulír,
Ján Juhár,
Sergej Chodarev,
Filip Gurbáľ
Abstract:
Software testing is one of the very important Quality Assurance (QA) components. A lot of researchers deal with the testing process in terms of tester motivation and how tests should or should not be written. However, it is not known from the recommendations how the tests are written in real projects. In this paper, the following was investigated: (i) the denotation of the word "test" in different…
▽ More
Software testing is one of the very important Quality Assurance (QA) components. A lot of researchers deal with the testing process in terms of tester motivation and how tests should or should not be written. However, it is not known from the recommendations how the tests are written in real projects. In this paper, the following was investigated: (i) the denotation of the word "test" in different natural languages; (ii) whether the number of occurrences of the word "test" correlates with the number of test cases; and (iii) what testing frameworks are mostly used. The analysis was performed on 38 GitHub open source repositories thoroughly selected from the set of 4.3M GitHub projects. We analyzed 20,340 test cases in 803 classes manually and 170k classes using an automated approach. The results show that: (i) there exists a weak correlation (r = 0.655) between the number of occurrences of the word "test" and the number of test cases in a class; (ii) the proposed algorithm using static file analysis correctly detected 97% of test cases; (iii) 15% of the analyzed classes used main() function whose represent regular Java programs that test the production code without using any third-party framework. The identification of such tests is very complex due to implementation diversity. The results may be leveraged to more quickly identify and locate test cases in a repository, to understand practices in customized testing solutions, and to mine tests to improve program comprehension in the future.
△ Less
Submitted 30 July, 2021; v1 submitted 23 February, 2021;
originally announced February 2021.
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Authors:
Steffen Herbold,
Alexander Trautsch,
Benjamin Ledel,
Alireza Aghamohammadi,
Taher Ahmed Ghaleb,
Kuljit Kaur Chahal,
Tim Bossenmaier,
Bhaveet Nagaria,
Philip Makedonski,
Matin Nili Ahmadabadi,
Kristof Szabados,
Helge Spieker,
Matej Madeja,
Nathaniel Hoy,
Valentina Lenarduzzi,
Shangwen Wang,
Gema Rodríguez-Pérez,
Ricardo Colomo-Palacios,
Roberto Verdecchia,
Paramvir Singh,
Yihao Qin,
Debasish Chakroborti,
Willard Davis,
Vijay Walunj,
Hongjun Wu
, et al. (23 additional authors not shown)
Abstract:
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.
Metho…
▽ More
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.
Conclusion: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise.
△ Less
Submitted 13 October, 2021; v1 submitted 12 November, 2020;
originally announced November 2020.