-
Enhancing Testing at Meta with Rich-State Simulated Populations
Authors:
Nadia Alshahwan,
Arianna Blasi,
Kinga Bojarczuk,
Andrea Ciancone,
Natalija Gucevska,
Mark Harman,
Simon Schellaert,
Inna Harper,
Yue Jia,
Michał Królikowski,
Will Lewis,
Dragos Martac,
Rubmary Rojas,
Kate Ustiuzhanina
Abstract:
This paper reports the results of the deployment of Rich-State Simulated Populations at Meta for both automated and manual testing. We use simulated users (aka test users) to mimic user interactions and acquire state in much the same way that real user accounts acquire state. For automated testing, we present empirical results from deployment on the Facebook, Messenger, and Instagram apps for iOS…
▽ More
This paper reports the results of the deployment of Rich-State Simulated Populations at Meta for both automated and manual testing. We use simulated users (aka test users) to mimic user interactions and acquire state in much the same way that real user accounts acquire state. For automated testing, we present empirical results from deployment on the Facebook, Messenger, and Instagram apps for iOS and Android Platforms. These apps consist of tens of millions of lines of code, communicating with hundreds of millions of lines of backend code, and are used by over 2 billion people every day. Our results reveal that rich state increases average code coverage by 38\%, and endpoint coverage by 61\%. More importantly, it also yields an average increase of 115\% in the faults found by automated testing. The rich-state test user populations are also deployed in a (continually evolving) Test Universe; a web-enabled simulation platform for privacy-safe manual testing, which has been used by over 21,000 Meta engineers since its deployment in November 2022.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Automated Unit Test Improvement using Large Language Models at Meta
Authors:
Nadia Alshahwan,
Jubin Chheda,
Anastasia Finegenova,
Beliz Gokkaya,
Mark Harman,
Inna Harper,
Alexandru Marginean,
Shubho Sengupta,
Eddy Wang
Abstract:
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Ins…
▽ More
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers. We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Observation-based unit test generation at Meta
Authors:
Nadia Alshahwan,
Mark Harman,
Alexandru Marginean,
Rotem Tal,
Eddy Wang
Abstract:
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. We describe the development and deployment of TestGen at Meta. In particular, we focus on the scalability challenges overcome during development in order to deploy observation-based test carving at scale in industry. So far, TestGen has landed 518 tests into production…
▽ More
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution. We describe the development and deployment of TestGen at Meta. In particular, we focus on the scalability challenges overcome during development in order to deploy observation-based test carving at scale in industry. So far, TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults. Meta is currently in the process of more widespread deployment. Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86\% of the classes covered by end-to-end tests. Testing on 16 Kotlin Instagram app-launch-blocking tasks demonstrated that the TestGen tests would have trapped 13 of these before they became launch blocking.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Assured LLM-Based Software Engineering
Authors:
Nadia Alshahwan,
Mark Harman,
Inna Harper,
Alexandru Marginean,
Shubho Sengupta,
Eddy Wang
Abstract:
In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code
- does not regress the properties of the original code?
- improves the original in a verifiable and measurable way?
To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approac…
▽ More
In this paper we address the following question: How can we use Large Language Models (LLMs) to improve code independently of a human, while ensuring that the improved code
- does not regress the properties of the original code?
- improves the original in a verifiable and measurable way?
To address this question, we advocate Assured LLM-Based Software Engineering; a generate-and-test approach, inspired by Genetic Improvement. Assured LLMSE applies a series of semantic filters that discard code that fails to meet these twin guarantees. This overcomes the potential problem of LLM's propensity to hallucinate. It allows us to generate code using LLMs, independently of any human. The human plays the role only of final code reviewer, as they would do with code generated by other human engineers.
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Large Language Models for Software Engineering: Survey and Open Problems
Authors:
Angela Fan,
Beliz Gokkaya,
Mark Harman,
Mitya Lyubarskiy,
Shubho Sengupta,
Shin Yoo,
Jie M. Zhang
Abstract:
This paper provides a survey of the emerging area of Large Language Models (LLMs) for Software Engineering (SE). It also sets out open research challenges for the application of LLMs to technical problems faced by software engineers. LLMs' emergent properties bring novelty and creativity with applications right across the spectrum of Software Engineering activities including coding, design, requir…
▽ More
This paper provides a survey of the emerging area of Large Language Models (LLMs) for Software Engineering (SE). It also sets out open research challenges for the application of LLMs to technical problems faced by software engineers. LLMs' emergent properties bring novelty and creativity with applications right across the spectrum of Software Engineering activities including coding, design, requirements, repair, refactoring, performance improvement, documentation and analytics. However, these very same emergent properties also pose significant technical challenges; we need techniques that can reliably weed out incorrect solutions, such as hallucinations. Our survey reveals the pivotal role that hybrid techniques (traditional SE plus LLMs) have to play in the development and deployment of reliable, efficient and effective LLM-based SE.
△ Less
Submitted 11 November, 2023; v1 submitted 5 October, 2023;
originally announced October 2023.
-
Large Language Models in Fault Localisation
Authors:
Yonghao Wu,
Zheng Li,
Jie M. Zhang,
Mike Papadakis,
Mark Harman,
Yong Liu
Abstract:
Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, program repair, code summarisation, and test generation. Fault localisation is instrumental in enabling automated debugging and repair of programs and was prominently featured as a highlight during the launch event of ChatGPT-4. Nevertheless, the performance of LLMs compared to state-o…
▽ More
Large Language Models (LLMs) have shown promise in multiple software engineering tasks including code generation, program repair, code summarisation, and test generation. Fault localisation is instrumental in enabling automated debugging and repair of programs and was prominently featured as a highlight during the launch event of ChatGPT-4. Nevertheless, the performance of LLMs compared to state-of-the-art methods, as well as the impact of prompt design and context length on their efficacy, remains unclear. To fill this gap, this paper presents an in-depth investigation into the capability of ChatGPT-3.5 and ChatGPT-4, the two state-of-the-art LLMs, on fault localisation. Using the widely-adopted large-scale Defects4J dataset, we compare the two LLMs with the existing fault localisation techniques. We also investigate the consistency of LLMs in fault localisation, as well as how prompt engineering and the length of code context affect the fault localisation effectiveness.
Our findings demonstrate that within function-level context, ChatGPT-4 outperforms all the existing fault localisation methods. Additional error logs can further improve ChatGPT models' localisation accuracy and consistency, with an average 46.9% higher accuracy over the state-of-the-art baseline SmartFL on the Defects4J dataset in terms of TOP-1 metric. However, when the code context of the Defects4J dataset expands to the class-level, ChatGPT-4's performance suffers a significant drop, with 49.9% lower accuracy than SmartFL under TOP-1 metric. These observations indicate that although ChatGPT can effectively localise faults under specific conditions, limitations are evident. Further research is needed to fully harness the potential of LLMs like ChatGPT for practical fault localisation applications.
△ Less
Submitted 2 October, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
COCO: Testing Code Generation Systems via Concretized Instructions
Authors:
Ming Yan,
Junjie Chen,
Jie M. Zhang,
Xuejie Cao,
Chen Yang,
Mark Harman
Abstract:
Code generation systems have been extensively developed in recent years to generate source code based on natural language instructions. However, despite their advancements, these systems still face robustness issues where even slightly different instructions can result in significantly different code semantics. Robustness is critical for code generation systems, as it can have significant impacts…
▽ More
Code generation systems have been extensively developed in recent years to generate source code based on natural language instructions. However, despite their advancements, these systems still face robustness issues where even slightly different instructions can result in significantly different code semantics. Robustness is critical for code generation systems, as it can have significant impacts on software development, software quality, and trust in the generated code. Although existing testing techniques for general text-to-text software can detect some robustness issues, they are limited in effectiveness due to ignoring the characteristics of code generation systems. In this work, we propose a novel technique COCO to test the robustness of code generation systems. It exploits the usage scenario of code generation systems to make the original programming instruction more concrete by incorporating features known to be contained in the original code. A robust system should maintain code semantics for the concretized instruction, and COCO detects robustness inconsistencies when it does not. We evaluated COCO on eight advanced code generation systems, including commercial tools such as Copilot and ChatGPT, using two widely-used datasets. Our results demonstrate the effectiveness of COCO in testing the robustness of code generation systems, outperforming two techniques adopted from general text-to-text software testing by 466.66% and 104.02%, respectively. Furthermore, concretized instructions generated by COCO can help reduce robustness inconsistencies by 18.35% to 53.91% through fine-tuning.
△ Less
Submitted 25 August, 2023;
originally announced August 2023.
-
LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation
Authors:
Shuyin Ouyang,
Jie M. Zhang,
Mark Harman,
Meng Wang
Abstract:
There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply ca…
▽ More
There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. This paper conducts an empirical study to demonstrate that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS, and HumanEval). Our results reveal high degrees of non-determinism: the ratio of coding tasks with zero equal test output across different requests is 72.73%, 60.40%, and 65.85% for CodeContests, APPS, and HumanEval, respectively. In addition, we find that setting the temperature to 0 does not guarantee determinism in code generation, although it indeed brings less non-determinism than the default configuration (temperature=1). These results confirm that there is, currently, a significant threat to scientific conclusion validity. In order to put LLM-based research on firmer scientific foundations, researchers need to take into account non-determinism in drawing their conclusions.
△ Less
Submitted 5 August, 2023;
originally announced August 2023.
-
Fairness Improvement with Multiple Protected Attributes: How Far Are We?
Authors:
Zhenpeng Chen,
Jie M. Zhang,
Federica Sarro,
Mark Harman
Abstract:
Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effective…
▽ More
Existing research mostly improves the fairness of Machine Learning (ML) software regarding a single protected attribute at a time, but this is unrealistic given that many users have multiple protected attributes. This paper conducts an extensive study of fairness improvement regarding multiple protected attributes, covering 11 state-of-the-art fairness improvement methods. We analyze the effectiveness of these methods with different datasets, metrics, and ML models when considering multiple protected attributes. The results reveal that improving fairness for a single protected attribute can largely decrease fairness regarding unconsidered protected attributes. This decrease is observed in up to 88.3% of scenarios (57.5% on average). More surprisingly, we find little difference in accuracy loss when considering single and multiple protected attributes, indicating that accuracy can be maintained in the multiple-attribute paradigm. However, the effect on F1-score when handling two protected attributes is about twice that of a single attribute. This has important implications for future fairness research: reporting only accuracy as the ML performance metric, which is currently common in the literature, is inadequate.
△ Less
Submitted 4 April, 2024; v1 submitted 25 July, 2023;
originally announced August 2023.
-
Simulation-Driven Automated End-to-End Test and Oracle Inference
Authors:
Shreshth Tuli,
Kinga Bojarczuk,
Natalija Gucevska,
Mark Harman,
Xiao-Yu Wang,
Graham Wright
Abstract:
This is the first work to report on inferential testing at scale in industry. Specifically, it reports the experience of automated testing of integrity systems at Meta. We built an internal tool called ALPACAS for automated inference of end-to-end integrity tests. Integrity tests are designed to keep users safe online by checking that interventions take place when harmful behaviour occurs on a pla…
▽ More
This is the first work to report on inferential testing at scale in industry. Specifically, it reports the experience of automated testing of integrity systems at Meta. We built an internal tool called ALPACAS for automated inference of end-to-end integrity tests. Integrity tests are designed to keep users safe online by checking that interventions take place when harmful behaviour occurs on a platform. ALPACAS infers not only the test input, but also the oracle, by observing production interventions to prevent harmful behaviour. This approach allows Meta to automate the process of generating integrity tests for its platforms, such as Facebook and Instagram, which consist of hundreds of millions of lines of production code. We outline the design and deployment of ALPACAS, and report results for its coverage, number of tests produced at each stage of the test inference process, and their pass rates. Specifically, we demonstrate that using ALPACAS significantly improves coverage from a manual test design for the particular aspect of integrity end-to-end testing it was applied to. Further, from a pool of 3 million data points, ALPACAS automatically yields 39 production-ready end-to-end integrity tests. We also report that the ALPACAS-inferred test suite enjoys exceptionally low flakiness for end-to-end testing with its average in-production pass rate of 99.84%.
△ Less
Submitted 5 February, 2023;
originally announced February 2023.
-
Kee** Mutation Test Suites Consistent and Relevant with Long-Standing Mutants
Authors:
Milos Ojdanic,
Mike Papadakis,
Mark Harman
Abstract:
Mutation testing has been demonstrated to be one of the most powerful fault-revealing tools in the tester's tool kit. Much previous work implicitly assumed it to be sufficient to re-compute mutant suites per release. Sadly, this makes mutation results inconsistent; mutant scores from each release cannot be directly compared, making it harder to measure test improvement. Furthermore, regular code c…
▽ More
Mutation testing has been demonstrated to be one of the most powerful fault-revealing tools in the tester's tool kit. Much previous work implicitly assumed it to be sufficient to re-compute mutant suites per release. Sadly, this makes mutation results inconsistent; mutant scores from each release cannot be directly compared, making it harder to measure test improvement. Furthermore, regular code change means that a mutant suite's relevance will naturally degrade over time. We measure this degradation in relevance for 143,500 mutants in 4 non-trivial systems finding that, on overage, 52% degrade. We introduce a mutant brittleness measure and use it to audit software systems and their mutation suites. We also demonstrate how consistent-by-construction long-standing mutant suites can be identified with a 10x improvement in mutant relevance over an arbitrary test suite. Our results indicate that the research community should avoid the re-computation of mutant suites and focus, instead, on long-standing mutants, thereby improving the consistency and relevance of mutation testing.
△ Less
Submitted 22 December, 2022;
originally announced December 2022.
-
Fairness Testing: A Comprehensive Survey and Analysis of Trends
Authors:
Zhenpeng Chen,
Jie M. Zhang,
Max Hort,
Mark Harman,
Federica Sarro
Abstract:
Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this paper offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test)…
▽ More
Unfair behaviors of Machine Learning (ML) software have garnered increasing attention and concern among software engineers. To tackle this issue, extensive research has been dedicated to conducting fairness testing of ML software, and this paper offers a comprehensive survey of existing studies in this field. We collect 100 papers and organize them based on the testing workflow (i.e., how to test) and testing components (i.e., what to test). Furthermore, we analyze the research focus, trends, and promising directions in the realm of fairness testing. We also identify widely-adopted datasets and open-source tools for fairness testing.
△ Less
Submitted 6 March, 2024; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Bias Mitigation for Machine Learning Classifiers: A Comprehensive Survey
Authors:
Max Hort,
Zhenpeng Chen,
Jie M. Zhang,
Mark Harman,
Federica Sarro
Abstract:
This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 341 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technique they apply. We investigate how existing bi…
▽ More
This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 341 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technique they apply. We investigate how existing bias mitigation methods are evaluated in the literature. In particular, we consider datasets, metrics and benchmarking. Based on the gathered insights (e.g., What is the most popular fairness metric? How many datasets are used for evaluating bias mitigation methods?), we hope to support practitioners in making informed choices when develo** and evaluating new bias mitigation methods.
△ Less
Submitted 11 October, 2023; v1 submitted 14 July, 2022;
originally announced July 2022.
-
A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers
Authors:
Zhenpeng Chen,
Jie M. Zhang,
Federica Sarro,
Mark Harman
Abstract:
Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance trade-off assessment, applied to 8 widely-adopted softwar…
▽ More
Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance trade-off assessment, applied to 8 widely-adopted software decision tasks. The empirical coverage is much more comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance trade-off measures compared to previous work on this important software property. We find that (1) the bias mitigation methods significantly decrease ML performance in 53% of the studied scenarios (ranging between 42%~66% according to different ML performance metrics); (2) the bias mitigation methods significantly improve fairness measured by the 4 used metrics in 46% of all the scenarios (ranging between 24%~59% according to different fairness metrics); (3) the bias mitigation methods even lead to decrease in both fairness and ML performance in 25% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, the choice of protected attributes, and the set of metrics used to assess fairness and ML performance; (5) there is no bias mitigation method that can achieve the best trade-off in all the scenarios. The best method that we find outperforms other methods in 30% of the scenarios. Researchers and practitioners need to choose the bias mitigation method best suited to their intended application scenario(s).
△ Less
Submitted 10 February, 2023; v1 submitted 7 July, 2022;
originally announced July 2022.
-
Leveraging Automated Unit Tests for Unsupervised Code Translation
Authors:
Baptiste Roziere,
Jie M. Zhang,
Francois Charton,
Mark Harman,
Gabriel Synnaeve,
Guillaume Lample
Abstract:
With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive…
▽ More
With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java $\to$ Python and Python $\to$ C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.
△ Less
Submitted 16 February, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
SCSS-Net: Solar Corona Structures Segmentation by Deep Learning
Authors:
Šimon Mackovjak,
Martin Harman,
Viera Maslej-Krešňáková,
Peter Butka
Abstract:
Structures in the solar corona are the main drivers of space weather processes that might directly or indirectly affect the Earth. Thanks to the most recent space-based solar observatories, with capabilities to acquire high-resolution images continuously, the structures in the solar corona can be monitored over the years with a time resolution of minutes. For this purpose, we have developed a meth…
▽ More
Structures in the solar corona are the main drivers of space weather processes that might directly or indirectly affect the Earth. Thanks to the most recent space-based solar observatories, with capabilities to acquire high-resolution images continuously, the structures in the solar corona can be monitored over the years with a time resolution of minutes. For this purpose, we have developed a method for automatic segmentation of solar corona structures observed in EUV spectrum that is based on a deep learning approach utilizing Convolutional Neural Networks. The available input datasets have been examined together with our own dataset based on the manual annotation of the target structures. Indeed, the input dataset is the main limitation of the developed model's performance. Our \textit{SCSS-Net} model provides results for coronal holes and active regions that could be compared with other generally used methods for automatic segmentation. Even more, it provides a universal procedure to identify structures in the solar corona with the help of the transfer learning technique. The outputs of the model can be then used for further statistical studies of connections between solar activity and the influence of space weather on Earth.
△ Less
Submitted 22 September, 2021;
originally announced September 2021.
-
An Empirical Study on Failed Error Propagation in Java Programs with Real Faults
Authors:
Gunel Jahangirova,
David Clark,
Mark Harman,
Paolo Tonella
Abstract:
During testing, developers can place oracles externally or internally with respect to a method. Given a faulty execution state, i.e., one that differs from the expected one, an oracle might be unable to expose the fault if it is placed at a program point with no access to the incorrect program state or where the program state is no longer corrupted. In such a case, the oracle is subject to failed…
▽ More
During testing, developers can place oracles externally or internally with respect to a method. Given a faulty execution state, i.e., one that differs from the expected one, an oracle might be unable to expose the fault if it is placed at a program point with no access to the incorrect program state or where the program state is no longer corrupted. In such a case, the oracle is subject to failed error propagation.
We conducted an empirical study to measure failed error propagation on Defects4J, the reference benchmark for Java programs with real faults, considering all 6 projects available (386 real bugs and 459 fixed methods). Our results indicate that the prevalence of failed error propagation is negligible when testing is performed at the unit level. However, when system-level inputs are provided, the prevalence of failed error propagation increases substantially. This indicates that it is enough for method postconditions to predicate only on the externally observable state/data and that intermediate steps should be checked when testing at system level.
△ Less
Submitted 21 November, 2020;
originally announced November 2020.
-
FrUITeR: A Framework for Evaluating UI Test Reuse
Authors:
Yixue Zhao,
Justin Chen,
Adriana Sejfia,
Marcelo Schmitt Laser,
Jie Zhang,
Federica Sarro,
Mark Harman,
Nenad Medvidovic
Abstract:
UI testing is tedious and time-consuming due to the manual effort required. Recent research has explored opportunities for reusing existing UI tests from an app to automatically generate new tests for other apps. However, the evaluation of such techniques currently remains manual, unscalable, and unreproducible, which can waste effort and impede progress in this emerging area. We introduce FrUITeR…
▽ More
UI testing is tedious and time-consuming due to the manual effort required. Recent research has explored opportunities for reusing existing UI tests from an app to automatically generate new tests for other apps. However, the evaluation of such techniques currently remains manual, unscalable, and unreproducible, which can waste effort and impede progress in this emerging area. We introduce FrUITeR, a framework that automatically evaluates UI test reuse in a reproducible way. We apply FrUITeR to existing test-reuse techniques on a uniform benchmark we established, resulting in 11,917 test reuse cases from 20 apps. We report several key findings aimed at improving UI test reuse that are missed by existing work.
△ Less
Submitted 3 November, 2020; v1 submitted 7 August, 2020;
originally announced August 2020.
-
Ownership at Large -- Open Problems and Challenges in Ownership Management
Authors:
John Ahlgren,
Maria Eugenia Berezin,
Kinga Bojarczuk,
Elena Dulskyte,
Inna Dvortsova,
Johann George,
Natalija Gucevska,
Mark Harman,
Shan He,
Ralf Lämmel,
Erik Meijer,
Silvia Sapora,
Justin Spahr-Summers
Abstract:
Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given p…
▽ More
Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, evolve (and thereby assume ownership of) a given asset. This paper introduces the Facebook Ownesty system, which uses a combination of ultra large scale data mining and machine learning and has been deployed at Facebook as part of the company's ownership management approach. Ownesty processes many millions of software assets (e.g., source-code files) and it takes into account workflow and organizational aspects. The paper sets out open problems and challenges on ownership for the research community with advances expected from the fields of software engineering, programming languages, and machine learning.
△ Less
Submitted 15 April, 2020;
originally announced April 2020.
-
WES: Agent-based User Interaction Simulation on Real Infrastructure
Authors:
John Ahlgren,
Maria Eugenia Berezin,
Kinga Bojarczuk,
Elena Dulskyte,
Inna Dvortsova,
Johann George,
Natalija Gucevska,
Mark Harman,
Ralf Lämmel,
Erik Meijer,
Silvia Sapora,
Justin Spahr-Summers
Abstract:
We introduce the Web-Enabled Simulation (WES) research agenda, and describe FACEBOOK's WW system. We describe the application of WW to reliability, integrity and privacy at FACEBOOK , where it is used to simulate social media interactions on an infrastructure consisting of hundreds of millions of lines of code. The WES agenda draws on research from many areas of study, including Search Based Softw…
▽ More
We introduce the Web-Enabled Simulation (WES) research agenda, and describe FACEBOOK's WW system. We describe the application of WW to reliability, integrity and privacy at FACEBOOK , where it is used to simulate social media interactions on an infrastructure consisting of hundreds of millions of lines of code. The WES agenda draws on research from many areas of study, including Search Based Software Engineering, Machine Learning, Programming Languages, Multi Agent Systems, Graph Theory, Game AI, and AI Assisted Game Play. We conclude with a set of open problems and research challenges to motivate wider investigation.
△ Less
Submitted 11 April, 2020;
originally announced April 2020.
-
FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair
Authors:
Maxime Cordy,
Renaud Rwemalika,
Mike Papadakis,
Mark Harman
Abstract:
Much research on software testing makes an implicit assumption that test failures are deterministic such that they always witness the presence of the same defects. However, this assumption is not always true because some test failures are due to so-called flaky tests, i.e., tests with non-deterministic outcomes. Unfortunately, flaky tests have major implications for testing and test-dependent acti…
▽ More
Much research on software testing makes an implicit assumption that test failures are deterministic such that they always witness the presence of the same defects. However, this assumption is not always true because some test failures are due to so-called flaky tests, i.e., tests with non-deterministic outcomes. Unfortunately, flaky tests have major implications for testing and test-dependent activities such as mutation testing and automated program repair. To deal with this issue, we introduce a test flakiness assessment and experimentation platform, called FlakiMe, that supports the seeding of a (controllable) degree of flakiness into the behaviour of a given test suite. Thereby, FlakiMe equips researchers with ways to investigate the impact of test flakiness on their techniques under laboratory-controlled conditions. We use FlakiME to report results and insights from case studies that assesses the impact of flakiness on mutation testing and program repair. These results indicate that a 5% of flakiness failures is enough to affect the mutation score, but the effect size is modest (2% - 4% ), while it completely annihilates the ability of program repair to patch 50% of the subject programs. We also observe that flakiness has case-specific effects, which mainly disrupts the repair of bugs that are covered by many tests. Moreover, we find that a minimal amount of user feedback is sufficient for alleviating the effects of flakiness.
△ Less
Submitted 6 December, 2019;
originally announced December 2019.
-
Automatic Testing and Improvement of Machine Translation
Authors:
Zeyu Sun,
Jie M. Zhang,
Mark Harman,
Mike Papadakis,
Lu Zhang
Abstract:
This paper presents TransRepair, a fully automatic approach for testing and repairing the consistency of machine translation systems. TransRepair combines mutation with metamorphic testing to detect inconsistency bugs (without access to human oracles). It then adopts probability-reference or cross-reference to post-process the translations, in a grey-box or black-box manner, to repair the inconsis…
▽ More
This paper presents TransRepair, a fully automatic approach for testing and repairing the consistency of machine translation systems. TransRepair combines mutation with metamorphic testing to detect inconsistency bugs (without access to human oracles). It then adopts probability-reference or cross-reference to post-process the translations, in a grey-box or black-box manner, to repair the inconsistencies. Our evaluation on two state-of-the-art translators, Google Translate and Transformer, indicates that TransRepair has a high precision (99%) on generating input pairs with consistent translations. With these tests, using automatic consistency metrics and manual assessment, we find that Google Translate and Transformer have approximately 36% and 40% inconsistency bugs. Black-box repair fixes 28% and 19% bugs on average for Google Translate and Transformer. Grey-box repair fixes 30% bugs on average for Transformer. Manual inspection indicates that the translations repaired by our approach improve consistency in 87% of cases (degrading it in 2%), and that our repairs have better translation acceptability in 27% of the cases (worse in 8%).
△ Less
Submitted 25 December, 2019; v1 submitted 7 October, 2019;
originally announced October 2019.
-
A Survey of Constrained Combinatorial Testing
Authors:
Huayao Wu,
Changhai Nie,
Justyna Petke,
Yue Jia,
Mark Harman
Abstract:
Combinatorial Testing (CT) is a potentially powerful testing technique, whereas its failure revealing ability might be dramatically reduced if it fails to handle constraints in an adequate and efficient manner. To ensure the wider applicability of CT in the presence of constrained problem domains, large and diverse efforts have been invested towards the techniques and applications of constrained c…
▽ More
Combinatorial Testing (CT) is a potentially powerful testing technique, whereas its failure revealing ability might be dramatically reduced if it fails to handle constraints in an adequate and efficient manner. To ensure the wider applicability of CT in the presence of constrained problem domains, large and diverse efforts have been invested towards the techniques and applications of constrained combinatorial testing. In this paper, we provide a comprehensive survey of representations, influences, and techniques that pertain to constraints in CT, covering 129 papers published between 1987 and 2018. This survey not only categorises the various constraint handling techniques, but also reviews comparatively less well-studied, yet potentially important, constraint identification and maintenance techniques. Since real-world programs are usually constrained, this survey can be of interest to researchers and practitioners who are looking to use and study constrained combinatorial testing techniques.
△ Less
Submitted 7 August, 2019;
originally announced August 2019.
-
Machine Learning Testing: Survey, Landscapes and Horizons
Authors:
Jie M. Zhang,
Mark Harman,
Lei Ma,
Yang Liu
Abstract:
This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper…
▽ More
This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 144 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.
△ Less
Submitted 21 December, 2019; v1 submitted 19 June, 2019;
originally announced June 2019.
-
Sub-Turing Islands in the Wild
Authors:
Earl T. Barr,
David W. Binkley,
Mark Harman,
Mohamed Nassim Seghir
Abstract:
Recently, there has been growing debate as to whether or not static analysis can be truly sound. In spite of this concern, research on techniques seeking to at least partially answer undecidable questions has a long history. However, little attention has been given to the more empirical question of how often an exact solution might be given to a question despite the question being, at least in the…
▽ More
Recently, there has been growing debate as to whether or not static analysis can be truly sound. In spite of this concern, research on techniques seeking to at least partially answer undecidable questions has a long history. However, little attention has been given to the more empirical question of how often an exact solution might be given to a question despite the question being, at least in theory, undecidable. This paper investigates this issue by exploring sub-Turing islands -- regions of code for which a question of interest is decidable. We define such islands and then consider how to identify them. We implemented Cook, a prototype for finding sub-Turing islands and applied it to a corpus of 1100 Android applications, containing over 2 million methods. Results reveal that 55\% of the all methods are sub-Turing. Our results also provide empirical, scientific evidence for the scalability of sub-Turing island identification. Sub-Turing identification has many downstream applications, because islands are so amenable to static analysis. We illustrate two downstream uses of the analysis. In the first, we found that over 37\% of the verification conditions associated with runtime exceptions fell within sub-Turing islands and thus are statically decidable. A second use of our analysis is during code review where it provides guidance to developers. The sub-Turing islands from our study turns out to contain significantly fewer bugs than `theswamp' (non sub-Turing methods). The greater bug density in the swamp is unsurprising; the fact that bugs remain prevalent in islands is, however, surprising: these are bugs whose repair can be fully automated.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
Model Validation Using Mutated Training Labels: An Exploratory Study
Authors:
Jie M. Zhang,
Mark Harman,
Benjamin Guedj,
Earl T. Barr,
John Shawe-Taylor
Abstract:
We introduce an exploratory study on Mutation Validation (MV), a model validation method using mutated training labels for supervised learning. MV mutates training data labels, retrains the model against the mutated data, then uses the metamorphic relation that captures the consequent training performance changes to assess model fit. It does not use a validation set or test set. The intuition unde…
▽ More
We introduce an exploratory study on Mutation Validation (MV), a model validation method using mutated training labels for supervised learning. MV mutates training data labels, retrains the model against the mutated data, then uses the metamorphic relation that captures the consequent training performance changes to assess model fit. It does not use a validation set or test set. The intuition underpinning MV is that overfitting models tend to fit noise in the training data. We explore 8 different learning algorithms, 18 datasets, and 5 types of hyperparameter tuning tasks. Our results demonstrate that MV is accurate in model selection: the model recommendation hit rate is 92\% for MV and less than 60\% for out-of-sample-validation. MV also provides more stable hyperparameter tuning results than out-of-sample-validation across different runs.
△ Less
Submitted 20 October, 2021; v1 submitted 24 May, 2019;
originally announced May 2019.
-
Indexing Operators to Extend the Reach of Symbolic Execution
Authors:
Earl T. Barr,
David Clark,
Mark Harman,
Alexandru Marginean
Abstract:
Traditional program analysis analyses a program language, that is, all programs that can be written in the language. There is a difference, however, between all possible programs that can be written and the corpus of actual programs written in a language. We seek to exploit this difference: for a given program, we apply a bespoke program transformation Indexify to convert expressions that current…
▽ More
Traditional program analysis analyses a program language, that is, all programs that can be written in the language. There is a difference, however, between all possible programs that can be written and the corpus of actual programs written in a language. We seek to exploit this difference: for a given program, we apply a bespoke program transformation Indexify to convert expressions that current SMT solvers do not, in general, handle, such as constraints on strings, into equisatisfiable expressions that they do handle. To this end, Indexify replaces operators in hard-to-handle expressions with homomorphic versions that behave the same on a finite subset of the domain of the original operator, and return bottom denoting unknown outside of that subset. By focusing on what literals and expressions are most useful for analysing a given program, Indexify constructs a small, finite theory that extends the power of a solver on the expressions a target program builds.
Indexify's bespoke nature necessarily means that its evaluation must be experimental, resting on a demonstration of its effectiveness in practice. We have developed Indexif}, a tool for Indexify. We demonstrate its utility and effectiveness by applying it to two real world benchmarks --- string expressions in coreutils and floats in fdlibm53. Indexify reduces time-to-completion on coreutils from Klee's 49.5m on average to 6.0m. It increases branch coverage on coreutils from 30.10% for Klee and 14.79% for Zesti to 66.83%. When indexifying floats in fdlibm53, Indexifyl increases branch coverage from 34.45% to 71.56% over Klee. For a restricted class of inputs, Indexify permits the symbolic execution of program paths unreachable with previous techniques: it covers more than twice as many branches in coreutils as Klee.
△ Less
Submitted 26 June, 2018;
originally announced June 2018.
-
A Study of Bug Resolution Characteristics in Popular Programming Languages
Authors:
Jie M. Zhang,
Feng Li,
Dan Hao,
Meng Wang,
Hao Tang,
Lu Zhang,
Mark Harman
Abstract:
This paper presents a large-scale study that investigates the bug resolution characteristics among popular Github projects written in different programming languages. We explore correlations but, of course, we cannot infer causation. Specifically, we analyse bug resolution data from approximately 70 million Source Line of Code, drawn from 3 million commits to 600 GitHub projects, primarily written…
▽ More
This paper presents a large-scale study that investigates the bug resolution characteristics among popular Github projects written in different programming languages. We explore correlations but, of course, we cannot infer causation. Specifically, we analyse bug resolution data from approximately 70 million Source Line of Code, drawn from 3 million commits to 600 GitHub projects, primarily written in 10 programming languages. We find notable variations in apparent bug resolution time and patch (fix) size. While interpretation of results from such large-scale empirical studies is inherently difficult, we believe that the differences in medians are sufficiently large to warrant further investigation, replication, re-analysis and follow up research. For example, in our corpus, the median apparent bug resolution time (elapsed time from raise to resolve) for Ruby was 4X that for Go and 2.5X for Java. We also found that patches tend to touch more files for the corpus of strongly typed and for statically typed programs. However, we also found evidence for a lower elapsed resolution time for bug resolution committed to projects constructed from statically typed languages. These findings, if replicated in subsequent follow on studies, may shed further empirical light on the debate about the importance of static ty**.
△ Less
Submitted 4 January, 2020; v1 submitted 3 January, 2018;
originally announced January 2018.
-
Using Genetic Programming to Model Software
Authors:
W. B. Langdon,
M. Harman
Abstract:
We study a generic program to investigate the scope for automatically customising it for a vital current task, which was not considered when it was first written. In detail, we show genetic programming (GP) can evolve models of aspects of BLAST's output when it is used to map Solexa Next-Gen DNA sequences to the human genome.
We study a generic program to investigate the scope for automatically customising it for a vital current task, which was not considered when it was first written. In detail, we show genetic programming (GP) can evolve models of aspects of BLAST's output when it is used to map Solexa Next-Gen DNA sequences to the human genome.
△ Less
Submitted 24 June, 2013;
originally announced June 2013.