-
Mind the Gap: The Difference Between Coverage and Mutation Score Can Guide Testing Efforts
Authors:
Kush Jain,
Goutamkumar Tulajappa Kalburgi,
Claire Le Goues,
Alex Groce
Abstract:
An "adequate" test suite should effectively find all inconsistencies between a system's requirements/specifications and its implementation. Practitioners frequently use code coverage to approximate adequacy, while academics argue that mutation score may better approximate true (oracular) adequacy coverage. High code coverage is increasingly attainable even on large systems via automatic test gener…
▽ More
An "adequate" test suite should effectively find all inconsistencies between a system's requirements/specifications and its implementation. Practitioners frequently use code coverage to approximate adequacy, while academics argue that mutation score may better approximate true (oracular) adequacy coverage. High code coverage is increasingly attainable even on large systems via automatic test generation, including fuzzing. In light of all of these options for measuring and improving testing effort, how should a QA engineer spend their time? We propose a new framework for reasoning about the extent, limits, and nature of a given testing effort based on an idea we call the oracle gap, or the difference between source code coverage and mutation score for a given software element. We conduct (1) a large-scale observational study of the oracle gap across popular Maven projects, (2) a study that varies testing and oracle quality across several of those projects and (3) a small-scale observational study of highly critical, well-tested code across comparable blockchain projects. We show that the oracle gap surfaces important information about the extent and quality of a test effort beyond either adequacy metric alone. In particular, it provides a way for practitioners to identify source files where it is likely a weak oracle tests important code.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Contextual Predictive Mutation Testing
Authors:
Kush Jain,
Uri Alon,
Alex Groce,
Claire Le Goues
Abstract:
Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect…
▽ More
Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, MutationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
The Test of Tests: A Framework For Differentially Private Hypothesis Testing
Authors:
Zeki Kazan,
Kaiyan Shi,
Adam Groce,
Andrew Bray
Abstract:
We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework…
▽ More
We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework of this type, as well as to several individually-designed private hypothesis tests. Our framework is higher power than other generic solutions and at least competitive with (and often better than) individually-designed tests.
△ Less
Submitted 8 February, 2023;
originally announced February 2023.
-
Mutation Analysis: Answering the Fuzzing Challenge
Authors:
Rahul Gopinath,
Philipp Görz,
Alex Groce
Abstract:
Fuzzing is one of the fastest growing fields in software testing. The idea behind fuzzing is to check the behavior of software against a large number of randomly generated inputs, trying to cover all interesting parts of the input space, while observing the tested software for anomalous behaviour. One of the biggest challenges facing fuzzer users is how to validate software behavior, and how to im…
▽ More
Fuzzing is one of the fastest growing fields in software testing. The idea behind fuzzing is to check the behavior of software against a large number of randomly generated inputs, trying to cover all interesting parts of the input space, while observing the tested software for anomalous behaviour. One of the biggest challenges facing fuzzer users is how to validate software behavior, and how to improve the quality of oracles used. While mutation analysis is the premier technique for evaluating the quality of software test oracles, mutation score is rarely used as a metric for evaluating fuzzer quality. Unless mutation analysis researchers can solve multiple problems that make applying mutation analysis to fuzzing challenging, mutation analysis may be permanently sidelined in one of the most important areas of testing and security research. This paper attempts to understand the main challenges in applying mutation analysis for evaluating fuzzers, so that researchers can focus on solving these challenges.
△ Less
Submitted 12 February, 2022; v1 submitted 26 January, 2022;
originally announced January 2022.
-
Using Relative Lines of Code to Guide Automated Test Generation for Python
Authors:
Josie Holmes,
Iftekhar Ahmed,
Caius Brindescu,
Rahul Gopinath,
He Zhang,
Alex Groce
Abstract:
Raw lines of code (LOC) is a metric that does not, at first glance, seem extremely useful for automated test generation. It is both highly language-dependent and not extremely meaningful, semantically, within a language: one coder can produce the same effect with many fewer lines than another. However, relative LOC, between components of the same project, turns out to be a highly useful metric for…
▽ More
Raw lines of code (LOC) is a metric that does not, at first glance, seem extremely useful for automated test generation. It is both highly language-dependent and not extremely meaningful, semantically, within a language: one coder can produce the same effect with many fewer lines than another. However, relative LOC, between components of the same project, turns out to be a highly useful metric for automated testing. In this paper, we make use of a heuristic based on LOC counts for tested functions to dramatically improve the effectiveness of automated test generation. This approach is particularly valuable in languages where collecting code coverage data to guide testing has a very high overhead.We apply the heuristic to property-based Python testing using the TSTL (Template Scripting Testing Language) tool. In our experiments, the simple LOC heuristic can improve branch and statement coverage by large margins (often more than 20%, up to 40% or more), and improve fault detection by an even larger margin (usually more than 75%, and up to 400% or more). The LOC heuristic is also easy to combine with other approaches, and is comparable to, and possibly more effective than, two well-established approaches for guiding random testing.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
Differentially Private Confidence Intervals
Authors:
Wenxin Du,
Canyon Foot,
Monica Moniot,
Andrew Bray,
Adam Groce
Abstract:
Confidence intervals for the population mean of normally distributed data are some of the most standard statistical outputs one might want from a database. In this work we give practical differentially private algorithms for this task. We provide five algorithms and then compare them to each other and to prior work. We give concrete, experimental analysis of their accuracy and find that our algori…
▽ More
Confidence intervals for the population mean of normally distributed data are some of the most standard statistical outputs one might want from a database. In this work we give practical differentially private algorithms for this task. We provide five algorithms and then compare them to each other and to prior work. We give concrete, experimental analysis of their accuracy and find that our algorithms provide much more accurate confidence intervals than prior work. For example, in one setting (with ε = 0.1 and n = 2782) our algorithm yields an interval that is only 1/15th the size of the standard set by prior work.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
What are the Actual Flaws in Important Smart Contracts (and How Can We Find Them)?
Authors:
Alex Groce,
Josselin Feist,
Gustavo Grieco,
Michael Colburn
Abstract:
An important problem in smart contract security is understanding the likelihood and criticality of discovered, or potential, weaknesses in contracts. In this paper we provide a summary of Ethereum smart contract audits performed for 23 professional stakeholders, avoiding the common problem of reporting issues mostly prevalent in low-quality contracts. These audits were performed at a leading compa…
▽ More
An important problem in smart contract security is understanding the likelihood and criticality of discovered, or potential, weaknesses in contracts. In this paper we provide a summary of Ethereum smart contract audits performed for 23 professional stakeholders, avoiding the common problem of reporting issues mostly prevalent in low-quality contracts. These audits were performed at a leading company in blockchain security, using both open-source and proprietary tools, as well as human code analysis performed by professional security engineers. We categorize 246 individual defects, making it possible to compare the severity and frequency of different vulnerability types, compare smart contract and non-smart contract flaws, and to estimate the efficacy of automated vulnerability detection approaches.
△ Less
Submitted 10 January, 2020; v1 submitted 18 November, 2019;
originally announced November 2019.
-
Slither: A Static Analysis Framework For Smart Contracts
Authors:
Josselin Feist,
Gustavo Grieco,
Alex Groce
Abstract:
This paper describes Slither, a static analysis framework designed to provide rich information about Ethereum smart contracts. It works by converting Solidity smart contracts into an intermediate representation called SlithIR. SlithIR uses Static Single Assignment (SSA) form and a reduced instruction set to ease implementation of analyses while preserving semantic information that would be lost in…
▽ More
This paper describes Slither, a static analysis framework designed to provide rich information about Ethereum smart contracts. It works by converting Solidity smart contracts into an intermediate representation called SlithIR. SlithIR uses Static Single Assignment (SSA) form and a reduced instruction set to ease implementation of analyses while preserving semantic information that would be lost in transforming Solidity to bytecode. Slither allows for the application of commonly used program analysis techniques like dataflow and taint tracking. Our framework has four main use cases: (1) automated detection of vulnerabilities, (2) automated detection of code optimization opportunities, (3) improvement of the user's understanding of the contracts, and (4) assistance with code review.
In this paper, we present an overview of Slither, detail the design of its intermediate representation, and evaluate its capabilities on real-world contracts. We show that Slither's bug detection is fast, accurate, and outperforms other static analysis tools at finding issues in Ethereum smart contracts in terms of speed, robustness, and balance of detection and false positives. We compared tools using a large dataset of smart contracts and manually reviewed results for 1000 of the most used contracts.
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
Manticore: A User-Friendly Symbolic Execution Framework for Binaries and Smart Contracts
Authors:
Mark Mossberg,
Felipe Manzano,
Eric Hennenfent,
Alex Groce,
Gustavo Grieco,
Josselin Feist,
Trent Brunson,
Artem Dinaburg
Abstract:
An effective way to maximize code coverage in software tests is through dynamic symbolic execution$-$a technique that uses constraint solving to systematically explore a program's state space. We introduce an open-source dynamic symbolic execution framework called Manticore for analyzing binaries and Ethereum smart contracts. Manticore's flexible architecture allows it to support both traditional…
▽ More
An effective way to maximize code coverage in software tests is through dynamic symbolic execution$-$a technique that uses constraint solving to systematically explore a program's state space. We introduce an open-source dynamic symbolic execution framework called Manticore for analyzing binaries and Ethereum smart contracts. Manticore's flexible architecture allows it to support both traditional and exotic execution environments, and its API allows users to customize their analysis. Here, we discuss Manticore's architecture and demonstrate the capabilities we have used to find bugs and verify the correctness of code for our commercial clients.
△ Less
Submitted 18 November, 2019; v1 submitted 8 July, 2019;
originally announced July 2019.
-
Oracle Separations Between Quantum and Non-interactive Zero-Knowledge Classes
Authors:
Benjamin Morrison,
Adam Groce
Abstract:
We study the relationship between problems solvable by quantum algorithms in polynomial time and those for which zero-knowledge proofs exist. In prior work, Aaronson [arxiv:quant-ph/0111102] showed an oracle separation between BQP and SZK, i.e. an oracle $A$ such that $\mathrm{SZK}^A \not\subseteq \mathrm{BQP}^A$. In this paper we give a simple extension of Aaronson's result to non-interactive zer…
▽ More
We study the relationship between problems solvable by quantum algorithms in polynomial time and those for which zero-knowledge proofs exist. In prior work, Aaronson [arxiv:quant-ph/0111102] showed an oracle separation between BQP and SZK, i.e. an oracle $A$ such that $\mathrm{SZK}^A \not\subseteq \mathrm{BQP}^A$. In this paper we give a simple extension of Aaronson's result to non-interactive zero-knowledge proofs with perfect security. This class, NIPZK, is the most restrictive zero-knowledge class. We show that even for this class we can construct an $A$ with $\mathrm{NIPZK}^A \not\subseteq \mathrm{BQP}^A$.
△ Less
Submitted 6 July, 2019;
originally announced July 2019.
-
Differentially Private Nonparametric Hypothesis Testing
Authors:
Simon Couch,
Zeki Kazan,
Kaiyan Shi,
Andrew Bray,
Adam Groce
Abstract:
Hypothesis tests are a crucial statistical tool for data mining and are the workhorse of scientific research in many fields. Here we study differentially private tests of independence between a categorical and a continuous variable. We take as our starting point traditional nonparametric tests, which require no distributional assumption (e.g., normality) about the data distribution. We present pri…
▽ More
Hypothesis tests are a crucial statistical tool for data mining and are the workhorse of scientific research in many fields. Here we study differentially private tests of independence between a categorical and a continuous variable. We take as our starting point traditional nonparametric tests, which require no distributional assumption (e.g., normality) about the data distribution. We present private analogues of the Kruskal-Wallis, Mann-Whitney, and Wilcoxon signed-rank tests, as well as the parametric one-sample t-test. These tests use novel test statistics developed specifically for the private setting. We compare our tests to prior work, both on parametric and nonparametric tests. We find that in all cases our new nonparametric tests achieve large improvements in statistical power, even when the assumptions of parametric tests are met.
△ Less
Submitted 22 March, 2019;
originally announced March 2019.
-
Improved Differentially Private Analysis of Variance
Authors:
Marika Swanberg,
Ira Globus-Harris,
Iris Griffith,
Anna Ritz,
Adam Groce,
Andrew Bray
Abstract:
Hypothesis testing is one of the most common types of data analysis and forms the backbone of scientific research in many disciplines. Analysis of variance (ANOVA) in particular is used to detect dependence between a categorical and a numerical variable. Here we show how one can carry out this hypothesis test under the restrictions of differential privacy. We show that the $F$-statistic, the optim…
▽ More
Hypothesis testing is one of the most common types of data analysis and forms the backbone of scientific research in many disciplines. Analysis of variance (ANOVA) in particular is used to detect dependence between a categorical and a numerical variable. Here we show how one can carry out this hypothesis test under the restrictions of differential privacy. We show that the $F$-statistic, the optimal test statistic in the public setting, is no longer optimal in the private setting, and we develop a new test statistic $F_1$ with much higher statistical power. We show how to rigorously compute a reference distribution for the $F_1$ statistic and give an algorithm that outputs accurate $p$-values. We implement our test and experimentally optimize several parameters. We then compare our test to the only previous work on private ANOVA testing, using the same effect size as that work. We see an order of magnitude improvement, with our test requiring only 7% as much data to detect the effect.
△ Less
Submitted 1 March, 2019;
originally announced March 2019.
-
A Differentially Private Wilcoxon Signed-Rank Test
Authors:
Simon Couch,
Zeki Kazan,
Kaiyan Shi,
Andrew Bray,
Adam Groce
Abstract:
Hypothesis tests are a crucial statistical tool for data mining and are the workhorse of scientific research in many fields. Here we present a differentially private analogue of the classic Wilcoxon signed-rank hypothesis test, which is used when comparing sets of paired (e.g., before-and-after) data values. We present not only a private estimate of the test statistic, but a method to accurately c…
▽ More
Hypothesis tests are a crucial statistical tool for data mining and are the workhorse of scientific research in many fields. Here we present a differentially private analogue of the classic Wilcoxon signed-rank hypothesis test, which is used when comparing sets of paired (e.g., before-and-after) data values. We present not only a private estimate of the test statistic, but a method to accurately compute a p-value and assess statistical significance. We evaluate our test on both simulated and real data. Compared to the only existing private test for this situation, that of Task and Clifton, we find that our test requires less than half as much data to achieve the same statistical power.
△ Less
Submitted 5 September, 2018;
originally announced September 2018.
-
Provenance and Pseudo-Provenance for Seeded Learning-Based Automated Test Generation
Authors:
Alex Groce,
Josie Holmes
Abstract:
Many methods for automated software test generation, including some that explicitly use machine learning (and some that use ML more broadly conceived) derive new tests from existing tests (often referred to as seeds). Often, the seed tests from which new tests are derived are manually constructed, or at least simpler than the tests that are produced as the final outputs of such test generators. We…
▽ More
Many methods for automated software test generation, including some that explicitly use machine learning (and some that use ML more broadly conceived) derive new tests from existing tests (often referred to as seeds). Often, the seed tests from which new tests are derived are manually constructed, or at least simpler than the tests that are produced as the final outputs of such test generators. We propose annotation of generated tests with a provenance (trail) showing how individual generated tests of interest (especially failing tests) derive from seed tests, and how the population of generated tests relates to the original seed tests. In some cases, post-processing of generated tests can invalidate provenance information, in which case we also propose a method for attempting to construct "pseudo-provenance" describing how the tests could have been (partly) generated from seeds.
△ Less
Submitted 15 November, 2017; v1 submitted 5 November, 2017;
originally announced November 2017.
-
Differentially Private ANOVA Testing
Authors:
Zachary Campbell,
Andrew Bray,
Anna Ritz,
Adam Groce
Abstract:
Modern society generates an incredible amount of data about individuals, and releasing summary statistics about this data in a manner that provably protects individual privacy would offer a valuable resource for researchers in many fields. We present the first algorithm for analysis of variance (ANOVA) that preserves differential privacy, allowing this important statistical test to be conducted (a…
▽ More
Modern society generates an incredible amount of data about individuals, and releasing summary statistics about this data in a manner that provably protects individual privacy would offer a valuable resource for researchers in many fields. We present the first algorithm for analysis of variance (ANOVA) that preserves differential privacy, allowing this important statistical test to be conducted (and the results released) on databases of sensitive information. In addition to our private algorithm for the F test statistic, we show a rigorous way to compute p-values that accounts for the added noise needed to preserve privacy. Finally, we present experimental results quantifying the statistical power of this differentially private version of the test, finding that a sample of several thousand observations is frequently enough to detect variation between groups. The differentially private ANOVA algorithm is a promising approach for releasing a common test statistic that is valuable in fields in the sciences and social sciences.
△ Less
Submitted 20 February, 2018; v1 submitted 3 November, 2017;
originally announced November 2017.
-
Proceedings 2nd International Workshop on Causal Reasoning for Embedded and safety-critical Systems Technologies
Authors:
Alex Groce,
Stefan Leue
Abstract:
The second international CREST workshop continued the focus of the first CREST workshop: addressing approaches to causal reasoning in engineering complex embedded and safety-critical systems. Relevant approaches to causal reasoning have been (usually independently) proposed by a variety of communities: AI, concurrency, model-based diagnosis, software engineering, security engineering, and forma…
▽ More
The second international CREST workshop continued the focus of the first CREST workshop: addressing approaches to causal reasoning in engineering complex embedded and safety-critical systems. Relevant approaches to causal reasoning have been (usually independently) proposed by a variety of communities: AI, concurrency, model-based diagnosis, software engineering, security engineering, and formal methods. The goal of CREST is to bring together researchers and practitioners from these communities to exchange ideas, especially between communities, in order to advance the science of determining root cause(s) for failures of critical systems. The growing complexity of failures such as power grid blackouts, airplane crashes, security and privacy violations, and malfunctioning medical devices or automotive systems makes the goals of CREST more relevant than ever before.
△ Less
Submitted 7 October, 2017;
originally announced October 2017.
-
Data Poisoning: Lightweight Soft Fault Injection for Python
Authors:
Mohammad Amin Alipour,
Alex Groce
Abstract:
This paper introduces and explores the idea of data poisoning, a light-weight peer-architecture technique to inject faults into Python programs. This method requires very small modification to the original program, which facilitates evaluation of sensitivity of systems that are prototyped or modeled in Python. We propose different fault scenarios that can be injected to programs using data poisoni…
▽ More
This paper introduces and explores the idea of data poisoning, a light-weight peer-architecture technique to inject faults into Python programs. This method requires very small modification to the original program, which facilitates evaluation of sensitivity of systems that are prototyped or modeled in Python. We propose different fault scenarios that can be injected to programs using data poisoning. We use Dijkstra's Self Stabilizing Ring Algorithm to illustrate the approach.
△ Less
Submitted 4 November, 2016;
originally announced November 2016.
-
Bounded Model Checking and Feature Omission Diversity
Authors:
Mohammad Amin Alipour,
Alex Groce
Abstract:
In this paper we introduce a novel way to speed up the discovery of counterexamples in bounded model checking, based on parallel runs over versions of a system in which features have been randomly disabled. As shown in previous work, adding constraints to a bounded model checking problem can reduce the size of the verification problem and dramatically decrease the time required to find counterexam…
▽ More
In this paper we introduce a novel way to speed up the discovery of counterexamples in bounded model checking, based on parallel runs over versions of a system in which features have been randomly disabled. As shown in previous work, adding constraints to a bounded model checking problem can reduce the size of the verification problem and dramatically decrease the time required to find counterexample. Adapting a technique developed in software testing to this problem provides a simple way to produce useful partial verification problems, with a resulting decrease in average time until a counterexample is produced. If no counterexample is found, partial verification results can also be useful in practice.
△ Less
Submitted 20 September, 2016;
originally announced October 2016.
-
Finding Model-Checkable Needles in Large Source Code Haystacks: Modular Bug-Finding via Static Analysis and Dynamic Invariant Discovery
Authors:
Mohammad Amin Alipour,
Alex Groce,
Chaoqiang Zhang,
Anahita Sanadaji,
Gokul Caushik
Abstract:
In this paper, we present a novel marriage of static and dynamic analysis. Given a large code base with many functions and a mature test suite, we propose using static analysis to find functions 1) with assertions or other evident correctness properties (e.g., array bounds requirements or pointer access) and 2) with simple enough control flow and data use to be amenable to predicate-abstraction ba…
▽ More
In this paper, we present a novel marriage of static and dynamic analysis. Given a large code base with many functions and a mature test suite, we propose using static analysis to find functions 1) with assertions or other evident correctness properties (e.g., array bounds requirements or pointer access) and 2) with simple enough control flow and data use to be amenable to predicate-abstraction based or bounded model checking without human intervention. Because most such functions in realistic software systems in fact rely on many input preconditions not specified by the language's type system (or annotated in any way), we propose using dynamically discovered invariants based on a program's test suite to characterize likely preconditions, in order to reduce the problem of false positives. While providing little in the way of verification, this approach may provide an additional quick and highly scalable bug-finding method for programs that are usually considered "too large to model check." We present a simple example showing that the technique can be useful for a more typically "model-checkable" code base, even in the presence of a poorly designed test suite and bad invariants.
△ Less
Submitted 20 September, 2016;
originally announced September 2016.
-
An Entry Point for Formal Methods: Specification and Analysis of Event Logs
Authors:
Howard Barringer,
Alex Groce,
Klaus Havelund,
Margaret Smith
Abstract:
Formal specification languages have long languished, due to the grave scalability problems faced by complete verification methods. Runtime verification promises to use formal specifications to automate part of the more scalable art of testing, but has not been widely applied to real systems, and often falters due to the cost and complexity of instrumentation for online monitoring. In this paper…
▽ More
Formal specification languages have long languished, due to the grave scalability problems faced by complete verification methods. Runtime verification promises to use formal specifications to automate part of the more scalable art of testing, but has not been widely applied to real systems, and often falters due to the cost and complexity of instrumentation for online monitoring. In this paper we discuss work in progress to apply an event-based specification system to the logging mechanism of the Mars Science Laboratory mission at JPL. By focusing on log analysis, we exploit the "instrumentation" already implemented and required for communicating with the spacecraft. We argue that this work both shows a practical method for using formal specifications in testing and opens interesting research avenues, including a challenging specification learning problem.
△ Less
Submitted 8 March, 2010;
originally announced March 2010.