Search | arXiv e-print repository

How Much is Unseen Depends Chiefly on Information About the Seen

Abstract: It might seem counter-intuitive at first: We find that, in expectation, the proportion of data points in an unknown population-that belong to classes that do not appear in the training data-is almost entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times. While in theory we show that the difference of the induced estimator decays exponential… ▽ More It might seem counter-intuitive at first: We find that, in expectation, the proportion of data points in an unknown population-that belong to classes that do not appear in the training data-is almost entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times. While in theory we show that the difference of the induced estimator decays exponentially in the size of the sample, in practice the high variance prevents us from using it directly for an estimator of the sample coverage. However, our precise characterization of the dependency between $f_k$'s induces a large search space of different representations of the expected value, which can be deterministically instantiated as estimators. Hence, we turn to optimization and develop a genetic algorithm that, given only the sample, searches for an estimator with minimal mean-squared error (MSE). In our experiments, our genetic algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 96% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80% of the Good-Turing estimator's. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: 8 pages with 5 pages of appendix, 5 figures, 3 tables

arXiv:2402.01944 [pdf, other]

Guarantees in Security: A Philosophical Perspective

Authors: Marcel Böhme

Abstract: Research in cybersecurity may seem reactive, specific, ephemeral, and indeed ineffective. Despite decades of innovation in defense, even the most critical software systems turn out to be vulnerable to attacks. Time and again. Offense and defense forever on repeat. Even provable security, meant to provide an indubitable guarantee of security, does not stop attackers from finding security flaws. As… ▽ More Research in cybersecurity may seem reactive, specific, ephemeral, and indeed ineffective. Despite decades of innovation in defense, even the most critical software systems turn out to be vulnerable to attacks. Time and again. Offense and defense forever on repeat. Even provable security, meant to provide an indubitable guarantee of security, does not stop attackers from finding security flaws. As we reflect on our achievements, we are left wondering: Can security be solved once and for all? In this paper, we take a philosophical perspective and develop the first theory of cybersecurity that explains what *fundamentally* prevents us from making reliable statements about the security of a software system. We substantiate each argument by demonstrating how the corresponding challenge is routinely exploited to attack a system despite credible assurances about the absence of security flaws. To make meaningful progress in the presence of these challenges, we introduce a philosophy of cybersecurity. △ Less

Submitted 26 June, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: 13 pages. Major rewrite. Feedback appreciated

arXiv:2402.00641 [pdf, other]

Testing side-channel security of cryptographic implementations against future microarchitectures

Authors: Gilles Barthe, Marcel Böhme, Sunjay Cauligi, Chitchanok Chuengsatiansup, Daniel Genkin, Marco Guarnieri, David Mateos Romero, Peter Schwabe, David Wu, Yuval Yarom

Abstract: How will future microarchitectures impact the security of existing cryptographic implementations? As we cannot keep reducing the size of transistors, chip vendors have started develo** new microarchitectural optimizations to speed up computation. A recent study (Sanchez Vicarte et al., ISCA 2021) suggests that these optimizations might open the Pandora's box of microarchitectural attacks. Howeve… ▽ More How will future microarchitectures impact the security of existing cryptographic implementations? As we cannot keep reducing the size of transistors, chip vendors have started develo** new microarchitectural optimizations to speed up computation. A recent study (Sanchez Vicarte et al., ISCA 2021) suggests that these optimizations might open the Pandora's box of microarchitectural attacks. However, there is little guidance on how to evaluate the security impact of future optimization proposals. To help chip vendors explore the impact of microarchitectural optimizations on cryptographic implementations, we develop (i) an expressive domain-specific language, called LmSpec, that allows them to specify the leakage model for the given optimization and (ii) a testing framework, called LmTest, to automatically detect leaks under the specified leakage model within the given implementation. Using this framework, we conduct an empirical study of 18 proposed microarchitectural optimizations on 25 implementations of eight cryptographic primitives in five popular libraries. We find that every implementation would contain secret-dependent leaks, sometimes sufficient to recover a victim's secret key, if these optimizations were realized. Ironically, some leaks are possible only because of coding idioms used to prevent leaks under the standard constant-time model. △ Less

Submitted 1 February, 2024; originally announced February 2024.

arXiv:2306.17193 [pdf, other]

Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection

Authors: Niklas Risse, Marcel Böhme

Abstract: Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function $f$, ML4VD techniques can decide if $f$ contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions… ▽ More Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function $f$, ML4VD techniques can decide if $f$ contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions where the vulnerability is patched. So, how can we explain this contradiction and how can we improve the way we evaluate ML4VD techniques to get a better picture of their actual capabilities? In this paper, we identify overfitting to unrelated features and out-of-distribution generalization as two problems, which are not captured by the traditional approach of evaluating ML4VD techniques. As a remedy, we propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Specifically, we propose (i) to augment the training and validation dataset according to our cross-validation algorithm, where a semantic preserving transformation is applied during the augmentation of either the training set or the testing set, and (ii) to augment the testing set with code snippets where the vulnerabilities are patched. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training, and (c) that state-of-the-art ML4VD techniques are unable to distinguish vulnerable functions from their patches. △ Less

Submitted 6 June, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

arXiv:2304.10070 [pdf, ps, other]

SBFT Tool Competition 2023 -- Fuzzing Track

Authors: Dongge Liu, Jonathan Metzman, Marcel Böhme, Oliver Chang, Abhishek Arya

Abstract: This report outlines the objectives, methodology, challenges, and results of the first Fuzzing Competition held at SBFT 2023. The competition utilized FuzzBench to assess the code-coverage performance and bug-finding efficacy of eight participating fuzzers over 23 hours. The competition was organized in three phases. In the first phase, participants were asked to integrate their fuzzers into FuzzB… ▽ More This report outlines the objectives, methodology, challenges, and results of the first Fuzzing Competition held at SBFT 2023. The competition utilized FuzzBench to assess the code-coverage performance and bug-finding efficacy of eight participating fuzzers over 23 hours. The competition was organized in three phases. In the first phase, participants were asked to integrate their fuzzers into FuzzBench and allowed them to privately run local experiments against the publicly available benchmarks. In the second phase, we publicly ran all submitted fuzzers on the publicly available benchmarks and allowed participants to fix any remaining bugs in their fuzzers. In the third phase, we publicly ran all submitted fuzzers plus three widely-used baseline fuzzers on a hidden set and the publicly available set of benchmark programs to establish the final results. △ Less

Submitted 15 May, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

Comments: 4 pages, will be published in SBFT workshop of ICSE'23

arXiv:2304.10044 [pdf, other]

Finding Bug-Inducing Program Environments

Authors: Zahra Mirzamomen, Marcel Böhme

Abstract: Some bugs cannot be exposed by program inputs, but only by certain program environments. During execution, most programs access various resources, like databases, files, or devices, that are external to the program and thus part of the program's environment. In this paper, we present a coverage-guided, mutation-based environment synthesis approach of bug-inducing program environments. Specifically… ▽ More Some bugs cannot be exposed by program inputs, but only by certain program environments. During execution, most programs access various resources, like databases, files, or devices, that are external to the program and thus part of the program's environment. In this paper, we present a coverage-guided, mutation-based environment synthesis approach of bug-inducing program environments. Specifically, we observe that programs interact with their environment via dedicated system calls and propose to intercept these system calls (i) to capture the resources accessed during the first execution of an input as initial program environment, and (ii) mutate copies of these resources during subsequent executions of that input to generate slightly changed program environments. Any generated environment that is observed to increase coverage is added to the corpus of environment seeds and becomes subject to further fuzzing. Bug-inducing program environments are reported to the user. Experiments demonstrate the effectiveness of our approach. We implemented a prototype called AFLChaos which found bugs in the resource-handling code of five (5) of the seven (7) open source projects in our benchmark set (incl. OpenSSL). Automatically, AFLChaos generated environments consisting of bug-inducing databases used for storing information, bug-inducing multimedia files used for streaming, bug-inducing cryptographic keys used for encryption, and bug-inducing configuration files used to configure the program. To support open science, we publish the experimental infrastructure, our tool, and all data. △ Less

Submitted 19 April, 2023; originally announced April 2023.

arXiv:2212.09519 [pdf, other]

Explainable Fuzzer Evaluation

Authors: Dylan Wolff, Marcel Böhme, Abhik Roychoudhury

Abstract: While the aim of fuzzer evaluation is to establish fuzzer performance in general, an evaluation is always conducted on a specific benchmark. In this paper, we investigate the degree to which the benchmarking result depends on the properties of the benchmark and propose a methodology to quantify the impact of benchmark properties on the benchmarking result in relation to the impact of the choice of… ▽ More While the aim of fuzzer evaluation is to establish fuzzer performance in general, an evaluation is always conducted on a specific benchmark. In this paper, we investigate the degree to which the benchmarking result depends on the properties of the benchmark and propose a methodology to quantify the impact of benchmark properties on the benchmarking result in relation to the impact of the choice of fuzzer. We found that the measured performance and ranking of a fuzzer substantially depends on properties of the programs and the seed corpora used during evaluation. For instance, if the benchmark contained larger programs or seed corpora with a higher initial coverage, AFL's ranking would improve while LibFuzzer's ranking would worsen. We describe our methodology as explainable fuzzer evaluation because it explains why the specific evaluation setup yields the observed superiority or ranking of the fuzzers and how it might change for different benchmarks. We envision that our analysis can be used to assess the degree to which evaluation results are overfitted to the benchmark and to identify the specific conditions under which different fuzzers performs better than others. △ Less

Submitted 19 December, 2022; originally announced December 2022.

arXiv:2205.14964 [pdf, other]

Effectiveness and Scalability of Fuzzing Techniques in CI/CD Pipelines

Authors: Thijs Klooster, Fatih Turkmen, Gerben Broenink, Ruben ten Hove, Marcel Böhme

Abstract: Fuzzing has proven to be a fundamental technique to automated software testing but also a costly one. With the increased adoption of CI/CD practices in software development, a natural question to ask is `What are the best ways to integrate fuzzing into CI/CD pipelines considering the velocity in code changes and the automated delivery/deployment practices?'. Indeed, a recent study by Böhme and Zhu… ▽ More Fuzzing has proven to be a fundamental technique to automated software testing but also a costly one. With the increased adoption of CI/CD practices in software development, a natural question to ask is `What are the best ways to integrate fuzzing into CI/CD pipelines considering the velocity in code changes and the automated delivery/deployment practices?'. Indeed, a recent study by Böhme and Zhu shows that four in every five bugs have been introduced by recent code changes (i.e. regressions). In this paper, we take a close look at the integration of fuzzers to CI/CD pipelines from both automated software testing and continuous development angles. Firstly, we study an optimization opportunity to triage commits that do not require fuzzing and find, through experimental analysis, that the average fuzzing effort in CI/CD can be reduced by ~63% in three of the nine libraries we analyzed (>40% for six libraries). Secondly, we investigate the impact of fuzzing campaign duration on the CI/CD process: A shorter fuzzing campaign such as 15 minutes (as opposed to the wisdom of 24 hours in the field) facilitates a faster pipeline and can still uncover important bugs, but may also reduce its capability to detect sophisticated bugs. Lastly, we discuss a prioritization strategy that automatically assigns resources to fuzzing campaigns based on a set of predefined priority strategies. Our findings suggest that continuous fuzzing (as part of the automated testing in CI/CD) is indeed beneficial and there are many optimization opportunities to improve the effectiveness and scalability of fuzz testing. △ Less

Submitted 7 June, 2022; v1 submitted 30 May, 2022; originally announced May 2022.

Comments: 12 pages, 5 figures

arXiv:2204.02545 [pdf, other]

Stateful Greybox Fuzzing

Authors: **sheng Ba, Marcel Böhme, Zahra Mirzamomen, Abhik Roychoudhury

Abstract: Many protocol implementations are reactive systems, where the protocol process is in continuous interaction with other processes and the environment. If a bug can be exposed only in a certain state, a fuzzer needs to provide a specific sequence of events as inputs that would take protocol into this state before the bug is manifested. We call these bugs as "stateful" bugs. Usually, when we are test… ▽ More Many protocol implementations are reactive systems, where the protocol process is in continuous interaction with other processes and the environment. If a bug can be exposed only in a certain state, a fuzzer needs to provide a specific sequence of events as inputs that would take protocol into this state before the bug is manifested. We call these bugs as "stateful" bugs. Usually, when we are testing a protocol implementation, we do not have a detailed formal specification of the protocol to rely upon. Without knowledge of the protocol, it is inherently difficult for a fuzzer to discover such stateful bugs. A key challenge then is to cover the state space without an explicit specification of the protocol. In this work, we posit that manual annotations for state identification can be avoided for stateful protocol fuzzing. Specifically, we rely on a programmatic intuition that the state variables used in protocol implementations often appear in enum type variables whose values (the state names) come from named constants. In our analysis of the Top-50 most widely used open-source protocol implementations, we found that every implementation uses state variables that are assigned named constants (with easy to comprehend names such as INIT, READY) to represent the current state. In this work, we propose to automatically identify such state variables and track the sequence of values assigned to them during fuzzing to produce a "map" of the explored state space. Our experiments confirm that our stateful fuzzer discovers stateful bugs twice as fast as the baseline greybox fuzzer that we extended. Starting from the initial state, our fuzzer exercises one order of magnitude more state/transition sequences and covers code two times faster than the baseline fuzzer. Several zero-day bugs in prominent protocol implementations were found by our fuzzer, and 8 CVEs have been assigned. △ Less

Submitted 16 May, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

Journal ref: 31st USENIX Security Symposium (USENIX Security 2022)

arXiv:2110.02682 [pdf, other]

How good does a Defect Predictor need to be to guide Search-Based Software Testing?

Authors: Anjana Perera, Burak Turhan, Aldeida Aleti, Marcel Böhme

Abstract: Defect predictors, static bug detectors and humans inspecting the code can locate the parts of the program that are buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely buggy code, thus speeding up the process of detecting existing bugs. However, o… ▽ More Defect predictors, static bug detectors and humans inspecting the code can locate the parts of the program that are buggy before they are discovered through testing. Automated test generators such as search-based software testing (SBST) techniques can use this information to direct their search for test cases to likely buggy code, thus speeding up the process of detecting existing bugs. However, often the predictions given by these tools or humans are imprecise, which can misguide the SBST technique and may deteriorate its performance. In this paper, we study the impact of imprecision in defect prediction on the bug detection effectiveness of SBST. Our study finds that the recall of the defect predictor, i.e., the probability of correctly identifying buggy code, has a significant impact on bug detection effectiveness of SBST with a large effect size. On the other hand, the effect of precision, a measure for false alarms, is not of meaningful practical significance as indicated by a very small effect size. In particular, the SBST technique finds 7.5 less bugs on average (out of 420 bugs) for every 5% decrements of the recall. In the context of combining defect prediction and SBST, our recommendation for practice is to increase the recall of defect predictors at the expense of precision, while maintaining a precision of at least 75%. To account for the imprecision of defect predictors, in particular low recall values, SBST techniques should be designed to search for test cases that also cover the predicted non-buggy parts of the program, while prioritising the parts that have been predicted as buggy. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: 12 pages, 4 figures

ACM Class: D.2.5

arXiv:2109.12645 [pdf, other]

doi 10.1145/3324884.3416612

Defect Prediction Guided Search-Based Software Testing

Authors: Anjana Perera, Aldeida Aleti, Marcel Böhme, Burak Turhan

Abstract: Today, most automated test generators, such as search-based software testing (SBST) techniques focus on achieving high code coverage. However, high code coverage is not sufficient to maximise the number of bugs found, especially when given a limited testing budget. In this paper, we propose an automated test generation technique that is also guided by the estimated degree of defectiveness of the s… ▽ More Today, most automated test generators, such as search-based software testing (SBST) techniques focus on achieving high code coverage. However, high code coverage is not sufficient to maximise the number of bugs found, especially when given a limited testing budget. In this paper, we propose an automated test generation technique that is also guided by the estimated degree of defectiveness of the source code. Parts of the code that are likely to be more defective receive more testing budget than the less defective parts. To measure the degree of defectiveness, we leverage Schwa, a notable defect prediction technique. We implement our approach into EvoSuite, a state of the art SBST tool for Java. Our experiments on the Defects4J benchmark demonstrate the improved efficiency of defect prediction guided test generation and confirm our hypothesis that spending more time budget on likely defective parts increases the number of bugs found in the same time budget. △ Less

Submitted 26 September, 2021; originally announced September 2021.

Comments: 13 pages, 8 figures

ACM Class: D.2.5

Journal ref: In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE '20), 2020

arXiv:2101.03008 [pdf, other]

Locating Faults with Program Slicing: An Empirical Analysis

Authors: Ezekiel Soremekun, Lukas Kirschner, Marcel Böhme, Andreas Zeller

Abstract: Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open sourc… ▽ More Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open source C programs, we compare the effectiveness of statistical fault localization against dynamic slicing. For single faults, we find that dynamic slicing was eight percentage points more effective than the best performing statistical debugging formula; for 66% of the bugs, dynamic slicing finds the fault earlier than the best performing statistical debugging formula. In our evaluation, dynamic slicing is more effective for programs with single fault, but statistical debugging performs better on multiple faults. Best results, however, are obtained by a hybrid approach: If programmers first examine at most the top five most suspicious locations from statistical debugging, and then switch to dynamic slices, on average, they will need to examine 15% (30 lines) of the code. These findings hold for 18 most effective statistical debugging formulas and our results are independent of the number of faults (i.e. single or multiple faults) and error type (i.e. artificial or real errors). △ Less

Submitted 8 January, 2021; originally announced January 2021.

arXiv:2009.03730 [pdf, other]

Large-scale Neural Solvers for Partial Differential Equations

Authors: Patrick Stiller, Friedrich Bethke, Maximilian Böhme, Richard Pausch, Sunna Torge, Alexander Debus, Jan Vorberger, Michael Bussmann, Nico Hoffmann

Abstract: Solving partial differential equations (PDE) is an indispensable part of many branches of science as many processes can be modelled in terms of PDEs. However, recent numerical solvers require manual discretization of the underlying equation as well as sophisticated, tailored code for distributed computing. Scanning the parameters of the underlying model significantly increases the runtime as the s… ▽ More Solving partial differential equations (PDE) is an indispensable part of many branches of science as many processes can be modelled in terms of PDEs. However, recent numerical solvers require manual discretization of the underlying equation as well as sophisticated, tailored code for distributed computing. Scanning the parameters of the underlying model significantly increases the runtime as the simulations have to be cold-started for each parameter configuration. Machine Learning based surrogate models denote promising ways for learning complex relationship among input, parameter and solution. However, recent generative neural networks require lots of training data, i.e. full simulation runs making them costly. In contrast, we examine the applicability of continuous, mesh-free neural solvers for partial differential equations, physics-informed neural networks (PINNs) solely requiring initial/boundary values and validation points for training but no simulation data. The induced curse of dimensionality is approached by learning a domain decomposition that steers the number of neurons per unit volume and significantly improves runtime. Distributed training on large-scale cluster systems also promises great utilization of large quantities of GPUs which we assess by a comprehensive evaluation study. Finally, we discuss the accuracy of GatedPINN with respect to analytical solutions -- as well as state-of-the-art numerical solvers, such as spectral solvers. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:1912.07758 [pdf, other]

Human-In-The-Loop Automatic Program Repair

Authors: Marcel Böhme, Charaka Geethal, Van-Thuan Pham

Abstract: We introduce Learn2fix, the first human-in-the-loop, semi-automatic repair technique when no bug oracle--except for the user who is reporting the bug--is available. Our approach negotiates with the user the condition under which the bug is observed. Only when a budget of queries to the user is exhausted, it attempts to repair the bug. A query can be thought of as the following question: "When exec… ▽ More We introduce Learn2fix, the first human-in-the-loop, semi-automatic repair technique when no bug oracle--except for the user who is reporting the bug--is available. Our approach negotiates with the user the condition under which the bug is observed. Only when a budget of queries to the user is exhausted, it attempts to repair the bug. A query can be thought of as the following question: "When executing this alternative test input, the program produces the following output; is the bug observed"? Through systematic queries, Learn2fix trains an automatic bug oracle that becomes increasingly more accurate in predicting the user's response. Our key challenge is to maximize the oracle's accuracy in predicting which tests are bug-exposing given a small budget of queries. From the alternative tests that were labeled by the user, test-driven automatic repair produces the patch. Our experiments demonstrate that Learn2fix learns a sufficiently accurate automatic oracle with a reasonably low labeling effort (lt. 20 queries). Given Learn2fix's test suite, the GenProg test-driven repair tool produces a higher-quality patch (i.e., passing a larger proportion of validation tests) than using manual test suites provided with the repair benchmark. △ Less

Submitted 16 December, 2019; originally announced December 2019.

Comments: Accepted as full paper (10+2 pages) at ICST'20 (https://icst2020.info/) *** Tool and Replication Package at: https://github.com/mboehme/learn2fix

arXiv:1911.04687 [pdf, other]

MCPA: Program Analysis as Machine Learning

Authors: Marcel Böhme

Abstract: Static program analysis today takes an analytical approach which is quite suitable for a well-scoped system. Data- and control-flow is taken into account. Special cases such as pointers, procedures, and undefined behavior must be handled. A program is analyzed precisely on the statement level. However, the analytical approach is ill-equiped to handle implementations of complex, large-scale, hetero… ▽ More Static program analysis today takes an analytical approach which is quite suitable for a well-scoped system. Data- and control-flow is taken into account. Special cases such as pointers, procedures, and undefined behavior must be handled. A program is analyzed precisely on the statement level. However, the analytical approach is ill-equiped to handle implementations of complex, large-scale, heterogeneous software systems we see in the real world. Existing static analysis techniques that scale, trade correctness (i.e., soundness or completeness) for scalability and build on strong assumptions (e.g., language-specificity). Scalable static analysis are well-known to report errors that do *not* exist (false positives) or fail to report errors that *do* exist (false negatives). Then, how do we know the degree to which the analysis outcome is correct? In this paper, we propose an approach to scale-oblivious greybox program analysis with bounded error which applies efficient approximation schemes (FPRAS) from the foundations of machine learning: PAC learnability. Given two parameters $δ$ and $ε$, with probability at least $(1-δ)$, our Monte Carlo Program Analysis (MCPA) approach produces an outcome that has an average error at most $ε$. The parameters $δ>0$ and $ε>0$ can be chosen arbitrarily close to zero (0) such that the program analysis outcome is said to be probably-approximately correct (PAC). We demonstrate the pertinent concepts of MCPA using three applications: $(ε,δ)$-approximate quantitative analysis, $(ε,δ)$-approximate software verification, and $(ε,δ)$-approximate patch verification. △ Less

Submitted 12 November, 2019; originally announced November 2019.

Comments: 10+2 pages. Feedback and (industry/research) collaborations welcome

arXiv:1811.09447 [pdf, other]

Smart Greybox Fuzzing

Authors: Van-Thuan Pham, Marcel Böhme, Andrew E. Santosa, Alexandru Răzvan Căciulescu, Abhik Roychoudhury

Abstract: Coverage-based greybox fuzzing (CGF) is one of the most successful methods for automated vulnerability detection. Given a seed file (as a sequence of bits), CGF randomly flips, deletes or bits to generate new files. CGF iteratively constructs (and fuzzes) a seed corpus by retaining those generated files which enhance coverage. However, random bitflips are unlikely to produce valid files (or valid… ▽ More Coverage-based greybox fuzzing (CGF) is one of the most successful methods for automated vulnerability detection. Given a seed file (as a sequence of bits), CGF randomly flips, deletes or bits to generate new files. CGF iteratively constructs (and fuzzes) a seed corpus by retaining those generated files which enhance coverage. However, random bitflips are unlikely to produce valid files (or valid chunks in files), for applications processing complex file formats. In this work, we introduce smart greybox fuzzing (SGF) which leverages a high-level structural representation of the seed file to generate new files. We define innovative mutation operators that work on the virtual file structure rather than on the bit level which allows SGF to explore completely new input domains while maintaining file validity. We introduce a novel validity-based power schedule that enables SGF to spend more time generating files that are more likely to pass the parsing stage of the program, which can expose vulnerabilities much deeper in the processing logic. Our evaluation demonstrates the effectiveness of SGF. On several libraries that parse structurally complex files, our tool AFLSmart explores substantially more paths (up to 200%) and exposes more vulnerabilities than baseline AFL. Our tool AFLSmart has discovered 42 zero-day vulnerabilities in widely-used, well-tested tools and libraries; so far 17 CVEs were assigned. △ Less

Submitted 23 November, 2018; originally announced November 2018.

Comments: Accepted IEEE Transactions on Software Engineering, 2020

arXiv:1807.10255 [pdf, ps, other]

Assurances in Software Testing: A Roadmap

Authors: Marcel Böhme

Abstract: As researchers, we already understand how to make testing more effective and efficient at finding bugs. However, as fuzzing (i.e., automated testing) becomes more widely adopted in practice, practitioners are asking: Which assurances does a fuzzing campaign provide that exposes no bugs? When is it safe to stop the fuzzer with a reasonable residual risk? How much longer should the fuzzer be run to… ▽ More As researchers, we already understand how to make testing more effective and efficient at finding bugs. However, as fuzzing (i.e., automated testing) becomes more widely adopted in practice, practitioners are asking: Which assurances does a fuzzing campaign provide that exposes no bugs? When is it safe to stop the fuzzer with a reasonable residual risk? How much longer should the fuzzer be run to achieve sufficient coverage? It is time for us to move beyond the innovation of increasingly sophisticated testing techniques, to build a body of knowledge around the explication and quantification of the testing process, and to develop sound methodologies to estimate and extrapolate these quantities with measurable accuracy. In our vision of the future practitioners leverage a rich statistical toolset to assess residual risk, to obtain statistical guarantees, and to analyze the cost-benefit trade-off for ongoing fuzzing campaigns. We propose a general framework as a first starting point to tackle this fundamental challenge and discuss a large number of concrete opportunities for future research. △ Less

Submitted 17 December, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

Comments: Accepted at ICSE'19 NIER. Extended version. 5 pages + references

arXiv:1803.02130 [pdf, other]

STADS: Software Testing as Species Discovery

Authors: Marcel Böhme

Abstract: A fundamental challenge of software testing is the statistically well-grounded extrapolation from program behaviors observed during testing. For instance, a security researcher who has run the fuzzer for a week has currently no means (i) to estimate the total number of feasible program branches, given that only a fraction has been covered so far, (ii) to estimate the additional time required to co… ▽ More A fundamental challenge of software testing is the statistically well-grounded extrapolation from program behaviors observed during testing. For instance, a security researcher who has run the fuzzer for a week has currently no means (i) to estimate the total number of feasible program branches, given that only a fraction has been covered so far, (ii) to estimate the additional time required to cover 10% more branches, or (iii) to assess the residual risk that a vulnerability exists when no vulnerability has been discovered. Failing to discover a vulnerability, does not mean that none exists---even if the fuzzer was run for a week (or a year). Hence, testing provides no formal correctness guarantees. In this article, I establish an unexpected connection with the otherwise unrelated scientific field of ecology, and introduce a statistical framework that models Software Testing and Analysis as Discovery of Species (STADS). For instance, in order to study the species diversity of arthropods in a tropical rain forest, ecologists would first sample a large number of individuals from that forest, determine their species, and extrapolate from the properties observed in the sample to properties of the whole forest. The estimation (i) of the total number of species, (ii) of the additional sampling effort required to discover 10% more species, or (iii) of the probability to discover a new species are classical problems in ecology. The STADS framework draws from over three decades of research in ecological biostatistics to address the fundamental extrapolation challenge for automated test generation. Our preliminary empirical study demonstrates a good estimator performance even for a fuzzer with adaptive sampling bias---AFL, a state-of-the-art vulnerability detection tool. The STADS framework provides statistical correctness guarantees with quantifiable accuracy. △ Less

Submitted 3 April, 2018; v1 submitted 6 March, 2018; originally announced March 2018.

Comments: To appear with minor revisions in ACM Transactions on Software Engineering and Methodology (TOSEM); 52 pages; journal-first

Showing 1–18 of 18 results for author: Böhme, M