-
Measuring Feature Dependency of Neural Networks by Collapsing Feature Dimensions in the Data Manifold
Authors:
Yinzhu **,
Matthew B. Dwyer,
P. Thomas Fletcher
Abstract:
This paper introduces a new technique to measure the feature dependency of neural network models. The motivation is to better understand a model by querying whether it is using information from human-understandable features, e.g., anatomical shape, volume, or image texture. Our method is based on the principle that if a model is dependent on a feature, then removal of that feature should significa…
▽ More
This paper introduces a new technique to measure the feature dependency of neural network models. The motivation is to better understand a model by querying whether it is using information from human-understandable features, e.g., anatomical shape, volume, or image texture. Our method is based on the principle that if a model is dependent on a feature, then removal of that feature should significantly harm its performance. A targeted feature is "removed" by collapsing the dimension in the data distribution that corresponds to that feature. We perform this by moving data points along the feature dimension to a baseline feature value while staying on the data manifold, as estimated by a deep generative model. Then we observe how the model's performance changes on the modified test data set, with the target feature dimension removed. We test our method on deep neural network models trained on synthetic image data with known ground truth, an Alzheimer's disease prediction task using MRI and hippocampus segmentations from the OASIS-3 dataset, and a cell nuclei classification task using the Lizard dataset.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
Harnessing Neuron Stability to Improve DNN Verification
Authors:
Hai Duong,
Dong Xu,
ThanhVu Nguyen,
Matthew B. Dwyer
Abstract:
Deep Neural Networks (DNN) have emerged as an effective approach to tackling real-world problems. However, like human-written software, DNNs are susceptible to bugs and attacks. This has generated significant interests in develo** effective and scalable DNN verification techniques and tools. In this paper, we present VeriStable, a novel extension of recently proposed DPLL-based constraint DNN ve…
▽ More
Deep Neural Networks (DNN) have emerged as an effective approach to tackling real-world problems. However, like human-written software, DNNs are susceptible to bugs and attacks. This has generated significant interests in develo** effective and scalable DNN verification techniques and tools. In this paper, we present VeriStable, a novel extension of recently proposed DPLL-based constraint DNN verification approach. VeriStable leverages the insight that while neuron behavior may be non-linear across the entire DNN input space, at intermediate states computed during verification many neurons may be constrained to have linear behavior - these neurons are stable. Efficiently detecting stable neurons reduces combinatorial complexity without compromising the precision of abstractions. Moreover, the structure of clauses arising in DNN verification problems shares important characteristics with industrial SAT benchmarks. We adapt and incorporate multi-threading and restart optimizations targeting those characteristics to further optimize DPLL-based DNN verification. We evaluate the effectiveness of VeriStable across a range of challenging benchmarks including fully-connected feedforward networks (FNNs), convolutional neural networks (CNNs) and residual networks (ResNets) applied to the standard MNIST and CIFAR datasets. Preliminary results show that VeriStable is competitive and outperforms state-of-the-art DNN verification tools, including $α$-$β$-CROWN and MN-BaB, the first and second performers of the VNN-COMP, respectively.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Artifact: Measuring and Mitigating Gaps in Structural Testing
Authors:
Soneya Binta Hossain,
Matthew B. Dwyer,
Sebastian Elbaum,
Anh Nguyen-Tuong
Abstract:
The artifact used for evaluating the experimental results of Measuring and Mitigating Gaps in Structural Testing is publicly available on GitHub, Software Heritage and figshare, and is reusable. The artifact consists of necessary data, tools, scripts, and detailed documentation for running the experiments and reproducing the results shown in the paper. We have also provided a VirtualBox VM image a…
▽ More
The artifact used for evaluating the experimental results of Measuring and Mitigating Gaps in Structural Testing is publicly available on GitHub, Software Heritage and figshare, and is reusable. The artifact consists of necessary data, tools, scripts, and detailed documentation for running the experiments and reproducing the results shown in the paper. We have also provided a VirtualBox VM image allowing users to quickly setup and reproduce the results. Users are expected to be familiar using the VirtualBox software and Linux platform for evaluating or reusing the artifact.
△ Less
Submitted 1 August, 2023;
originally announced August 2023.
-
Neural-Based Test Oracle Generation: A Large-scale Evaluation and Lessons Learned
Authors:
Soneya Binta Hossain,
Antonio Filieri,
Matthew B. Dwyer,
Sebastian Elbaum,
Willem Visser
Abstract:
Defining test oracles is crucial and central to test development, but manual construction of oracles is expensive. While recent neural-based automated test oracle generation techniques have shown promise, their real-world effectiveness remains a compelling question requiring further exploration and understanding. This paper investigates the effectiveness of TOGA, a recently developed neural-based…
▽ More
Defining test oracles is crucial and central to test development, but manual construction of oracles is expensive. While recent neural-based automated test oracle generation techniques have shown promise, their real-world effectiveness remains a compelling question requiring further exploration and understanding. This paper investigates the effectiveness of TOGA, a recently developed neural-based method for automatic test oracle generation by Dinella et al. TOGA utilizes EvoSuite-generated test inputs and generates both exception and assertion oracles. In a Defects4j study, TOGA outperformed specification, search, and neural-based techniques, detecting 57 bugs, including 30 unique bugs not detected by other methods. To gain a deeper understanding of its applicability in real-world settings, we conducted a series of external, extended, and conceptual replication studies of TOGA.
In a large-scale study involving 25 real-world Java systems, 223.5K test cases, and 51K injected faults, we evaluate TOGA's ability to improve fault-detection effectiveness relative to the state-of-the-practice and the state-of-the-art. We find that TOGA misclassifies the type of oracle needed 24% of the time and that when it classifies correctly around 62% of the time it is not confident enough to generate any assertion oracle. When it does generate an assertion oracle, more than 47% of them are false positives, and the true positive assertions only increase fault detection by 0.3% relative to prior work. These findings expose limitations of the state-of-the-art neural-based oracle generation technique, provide valuable insights for improvement, and offer lessons for evaluating future automated oracle generation methods.
△ Less
Submitted 25 August, 2023; v1 submitted 29 July, 2023;
originally announced July 2023.
-
PCV: A Point Cloud-Based Network Verifier
Authors:
Arup Kumar Sarker,
Farzana Yasmin Ahmad,
Matthew B. Dwyer
Abstract:
3D vision with real-time LiDAR-based point cloud data became a vital part of autonomous system research, especially perception and prediction modules use for object classification, segmentation, and detection. Despite their success, point cloud-based network models are vulnerable to multiple adversarial attacks, where the certain factor of changes in the validation set causes significant performan…
▽ More
3D vision with real-time LiDAR-based point cloud data became a vital part of autonomous system research, especially perception and prediction modules use for object classification, segmentation, and detection. Despite their success, point cloud-based network models are vulnerable to multiple adversarial attacks, where the certain factor of changes in the validation set causes significant performance drop in well-trained networks. Most of the existing verifiers work perfectly on 2D convolution. Due to complex architecture, dimension of hyper-parameter, and 3D convolution, no verifiers can perform the basic layer-wise verification. It is difficult to conclude the robustness of a 3D vision model without performing the verification. Because there will be always corner cases and adversarial input that can compromise the model's effectiveness.
In this project, we describe a point cloud-based network verifier that successfully deals state of the art 3D classifier PointNet verifies the robustness by generating adversarial inputs. We have used extracted properties from the trained PointNet and changed certain factors for perturbation input. We calculate the impact on model accuracy versus property factor and can test PointNet network's robustness against a small collection of perturbing input states resulting from adversarial attacks like the suggested hybrid reverse signed attack. The experimental results reveal that the resilience property of PointNet is affected by our hybrid reverse signed perturbation strategy
△ Less
Submitted 30 January, 2023; v1 submitted 27 January, 2023;
originally announced January 2023.
-
A Brief Survey on Oracle-based Test Adequacy Metrics
Authors:
Soneya Binta Hossain,
Matthew B. Dwyer
Abstract:
Code coverage is a popular and widespread test adequacy metric that measures the percentage of program codes executed by a test suite. Despite its popularity, code coverage has several limitations. One of the major limitations is that it does not provide any insights into the quality or quantity of test oracles, a core component of testing. Due to this limitation, several studies have suggested th…
▽ More
Code coverage is a popular and widespread test adequacy metric that measures the percentage of program codes executed by a test suite. Despite its popularity, code coverage has several limitations. One of the major limitations is that it does not provide any insights into the quality or quantity of test oracles, a core component of testing. Due to this limitation, several studies have suggested that coverage is a poor test adequacy metric; therefore, it should not be used as an indicator of a test suite's fault detection effectiveness. To address this limitation, researchers have proposed extensions to traditional structural code coverage to explicitly consider the quality of test oracles. We refer to these extensions as oracle-based code coverage. This survey paper studies oracle-based coverage techniques published since their inception in 2007. We discuss each metric's definition, methodology, experimental studies, and research findings. Even though oracle-based coverage metrics are proven to be more effective than traditional coverage in detecting faults, they have received little attention in the software engineering community. We present all existing oracle-based adequacy metrics in this paper and compare the critical features against each other. We observe that different oracle-based adequacy metrics operate on different coverage domains and use diverse underlying analysis techniques, enabling a software tester to choose the appropriate metric based on the testing requirements. Our paper provides valuable information regarding the limitations of oracle-based methods, addressing which may help their broader adoption in software testing automation.
△ Less
Submitted 15 February, 2023; v1 submitted 12 December, 2022;
originally announced December 2022.
-
White-box Testing of NLP models with Mask Neuron Coverage
Authors:
Arshdeep Sekhon,
Yangfeng Ji,
Matthew B. Dwyer,
Yanjun Qi
Abstract:
Recent literature has seen growing interest in using black-box strategies like CheckList for testing the behavior of NLP models. Research on white-box testing has developed a number of methods for evaluating how thoroughly the internal behavior of deep models is tested, but they are not applicable to NLP models. We propose a set of white-box testing methods that are customized for transformer-base…
▽ More
Recent literature has seen growing interest in using black-box strategies like CheckList for testing the behavior of NLP models. Research on white-box testing has developed a number of methods for evaluating how thoroughly the internal behavior of deep models is tested, but they are not applicable to NLP models. We propose a set of white-box testing methods that are customized for transformer-based NLP models. These include Mask Neuron Coverage (MNCOVER) that measures how thoroughly the attention layers in models are exercised during testing. We show that MNCOVER can refine testing suites generated by CheckList by substantially reduce them in size, for more than 60\% on average, while retaining failing tests -- thereby concentrating the fault detection power of the test suite. Further we show how MNCOVER can be used to guide CheckList input generation, evaluate alternative NLP testing methods, and drive data augmentation to improve accuracy.
△ Less
Submitted 10 May, 2022;
originally announced May 2022.
-
Algorithm Selection for Software Verification using Graph Neural Networks
Authors:
Will Leeson,
Matthew B Dwyer
Abstract:
The field of software verification has produced a wide array of algorithmic techniques that can prove a variety of properties of a given program. It has been demonstrated that the performance of these techniques can vary up to 4 orders of magnitude on the same verification problem. Even for verification experts, it is difficult to decide which tool will perform best on a given problem. For general…
▽ More
The field of software verification has produced a wide array of algorithmic techniques that can prove a variety of properties of a given program. It has been demonstrated that the performance of these techniques can vary up to 4 orders of magnitude on the same verification problem. Even for verification experts, it is difficult to decide which tool will perform best on a given problem. For general users, deciding the best tool for their verification problem is effectively impossible.
In this work, we present Graves, a selection strategy based on graph neural networks (GNNs). Graves generates a graph representation of a program from which a GNN predicts a score for a verifier that indicates its performance on the program.
We evaluate Graves on a set of 10 verification tools and over 8000 verification problems and find that it improves the state-of-the-art in verification algorithm selection by 12%, or 8 percentage points. Further, it is able to verify 9% more problems than any existing verifier on our test set. Through a qualitative study on model interpretability, we find strong evidence that the Graves' model learns to base its predictions on factors that relate to the unique features of the algorithmic techniques.
△ Less
Submitted 5 September, 2023; v1 submitted 27 January, 2022;
originally announced January 2022.
-
DNNV: A Framework for Deep Neural Network Verification
Authors:
David Shriver,
Sebastian Elbaum,
Matthew B. Dwyer
Abstract:
Despite the large number of sophisticated deep neural network (DNN) verification algorithms, DNN verifier developers, users, and researchers still face several challenges. First, verifier developers must contend with the rapidly changing DNN field to support new DNN operations and property types. Second, verifier users have the burden of selecting a verifier input format to specify their problem.…
▽ More
Despite the large number of sophisticated deep neural network (DNN) verification algorithms, DNN verifier developers, users, and researchers still face several challenges. First, verifier developers must contend with the rapidly changing DNN field to support new DNN operations and property types. Second, verifier users have the burden of selecting a verifier input format to specify their problem. Due to the many input formats, this decision can greatly restrict the verifiers that a user may run. Finally, researchers face difficulties in re-using benchmarks to evaluate and compare verifiers, due to the large number of input formats required to run different verifiers. Existing benchmarks are rarely in formats supported by verifiers other than the one for which the benchmark was introduced. In this work we present DNNV, a framework for reducing the burden on DNN verifier researchers, developers, and users. DNNV standardizes input and output formats, includes a simple yet expressive DSL for specifying DNN properties, and provides powerful simplification and reduction operations to facilitate the application, development, and comparison of DNN verifiers. We show how DNNV increases the support of verifiers for existing benchmarks from 30% to 74%.
△ Less
Submitted 26 May, 2021;
originally announced May 2021.
-
Distribution-Aware Testing of Neural Networks Using Generative Models
Authors:
Swaroopa Dola,
Matthew B. Dwyer,
Mary Lou Soffa
Abstract:
The reliability of software that has a Deep Neural Network (DNN) as a component is urgently important today given the increasing number of critical applications being deployed with DNNs. The need for reliability raises a need for rigorous testing of the safety and trustworthiness of these systems. In the last few years, there have been a number of research efforts focused on testing DNNs. However…
▽ More
The reliability of software that has a Deep Neural Network (DNN) as a component is urgently important today given the increasing number of critical applications being deployed with DNNs. The need for reliability raises a need for rigorous testing of the safety and trustworthiness of these systems. In the last few years, there have been a number of research efforts focused on testing DNNs. However the test generation techniques proposed so far lack a check to determine whether the test inputs they are generating are valid, and thus invalid inputs are produced. To illustrate this situation, we explored three recent DNN testing techniques. Using deep generative model based input validation, we show that all the three techniques generate significant number of invalid test inputs. We further analyzed the test coverage achieved by the test inputs generated by the DNN testing techniques and showed how invalid test inputs can falsely inflate test coverage metrics.
To overcome the inclusion of invalid inputs in testing, we propose a technique to incorporate the valid input space of the DNN model under test in the test generation process. Our technique uses a deep generative model-based algorithm to generate only valid inputs. Results of our empirical studies show that our technique is effective in eliminating invalid tests and boosting the number of valid test inputs generated.
△ Less
Submitted 26 February, 2021;
originally announced February 2021.
-
Optimal Runtime Verification of Finite State Properties over Lossy Event Streams
Authors:
Peeyush Kushwaha,
Rahul Purandare,
Matthew B. Dwyer
Abstract:
Monitoring programs for finite state properties is challenging due to high memory and execution time overheads it incurs. Some events if skipped or lost naturally can reduce both overheads, but lead to uncertainty about the current monitor state. In this work, we present a theoretical framework to model these lossy event streams and provide a construction for a monitor which observes them without…
▽ More
Monitoring programs for finite state properties is challenging due to high memory and execution time overheads it incurs. Some events if skipped or lost naturally can reduce both overheads, but lead to uncertainty about the current monitor state. In this work, we present a theoretical framework to model these lossy event streams and provide a construction for a monitor which observes them without producing false positives. The constructed monitor is optimally sound among all complete monitors. We model several loss types of practical relevance using our framework and provide construction of smaller approximate monitors for properties with a large number of states.
△ Less
Submitted 8 April, 2020;
originally announced April 2020.
-
Refactoring Neural Networks for Verification
Authors:
David Shriver,
Dong Xu,
Sebastian Elbaum,
Matthew B. Dwyer
Abstract:
Deep neural networks (DNN) are growing in capability and applicability. Their effectiveness has led to their use in safety critical and autonomous systems, yet there is a dearth of cost-effective methods available for reasoning about the behavior of a DNN. In this paper, we seek to expand the applicability and scalability of existing DNN verification techniques through DNN refactoring. A DNN refac…
▽ More
Deep neural networks (DNN) are growing in capability and applicability. Their effectiveness has led to their use in safety critical and autonomous systems, yet there is a dearth of cost-effective methods available for reasoning about the behavior of a DNN. In this paper, we seek to expand the applicability and scalability of existing DNN verification techniques through DNN refactoring. A DNN refactoring defines (a) the transformation of the DNN's architecture, i.e., the number and size of its layers, and (b) the distillation of the learned relationships between the input features and function outputs of the original to train the transformed network. Unlike with traditional code refactoring, DNN refactoring does not guarantee functional equivalence of the two networks, but rather it aims to preserve the accuracy of the original network while producing a simpler network that is amenable to more efficient property verification. We present an automated framework for DNN refactoring, and demonstrate its potential effectiveness through three case studies on networks used in autonomous systems.
△ Less
Submitted 6 August, 2019;
originally announced August 2019.
-
SymInfer: Inferring Program Invariants using Symbolic States
Authors:
ThanhVu Nguyen,
Matthew B. Dwyer,
Willem Visser
Abstract:
We introduce a new technique for inferring program invariants that uses symbolic states generated by symbolic execution. Symbolic states, which consist of path conditions and constraints on local variables, are a compact description of sets of concrete program states and they can be used for both invariant inference and invariant verification. Our technique uses a counterexample-based algorithm th…
▽ More
We introduce a new technique for inferring program invariants that uses symbolic states generated by symbolic execution. Symbolic states, which consist of path conditions and constraints on local variables, are a compact description of sets of concrete program states and they can be used for both invariant inference and invariant verification. Our technique uses a counterexample-based algorithm that creates concrete states from symbolic states, infers candidate invariants from concrete states, and then verifies or refutes candidate invariants using symbolic states. The refutation case produces concrete counterexamples that prevent spurious results and allow the technique to obtain more precise invariants. This process stops when the algorithm reaches a stable set of invariants.
We present SymInfer, a tool that implements these ideas to automatically generate invariants at arbitrary locations in a Java program. The tool obtains symbolic states from Symbolic PathFinder and uses existing algorithms to infer complex (potentially nonlinear) numerical invariants. Our preliminary results show that SymInfer is effective in using symbolic states to generate precise and useful invariants for proving program safety and analyzing program runtime complexity. We also show that SymInfer outperforms existing invariant generation systems.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.