Search | arXiv e-print repository

Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification

Authors: Ruben Grewal, Paolo Tonella, Andrea Stocco

Abstract: The automated real-time recognition of unexpected situations plays a crucial role in the safety of autonomous vehicles, especially in unsupported and unpredictable scenarios. This paper evaluates different Bayesian uncertainty quantification methods from the deep learning domain for the anticipatory testing of safety-critical misbehaviours during system-level simulation-based testing. Specifically… ▽ More The automated real-time recognition of unexpected situations plays a crucial role in the safety of autonomous vehicles, especially in unsupported and unpredictable scenarios. This paper evaluates different Bayesian uncertainty quantification methods from the deep learning domain for the anticipatory testing of safety-critical misbehaviours during system-level simulation-based testing. Specifically, we compute uncertainty scores as the vehicle executes, following the intuition that high uncertainty scores are indicative of unsupported runtime conditions that can be used to distinguish safe from failure-inducing driving behaviors. In our study, we conducted an evaluation of the effectiveness and computational overhead associated with two Bayesian uncertainty quantification methods, namely MC- Dropout and Deep Ensembles, for misbehaviour avoidance. Overall, for three benchmarks from the Udacity simulator comprising both out-of-distribution and unsafe conditions introduced via mutation testing, both methods successfully detected a high number of out-of-bounds episodes providing early warnings several seconds in advance, outperforming two state-of-the-art misbehaviour prediction methods based on autoencoders and attention maps in terms of effectiveness and efficiency. Notably, Deep Ensembles detected most misbehaviours without any false alarms and did so even when employing a relatively small number of models, making them computationally feasible for real-time detection. Our findings suggest that incorporating uncertainty quantification methods is a viable approach for building fail-safe mechanisms in deep neural network-based autonomous vehicles. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: In Proceedings of 17th IEEE International Conference on Software Testing, Verification and Validation 2024 (ICST '24)

arXiv:2403.13729 [pdf, other]

Reinforcement Learning for Online Testing of Autonomous Driving Systems: a Replication and Extension Study

Authors: Luca Giamattei, Matteo Biagiola, Roberto Pietrantuono, Stefano Russo, Paolo Tonella

Abstract: In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extensi… ▽ More In a recent study, Reinforcement Learning (RL) used in combination with many-objective search, has been shown to outperform alternative techniques (random search and many-objective search) for online testing of Deep Neural Network-enabled systems. The empirical evaluation of these techniques was conducted on a state-of-the-art Autonomous Driving System (ADS). This work is a replication and extension of that empirical study. Our replication shows that RL does not outperform pure random test generation in a comparison conducted under the same settings of the original study, but with no confounding factor coming from the way collisions are measured. Our extension aims at eliminating some of the possible reasons for the poor performance of RL observed in our replication: (1) the presence of reward components providing contrasting or useless feedback to the RL agent; (2) the usage of an RL algorithm (Q-learning) which requires discretization of an intrinsically continuous state space. Results show that our new RL agent is able to converge to an effective policy that outperforms random testing. Results also highlight other possible improvements, which open to further investigations on how to best leverage RL for online ADS testing. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2312.15302 [pdf, other]

doi 10.1109/TSE.2024.3407840

GenMorph: Automatically Generating Metamorphic Relations via Genetic Programming

Authors: Jon Ayerdi, Valerio Terragni, Gunel Jahangirova, Aitor Arrieta, Paolo Tonella

Abstract: Metamorphic testing is a popular approach that aims to alleviate the oracle problem in software testing. At the core of this approach are Metamorphic Relations (MRs), specifying properties that hold among multiple test inputs and corresponding outputs. Deriving MRs is mostly a manual activity, since their automated generation is a challenging and largely unexplored problem. This paper presents G… ▽ More Metamorphic testing is a popular approach that aims to alleviate the oracle problem in software testing. At the core of this approach are Metamorphic Relations (MRs), specifying properties that hold among multiple test inputs and corresponding outputs. Deriving MRs is mostly a manual activity, since their automated generation is a challenging and largely unexplored problem. This paper presents GenMorph, a technique to automatically generate MRs for Java methods that involve inputs and outputs that are boolean, numerical, or ordered sequences. GenMorph uses an evolutionary algorithm to search for effective test oracles, i.e., oracles that trigger no false alarms and expose software faults in the method under test. The proposed search algorithm is guided by two fitness functions that measure the number of false alarms and the number of missed faults for the generated MRs. Our results show that GenMorph generates effective MRs for 18 out of 23 methods (mutation score >20%). Furthermore, it can increase Randoop's fault detection capability in 7 out of 23 methods, and Evosuite's in 14 out of 23 methods. When compared with AutoMR, a state-of-the-art MR generator, GenMorph also outperformed its fault detection capability in 9 out of 10 methods. △ Less

Submitted 5 June, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

arXiv:2307.10590 [pdf, other]

Boundary State Generation for Testing and Improvement of Autonomous Driving Systems

Authors: Matteo Biagiola, Paolo Tonella

Abstract: Recent advances in Deep Neural Networks (DNNs) and sensor technologies are enabling autonomous driving systems (ADSs) with an ever-increasing level of autonomy. However, assessing their dependability remains a critical concern. State-of-the-art ADS testing approaches modify the controllable attributes of a simulated driving environment until the ADS misbehaves. Such approaches have two main drawba… ▽ More Recent advances in Deep Neural Networks (DNNs) and sensor technologies are enabling autonomous driving systems (ADSs) with an ever-increasing level of autonomy. However, assessing their dependability remains a critical concern. State-of-the-art ADS testing approaches modify the controllable attributes of a simulated driving environment until the ADS misbehaves. Such approaches have two main drawbacks: (1) modifications to the simulated environment might not be easily transferable to the in-field test setting (e.g., changing the road shape); (2) environment instances in which the ADS is successful are discarded, despite the possibility that they could contain hidden driving conditions in which the ADS may misbehave. In this paper, we present GenBo (GENerator of BOundary state pairs), a novel test generator for ADS testing. GenBo mutates the driving conditions of the ego vehicle (position, velocity and orientation), collected in a failure-free environment instance, and efficiently generates challenging driving conditions at the behavior boundary (i.e., where the model starts to misbehave) in the same environment. We use such boundary conditions to augment the initial training dataset and retrain the DNN model under test. Our evaluation results show that the retrained model has up to 16 higher success rate on a separate set of evaluation tracks with respect to the original DNN model. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2306.07400 [pdf, other]

Neural Embeddings for Web Testing

Authors: Andrea Stocco, Alexandra Willi, Luigi Libero Lucio Starace, Matteo Biagiola, Paolo Tonella

Abstract: Web test automation techniques employ web crawlers to automatically produce a web app model that is used for test generation. Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence. Such algorithms are hard to tune in the general case and cannot accurately identify and remove near-duplicate web pages from crawl models. Failing to retrieve an accurate web ap… ▽ More Web test automation techniques employ web crawlers to automatically produce a web app model that is used for test generation. Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence. Such algorithms are hard to tune in the general case and cannot accurately identify and remove near-duplicate web pages from crawl models. Failing to retrieve an accurate web app model results in automated test generation solutions that produce redundant test cases and inadequate test suites that do not cover the web app functionalities adequately. In this paper, we propose WEBEMBED, a novel abstraction function based on neural network embeddings and threshold-free classifiers that can be used to produce accurate web app models during model-based test generation. Our evaluation on nine web apps shows that WEBEMBED outperforms state-of-the-art techniques by detecting near-duplicates more accurately, inferring better web app models that exhibit 22% more precision, and 24% more recall on average. Consequently, the test suites generated from these models achieve higher code coverage, with improvements ranging from 2% to 59% on an app-wise basis and averaging at 23%. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: 12 pages; in revision

arXiv:2305.12751 [pdf, other]

doi 10.1145/3631970

Testing of Deep Reinforcement Learning Agents with Surrogate Models

Authors: Matteo Biagiola, Paolo Tonella

Abstract: Deep Reinforcement Learning (DRL) has received a lot of attention from the research community in recent years. As the technology moves away from game playing to practical contexts, such as autonomous vehicles and robotics, it is crucial to evaluate the quality of DRL agents. In this paper, we propose a search-based approach to test such agents. Our approach, implemented in a tool called Indago, tr… ▽ More Deep Reinforcement Learning (DRL) has received a lot of attention from the research community in recent years. As the technology moves away from game playing to practical contexts, such as autonomous vehicles and robotics, it is crucial to evaluate the quality of DRL agents. In this paper, we propose a search-based approach to test such agents. Our approach, implemented in a tool called Indago, trains a classifier on failure and non-failure environment (i.e., pass) configurations resulting from the DRL training process. The classifier is used at testing time as a surrogate model for the DRL agent execution in the environment, predicting the extent to which a given environment configuration induces a failure of the DRL agent under test. The failure prediction acts as a fitness function, guiding the generation towards failure environment configurations, while saving computation time by deferring the execution of the DRL agent in the environment to those configurations that are more likely to expose failures. Experimental results show that our search-based approach finds 50% more failures of the DRL agent than state-of-the-art techniques. Moreover, such failures are, on average, 78% more diverse; similarly, the behaviors of the DRL agent induced by failure configurations are 74% more diverse. △ Less

Submitted 11 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

ACM Class: D.2.5

arXiv:2305.08060 [pdf, other]

doi 10.1007/s10664-024-10458-4

Two is Better Than One: Digital Siblings to Improve Autonomous Driving Testing

Authors: Matteo Biagiola, Andrea Stocco, Vincenzo Riccio, Paolo Tonella

Abstract: Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of digital… ▽ More Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of digital siblings, a multi-simulator approach that tests a given autonomous vehicle on multiple general-purpose simulators built with different technologies, that operate collectively as an ensemble in the testing process. We exemplify our approach on a case study focused on testing the lane-kee** component of an autonomous vehicle. We use two open-source simulators as digital siblings, and we empirically compare such a multi-simulator approach against a digital twin of a physical scaled autonomous vehicle on a large set of test cases. Our approach requires generating and running test cases for each individual simulator, in the form of sequences of road points. Then, test cases are migrated between simulators, using feature maps to characterize the exercised driving conditions. Finally, the joint predicted failure probability is computed, and a failure is reported only in cases of agreement among the siblings. Our empirical evaluation shows that the ensemble failure predictor by the digital siblings is superior to each individual simulator at predicting the failures of the digital twin. We discuss the findings of our case study and detail how our approach can help researchers interested in automated testing of autonomous driving software. △ Less

Submitted 24 April, 2024; v1 submitted 14 May, 2023; originally announced May 2023.

ACM Class: D.2.5

arXiv:2304.02654 [pdf, other]

Adopting Two Supervisors for Efficient Use of Large-Scale Remote Deep Neural Networks

Authors: Michael Weiss, Paolo Tonella

Abstract: Recent decades have seen the rise of large-scale Deep Neural Networks (DNNs) to achieve human-competitive performance in a variety of artificial intelligence tasks. Often consisting of hundreds of millions, if not hundreds of billion parameters, these DNNs are too large to be deployed to, or efficiently run on resource-constrained devices such as mobile phones or IoT microcontrollers. Systems rely… ▽ More Recent decades have seen the rise of large-scale Deep Neural Networks (DNNs) to achieve human-competitive performance in a variety of artificial intelligence tasks. Often consisting of hundreds of millions, if not hundreds of billion parameters, these DNNs are too large to be deployed to, or efficiently run on resource-constrained devices such as mobile phones or IoT microcontrollers. Systems relying on large-scale DNNs thus have to call the corresponding model over the network, leading to substantial costs for hosting and running the large-scale remote model, costs which are often charged on a per-use basis. In this paper, we propose BiSupervised, a novel architecture, where, before relying on a large remote DNN, a system attempts to make a prediction on a small-scale local model. A DNN supervisor monitors said prediction process and identifies easy inputs for which the local prediction can be trusted. For these inputs, the remote model does not have to be invoked, thus saving costs, while only marginally impacting the overall system accuracy. Our architecture furthermore foresees a second supervisor to monitor the remote predictions and identify inputs for which not even these can be trusted, allowing to raise an exception or run a fallback strategy instead. We evaluate the cost savings, and the ability to detect incorrectly predicted inputs on four diverse case studies: IMDB movie review sentiment classification, Github issue triaging, Imagenet image classification, and SQuADv2 free-text question answering △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: A corresponding registered Report (i.e., paper without results) was accepted at ACM TOSEM. This preprint includes the (not yet peer-reviewed) results

arXiv:2301.11568 [pdf, other]

Repairing DNN Architecture: Are We There Yet?

Authors: **han Kim, Nargiz Humbatova, Gunel Jahangirova, Paolo Tonella, Shin Yoo

Abstract: As Deep Neural Networks (DNNs) are rapidly being adopted within large software systems, software developers are increasingly required to design, train, and deploy such models into the systems they develop. Consequently, testing and improving the robustness of these models have received a lot of attention lately. However, relatively little effort has been made to address the difficulties developers… ▽ More As Deep Neural Networks (DNNs) are rapidly being adopted within large software systems, software developers are increasingly required to design, train, and deploy such models into the systems they develop. Consequently, testing and improving the robustness of these models have received a lot of attention lately. However, relatively little effort has been made to address the difficulties developers experience when designing and training such models: if the evaluation of a model shows poor performance after the initial training, what should the developer change? We survey and evaluate existing state-of-the-art techniques that can be used to repair model performance, using a benchmark of both real-world mistakes developers made while designing DNN models and artificial faulty models generated by mutating the model code. The empirical evaluation shows that random baseline is comparable with or sometimes outperforms existing state-of-the-art techniques. However, for larger and more complicated models, all repair techniques fail to find fixes. Our findings call for further research to develop more sophisticated techniques for Deep Learning repair. △ Less

Submitted 27 January, 2023; originally announced January 2023.

Comments: Paper accepted to ICST 2023

arXiv:2212.11368 [pdf, ps, other]

When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study

Authors: Vincenzo Riccio, Paolo Tonella

Abstract: Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as… ▽ More Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as part of the input domain, thus providing an unreliable quality assessment. Automated validators can ease the burden of manually checking the validity of inputs for human testers, although input validity is a concept difficult to formalise and, thus, automate. In this paper, we investigate to what extent TIGs can generate valid inputs, according to both automated and human validators. We conduct a large empirical study, involving 2 different automated validators, 220 human assessors, 5 different TIGs and 3 classification tasks. Our results show that 84% artificially generated inputs are valid, according to automated validators, but their expected label is not always preserved. Automated validators reach a good consensus with humans (78% accuracy), but still have limitations when dealing with feature-rich datasets. △ Less

Submitted 21 December, 2022; originally announced December 2022.

Comments: To be published in Proceedings of the 45th ACM/IEEE International Conference on Software Engineering (ICSE 2023)

ACM Class: D.2.5

arXiv:2212.07118 [pdf, other]

Uncertainty Quantification for Deep Neural Networks: An Empirical Comparison and Usage Guidelines

Authors: Michael Weiss, Paolo Tonella

Abstract: Deep Neural Networks (DNN) are increasingly used as components of larger software systems that need to process complex data, such as images, written texts, audio/video signals. DNN predictions cannot be assumed to be always correct for several reasons, among which the huge input space that is dealt with, the ambiguity of some inputs data, as well as the intrinsic properties of learning algorithms,… ▽ More Deep Neural Networks (DNN) are increasingly used as components of larger software systems that need to process complex data, such as images, written texts, audio/video signals. DNN predictions cannot be assumed to be always correct for several reasons, among which the huge input space that is dealt with, the ambiguity of some inputs data, as well as the intrinsic properties of learning algorithms, which can provide only statistical warranties. Hence, developers have to cope with some residual error probability. An architectural pattern commonly adopted to manage failure-prone components is the supervisor, an additional component that can estimate the reliability of the predictions made by untrusted (e.g., DNN) components and can activate an automated healing procedure when these are likely to fail, ensuring that the Deep Learning based System (DLS) does not cause damages, despite its main functionality being suspended. In this paper, we consider DLS that implement a supervisor by means of uncertainty estimation. After overviewing the main approaches to uncertainty estimation and discussing their pros and cons, we motivate the need for a specific empirical assessment method that can deal with the experimental setting in which supervisors are used, where the accuracy of the DNN matters only as long as the supervisor lets the DLS continue to operate. Then we present a large empirical study conducted to compare the alternative approaches to uncertainty estimation. We distilled a set of guidelines for developers that are useful to incorporate a supervisor based on uncertainty monitoring into a DLS. △ Less

Submitted 14 December, 2022; originally announced December 2022.

Comments: Accepted for publication at the Journal of Software: Testing, Verification and Reliability. arXiv admin note: substantial text overlap with arXiv:2102.00902

arXiv:2207.10495 [pdf, other]

Generating and Detecting True Ambiguity: A Forgotten Danger in DNN Supervision Testing

Authors: Michael Weiss, André García Gómez, Paolo Tonella

Abstract: Deep Neural Networks (DNNs) are becoming a crucial component of modern software systems, but they are prone to fail under conditions that are different from the ones observed during training (out-of-distribution inputs) or on inputs that are truly ambiguous, i.e., inputs that admit multiple classes with nonzero probability in their labels. Recent work proposed DNN supervisors to detect high-uncert… ▽ More Deep Neural Networks (DNNs) are becoming a crucial component of modern software systems, but they are prone to fail under conditions that are different from the ones observed during training (out-of-distribution inputs) or on inputs that are truly ambiguous, i.e., inputs that admit multiple classes with nonzero probability in their labels. Recent work proposed DNN supervisors to detect high-uncertainty inputs before their possible misclassification leads to any harm. To test and compare the capabilities of DNN supervisors, researchers proposed test generation techniques, to focus the testing effort on high-uncertainty inputs that should be recognized as anomalous by supervisors. However, existing test generators aim to produce out-of-distribution inputs. No existing model- and supervisor independent technique targets the generation of truly ambiguous test inputs, i.e., inputs that admit multiple classes according to expert human judgment. In this paper, we propose a novel way to generate ambiguous inputs to test DNN supervisors and used it to empirically compare several existing supervisor techniques. In particular, we propose AmbiGuess to generate ambiguous samples for image classification problems. AmbiGuess is based on gradient-guided sampling in the latent space of a regularized adversarial autoencoder. Moreover, we conducted what is -- to the best of our knowledge -- the most extensive comparative study of DNN supervisors, considering their capabilities to detect 4 distinct types of high-uncertainty inputs, including truly ambiguous ones. We find that the tested supervisors' capabilities are complementary: Those best suited to detect true ambiguity perform worse on invalid, out-of-distribution and adversarial inputs and vice-versa. △ Less

Submitted 8 September, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

Comments: Accepted for publication at Springers "Empirical Software Engineering" (EMSE)

arXiv:2205.00664 [pdf, other]

doi 10.1145/3533767.3534375

Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study)

Authors: Michael Weiss, Paolo Tonella

Abstract: Test Input Prioritizers (TIP) for Deep Neural Networks (DNN) are an important technique to handle the typically very large test datasets efficiently, saving computation and labeling costs. This is particularly true for large-scale, deployed systems, where inputs observed in production are recorded to serve as potential test or training data for the next versions of the system. Feng et. al. propose… ▽ More Test Input Prioritizers (TIP) for Deep Neural Networks (DNN) are an important technique to handle the typically very large test datasets efficiently, saving computation and labeling costs. This is particularly true for large-scale, deployed systems, where inputs observed in production are recorded to serve as potential test or training data for the next versions of the system. Feng et. al. propose DeepGini, a very fast and simple TIP, and show that it outperforms more elaborate techniques such as neuron- and surprise coverage. In a large-scale study (4 case studies, 8 test datasets, 32'200 trained models) we verify their findings. However, we also find that other comparable or even simpler baselines from the field of uncertainty quantification, such as the predicted softmax likelihood or the entropy of the predicted softmax likelihoods perform equally well as DeepGini. △ Less

Submitted 24 May, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: Accepted at ISSTA 2022

arXiv:2112.11255 [pdf, other]

doi 10.1109/TSE.2022.3202311

Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving Systems

Authors: Andrea Stocco, Brian Pulfer, Paolo Tonella

Abstract: Safe deployment of self-driving cars (SDC) necessitates thorough simulated and in-field testing. Most testing techniques consider virtualized SDCs within a simulation environment, whereas less effort has been directed towards assessing whether such techniques transfer to and are effective with a physical real-world vehicle. In this paper, we shed light on the problem of generalizing testing result… ▽ More Safe deployment of self-driving cars (SDC) necessitates thorough simulated and in-field testing. Most testing techniques consider virtualized SDCs within a simulation environment, whereas less effort has been directed towards assessing whether such techniques transfer to and are effective with a physical real-world vehicle. In this paper, we shed light on the problem of generalizing testing results obtained in a driving simulator to a physical platform and provide a characterization and quantification of the sim2real gap affecting SDC testing. In our empirical study, we compare SDC testing when deployed on a physical small-scale vehicle vs its digital twin. Due to the unavailability of driving quality indicators from the physical platform, we use neural rendering to estimate them through visual odometry, hence allowing full comparability with the digital twin. Then, we investigate the transferability of behavior and failure exposure between virtual and real-world environments, targeting both unintended abnormal test data and intended adversarial examples. Our study shows that, despite the usage of a faithful digital twin, there are still critical shortcomings that contribute to the reality gap between the virtual and physical world, threatening existing testing solutions that only consider virtual SDCs. On the positive side, our results present the test configurations for which physical testing can be avoided, either because their outcome does transfer between virtual and physical environments, or because the uncertainty profiles in the simulator can help predict their outcome in the real world. △ Less

Submitted 25 August, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

Comments: 13 pages; Accepted for publication in the IEEE Transactions of Software Engineering (TSE)

arXiv:2109.07514 [pdf, other]

DeepMetis: Augmenting a Deep Learning Test Set to Increase its Mutation Score

Authors: Vincenzo Riccio, Nargiz Humbatova, Gunel Jahangirova, Paolo Tonella

Abstract: Deep Learning (DL) components are routinely integrated into software systems that need to perform complex tasks such as image or natural language processing. The adequacy of the test data used to test such systems can be assessed by their ability to expose artificially injected faults (mutations) that simulate real DL faults. In this paper, we describe an approach to automatically generate new tes… ▽ More Deep Learning (DL) components are routinely integrated into software systems that need to perform complex tasks such as image or natural language processing. The adequacy of the test data used to test such systems can be assessed by their ability to expose artificially injected faults (mutations) that simulate real DL faults. In this paper, we describe an approach to automatically generate new test inputs that can be used to augment the existing test set so that its capability to detect DL mutations increases. Our tool DeepMetis implements a search based input generation strategy. To account for the non-determinism of the training and the mutation processes, our fitness function involves multiple instances of the DL model under test. Experimental results show that \tool is effective at augmenting the given test set, increasing its capability to detect mutants by 63% on average. A leave-one-out experiment shows that the augmented test set is capable of exposing unseen mutants, which simulate the occurrence of yet undetected faults. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: To be published in Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE 2021)

ACM Class: D.2.5

arXiv:2107.06997 [pdf, other]

DeepHyperion: Exploring the Feature Space of Deep Learning-Based Systems through Illumination Search

Authors: Tahereh Zohdinasab, Vincenzo Riccio, Alessio Gambi, Paolo Tonella

Abstract: Deep Learning (DL) has been successfully applied to a wide range of application domains, including safety-critical ones. Several DL testing approaches have been recently proposed in the literature but none of them aims to assess how different interpretable features of the generated inputs affect the system's behaviour. In this paper, we resort to Illumination Search to find the highest-performing… ▽ More Deep Learning (DL) has been successfully applied to a wide range of application domains, including safety-critical ones. Several DL testing approaches have been recently proposed in the literature but none of them aims to assess how different interpretable features of the generated inputs affect the system's behaviour. In this paper, we resort to Illumination Search to find the highest-performing test cases (i.e., misbehaving and closest to misbehaving), spread across the cells of a map representing the feature space of the system. We introduce a methodology that guides the users of our approach in the tasks of identifying and quantifying the dimensions of the feature space for a given domain. We developed DeepHyperion, a search-based tool for DL systems that illuminates, i.e., explores at large, the feature space, by providing developers with an interpretable feature map where automatically generated inputs are placed along with information about the exposed behaviours. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: To be published in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA '21), July 11-17, 2021, Virtual, Denmark. ACM, New York, NY, USA, 12 pages

ACM Class: D.2.5

arXiv:2103.05939 [pdf, other]

A Review and Refinement of Surprise Adequacy

Authors: Michael Weiss, Rwiddhi Chakraborty, Paolo Tonella

Abstract: Surprise Adequacy (SA) is one of the emerging and most promising adequacy criteria for Deep Learning (DL) testing. As an adequacy criterion, it has been used to assess the strength of DL test suites. In addition, it has also been used to find inputs to a Deep Neural Network (DNN) which were not sufficiently represented in the training data, or to select samples for DNN retraining. However, computa… ▽ More Surprise Adequacy (SA) is one of the emerging and most promising adequacy criteria for Deep Learning (DL) testing. As an adequacy criterion, it has been used to assess the strength of DL test suites. In addition, it has also been used to find inputs to a Deep Neural Network (DNN) which were not sufficiently represented in the training data, or to select samples for DNN retraining. However, computation of the SA metric for a test suite can be prohibitively expensive, as it involves a quadratic number of distance calculations. Hence, we developed and released a performance-optimized, but functionally equivalent, implementation of SA, reducing the evaluation time by up to 97\%. We also propose refined variants of the SA omputation algorithm, aiming to further increase the evaluation speed. We then performed an empirical study on MNIST, focused on the out-of-distribution detection capabilities of SA, which allowed us to reproduce parts of the results presented when SA was first released. The experiments show that our refined variants are substantially faster than plain SA, while producing comparable outcomes. Our experimental results exposed also an overlooked issue of SA: it can be highly sensitive to the non-determinism associated with the DNN training procedure. △ Less

Submitted 10 March, 2021; originally announced March 2021.

Comments: Accepted at DeepTest 2021 (ICSE Workshop)

arXiv:2103.02901 [pdf, other]

GAssert: A Fully Automated Tool to Improve Assertion Oracles

Authors: Valerio Terragni, Gunel Jahangirova, Paolo Tonella, Mauro Pezzè

Abstract: This demo presents the implementation and usage details of GASSERT, the first tool to automatically improve assertion oracles. Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assertion oracles is difficult, assertions are prone to… ▽ More This demo presents the implementation and usage details of GASSERT, the first tool to automatically improve assertion oracles. Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assertion oracles is difficult, assertions are prone to both false positives (the assertion fails but should pass) and false negatives (the assertion passes but should fail). Given a Java method containing an assertion oracle to improve, GASSERT returns an improved assertion with fewer false positives and false negatives than the initial assertion. Internally, GASSERT implements a novel co-evolutionary algorithm that explores the space of possible assertions guided by two fitness functions that reward assertions with fewer false positives, fewer false negatives, and smaller size. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: 4 pages, published at the 43nd IEEE/ACM International Conference on Software Engineering, Demonstration Track ICSE-DEMO 2021

arXiv:2102.00902 [pdf, ps, other]

Fail-Safe Execution of Deep Learning based Systems through Uncertainty Monitoring

Authors: Michael Weiss, Paolo Tonella

Abstract: Modern software systems rely on Deep Neural Networks (DNN) when processing complex, unstructured inputs, such as images, videos, natural language texts or audio signals. Provided the intractably large size of such input spaces, the intrinsic limitations of learning algorithms, and the ambiguity about the expected predictions for some of the inputs, not only there is no guarantee that DNN's predict… ▽ More Modern software systems rely on Deep Neural Networks (DNN) when processing complex, unstructured inputs, such as images, videos, natural language texts or audio signals. Provided the intractably large size of such input spaces, the intrinsic limitations of learning algorithms, and the ambiguity about the expected predictions for some of the inputs, not only there is no guarantee that DNN's predictions are always correct, but rather developers must safely assume a low, though not negligible, error probability. A fail-safe Deep Learning based System (DLS) is one equipped to handle DNN faults by means of a supervisor, capable of recognizing predictions that should not be trusted and that should activate a healing procedure bringing the DLS to a safe state. In this paper, we propose an approach to use DNN uncertainty estimators to implement such a supervisor. We first discuss the advantages and disadvantages of existing approaches to measure uncertainty for DNNs and propose novel metrics for the empirical assessment of the supervisor that rely on such approaches. We then describe our publicly available tool UNCERTAINTY-WIZARD, which allows transparent estimation of uncertainty for regular tf.keras DNNs. Lastly, we discuss a large-scale study conducted on four different subjects to empirically validate the approach, reporting the lessons-learned as guidance for software engineers who intend to monitor uncertainty for fail-safe execution of DLS. △ Less

Submitted 1 February, 2021; originally announced February 2021.

Comments: Accepted at IEEE International Conference on Software Testing, Verification and Validation 2021

arXiv:2101.02636 [pdf, other]

doi 10.1145/3502868

Deep Reinforcement Learning for Black-Box Testing of Android Apps

Authors: Andrea Romdhana, Alessio Merlo, Mariano Ceccato, Paolo Tonella

Abstract: The state space of Android apps is huge and its thorough exploration during testing remains a major challenge. In fact, the best exploration strategy is highly dependent on the features of the app under test. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task by trial and error, guided by positive or negative reward, rather than by explicit… ▽ More The state space of Android apps is huge and its thorough exploration during testing remains a major challenge. In fact, the best exploration strategy is highly dependent on the features of the app under test. Reinforcement Learning (RL) is a machine learning technique that learns the optimal strategy to solve a task by trial and error, guided by positive or negative reward, rather than by explicit supervision. Deep RL is a recent extension of RL that takes advantage of the learning capabilities of neural networks. Such capabilities make Deep RL suitable for complex exploration spaces such as the one of Android apps. However, state of the art, publicly available tools only support basic, tabular RL. We have developed ARES, a Deep RL approach for black-box testing of Android apps. Experimental results show that it achieves higher coverage and fault revelation than the baselines, which include state of the art RL based tools, such as TimeMachine and Q-Testing. We also investigated qualitatively the reasons behind such performance and we have identified the key features of Android apps that make Deep RL particularly effective on them to be the presence of chained and blocking activities. △ Less

Submitted 15 January, 2021; v1 submitted 7 January, 2021; originally announced January 2021.

Journal ref: ACM Transactions on Software Engineering and Methodology, 2022

arXiv:2101.00982 [pdf, ps, other]

Uncertainty-Wizard: Fast and User-Friendly Neural Network Uncertainty Quantification

Authors: Michael Weiss, Paolo Tonella

Abstract: Uncertainty and confidence have been shown to be useful metrics in a wide variety of techniques proposed for deep learning testing, including test data selection and system supervision.We present uncertainty-wizard, a tool that allows to quantify such uncertainty and confidence in artificial neural networks. It is built on top of the industry-leading tf.keras deep learning API and it provides a ne… ▽ More Uncertainty and confidence have been shown to be useful metrics in a wide variety of techniques proposed for deep learning testing, including test data selection and system supervision.We present uncertainty-wizard, a tool that allows to quantify such uncertainty and confidence in artificial neural networks. It is built on top of the industry-leading tf.keras deep learning API and it provides a near-transparent and easy to understand interface. At the same time, it includes major performance optimizations that we benchmarked on two different machines and different configurations. △ Less

Submitted 28 January, 2021; v1 submitted 29 December, 2020; originally announced January 2021.

Comments: Accepted for publication at the IEEE International Conference on Software Testing, Verification and Validation 2021

arXiv:2011.10787 [pdf, other]

An Empirical Study on Failed Error Propagation in Java Programs with Real Faults

Authors: Gunel Jahangirova, David Clark, Mark Harman, Paolo Tonella

Abstract: During testing, developers can place oracles externally or internally with respect to a method. Given a faulty execution state, i.e., one that differs from the expected one, an oracle might be unable to expose the fault if it is placed at a program point with no access to the incorrect program state or where the program state is no longer corrupted. In such a case, the oracle is subject to failed… ▽ More During testing, developers can place oracles externally or internally with respect to a method. Given a faulty execution state, i.e., one that differs from the expected one, an oracle might be unable to expose the fault if it is placed at a program point with no access to the incorrect program state or where the program state is no longer corrupted. In such a case, the oracle is subject to failed error propagation. We conducted an empirical study to measure failed error propagation on Defects4J, the reference benchmark for Java programs with real faults, considering all 6 projects available (386 real bugs and 459 fixed methods). Our results indicate that the prevalence of failed error propagation is negligible when testing is performed at the unit level. However, when system-level inputs are provided, the prevalence of failed error propagation increases substantially. This indicates that it is enough for method postconditions to predicate only on the externally observable state/data and that intermediate steps should be checked when testing at system level. △ Less

Submitted 21 November, 2020; originally announced November 2020.

arXiv:2007.02787 [pdf, other]

Model-based Exploration of the Frontier of Behaviours for Deep Learning System Testing

Authors: Vincenzo Riccio, Paolo Tonella

Abstract: With the increasing adoption of Deep Learning (DL) for critical tasks, such as autonomous driving, the evaluation of the quality of systems that rely on DL has become crucial. Once trained, DL systems produce an output for any arbitrary numeric vector provided as input, regardless of whether it is within or outside the validity domain of the system under test. Hence, the quality of such systems is… ▽ More With the increasing adoption of Deep Learning (DL) for critical tasks, such as autonomous driving, the evaluation of the quality of systems that rely on DL has become crucial. Once trained, DL systems produce an output for any arbitrary numeric vector provided as input, regardless of whether it is within or outside the validity domain of the system under test. Hence, the quality of such systems is determined by the intersection between their validity domain and the regions where their outputs exhibit a misbehaviour. In this paper, we introduce the notion of frontier of behaviours, i.e., the inputs at which the DL system starts to misbehave. If the frontier of misbehaviours is outside the validity domain of the system, the quality check is passed. Otherwise, the inputs at the intersection represent quality deficiencies of the system. We developed DeepJanus, a search-based tool that generates frontier inputs for DL systems. The experimental results obtained for the lane kee** component of a self-driving car show that the frontier of a well trained system contains almost exclusively unrealistic roads that violate the best practices of civil engineering, while the frontier of a poorly trained one includes many valid inputs that point to serious deficiencies of the system. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: To be published in the Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020); 13 pages, 6 figures

ACM Class: D.2.5

arXiv:2002.01785 [pdf, other]

A Framework for In-Vivo Testing of Mobile Applications

Authors: Mariano Ceccato, Davide Corradini, Luca Gazzola, Fitsum Meshesha Kifetew, Leonardo Mariani, Matteo Orrù, Paolo Tonella

Abstract: The ecosystem in which mobile applications run is highly heterogeneous and configurable. All layers upon which mobile apps are built offer wide possibilities of variations, from the device and the hardware, to the operating system and middleware, up to the user preferences and settings. Testing all possible configurations exhaustively, before releasing the app, is unaffordable. As a consequence, t… ▽ More The ecosystem in which mobile applications run is highly heterogeneous and configurable. All layers upon which mobile apps are built offer wide possibilities of variations, from the device and the hardware, to the operating system and middleware, up to the user preferences and settings. Testing all possible configurations exhaustively, before releasing the app, is unaffordable. As a consequence, the app may exhibit different, including faulty, behaviours when executed in the field, under specific configurations. In this paper, we describe a framework that can be instantiated to support in-vivo testing of a mobile app. The framework monitors the configuration in the field and triggers in-vivo testing when an untested configuration is recognized. Experimental results show that the overhead introduced by monitoring is unnoticeable to negligible (i.e., 0-6%) depending on the device being used (high- vs. low-end). In-vivo test execution required on average 3s: if performed upon screen lock activation, it introduces just a slight delay before locking the device. △ Less

Submitted 5 February, 2020; originally announced February 2020.

Comments: Research paper accepted to ICST'20, 10+1 pages

arXiv:1910.11015 [pdf, other]

Taxonomy of Real Faults in Deep Learning Systems

Authors: Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, Paolo Tonella

Abstract: The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTo… ▽ More The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13/15) were experienced by at least 50% of the survey participants. △ Less

Submitted 7 November, 2019; v1 submitted 24 October, 2019; originally announced October 2019.

arXiv:1910.04443 [pdf, other]

Misbehaviour Prediction for Autonomous Driving Systems

Authors: Andrea Stocco, Michael Weiss, Marco Calzana, Paolo Tonella

Abstract: Deep Neural Networks (DNNs) are the core component of modern autonomous driving systems. To date, it is still unrealistic that a DNN will generalize correctly in all driving conditions. Current testing techniques consist of offline solutions that identify adversarial or corner cases for improving the training phase, and little has been done for enabling online healing of DNN-based vehicles. In thi… ▽ More Deep Neural Networks (DNNs) are the core component of modern autonomous driving systems. To date, it is still unrealistic that a DNN will generalize correctly in all driving conditions. Current testing techniques consist of offline solutions that identify adversarial or corner cases for improving the training phase, and little has been done for enabling online healing of DNN-based vehicles. In this paper, we address the problem of estimating the confidence of DNNs in response to unexpected execution contexts with the purpose of predicting potential safety-critical misbehaviours such as out of bound episodes or collisions. Our approach SelfOracle is based on a novel concept of self-assessment oracle, which monitors the DNN confidence at runtime, to predict unsupported driving scenarios in advance. SelfOracle uses autoencoder and time-series-based anomaly detection to reconstruct the driving scenarios seen by the car, and determine the confidence boundary of normal/unsupported conditions. In our empirical assessment, we evaluated the effectiveness of different variants of SelfOracle at predicting injected anomalous driving contexts, using DNN models and simulation environment from Udacity. Results show that, overall, SelfOracle can predict 77% misbehaviours, up to 6 seconds in advance, outperforming the online input validation approach of DeepRoad by a factor almost equal to 3. △ Less

Submitted 10 October, 2019; originally announced October 2019.

Comments: 11 pages

arXiv:1905.00357 [pdf, other]

doi 10.1145/3338906.3338948

Web Test Dependency Detection

Authors: Matteo Biagiola, Andrea Stocco, Ali Mesbah, Filippo Ricca, Paolo Tonella

Abstract: E2E web test suites are prone to test dependencies due to the heterogeneous multi-tiered nature of modern web apps, which makes it difficult for developers to create isolated program states for each test case. In this paper, we present the first approach for detecting and validating test dependencies present in E2E web test suites. Our approach employs string analysis to extract an approximated se… ▽ More E2E web test suites are prone to test dependencies due to the heterogeneous multi-tiered nature of modern web apps, which makes it difficult for developers to create isolated program states for each test case. In this paper, we present the first approach for detecting and validating test dependencies present in E2E web test suites. Our approach employs string analysis to extract an approximated set of dependencies from the test code. It then filters potential false dependencies through natural language processing of test names. Finally, it validates all dependencies, and uses a novel recovery algorithm to ensure no true dependencies are missed in the final test dependency graph. Our approach is implemented in a tool called TEDD and evaluated on the test suites of six open-source web apps. Our results show that TEDD can correctly detect and validate test dependencies up to 72% faster than the baseline with the original test ordering in which the graph contains all possible dependencies. The test dependency graphs produced by TEDD enable test execution parallelization, with a speed-up factor of up to 7x. △ Less

Submitted 10 October, 2019; v1 submitted 1 May, 2019; originally announced May 2019.

Comments: 11 pages, published in the Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019), pp. 154-164

arXiv:1704.02774 [pdf, other]

doi 10.1109/ICPC.2017.2

How Professional Hackers Understand Protected Code while Performing Attack Tasks

Authors: Mariano Ceccato, Paolo Tonella, Cataldo Basile, Bart Coppens, Bjorn De Sutter, Paolo Falcarin, Marco Torchiano

Abstract: Code protections aim at blocking (or at least delaying) reverse engineering and tampering attacks to critical assets within programs. Knowing the way hackers understand protected code and perform attacks is important to achieve a stronger protection of the software assets, based on realistic assumptions about the hackers' behaviour. However, building such knowledge is difficult because hackers can… ▽ More Code protections aim at blocking (or at least delaying) reverse engineering and tampering attacks to critical assets within programs. Knowing the way hackers understand protected code and perform attacks is important to achieve a stronger protection of the software assets, based on realistic assumptions about the hackers' behaviour. However, building such knowledge is difficult because hackers can hardly be involved in controlled experiments and empirical studies. The FP7 European project Aspire has given the authors of this paper the unique opportunity to have access to the professional penetration testers employed by the three industrial partners. In particular, we have been able to perform a qualitative analysis of three reports of professional penetration test performed on protected industrial code. Our qualitative analysis of the reports consists of open coding, carried out by 7 annotators and resulting in 459 annotations, followed by concept extraction and model inference. We identified the main activities: understanding, building attack, choosing and customizing tools, and working around or defeating protections. We built a model of how such activities take place. We used such models to identify a set of research directions for the creation of stronger code protections. △ Less

Submitted 26 May, 2017; v1 submitted 10 April, 2017; originally announced April 2017.

Comments: Post-print for ICPC 2017 conference

arXiv:1704.02307 [pdf, other]

doi 10.1109/SCAM.2016.17

Assessment of Source Code Obfuscation Techniques

Authors: Alessio Viticchié, Leonardo Regano, Marco Torchiano, Cataldo Basile, Mariano Ceccato, Paolo Tonella, Roberto Tiella

Abstract: Obfuscation techniques are a general category of software protections widely adopted to prevent malicious tampering of the code by making applications more difficult to understand and thus harder to modify. Obfuscation techniques are divided in code and data obfuscation, depending on the protected asset. While preliminary empirical studies have been conducted to determine the impact of code obfusc… ▽ More Obfuscation techniques are a general category of software protections widely adopted to prevent malicious tampering of the code by making applications more difficult to understand and thus harder to modify. Obfuscation techniques are divided in code and data obfuscation, depending on the protected asset. While preliminary empirical studies have been conducted to determine the impact of code obfuscation, our work aims at assessing the effectiveness and efficiency in preventing attacks of a specific data obfuscation technique - VarMerge. We conducted an experiment with student participants performing two attack tasks on clear and obfuscated versions of two applications written in C. The experiment showed a significant effect of data obfuscation on both the time required to complete and the successful attack efficiency. An application with VarMerge reduces by six times the number of successful attacks per unit of time. This outcome provides a practical clue that can be used when applying software protections based on data obfuscation. △ Less

Submitted 7 April, 2017; originally announced April 2017.

Comments: Post-print, SCAM 2016

arXiv:cs/0607006 [pdf]

Applying and Combining Three Different Aspect Mining Techniques

Authors: Mariano Ceccato, Marius Marin, Kim Mens, Leon Moonen, Paolo Tonella, Tom Tourwe

Abstract: Understanding a software system at source-code level requires understanding the different concerns that it addresses, which in turn requires a way to identify these concerns in the source code. Whereas some concerns are explicitly represented by program entities (like classes, methods and variables) and thus are easy to identify, crosscutting concerns are not captured by a single program entity… ▽ More Understanding a software system at source-code level requires understanding the different concerns that it addresses, which in turn requires a way to identify these concerns in the source code. Whereas some concerns are explicitly represented by program entities (like classes, methods and variables) and thus are easy to identify, crosscutting concerns are not captured by a single program entity but are scattered over many program entities and are tangled with the other concerns. Because of their crosscutting nature, such crosscutting concerns are difficult to identify, and reduce the understandability of the system as a whole. In this paper, we report on a combined experiment in which we try to identify crosscutting concerns in the JHotDraw framework automatically. We first apply three independently developed aspect mining techniques to JHotDraw and evaluate and compare their results. Based on this analysis, we present three interesting combinations of these three techniques, and show how these combinations provide a more complete coverage of the detected concerns as compared to the original techniques individually. Our results are a first step towards improving the understandability of a system that contains crosscutting concerns, and can be used as a basis for refactoring the identified crosscutting concerns into aspects. △ Less

Submitted 2 July, 2006; originally announced July 2006.

Comments: 28 pages

Report number: TUD-SERG-2006-002

Showing 1–30 of 30 results for author: Tonella, P