Search | arXiv e-print repository

Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Graph Execution

Authors: Raffi Khatchadourian, Tatiana Castro Vélez, Mehdi Bagherzadeh, Nan Jia, Anita Raja

Abstract: Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code -- supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce code that is error-prone, non-intuitive, and difficult to debug. Consequently… ▽ More Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code -- supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce code that is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, less error-prone imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. Though hybrid approaches aim for the "best of both worlds," using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution -- avoiding performance bottlenecks and semantically inequivalent results. We present our ongoing work on an automated refactoring approach that assists developers in specifying whether and how their otherwise eagerly-executed imperative DL code could be reliably and efficiently executed as graphs at run-time in a semantics-preserving fashion. The approach, based on a novel tensor analysis specifically for imperative DL code, consists of refactoring preconditions for automatically determining when it is safe and potentially advantageous to migrate imperative DL code to graph execution and modifying decorator parameters or eagerly executing code already running as graphs. The approach is being implemented as a PyDev Eclipse IDE plug-in and uses the WALA Ariadne analysis framework. We discuss our ongoing work towards optimizing imperative DL code to its full potential. △ Less

Submitted 10 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: To appear in the NIER track of the IEEE/ACM International Conference on Automated Software Engineering, ASE '23, Kirchberg, Luxembourg, September 2023

arXiv:2206.07813 [pdf, other]

doi 10.1109/TSE.2023.3269804

A Search-Based Testing Approach for Deep Reinforcement Learning Agents

Authors: Amirhossein Zolfagharian, Manel Abdellatif, Lionel Briand, Mojtaba Bagherzadeh, Ramesh S

Abstract: Deep Reinforcement Learning (DRL) algorithms have been increasingly employed during the last decade to solve various decision-making problems such as autonomous driving and robotics. However, these algorithms have faced great challenges when deployed in safety-critical environments since they often exhibit erroneous behaviors that can lead to potentially critical errors. One way to assess the safe… ▽ More Deep Reinforcement Learning (DRL) algorithms have been increasingly employed during the last decade to solve various decision-making problems such as autonomous driving and robotics. However, these algorithms have faced great challenges when deployed in safety-critical environments since they often exhibit erroneous behaviors that can lead to potentially critical errors. One way to assess the safety of DRL agents is to test them to detect possible faults leading to critical failures during their execution. This raises the question of how we can efficiently test DRL policies to ensure their correctness and adherence to safety requirements. Most existing works on testing DRL agents use adversarial attacks that perturb states or actions of the agent. However, such attacks often lead to unrealistic states of the environment. Their main goal is to test the robustness of DRL agents rather than testing the compliance of agents' policies with respect to requirements. Due to the huge state space of DRL environments, the high cost of test execution, and the black-box nature of DRL algorithms, the exhaustive testing of DRL agents is impossible. In this paper, we propose a Search-based Testing Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL agent by effectively searching for failing executions of the agent within a limited testing budget. We use machine learning models and a dedicated genetic algorithm to narrow the search towards faulty episodes. We apply STARLA on Deep-Q-Learning agents which are widely used as benchmarks and show that it significantly outperforms Random Testing by detecting more faults related to the agent's policy. We also investigate how to extract rules that characterize faulty episodes of the DRL agent using our search results. Such rules can be used to understand the conditions under which the agent fails and thus assess its deployment risks. △ Less

Submitted 4 August, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

ACM Class: D.2.5; I.2.0

Journal ref: in IEEE Transactions on Software Engineering, vol. 49, no. 7, pp. 3715-3735, July 2023

arXiv:2201.09953 [pdf, other]

doi 10.1145/3524842.3528455

Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study

Authors: Tatiana Castro Vélez, Raffi Khatchadourian, Mehdi Bagherzadeh, Anita Raja

Abstract: Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code that supports symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce DL code that is error-prone, non-intuitive, and difficult to debug. Consequen… ▽ More Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code that supports symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development tends to produce DL code that is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, less error-prone imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. While hybrid approaches aim for the "best of both worlds," the challenges in applying them in the real world are largely unknown. We conduct a data-driven analysis of challenges -- and resultant bugs -- involved in writing reliable yet performant imperative DL code by studying 250 open-source projects, consisting of 19.7 MLOC, along with 470 and 446 manually examined code patches and bug reports, respectively. The results indicate that hybridization: (i) is prone to API misuse, (ii) can result in performance degradation -- the opposite of its intention, and (iii) has limited application due to execution mode incompatibility. We put forth several recommendations, best practices, and anti-patterns for effectively hybridizing imperative DL code, potentially benefiting DL practitioners, API designers, tool developers, and educators. △ Less

Submitted 5 April, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

Comments: International Conference on Mining Software Repositories, MSR 2022. ACM/IEEE, ACM, May 2022

ACM Class: D.2.m

Journal ref: ACM/IEEE International Conference on Mining Software Repositories, May 2022

arXiv:2112.12591 [pdf, other]

doi 10.1109/TSE.2023.3243522

Black-Box Testing of Deep Neural Networks Through Test Case Diversity

Authors: Zohreh Aghababaeyan, Manel Abdellatif, Lionel Briand, Ramesh S, Mojtaba Bagherzadeh

Abstract: Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analo… ▽ More Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using four datasets and five DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs. △ Less

Submitted 6 February, 2023; v1 submitted 20 December, 2021; originally announced December 2021.

Journal ref: IEEE Transactions on Software Engineering (TSE) (2023) 1-26

arXiv:2112.02758 [pdf, other]

doi 10.1109/ICSE-Companion55297.2022.9793736

A Tool for Rejuvenating Feature Logging Levels via Git Histories and Degree of Interest

Authors: Yiming Tang, Allan Spektor, Raffi Khatchadourian, Mehdi Bagherzadeh

Abstract: Logging is a significant programming practice. Due to the highly transactional nature of modern software applications, massive amount of logs are generated every day, which may overwhelm developers. Logging information overload can be dangerous to software applications. Using log levels, developers can print the useful information while hiding the verbose logs during software runtime. As software… ▽ More Logging is a significant programming practice. Due to the highly transactional nature of modern software applications, massive amount of logs are generated every day, which may overwhelm developers. Logging information overload can be dangerous to software applications. Using log levels, developers can print the useful information while hiding the verbose logs during software runtime. As software evolves, the log levels of logging statements associated with the surrounding software feature implementation may also need to be altered. Maintaining log levels necessitates a significant amount of manual effort. In this paper, we demonstrate an automated approach that can rejuvenate feature log levels by matching the interest level of developers in the surrounding features. The approach is implemented as an open-source Eclipse plugin, using two external plug-ins (JGit and Mylyn). It was tested on 18 open-source Java projects consisting of ~3 million lines of code and ~4K log statements. Our tool successfully analyzes 99.22% of logging statements, increases log level distributions by ~20%, and increases the focus of logs in bug fix contexts ~83% of the time. For further details, interested readers can watch our demonstration video (https://www.youtube.com/watch?v=qIULoAXoDv4). △ Less

Submitted 16 February, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

Comments: 4 pages, ICSE '22 (tool demo track)

Journal ref: International Conference on Software Engineering, ICSE 2022. ACM/IEEE, ACM, May 2022

arXiv:2109.13168 [pdf, other]

doi 10.1109/TSE.2022.3184842

Scalable and Accurate Test Case Prioritization in Continuous Integration Contexts

Authors: Ahmadreza Saboor Yaraghi, Mojtaba Bagherzadeh, Nafiseh Kahani, Lionel Briand

Abstract: Continuous Integration (CI) requires efficient regression testing to ensure software quality without significantly delaying its CI builds. This warrants the need for techniques to reduce regression testing time, such as Test Case Prioritization (TCP) techniques that prioritize the execution of test cases to detect faults as early as possible. Many recent TCP studies employ various Machine Learning… ▽ More Continuous Integration (CI) requires efficient regression testing to ensure software quality without significantly delaying its CI builds. This warrants the need for techniques to reduce regression testing time, such as Test Case Prioritization (TCP) techniques that prioritize the execution of test cases to detect faults as early as possible. Many recent TCP studies employ various Machine Learning (ML) techniques to deal with the dynamic and complex nature of CI. However, most of them use a limited number of features for training ML models and evaluate the models on subjects for which the application of TCP makes little practical sense, due to their small regression testing time and low number of failed builds. In this work, we first define, at a conceptual level, a data model that captures data sources and their relations in a typical CI environment. Second, based on this data model, we define a comprehensive set of features that covers all features previously used by related studies. Third, we develop methods and tools to collect the defined features for 25 open-source software systems with enough failed builds and whose regression testing takes at least five minutes. Fourth, relying on the collected dataset containing a comprehensive feature set, we answer four research questions concerning data collection time, the effectiveness of ML-based TCP, the impact of the features on effectiveness, the decay of ML-based TCP models over time, and the trade-off between data collection time and the effectiveness of ML-based TCP techniques. △ Less

Submitted 5 April, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: 27 pages, LaTeX; Minor writing corrections in the abstract; Major revision

arXiv:2109.03138 [pdf, other]

Interests, Difficulties, Sentiments, and Tool Usages of Concurrency Developers: A Large-Scale Study on Stack Overflow

Authors: Mehdi Bagherzadeh, Syed Ahmed, Srilakshmi Sripathi, Raffi Khatchadourian

Abstract: Context: Software developers are increasingly facing the challenges of writing code that is not only concurrent but also correct. Objective: To help these developers, it is necessary to understand concurrency topics they are interested in, their difficulty in finding answers for questions in these topics, their sentiment for these topics, and how they use concurrency tools and techniques to guar… ▽ More Context: Software developers are increasingly facing the challenges of writing code that is not only concurrent but also correct. Objective: To help these developers, it is necessary to understand concurrency topics they are interested in, their difficulty in finding answers for questions in these topics, their sentiment for these topics, and how they use concurrency tools and techniques to guarantee correctness. Method: We conduct a large-scale study on the entirety of Stack Overflow to understand interests, difficulties, sentiment, and tool usages of concurrency developers. We discuss the implications of our findings for the practice, research, and education of concurrent software development, and investigate the relation of our findings with the findings of the previous work. Results: A few findings of our study are: (1) questions that concurrency developers ask can be grouped into a hierarchy with 27 concurrency topics under 8 major categories, (2) thread safety is among the most popular concurrency topics and client-server concurrency is among the least popular, (3) irreproducible behavior is among the most difficult topics and memory consistency is among the least difficult, (4) data scra** is among the most positive concurrency topics and irreproducible behavior is among the most negative, (5) root cause identification has the most number of questions for usage of data race tools and alternative use has the least. Conclusion: The results of our study can not only help concurrency developers but also concurrency educators and researchers to better decide where to focus their efforts, by trading off one concurrency topic against another. △ Less

Submitted 9 September, 2021; v1 submitted 7 September, 2021; originally announced September 2021.

ACM Class: D.1.3

arXiv:2106.13891 [pdf, other]

doi 10.1007/s10664-021-10066-6

Test Case Selection and Prioritization Using Machine Learning: A Systematic Literature Review

Authors: Rongqi Pan, Mojtaba Bagherzadeh, Taher A. Ghaleb, Lionel Briand

Abstract: Regression testing is an essential activity to assure that software code changes do not adversely affect existing functionalities. With the wide adoption of Continuous Integration (CI) in software projects, which increases the frequency of running software builds, running all tests can be time-consuming and resource-intensive. To alleviate that problem, Test case Selection and Prioritization (TSP)… ▽ More Regression testing is an essential activity to assure that software code changes do not adversely affect existing functionalities. With the wide adoption of Continuous Integration (CI) in software projects, which increases the frequency of running software builds, running all tests can be time-consuming and resource-intensive. To alleviate that problem, Test case Selection and Prioritization (TSP) techniques have been proposed to improve regression testing by selecting and prioritizing test cases in order to provide early feedback to developers. In recent years, researchers have relied on Machine Learning (ML) techniques to achieve effective TSP (ML-based TSP). Such techniques help combine information about test cases, from partial and imperfect sources, into accurate prediction models. This work conducts a systematic literature review focused on ML-based TSP techniques, aiming to perform an in-depth analysis of the state of the art, thus gaining insights regarding future avenues of research. To that end, we analyze 29 primary studies published from 2006 to 2020, which have been identified through a systematic and documented process. This paper addresses five research questions addressing variations in ML-based TSP techniques and feature sets for training and testing ML models, alternative metrics used for evaluating the techniques, the performance of techniques, and the reproducibility of the published studies. We summarize the results related to our research questions in a high-level summary that can be used as a taxonomy for classifying future TSP studies. △ Less

Submitted 5 October, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

Journal ref: Empirical Software Engineering (EMSE). (2021) 1-34

arXiv:2104.07736 [pdf, other]

doi 10.1016/j.scico.2021.102724

Automated Evolution of Feature Logging Statement Levels Using Git Histories and Degree of Interest

Authors: Yiming Tang, Allan Spektor, Raffi Khatchadourian, Mehdi Bagherzadeh

Abstract: Logging -- used for system events and security breaches to describe more informational yet essential aspects of software features -- is pervasive. Given the high transactionality of today's software, logging effectiveness can be reduced by information overload. Log levels help alleviate this problem by correlating a priority to logs that can be later filtered. As software evolves, however, levels… ▽ More Logging -- used for system events and security breaches to describe more informational yet essential aspects of software features -- is pervasive. Given the high transactionality of today's software, logging effectiveness can be reduced by information overload. Log levels help alleviate this problem by correlating a priority to logs that can be later filtered. As software evolves, however, levels of logs documenting surrounding feature implementations may also require modification as features once deemed important may have decreased in urgency and vice-versa. We present an automated approach that assists developers in evolving levels of such (feature) logs. The approach, based on mining Git histories and manipulating a degree of interest (DOI) model, transforms source code to revitalize feature log levels based on the "interestingness" of the surrounding code. Built upon JGit and Mylyn, the approach is implemented as an Eclipse IDE plug-in and evaluated on 18 Java projects with $\sim$3 million lines of code and $\sim$4K log statements. Our tool successfully analyzes 99.22% of logging statements, increases log level distributions by $\sim$20%, and increases the focus of logs in bug fix contexts $\sim$83% of the time. Moreover, pull (patch) requests were integrated into large and popular open-source projects. The results indicate that the approach is promising in assisting developers in evolving feature log levels. △ Less

Submitted 12 September, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

Comments: 37 pages, 1 figure

Journal ref: Science of Computer Programming, 2022-02, Vol. 214, p.102724

arXiv:2103.17194 [pdf, other]

doi 10.1109/TSE.2020.3008850

Execution of Partial State Machine Models

Authors: Mojtaba Bagherzadeh, Nafiseh Kahani, Karim Jahed, Juergen Dingel

Abstract: The iterative and incremental nature of software development using models typically makes a model of a system incomplete (i.e., partial) until a more advanced and complete stage of development is reached. Existing model execution approaches (interpretation of models or code generation) do not support the execution of partial models. Supporting the execution of partial models at the early stages of… ▽ More The iterative and incremental nature of software development using models typically makes a model of a system incomplete (i.e., partial) until a more advanced and complete stage of development is reached. Existing model execution approaches (interpretation of models or code generation) do not support the execution of partial models. Supporting the execution of partial models at the early stages of software development allows early detection of defects, which can be fixed more easily and at a lower cost. This paper proposes a conceptual framework for the execution of partial models, which consists of three steps: static analysis, automatic refinement, and input-driven execution. First, a static analysis that respects the execution semantics of models is applied to detect problematic elements of models that cause problems for the execution. Second, using model transformation techniques, the models are refined automatically, mainly by adding decision points where missing information can be supplied. Third, refined models are executed, and when the execution reaches the decision points, it uses inputs obtained either interactively or by a script that captures how to deal with partial elements. We created an execution engine called PMExec for the execution of partial models of UML-RT (i.e., a modeling language for the development of soft real-time systems) that embodies our proposed framework. We evaluated PMExec based on several use-cases that show that the static analysis, refinement, and application of user input can be carried out with reasonable performance and that the overhead of approach, which is mostly due to the refinement and the increase in model complexity it causes, is manageable. We also discuss the properties of the refinement formally and show how the refinement preserves the original behaviors of the model. △ Less

Submitted 31 March, 2021; originally announced March 2021.

Journal ref: IEEE Transactions on Software Engineering (TSE). (2020) 1-28

arXiv:2011.01834 [pdf, other]

Reinforcement Learning for Test Case Prioritization

Authors: Mojtaba Bagherzadeh, Nafiseh Kahani, Lionel Briand

Abstract: Continuous Integration (CI) significantly reduces integration problems, speeds up development time, and shortens release time. However, it also introduces new challenges for quality assurance activities, including regression testing, which is the focus of this work. Though various approaches for test case prioritization have shown to be very promising in the context of regression testing, specific… ▽ More Continuous Integration (CI) significantly reduces integration problems, speeds up development time, and shortens release time. However, it also introduces new challenges for quality assurance activities, including regression testing, which is the focus of this work. Though various approaches for test case prioritization have shown to be very promising in the context of regression testing, specific techniques must be designed to deal with the dynamic nature and timing constraints of CI. Recently, Reinforcement Learning (RL) has shown great potential in various challenging scenarios that require continuous adaptation, such as game playing, real-time ads bidding, and recommender systems. Inspired by this line of work and building on initial efforts in supporting test case prioritization with RL techniques, we perform here a comprehensive investigation of RL-based test case prioritization in a CI context. To this end, taking test case prioritization as a ranking problem, we model the sequential interactions between the CI environment and a test case prioritization agent as an RL problem, using three alternative ranking models. We then rely on carefully selected and tailored state-of-the-art RL techniques to automatically and continuously learn a test case prioritization strategy, whose objective is to be as close as possible to the optimal one. Our extensive experimental analysis shows that the best RL solutions provide a significant accuracy improvement over previous RL-based work, with prioritization strategies getting close to being optimal, thus paving the way for using RL to prioritize test cases in a CI context. △ Less

Submitted 25 March, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

Journal ref: IEEE Transactions on Software Engineering (TSE). (2021) 1-21

arXiv:2002.00863 [pdf, other]

doi 10.1109/TR.2021.3074750

Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning

Authors: Hazem Fahmy, Fabrizio Pastore, Mojtaba Bagherzadeh, Lionel Briand

Abstract: Deep neural networks (DNNs) are increasingly important in safety-critical systems, for example in their perception layer to analyze images. Unfortunately, there is a lack of methods to ensure the functional safety of DNN-based components. We observe three major challenges with existing practices regarding DNNs in safety-critical systems: (1) scenarios that are underrepresented in the test set may… ▽ More Deep neural networks (DNNs) are increasingly important in safety-critical systems, for example in their perception layer to analyze images. Unfortunately, there is a lack of methods to ensure the functional safety of DNN-based components. We observe three major challenges with existing practices regarding DNNs in safety-critical systems: (1) scenarios that are underrepresented in the test set may lead to serious safety violation risks, but may, however, remain unnoticed; (2) characterizing such high-risk scenarios is critical for safety analysis; (3) retraining DNNs to address these risks is poorly supported when causes of violations are difficult to determine. To address these problems in the context of DNNs analyzing images, we propose HUDD, an approach that automatically supports the identification of root causes for DNN errors. HUDD identifies root causes by applying a clustering algorithm to heatmaps capturing the relevance of every DNN neuron on the DNN outcome. Also, HUDD retrains DNNs with images that are automatically selected based on their relatedness to the identified image clusters. We evaluated HUDD with DNNs from the automotive domain. HUDD was able to identify all the distinct root causes of DNN errors, thus supporting safety analysis. Also, our retraining approach has shown to be more effective at improving DNN accuracy than existing approaches. △ Less

Submitted 22 April, 2021; v1 submitted 3 February, 2020; originally announced February 2020.

Showing 1–12 of 12 results for author: Bagherzadeh, M