-
Data-Driven Evidence-Based Syntactic Sugar Design
Authors:
David OBrien,
Robert Dyer,
Tien N. Nguyen,
Hridesh Rajan
Abstract:
Programming languages are essential tools for developers, and their evolution plays a crucial role in supporting the activities of developers. One instance of programming language evolution is the introduction of syntactic sugars, which are additional syntax elements that provide alternative, more readable code constructs. However, the process of designing and evolving a programming language has t…
▽ More
Programming languages are essential tools for developers, and their evolution plays a crucial role in supporting the activities of developers. One instance of programming language evolution is the introduction of syntactic sugars, which are additional syntax elements that provide alternative, more readable code constructs. However, the process of designing and evolving a programming language has traditionally been guided by anecdotal experiences and intuition. Recent advances in tools and methodologies for mining open-source repositories have enabled developers to make data-driven software engineering decisions. In light of this, this paper proposes an approach for motivating data-driven programming evolution by applying frequent subgraph mining techniques to a large dataset of 166,827,154 open-source Java methods. The dataset is mined by generalizing Java control-flow graphs to capture broad programming language usages and instances of duplication. Frequent subgraphs are then extracted to identify potentially impactful opportunities for new syntactic sugars. Our diverse results demonstrate the benefits of the proposed technique by identifying new syntactic sugars involving a variety of programming constructs that could be implemented in Java, thus simplifying frequent code idioms. This approach can potentially provide valuable insights for Java language designers, and serve as a proof-of-concept for data-driven programming language design and evolution.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment
Authors:
Shibbir Ahmed,
Hongyang Gao,
Hridesh Rajan
Abstract:
Deep learning models are trained with certain assumptions about the data during the development stage and then used for prediction in the deployment stage. It is important to reason about the trustworthiness of the model's predictions with unseen data during deployment. Existing methods for specifying and verifying traditional software are insufficient for this task, as they cannot handle the comp…
▽ More
Deep learning models are trained with certain assumptions about the data during the development stage and then used for prediction in the deployment stage. It is important to reason about the trustworthiness of the model's predictions with unseen data during deployment. Existing methods for specifying and verifying traditional software are insufficient for this task, as they cannot handle the complexity of DNN model architecture and expected outcomes. In this work, we propose a novel technique that uses rules derived from neural network computations to infer data preconditions for a DNN model to determine the trustworthiness of its predictions. Our approach, DeepInfer involves introducing a novel abstraction for a trained DNN model that enables weakest precondition reasoning using Dijkstra's Predicate Transformer Semantics. By deriving rules over the inductive type of neural network abstract representation, we can overcome the matrix dimensionality issues that arise from the backward non-linear computation from the output layer to the input layer. We utilize the weakest precondition computation using rules of each kind of activation function to compute layer-wise precondition from the given postcondition on the final output of a deep neural network. We extensively evaluated DeepInfer on 29 real-world DNN models using four different datasets collected from five different sources and demonstrated the utility, effectiveness, and performance improvement over closely related work. DeepInfer efficiently detects correct and incorrect predictions of high-accuracy models with high recall (0.98) and high F-1 score (0.84) and has significantly improved over prior technique, SelfChecker. The average runtime overhead of DeepInfer is low, 0.22 sec for all unseen datasets. We also compared runtime overhead using the same hardware settings and found that DeepInfer is 3.27 times faster than SelfChecker.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Mutation-based Fault Localization of Deep Neural Networks
Authors:
Ali Ghanbari,
Deepak-George Thomas,
Muhammad Arbab Arshad,
Hridesh Rajan
Abstract:
Deep neural networks (DNNs) are susceptible to bugs, just like other types of software systems. A significant uptick in using DNN, and its applications in wide-ranging areas, including safety-critical systems, warrant extensive research on software engineering tools for improving the reliability of DNN-based systems. One such tool that has gained significant attention in the recent years is DNN fa…
▽ More
Deep neural networks (DNNs) are susceptible to bugs, just like other types of software systems. A significant uptick in using DNN, and its applications in wide-ranging areas, including safety-critical systems, warrant extensive research on software engineering tools for improving the reliability of DNN-based systems. One such tool that has gained significant attention in the recent years is DNN fault localization. This paper revisits mutation-based fault localization in the context of DNN models and proposes a novel technique, named deepmufl, applicable to a wide range of DNN models. We have implemented deepmufl and have evaluated its effectiveness using 109 bugs obtained from StackOverflow. Our results show that deepmufl detects 53/109 of the bugs by ranking the buggy layer in top-1 position, outperforming state-of-the-art static and dynamic DNN fault localization systems that are also designed to target the class of bugs supported by deepmufl. Moreover, we observed that we can halve the fault localization time for a pre-trained model using mutation selection, yet losing only 7.55% of the bugs localized in top-1 position.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
What Kinds of Contracts Do ML APIs Need?
Authors:
Samantha Syeda Khairunnesa,
Shibbir Ahmed,
Sayem Mohammad Imtiaz,
Hridesh Rajan,
Gary T. Leavens
Abstract:
Recent work has shown that Machine Learning (ML) programs are error-prone and called for contracts for ML code. Contracts, as in the design by contract methodology, help document APIs and aid API users in writing correct code. The question is: what kinds of contracts would provide the most help to API users? We are especially interested in what kinds of contracts help API users catch errors at ear…
▽ More
Recent work has shown that Machine Learning (ML) programs are error-prone and called for contracts for ML code. Contracts, as in the design by contract methodology, help document APIs and aid API users in writing correct code. The question is: what kinds of contracts would provide the most help to API users? We are especially interested in what kinds of contracts help API users catch errors at earlier stages in the ML pipeline. We describe an empirical study of posts on Stack Overflow of the four most often-discussed ML libraries: TensorFlow, Scikit-learn, Keras, and PyTorch. For these libraries, our study extracted 413 informal (English) API specifications. We used these specifications to understand the following questions. What are the root causes and effects behind ML contract violations? Are there common patterns of ML contract violations? When does understanding ML contracts require an advanced level of ML software expertise? Could checking contracts at the API level help detect the violations in early ML pipeline stages? Our key findings are that the most commonly needed contracts for ML APIs are either checking constraints on single arguments of an API or on the order of API calls. The software engineering community could employ existing contract mining approaches to mine these contracts to promote an increased understanding of ML APIs. We also noted a need to combine behavioral and temporal contract mining approaches. We report on categories of required ML contracts, which may help designers of contract languages.
△ Less
Submitted 26 July, 2023;
originally announced July 2023.
-
An Effective Data-Driven Approach for Localizing Deep Learning Faults
Authors:
Mohammad Wardat,
Breno Dantas Cruz,
Wei Le,
Hridesh Rajan
Abstract:
Deep Learning (DL) applications are being used to solve problems in critical domains (e.g., autonomous driving or medical diagnosis systems). Thus, developers need to debug their systems to ensure that the expected behavior is delivered. However, it is hard and expensive to debug DNNs. When the failure symptoms or unsatisfied accuracies are reported after training, we lose the traceability as to w…
▽ More
Deep Learning (DL) applications are being used to solve problems in critical domains (e.g., autonomous driving or medical diagnosis systems). Thus, developers need to debug their systems to ensure that the expected behavior is delivered. However, it is hard and expensive to debug DNNs. When the failure symptoms or unsatisfied accuracies are reported after training, we lose the traceability as to which part of the DNN program is responsible for the failure. Even worse, sometimes, a deep learning program has different types of bugs. To address the challenges of debugging DNN models, we propose a novel data-driven approach that leverages model features to learn problem patterns. Our approach extracts these features, which represent semantic information of faults during DNN training. Our technique uses these features as a training dataset to learn and infer DNN fault patterns. Also, our methodology automatically links bug symptoms to their root causes, without the need for manually crafted map**s, so that developers can take the necessary steps to fix faults. We evaluate our approach using real-world and mutated models. Our results demonstrate that our technique can effectively detect and diagnose different bug types. Finally, our technique achieved better accuracy, precision, and recall than prior work for mutated models. Also, our approach achieved comparable results for real-world models in terms of accuracy and performance to the state-of-the-art.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Fix Fairness, Don't Ruin Accuracy: Performance Aware Fairness Repair using AutoML
Authors:
Giang Nguyen,
Sumon Biswas,
Hridesh Rajan
Abstract:
Machine learning (ML) is increasingly being used in critical decision-making software, but incidents have raised questions about the fairness of ML predictions. To address this issue, new tools and methods are needed to mitigate bias in ML-based software. Previous studies have proposed bias mitigation algorithms that only work in specific situations and often result in a loss of accuracy. Our prop…
▽ More
Machine learning (ML) is increasingly being used in critical decision-making software, but incidents have raised questions about the fairness of ML predictions. To address this issue, new tools and methods are needed to mitigate bias in ML-based software. Previous studies have proposed bias mitigation algorithms that only work in specific situations and often result in a loss of accuracy. Our proposed solution is a novel approach that utilizes automated machine learning (AutoML) techniques to mitigate bias. Our approach includes two key innovations: a novel optimization function and a fairness-aware search space. By improving the default optimization function of AutoML and incorporating fairness objectives, we are able to mitigate bias with little to no loss of accuracy. Additionally, we propose a fairness-aware search space pruning method for AutoML to reduce computational cost and repair time. Our approach, built on the state-of-the-art Auto-Sklearn tool, is designed to reduce bias in real-world scenarios. In order to demonstrate the effectiveness of our approach, we evaluated our approach on four fairness problems and 16 different ML models, and our results show a significant improvement over the baseline and existing bias mitigation techniques. Our approach, Fair-AutoML, successfully repaired 60 out of 64 buggy cases, while existing bias mitigation techniques only repaired up to 44 out of 64 cases.
△ Less
Submitted 28 August, 2023; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Characterizing Bugs in Python and R Data Analytics Programs
Authors:
Shibbir Ahmed,
Mohammad Wardat,
Hamid Bagheri,
Breno Dantas Cruz,
Hridesh Rajan
Abstract:
R and Python are among the most popular languages used in many critical data analytics tasks. However, we still do not fully understand the capabilities of these two languages w.r.t. bugs encountered in data analytics tasks. What type of bugs are common? What are the main root causes? What is the relation between bugs and root causes? How to mitigate these bugs? We present a comprehensive study of…
▽ More
R and Python are among the most popular languages used in many critical data analytics tasks. However, we still do not fully understand the capabilities of these two languages w.r.t. bugs encountered in data analytics tasks. What type of bugs are common? What are the main root causes? What is the relation between bugs and root causes? How to mitigate these bugs? We present a comprehensive study of 5,068 Stack Overflow posts, 1,800 bug fix commits from GitHub repositories, and several GitHub issues of the most used libraries to understand bugs in R and Python. Our key findings include: while both R and Python have bugs due to inexperience with data analysis, Python see significantly larger data preprocessing bugs compared to R. Developers experience significantly more data flow bugs in R because intermediate results are often implicit. We also found changes and bugs in packages and libraries cause more bugs in R compared to Python while package or library misselection and conflicts cause more bugs in Python than R. While R has a slightly higher readability barrier for data analysts, the statistical power of R leads to a less number of bad performance bugs. In terms of data visualization, R packages have significantly more bugs than Python libraries. We also identified a strong correlation between comparable packages in R and Python despite their linguistic and methodological differences. Lastly, we contribute a large dataset of manually verified R and Python bugs.
△ Less
Submitted 14 June, 2023;
originally announced June 2023.
-
Fairify: Fairness Verification of Neural Networks
Authors:
Sumon Biswas,
Hridesh Rajan
Abstract:
Fairness of machine learning (ML) software has become a major concern in the recent past. Although recent research on testing and improving fairness have demonstrated impact on real-world software, providing fairness guarantee in practice is still lacking. Certification of ML models is challenging because of the complex decision-making process of the models. In this paper, we proposed Fairify, an…
▽ More
Fairness of machine learning (ML) software has become a major concern in the recent past. Although recent research on testing and improving fairness have demonstrated impact on real-world software, providing fairness guarantee in practice is still lacking. Certification of ML models is challenging because of the complex decision-making process of the models. In this paper, we proposed Fairify, an SMT-based approach to verify individual fairness property in neural network (NN) models. Individual fairness ensures that any two similar individuals get similar treatment irrespective of their protected attributes e.g., race, sex, age. Verifying this fairness property is hard because of the global checking and non-linear computation nodes in NN. We proposed sound approach to make individual fairness verification tractable for the developers. The key idea is that many neurons in the NN always remain inactive when a smaller part of the input domain is considered. So, Fairify leverages whitebox access to the models in production and then apply formal analysis based pruning. Our approach adopts input partitioning and then prunes the NN for each partition to provide fairness certification or counterexample. We leveraged interval arithmetic and activation heuristic of the neurons to perform the pruning as necessary. We evaluated Fairify on 25 real-world neural networks collected from four different sources, and demonstrated the effectiveness, scalability and performance over baseline and closely related work. Fairify is also configurable based on the domain and size of the NN. Our novel formulation of the problem can answer targeted verification queries with relaxations and counterexamples, which have practical implications.
△ Less
Submitted 13 December, 2022; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Decomposing a Recurrent Neural Network into Modules for Enabling Reusability and Replacement
Authors:
Sayem Mohammad Imtiaz,
Fraol Batole,
Astha Singh,
Rangeet Pan,
Breno Dantas Cruz,
Hridesh Rajan
Abstract:
Can we take a recurrent neural network (RNN) trained to translate between languages and augment it to support a new natural language without retraining the model from scratch? Can we fix the faulty behavior of the RNN by replacing portions associated with the faulty behavior? Recent works on decomposing a fully connected neural network (FCNN) and convolutional neural network (CNN) into modules hav…
▽ More
Can we take a recurrent neural network (RNN) trained to translate between languages and augment it to support a new natural language without retraining the model from scratch? Can we fix the faulty behavior of the RNN by replacing portions associated with the faulty behavior? Recent works on decomposing a fully connected neural network (FCNN) and convolutional neural network (CNN) into modules have shown the value of engineering deep models in this manner, which is standard in traditional SE but foreign for deep learning models. However, prior works focus on the image-based multiclass classification problems and cannot be applied to RNN due to (a) different layer structures, (b) loop structures, (c) different types of input-output architectures, and (d) usage of both nonlinear and logistic activation functions. In this work, we propose the first approach to decompose an RNN into modules. We study different types of RNNs, i.e., Vanilla, LSTM, and GRU. Further, we show how such RNN modules can be reused and replaced in various scenarios. We evaluate our approach against 5 canonical datasets (i.e., Math QA, Brown Corpus, Wiki-toxicity, Clinc OOS, and Tatoeba) and 4 model variants for each dataset. We found that decomposing a trained model has a small cost (Accuracy: -0.6%, BLEU score: +0.10%). Also, the decomposed modules can be reused and replaced without needing to retrain.
△ Less
Submitted 9 February, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
Towards Understanding Fairness and its Composition in Ensemble Machine Learning
Authors:
Usman Gohar,
Sumon Biswas,
Hridesh Rajan
Abstract:
Machine Learning (ML) software has been widely adopted in modern society, with reported fairness implications for minority groups based on race, sex, age, etc. Many recent works have proposed methods to measure and mitigate algorithmic bias in ML models. The existing approaches focus on single classifier-based ML models. However, real-world ML models are often composed of multiple independent or d…
▽ More
Machine Learning (ML) software has been widely adopted in modern society, with reported fairness implications for minority groups based on race, sex, age, etc. Many recent works have proposed methods to measure and mitigate algorithmic bias in ML models. The existing approaches focus on single classifier-based ML models. However, real-world ML models are often composed of multiple independent or dependent learners in an ensemble (e.g., Random Forest), where the fairness composes in a non-trivial way. How does fairness compose in ensembles? What are the fairness impacts of the learners on the ultimate fairness of the ensemble? Can fair learners result in an unfair ensemble? Furthermore, studies have shown that hyperparameters influence the fairness of ML models. Ensemble hyperparameters are more complex since they affect how learners are combined in different categories of ensembles. Understanding the impact of ensemble hyperparameters on fairness will help programmers design fair ensembles. Today, we do not understand these fully for different ensemble algorithms. In this paper, we comprehensively study popular real-world ensembles: bagging, boosting, stacking and voting. We have developed a benchmark of 168 ensemble models collected from Kaggle on four popular fairness datasets. We use existing fairness metrics to understand the composition of fairness. Our results show that ensembles can be designed to be fairer without using mitigation techniques. We also identify the interplay between fairness composition and data characteristics to guide fair ensemble design. Finally, our benchmark can be leveraged for further research on fair ensembles. To the best of our knowledge, this is one of the first and largest studies on fairness composition in ensembles yet presented in the literature.
△ Less
Submitted 25 April, 2023; v1 submitted 8 December, 2022;
originally announced December 2022.
-
An Empirical Study on the Bugs Found while Reusing Pre-trained Natural Language Processing Models
Authors:
Rangeet Pan,
Sumon Biswas,
Mohna Chakraborty,
Breno Dantas Cruz,
Hridesh Rajan
Abstract:
In NLP, reusing pre-trained models instead of training from scratch has gained popularity; however, NLP models are mostly black boxes, very large, and often require significant resources. To ease, models trained with large corpora are made available, and developers reuse them for different problems. In contrast, developers mostly build their models from scratch for traditional DL-related problems.…
▽ More
In NLP, reusing pre-trained models instead of training from scratch has gained popularity; however, NLP models are mostly black boxes, very large, and often require significant resources. To ease, models trained with large corpora are made available, and developers reuse them for different problems. In contrast, developers mostly build their models from scratch for traditional DL-related problems. By doing so, they have control over the choice of algorithms, data processing, model structure, tuning hyperparameters, etc. Whereas in NLP, due to the reuse of the pre-trained models, NLP developers are limited to little to no control over such design decisions. They either apply tuning or transfer learning on pre-trained models to meet their requirements. Also, NLP models and their corresponding datasets are significantly larger than the traditional DL models and require heavy computation. Such reasons often lead to bugs in the system while reusing the pre-trained models. While bugs in traditional DL software have been intensively studied, the nature of extensive reuse and black-box structure motivates us to understand the different types of bugs that occur while reusing NLP models? What are the root causes of those bugs? How do these bugs affect the system? To answer these questions, We studied the bugs reported while reusing the 11 popular NLP models. We mined 9,214 issues from GitHub repositories and identified 984 bugs. We created a taxonomy with bug types, root causes, and impacts. Our observations led to several findings, including limited access to model internals resulting in a lack of robustness, lack of input validation leading to the propagation of algorithmic and data bias, and high-resource consumption causing more crashes, etc. Our observations suggest several bug patterns, which would greatly facilitate further efforts in reducing bugs in pre-trained models and code reuse.
△ Less
Submitted 30 November, 2022;
originally announced December 2022.
-
On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention
Authors:
Menglu Yu,
Bo Ji,
Hridesh Rajan,
Jia Liu
Abstract:
Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called…
▽ More
Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called "ring-all-reduce" (RAR) technologies have recently emerged as a favorable computing architecture to efficiently process network communication and computation load in GPU clusters. The most salient feature of RAR is that it removes the need for dedicated parameter servers, thus alleviating the potential communication bottleneck. However, when multiple RAR-based DL training jobs are deployed over GPU clusters, communication bottlenecks could still occur due to contentions between DL training jobs. So far, there remains a lack of theoretical understanding on how to design contention-aware resource scheduling algorithms for RAR-based DL training jobs, which motivates us to fill this gap in this work. Our main contributions are three-fold: i) We develop a new analytical model that characterizes both communication overhead related to the worker distribution of the job and communication contention related to the co-location of different jobs; ii) Based on the proposed analytical model, we formulate the problem as a non-convex integer program to minimize the makespan of all RAR-based DL training jobs. To address the unique structure in this problem that is not amenable for optimization algorithm design, we reformulate the problem into an integer linear program that enables provable approximation algorithm design called SJF-BCO (Smallest Job First with Balanced Contention and Overhead); and iii) We conduct extensive experiments to show the superiority of SJF-BCO over existing schedulers.
△ Less
Submitted 14 August, 2022; v1 submitted 15 July, 2022;
originally announced July 2022.
-
GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs
Authors:
Menglu Yu,
Ye Tian,
Bo Ji,
Chuan Wu,
Hridesh Rajan,
Jia Liu
Abstract:
Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and load balancing issues in distributed computing, the so-called ``ring-all-reduce'' decentralized architecture has been increasingly adopted to remove the need f…
▽ More
Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and load balancing issues in distributed computing, the so-called ``ring-all-reduce'' decentralized architecture has been increasingly adopted to remove the need for dedicated parameter servers. To date, however, there remains a lack of theoretical understanding on how to design resource optimization algorithms for efficiently scheduling ring-all-reduce DDL jobs in computing clusters. This motivates us to fill this gap by proposing a series of new resource scheduling designs for ring-all-reduce DDL jobs. Our contributions in this paper are three-fold: i) We propose a new resource scheduling analytical model for ring-all-reduce deep learning, which covers a wide range of objectives in DDL performance optimization (e.g., excessive training avoidance, energy efficiency, fairness); ii) Based on the proposed performance analytical model, we develop an efficient resource scheduling algorithm called GADGET (greedy ring-all-reduce distributed graph embedding technique), which enjoys a provable strong performance guarantee; iii) We conduct extensive trace-driven experiments to demonstrate the effectiveness of the GADGET approach and its superiority over the state of the art.
△ Less
Submitted 2 February, 2022;
originally announced February 2022.
-
DeepDiagnosis: Automatically Diagnosing Faults and Recommending Actionable Fixes in Deep Learning Programs
Authors:
Mohammad Wardat,
Breno Dantas Cruz,
Wei Le,
Hridesh Rajan
Abstract:
Deep Neural Networks (DNNs) are used in a wide variety of applications. However, as in any software application, DNN-based apps are afflicted with bugs. Previous work observed that DNN bug fix patterns are different from traditional bug fix patterns. Furthermore, those buggy models are non-trivial to diagnose and fix due to inexplicit errors with several options to fix them. To support developers…
▽ More
Deep Neural Networks (DNNs) are used in a wide variety of applications. However, as in any software application, DNN-based apps are afflicted with bugs. Previous work observed that DNN bug fix patterns are different from traditional bug fix patterns. Furthermore, those buggy models are non-trivial to diagnose and fix due to inexplicit errors with several options to fix them. To support developers in locating and fixing bugs, we propose DeepDiagnosis, a novel debugging approach that localizes the faults, reports error symptoms and suggests fixes for DNN programs. In the first phase, our technique monitors a training model, periodically checking for eight types of error conditions. Then, in case of problems, it reports messages containing sufficient information to perform actionable repairs to the model. In the evaluation, we thoroughly examine 444 models -53 real-world from GitHub and Stack Overflow, and 391 curated by AUTOTRAINER. DeepDiagnosis provides superior accuracy when compared to UMLUAT and DeepLocalize. Our technique is faster than AUTOTRAINER for fault localization. The results show that our approach can support additional types of models, while state-of-the-art was only able to handle classification ones. Our technique was able to report bugs that do not manifest as numerical errors during training. Also, it can provide actionable insights for fix whereas DeepLocalize can only report faults that lead to numerical errors during training. DeepDiagnosis manifests the best capabilities of fault detection, bug localization, and symptoms identification when compared to other approaches.
△ Less
Submitted 7 December, 2021;
originally announced December 2021.
-
Manas: Mining Software Repositories to Assist AutoML
Authors:
Giang Nguyen,
Md Johir Islam,
Rangeet Pan,
Hridesh Rajan
Abstract:
Today deep learning is widely used for building software. A software engineering problem with deep learning is that finding an appropriate convolutional neural network (CNN) model for the task can be a challenge for developers. Recent work on AutoML, more precisely neural architecture search (NAS), embodied by tools like Auto-Keras aims to solve this problem by essentially viewing it as a search p…
▽ More
Today deep learning is widely used for building software. A software engineering problem with deep learning is that finding an appropriate convolutional neural network (CNN) model for the task can be a challenge for developers. Recent work on AutoML, more precisely neural architecture search (NAS), embodied by tools like Auto-Keras aims to solve this problem by essentially viewing it as a search problem where the starting point is a default CNN model, and mutation of this CNN model allows exploration of the space of CNN models to find a CNN model that will work best for the problem. These works have had significant success in producing high-accuracy CNN models. There are two problems, however. First, NAS can be very costly, often taking several hours to complete. Second, CNN models produced by NAS can be very complex that makes it harder to understand them and costlier to train them. We propose a novel approach for NAS, where instead of starting from a default CNN model, the initial model is selected from a repository of models extracted from GitHub. The intuition being that developers solving a similar problem may have developed a better starting point compared to the default model. We also analyze common layer patterns of CNN models in the wild to understand changes that the developers make to improve their models. Our approach uses commonly occurring changes as mutation operators in NAS. We have extended Auto-Keras to implement our approach. Our evaluation using 8 top voted problems from Kaggle for tasks including image classification and image regression shows that given the same search time, without loss of accuracy, Manas produces models with 42.9% to 99.6% fewer number of parameters than Auto-Keras' models. Benchmarked on GPU, Manas' models train 30.3% to 641.6% faster than Auto-Keras' models.
△ Less
Submitted 13 February, 2022; v1 submitted 6 December, 2021;
originally announced December 2021.
-
The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large
Authors:
Sumon Biswas,
Mohammad Wardat,
Hridesh Rajan
Abstract:
Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. W…
▽ More
Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.
△ Less
Submitted 14 February, 2022; v1 submitted 2 December, 2021;
originally announced December 2021.
-
Decomposing Convolutional Neural Networks into Reusable and Replaceable Modules
Authors:
Rangeet Pan,
Hridesh Rajan
Abstract:
Training from scratch is the most common way to build a Convolutional Neural Network (CNN) based model. What if we can build new CNN models by reusing parts from previously build CNN models? What if we can improve a CNN model by replacing (possibly faulty) parts with other parts? In both cases, instead of training, can we identify the part responsible for each output class (module) in the model(s)…
▽ More
Training from scratch is the most common way to build a Convolutional Neural Network (CNN) based model. What if we can build new CNN models by reusing parts from previously build CNN models? What if we can improve a CNN model by replacing (possibly faulty) parts with other parts? In both cases, instead of training, can we identify the part responsible for each output class (module) in the model(s) and reuse or replace only the desired output classes to build a model? Prior work has proposed decomposing dense-based networks into modules (one for each output class) to enable reusability and replaceability in various scenarios. However, this work is limited to the dense layers and based on the one-to-one relationship between the nodes in consecutive layers. Due to the shared architecture in the CNN model, prior work cannot be adapted directly. In this paper, we propose to decompose a CNN model used for image classification problems into modules for each output class. These modules can further be reused or replaced to build a new model. We have evaluated our approach with CIFAR-10, CIFAR-100, and ImageNet tiny datasets with three variations of ResNet models and found that enabling decomposition comes with a small cost (1.77% and 0.85% for top-1 and top-5 accuracy, respectively). Also, building a model by reusing or replacing modules can be done with a 2.3% and 0.5% average loss of accuracy. Furthermore, reusing and replacing these modules reduces CO2e emission by ~37 times compared to training the model from scratch.
△ Less
Submitted 20 December, 2021; v1 submitted 11 October, 2021;
originally announced October 2021.
-
A global convergence theory for deep ReLU implicit networks via over-parameterization
Authors:
Tianxiang Gao,
Hailiang Liu,
Jia Liu,
Hridesh Rajan,
Hongyang Gao
Abstract:
Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of i…
▽ More
Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the solution of an equilibrium equation. Although a line of recent empirical studies has demonstrated its superior performances, the theoretical understanding of implicit neural networks is limited. In general, the equilibrium equation may not be well-posed during the training. As a result, there is no guarantee that a vanilla (stochastic) gradient descent (SGD) training nonlinear implicit neural networks can converge. This paper fills the gap by analyzing the gradient flow of Rectified Linear Unit (ReLU) activated implicit neural networks. For an $m$-width implicit neural network with ReLU activation and $n$ training samples, we show that a randomly initialized gradient descent converges to a global minimum at a linear rate for the square loss function if the implicit neural network is \textit{over-parameterized}. It is worth noting that, unlike existing works on the convergence of (S)GD on finite-layer over-parameterized neural networks, our convergence results hold for implicit neural networks, where the number of layers is \textit{infinite}.
△ Less
Submitted 18 February, 2022; v1 submitted 11 October, 2021;
originally announced October 2021.
-
Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline
Authors:
Sumon Biswas,
Hridesh Rajan
Abstract:
In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. How…
▽ More
In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline.
△ Less
Submitted 19 July, 2021; v1 submitted 2 June, 2021;
originally announced June 2021.
-
DeepLocalize: Fault Localization for Deep Neural Networks
Authors:
Mohammad Wardat,
Wei Le,
Hridesh Rajan
Abstract:
Deep neural networks (DNNs) are becoming an integral part of most software systems. Previous work has shown that DNNs have bugs. Unfortunately, existing debugging techniques do not support localizing DNN bugs because of the lack of understanding of model behaviors. The entire DNN model appears as a black box. To address these problems, we propose an approach that automatically determines whether t…
▽ More
Deep neural networks (DNNs) are becoming an integral part of most software systems. Previous work has shown that DNNs have bugs. Unfortunately, existing debugging techniques do not support localizing DNN bugs because of the lack of understanding of model behaviors. The entire DNN model appears as a black box. To address these problems, we propose an approach that automatically determines whether the model is buggy or not, and identifies the root causes. Our key insight is that historic trends in values propagated between layers can be analyzed to identify faults, and localize faults. To that end, we first enable dynamic analysis of deep learning applications: by converting it into an imperative representation and alternatively using a callback mechanism. Both mechanisms allows us to insert probes that enable dynamic analysis over the traces produced by the DNN while it is being trained on the training data. We then conduct dynamic analysis over the traces to identify the faulty layer that causes the error. We propose an algorithm for identifying root causes by capturing any numerical error and monitoring the model during training and finding the relevance of every layer on the DNN outcome. We have collected a benchmark containing 40 buggy models and patches that contain real errors in deep learning applications from Stack Overflow and GitHub. Our benchmark can be used to evaluate automated debugging tools and repair techniques. We have evaluated our approach using this DNN bug-and-patch benchmark, and the results showed that our approach is much more effective than the existing debugging approach used in the state of the practice Keras library. For 34 out of 40 cases, our approach was able to detect faults whereas the best debugging approach provided by Keras detected 32 out of 40 faults. Our approach was able to localize 21 out of 40 bugs whereas Keras did not localize any faults.
△ Less
Submitted 4 March, 2021;
originally announced March 2021.
-
Do the Machine Learning Models on a Crowd Sourced Platform Exhibit Bias? An Empirical Study on Model Fairness
Authors:
Sumon Biswas,
Hridesh Rajan
Abstract:
Machine learning models are increasingly being used in important decision-making software such as approving bank loans, recommending criminal sentencing, hiring employees, and so on. It is important to ensure the fairness of these models so that no discrimination is made based on protected attribute (e.g., race, sex, age) while decision making. Algorithms have been developed to measure unfairness…
▽ More
Machine learning models are increasingly being used in important decision-making software such as approving bank loans, recommending criminal sentencing, hiring employees, and so on. It is important to ensure the fairness of these models so that no discrimination is made based on protected attribute (e.g., race, sex, age) while decision making. Algorithms have been developed to measure unfairness and mitigate them to a certain extent. In this paper, we have focused on the empirical evaluation of fairness and mitigations on real-world machine learning models. We have created a benchmark of 40 top-rated models from Kaggle used for 5 different tasks, and then using a comprehensive set of fairness metrics, evaluated their fairness. Then, we have applied 7 mitigation techniques on these models and analyzed the fairness, mitigation results, and impacts on performance. We have found that some model optimization techniques result in inducing unfairness in the models. On the other hand, although there are some fairness control mechanisms in machine learning libraries, they are not documented. The mitigation algorithm also exhibit common patterns such as mitigation in the post-processing is often costly (in terms of performance) and mitigation in the pre-processing stage is preferred in most cases. We have also presented different trade-off choices of fairness mitigation decisions. Our study suggests future research directions to reduce the gap between theoretical fairness aware algorithms and the software engineering methods to leverage them in practice.
△ Less
Submitted 22 September, 2020; v1 submitted 21 May, 2020;
originally announced May 2020.
-
BCFA: Bespoke Control Flow Analysis for CFA at Scale
Authors:
Ramanathan Ramu,
Ganesha B Upadhyaya,
Hoan Anh Nguyen,
Hridesh Rajan
Abstract:
Many data-driven software engineering tasks such as discovering programming patterns, mining API specifications, etc., perform source code analysis over control flow graphs (CFGs) at scale. Analyzing millions of CFGs can be expensive and performance of the analysis heavily depends on the underlying CFG traversal strategy. State-of-the-art analysis frameworks use a fixed traversal strategy. We argu…
▽ More
Many data-driven software engineering tasks such as discovering programming patterns, mining API specifications, etc., perform source code analysis over control flow graphs (CFGs) at scale. Analyzing millions of CFGs can be expensive and performance of the analysis heavily depends on the underlying CFG traversal strategy. State-of-the-art analysis frameworks use a fixed traversal strategy. We argue that a single traversal strategy does not fit all kinds of analyses and CFGs and propose bespoke control flow analysis (BCFA). Given a control flow analysis (CFA) and a large number of CFGs, BCFA selects the most efficient traversal strategy for each CFG. BCFA extracts a set of properties of the CFA by analyzing the code of the CFA and combines it with properties of the CFG, such as branching factor and cyclicity, for selecting the optimal traversal strategy. We have implemented BCFA in Boa, and evaluated BCFA using a set of representative static analyses that mainly involve traversing CFGs and two large datasets containing 287 thousand and 162 million CFGs. Our results show that BCFA can speedup the large scale analyses by 1%-28%. Further, BCFA has low overheads; less than 0.2%, and low misprediction rate; less than 0.01%.
△ Less
Submitted 3 May, 2020;
originally announced May 2020.
-
Repairing Deep Neural Networks: Fix Patterns and Challenges
Authors:
Md Johirul Islam,
Rangeet Pan,
Giang Nguyen,
Hridesh Rajan
Abstract:
Significant interest in applying Deep Neural Network (DNN) has fueled the need to support engineering of software that uses DNNs. Repairing software that uses DNNs is one such unmistakable SE need where automated tools could be beneficial; however, we do not fully understand challenges to repairing and patterns that are utilized when manually repairing DNNs. What challenges should automated repair…
▽ More
Significant interest in applying Deep Neural Network (DNN) has fueled the need to support engineering of software that uses DNNs. Repairing software that uses DNNs is one such unmistakable SE need where automated tools could be beneficial; however, we do not fully understand challenges to repairing and patterns that are utilized when manually repairing DNNs. What challenges should automated repair tools address? What are the repair patterns whose automation could help developers? Which repair patterns should be assigned a higher priority for building automated bug repair tools? This work presents a comprehensive study of bug fix patterns to address these questions. We have studied 415 repairs from Stack overflow and 555 repairs from Github for five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand challenges in repairs and bug repair patterns. Our key findings reveal that DNN bug fix patterns are distinctive compared to traditional bug fix patterns; the most common bug fix patterns are fixing data dimension and neural network connectivity; DNN bug fixes have the potential to introduce adversarial vulnerabilities; DNN bug fixes frequently introduce new bugs; and DNN bug localization, reuse of trained model, and co** with frequent releases are major challenges faced by developers when fixing bugs. We also contribute a benchmark of 667 DNN (bug, repair) instances.
△ Less
Submitted 2 May, 2020;
originally announced May 2020.
-
What Do Developers Ask About ML Libraries? A Large-scale Study Using Stack Overflow
Authors:
Md Johirul Islam,
Hoan Anh Nguyen,
Rangeet Pan,
Hridesh Rajan
Abstract:
Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To that end, this work reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras…
▽ More
Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To that end, this work reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras, scikit-learn, Weka, Caffe, Theano, MLlib, Torch, Mahout, and H2O, on Stack Overflow, a popular online technical Q&A forum. We classify these questions into seven typical stages of an ML pipeline to understand the correlation between the library and the stage. Then we study the questions and perform statistical analysis to explore the answer to four research objectives (finding the most difficult stage, understanding the nature of problems, nature of libraries and studying whether the difficulties stayed consistent over time). Our findings reveal the urgent need for software engineering (SE) research in this area. Both static and dynamic analyses are mostly absent and badly needed to help developers find errors earlier. While there has been some early research on debugging, much more work is needed. API misuses are prevalent and API design improvements are sorely needed. Last and somewhat surprisingly, a tug of war between providing higher levels of abstractions and the need to understand the behavior of the trained model is prevalent.
△ Less
Submitted 27 June, 2019;
originally announced June 2019.
-
A Comprehensive Study on Deep Learning Bug Characteristics
Authors:
Md Johirul Islam,
Giang Nguyen,
Rangeet Pan,
Hridesh Rajan
Abstract:
Deep learning has gained substantial popularity in recent years. Developers mainly rely on libraries and tools to add deep learning capabilities to their software. What kinds of bugs are frequently found in such software? What are the root causes of such bugs? What impacts do such bugs have? Which stages of deep learning pipeline are more bug prone? Are there any antipatterns? Understanding such c…
▽ More
Deep learning has gained substantial popularity in recent years. Developers mainly rely on libraries and tools to add deep learning capabilities to their software. What kinds of bugs are frequently found in such software? What are the root causes of such bugs? What impacts do such bugs have? Which stages of deep learning pipeline are more bug prone? Are there any antipatterns? Understanding such characteristics of bugs in deep learning software has the potential to foster the development of better deep learning platforms, debugging mechanisms, development practices, and encourage the development of analysis and verification frameworks. Therefore, we study 2716 high-quality posts from Stack Overflow and 500 bug fix commits from Github about five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand the types of bugs, root causes of bugs, impacts of bugs, bug-prone stage of deep learning pipeline as well as whether there are some common antipatterns found in this buggy software. The key findings of our study include: data bug and logic bug are the most severe bug types in deep learning software appearing more than 48% of the times, major root causes of these bugs are Incorrect Model Parameter (IPS) and Structural Inefficiency (SI) showing up more than 43% of the times. We have also found that the bugs in the usage of deep learning libraries have some common antipatterns that lead to a strong correlation of bug types among the libraries.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Identifying Classes Susceptible to Adversarial Attacks
Authors:
Rangeet Pan,
Md Johirul Islam,
Shibbir Ahmed,
Hridesh Rajan
Abstract:
Despite numerous attempts to defend deep learning based image classifiers, they remain susceptible to the adversarial attacks. This paper proposes a technique to identify susceptible classes, those classes that are more easily subverted. To identify the susceptible classes we use distance-based measures and apply them on a trained model. Based on the distance among original classes, we create mapp…
▽ More
Despite numerous attempts to defend deep learning based image classifiers, they remain susceptible to the adversarial attacks. This paper proposes a technique to identify susceptible classes, those classes that are more easily subverted. To identify the susceptible classes we use distance-based measures and apply them on a trained model. Based on the distance among original classes, we create map** among original classes and adversarial classes that helps to reduce the randomness of a model to a significant amount in an adversarial setting. We analyze the high dimensional geometry among the feature classes and identify the k most susceptible target classes in an adversarial attack. We conduct experiments using MNIST, Fashion MNIST, CIFAR-10 (ImageNet and ResNet-32) datasets. Finally, we evaluate our techniques in order to determine which distance-based measure works best and how the randomness of a model changes with perturbation.
△ Less
Submitted 30 May, 2019;
originally announced May 2019.
-
Inferring Concise Specifications of APIs
Authors:
John L. Singleton,
Gary T. Leavens,
Hridesh Rajan,
David R. Cok
Abstract:
Modern software relies on libraries and uses them via application programming interfaces (APIs). Correct API usage as well as many software engineering tasks are enabled when APIs have formal specifications. In this work, we analyze the implementation of each method in an API to infer a formal postcondition. Conventional wisdom is that, if one has preconditions, then one can use the strongest post…
▽ More
Modern software relies on libraries and uses them via application programming interfaces (APIs). Correct API usage as well as many software engineering tasks are enabled when APIs have formal specifications. In this work, we analyze the implementation of each method in an API to infer a formal postcondition. Conventional wisdom is that, if one has preconditions, then one can use the strongest postcondition predicate transformer (SP) to infer postconditions. However, SP yields postconditions that are exponentially large, which makes them difficult to use, either by humans or by tools. Our key idea is an algorithm that converts such exponentially large specifications into a form that is more concise and thus more usable. This is done by leveraging the structure of the specifications that result from the use of SP. We applied our technique to infer postconditions for over 2,300 methods in seven popular Java libraries. Our technique was able to infer specifications for 75.7% of these methods, each of which was verified using an Extended Static Checker. We also found that 84.6% of resulting specifications were less than 1/4 page (20 lines) in length. Our technique was able to reduce the length of SMT proofs needed for verifying implementations by 76.7% and reduced prover execution time by 26.7%.
△ Less
Submitted 16 May, 2019;
originally announced May 2019.
-
A Cyberinfrastructure for BigData Transportation Engineering
Authors:
Md Johirul Islam,
Anuj Sharma,
Hridesh Rajan
Abstract:
Big Data-driven transportation engineering has the potential to improve utilization of road infrastructure, decrease traffic fatalities, improve fuel consumption, decrease construction worker injuries, among others. Despite these benefits, research on Big Data-driven transportation engineering is difficult today due to the computational expertise required to get started. This work proposes BoaT, a…
▽ More
Big Data-driven transportation engineering has the potential to improve utilization of road infrastructure, decrease traffic fatalities, improve fuel consumption, decrease construction worker injuries, among others. Despite these benefits, research on Big Data-driven transportation engineering is difficult today due to the computational expertise required to get started. This work proposes BoaT, a transportation-specific programming language, and it's Big Data infrastructure that is aimed at decreasing this barrier to entry. Our evaluation that uses over two dozen research questions from six categories show that research is easier to realize as a BoaT computer program, an order of magnitude faster when this program is run, and exhibits 12-14x decrease in storage requirements.
△ Less
Submitted 30 April, 2018;
originally announced May 2018.