-
Can ChatGPT Support Developers? An Empirical Evaluation of Large Language Models for Code Generation
Authors:
Kailun **,
Chung-Yu Wang,
Hung Viet Pham,
Hadi Hemmati
Abstract:
Large language models (LLMs) have demonstrated notable proficiency in code generation, with numerous prior studies showing their promising capabilities in various development scenarios. However, these studies mainly provide evaluations in research settings, which leaves a significant gap in understanding how effectively LLMs can support developers in real-world. To address this, we conducted an em…
▽ More
Large language models (LLMs) have demonstrated notable proficiency in code generation, with numerous prior studies showing their promising capabilities in various development scenarios. However, these studies mainly provide evaluations in research settings, which leaves a significant gap in understanding how effectively LLMs can support developers in real-world. To address this, we conducted an empirical analysis of conversations in DevGPT, a dataset collected from developers' conversations with ChatGPT (captured with the Share Link feature on platforms such as GitHub). Our empirical findings indicate that the current practice of using LLM-generated code is typically limited to either demonstrating high-level concepts or providing examples in documentation, rather than to be used as production-ready code. These findings indicate that there is much future work needed to improve LLMs in code generation before they can be integral parts of modern software development.
△ Less
Submitted 16 March, 2024; v1 submitted 18 February, 2024;
originally announced February 2024.
-
I came, I saw, I certified: some perspectives on the safety assurance of cyber-physical systems
Authors:
Mithila Sivakumar,
Alvine B. Belle,
Kimya Khakzad Shahandashti,
Oluwafemi Odu,
Hadi Hemmati,
Segla Kpodjedo,
Song Wang,
Opeyemi O. Adesina
Abstract:
The execution failure of cyber-physical systems (e.g., autonomous driving systems, unmanned aerial systems, and robotic systems) could result in the loss of life, severe injuries, large-scale environmental damage, property destruction, and major economic loss. Hence, such systems usually require a strong justification that they will effectively support critical requirements (e.g., safety, security…
▽ More
The execution failure of cyber-physical systems (e.g., autonomous driving systems, unmanned aerial systems, and robotic systems) could result in the loss of life, severe injuries, large-scale environmental damage, property destruction, and major economic loss. Hence, such systems usually require a strong justification that they will effectively support critical requirements (e.g., safety, security, and reliability) for which they were designed. Thus, it is often mandatory to develop compelling assurance cases to support that justification and allow regulatory bodies to certify such systems. In such contexts, detecting assurance deficits, relying on patterns to improve the structure of assurance cases, improving existing assurance case notations, and (semi-)automating the generation of assurance cases are key to develop compelling assurance cases and foster consumer acceptance. We therefore explore challenges related to such assurance enablers and outline some potential directions that could be explored to tackle them.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Log-based Anomaly Detection of Enterprise Software: An Empirical Study
Authors:
Nadun Wijesinghe,
Hadi Hemmati
Abstract:
Most enterprise applications use logging as a mechanism to diagnose anomalies, which could help with reducing system downtime. Anomaly detection using software execution logs has been explored in several prior studies, using both classical and deep neural network-based machine learning models. In recent years, the research has largely focused in using variations of sequence-based deep neural netwo…
▽ More
Most enterprise applications use logging as a mechanism to diagnose anomalies, which could help with reducing system downtime. Anomaly detection using software execution logs has been explored in several prior studies, using both classical and deep neural network-based machine learning models. In recent years, the research has largely focused in using variations of sequence-based deep neural networks (e.g., Long-Short Term Memory and Transformer-based models) for log-based anomaly detection on open-source data. However, they have not been applied in industrial datasets, as often. In addition, the studied open-source datasets are typically very large in size with logging statements that do not change much over time, which may not be the case with a dataset from an industrial service that is relatively new. In this paper, we evaluate several state-of-the-art anomaly detection models on an industrial dataset from our research partner, which is much smaller and loosely structured than most large scale open-source benchmark datasets. Results show that while all models are capable of detecting anomalies, certain models are better suited for less-structured datasets. We also see that model effectiveness changes when a common data leak associated with a random train-test split in some prior work is removed. A qualitative study of the defects' characteristics identified by the developers on the industrial dataset further shows strengths and weaknesses of the models in detecting different types of anomalies. Finally, we explore the effect of limited training data by gradually increasing the training set size, to evaluate if the model effectiveness does depend on the training set size.
△ Less
Submitted 31 October, 2023;
originally announced October 2023.
-
Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks
Authors:
Jiho Shin,
Clark Tang,
Tahmineh Mohati,
Maleknaz Nayebi,
Song Wang,
Hadi Hemmati
Abstract:
In this paper, we investigate the effectiveness of state-of-the-art LLM, i.e., GPT-4, with three different prompting engineering techniques (i.e., basic prompting, in-context learning, and task-specific prompting) against 18 fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code summarization, and code translation. Our quantitative analysis of these prompting strategies suggests t…
▽ More
In this paper, we investigate the effectiveness of state-of-the-art LLM, i.e., GPT-4, with three different prompting engineering techniques (i.e., basic prompting, in-context learning, and task-specific prompting) against 18 fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code summarization, and code translation. Our quantitative analysis of these prompting strategies suggests that prompt engineering GPT-4 cannot necessarily and significantly outperform fine-tuning smaller/older LLMs in all three tasks. For comment generation, GPT-4 with the best prompting strategy (i.e., task-specific prompt) had outperformed the first-ranked fine-tuned model by 8.33% points on average in BLEU. However, for code generation, the first-ranked fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3% points, on average in BLEU. For code translation, GPT-4 and fine-tuned baselines tie as they outperform each other on different translation tasks. To explore the impact of different prompting strategies, we conducted a user study with 27 graduate students and 10 industry practitioners. From our qualitative analysis, we find that the GPT-4 with conversational prompts (i.e., when a human provides feedback and instructions back and forth with a model to achieve best results) showed drastic improvement compared to GPT-4 with automatic prompting strategies. Moreover, we observe that participants tend to request improvements, add more context, or give specific instructions as conversational prompts, which goes beyond typical and generic prompting strategies. Our study suggests that, at its current state, GPT-4 with conversational prompting has great potential for ASE tasks, but fully automated prompt engineering with no human in the loop requires more study and improvement.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Assessing Evaluation Metrics for Neural Test Oracle Generation
Authors:
Jiho Shin,
Hadi Hemmati,
Moshi Wei,
Song Wang
Abstract:
In this work, we revisit existing oracle generation studies plus ChatGPT to empirically investigate the current standing of their performance in both NLG-based and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on five NLG-based and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two di…
▽ More
In this work, we revisit existing oracle generation studies plus ChatGPT to empirically investigate the current standing of their performance in both NLG-based and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on five NLG-based and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two different sets of metrics. Surprisingly, we found no significant correlation between the NLG-based metrics and test adequacy metrics. For instance, oracles generated from ChatGPT on the project activemq-artemis had the highest performance on all the NLG-based metrics among the studied NOGs, however, it had the most number of projects with a decrease in test adequacy metrics compared to all the studied NOGs. We further conduct a qualitative analysis to explore the reasons behind our observations, we found that oracles with high NLG-based metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle's parameters, making it hard for the model to generate completely, affecting the test adequacy metrics. On the other hand, oracles with low NLG-based metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth. Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation with both NLG and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Gray-box Adversarial Attack of Deep Reinforcement Learning-based Trading Agents
Authors:
Foozhan Ataiefard,
Hadi Hemmati
Abstract:
In recent years, deep reinforcement learning (Deep RL) has been successfully implemented as a smart agent in many systems such as complex games, self-driving cars, and chat-bots. One of the interesting use cases of Deep RL is its application as an automated stock trading agent. In general, any automated trading agent is prone to manipulations by adversaries in the trading environment. Thus studyin…
▽ More
In recent years, deep reinforcement learning (Deep RL) has been successfully implemented as a smart agent in many systems such as complex games, self-driving cars, and chat-bots. One of the interesting use cases of Deep RL is its application as an automated stock trading agent. In general, any automated trading agent is prone to manipulations by adversaries in the trading environment. Thus studying their robustness is vital for their success in practice. However, typical mechanism to study RL robustness, which is based on white-box gradient-based adversarial sample generation techniques (like FGSM), is obsolete for this use case, since the models are protected behind secure international exchange APIs, such as NASDAQ. In this research, we demonstrate that a "gray-box" approach for attacking a Deep RL-based trading agent is possible by trading in the same stock market, with no extra access to the trading agent. In our proposed approach, an adversary agent uses a hybrid Deep Neural Network as its policy consisting of Convolutional layers and fully-connected layers. On average, over three simulated trading market configurations, the adversary policy proposed in this research is able to reduce the reward values by 214.17%, which results in reducing the potential profits of the baseline by 139.4%, ensemble method by 93.7%, and an automated trading software developed by our industrial partner by 85.5%, while consuming significantly less budget than the victims (427.77%, 187.16%, and 66.97%, respectively).
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Method-Level Bug Severity Prediction using Source Code Metrics and LLMs
Authors:
Ehsan Mashhadi,
Hossein Ahmadvand,
Hadi Hemmati
Abstract:
In the past couple of decades, significant research efforts are devoted to the prediction of software bugs. However, most existing work in this domain treats all bugs the same, which is not the case in practice. It is important for a defect prediction method to estimate the severity of the identified bugs so that the higher-severity ones get immediate attention. In this study, we investigate sourc…
▽ More
In the past couple of decades, significant research efforts are devoted to the prediction of software bugs. However, most existing work in this domain treats all bugs the same, which is not the case in practice. It is important for a defect prediction method to estimate the severity of the identified bugs so that the higher-severity ones get immediate attention. In this study, we investigate source code metrics, source code representation using large language models (LLMs), and their combination in predicting bug severity labels of two prominent datasets. We leverage several source metrics at method-level granularity to train eight different machine-learning models. Our results suggest that Decision Tree and Random Forest models outperform other models regarding our several evaluation metrics. We then use the pre-trained CodeBERT LLM to study the source code representations' effectiveness in predicting bug severity. CodeBERT finetuning improves the bug severity prediction results significantly in the range of 29%-140% for several evaluation metrics, compared to the best classic prediction model on source code metric. Finally, we integrate source code metrics into CodeBERT as an additional input, using our two proposed architectures, which both enhance the CodeBERT model effectiveness.
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Domain Adaptation for Deep Unit Test Case Generation
Authors:
Jiho Shin,
Sepehr Hashtroudi,
Hadi Hemmati,
Song Wang
Abstract:
Recently, deep learning-based test case generation approaches have been proposed to automate the generation of unit test cases. In this study, we leverage Transformer-based code models to generate unit tests with the help of Domain Adaptation (DA) at a project level. Specifically, we use CodeT5, which is a relatively small language model trained on source code data, and fine-tune it on the test ge…
▽ More
Recently, deep learning-based test case generation approaches have been proposed to automate the generation of unit test cases. In this study, we leverage Transformer-based code models to generate unit tests with the help of Domain Adaptation (DA) at a project level. Specifically, we use CodeT5, which is a relatively small language model trained on source code data, and fine-tune it on the test generation task; then again further fine-tune it on each target project data to learn the project-specific knowledge (project-level DA). We use the Methods2test dataset to fine-tune CodeT5 for the test generation task and the Defects4j dataset for project-level domain adaptation and evaluation. We compare our approach with (a) CodeT5 fine-tuned on the test generation without DA, (b) the A3Test tool, and (c) GPT-4, on 5 projects from the Defects4j dataset. The results show that using DA can increase the line coverage of the generated tests on average 18.62%, 19.88%, and 18.02% compared to the above (a), (b), and (c) baselines, respectively. The results also consistently show improvements using other metrics such as BLEU and CodeBLEU. In addition, we show that our approach can be seen as a complementary solution alongside existing search-based test generation tools such as EvoSuite, to increase the overall coverage and mutation scores with an average of 34.42% and 6.8%, for line coverage and mutation score, respectively.
△ Less
Submitted 19 January, 2024; v1 submitted 15 August, 2023;
originally announced August 2023.
-
A First Look at Fairness of Machine Learning Based Code Reviewer Recommendation
Authors:
Mohammad Mahdi Mohajer,
Alvine Boaye Belle,
Nima Shiri harzevili,
Junjie Wang,
Hadi Hemmati,
Song Wang,
Zhen Ming,
Jiang
Abstract:
The fairness of machine learning (ML) approaches is critical to the reliability of modern artificial intelligence systems. Despite extensive study on this topic, the fairness of ML models in the software engineering (SE) domain has not been well explored yet. As a result, many ML-powered software systems, particularly those utilized in the software engineering community, continue to be prone to fa…
▽ More
The fairness of machine learning (ML) approaches is critical to the reliability of modern artificial intelligence systems. Despite extensive study on this topic, the fairness of ML models in the software engineering (SE) domain has not been well explored yet. As a result, many ML-powered software systems, particularly those utilized in the software engineering community, continue to be prone to fairness issues. Taking one of the typical SE tasks, i.e., code reviewer recommendation, as a subject, this paper conducts the first study toward investigating the issue of fairness of ML applications in the SE domain. Our empirical study demonstrates that current state-of-the-art ML-based code reviewer recommendation techniques exhibit unfairness and discriminating behaviors. Specifically, male reviewers get on average 7.25% more recommendations than female code reviewers compared to their distribution in the reviewer set. This paper also discusses the reasons why the studied ML-based code reviewer recommendation systems are unfair and provides solutions to mitigate the unfairness. Our study further indicates that the existing mitigation methods can enhance fairness by 100% in projects with a similar distribution of protected and privileged groups, but their effectiveness in improving fairness on imbalanced or skewed data is limited. Eventually, we suggest a solution to overcome the drawbacks of existing mitigation techniques and tackle bias in datasets that are imbalanced or skewed.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair
Authors:
Sakina Fatima,
Hadi Hemmati,
Lionel Briand
Abstract:
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that…
▽ More
Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky test cases where the root cause of flakiness is in the test case itself and not in the production code. Our key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, in addition to informing testers, we augment a Large Language Model (LLM) like GPT with such extra knowledge to ask the LLM for repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT 3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs, (roughly between 70% and 90%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.
△ Less
Submitted 19 May, 2024; v1 submitted 21 June, 2023;
originally announced July 2023.
-
A Systematic Literature Review of Explainable AI for Software Engineering
Authors:
Ahmad Haji Mohammadkhani,
Nitin Sai Bommi,
Mariem Daboussi,
Onkar Sabnis,
Chakkrit Tantithamthavorn,
Hadi Hemmati
Abstract:
Context: In recent years, leveraging machine learning (ML) techniques has become one of the main solutions to tackle many software engineering (SE) tasks, in research studies (ML4SE). This has been achieved by utilizing state-of-the-art models that tend to be more complex and black-box, which is led to less explainable solutions that reduce trust and uptake of ML4SE solutions by professionals in t…
▽ More
Context: In recent years, leveraging machine learning (ML) techniques has become one of the main solutions to tackle many software engineering (SE) tasks, in research studies (ML4SE). This has been achieved by utilizing state-of-the-art models that tend to be more complex and black-box, which is led to less explainable solutions that reduce trust and uptake of ML4SE solutions by professionals in the industry.
Objective: One potential remedy is to offer explainable AI (XAI) methods to provide the missing explainability. In this paper, we aim to explore to what extent XAI has been studied in the SE community (XAI4SE) and provide a comprehensive view of the current state-of-the-art as well as challenge and roadmap for future work.
Method: We conduct a systematic literature review on 24 (out of 869 primary studies that were selected by keyword search) most relevant published studies in XAI4SE. We have three research questions that were answered by meta-analysis of the collected data per paper.
Results: Our study reveals that among the identified studies, software maintenance (\%68) and particularly defect prediction has the highest share on the SE stages and tasks being studied. Additionally, we found that XAI methods were mainly applied to classic ML models rather than more complex models. We also noticed a clear lack of standard evaluation metrics for XAI methods in the literature which has caused confusion among researchers and a lack of benchmarks for comparisons.
Conclusions: XAI has been identified as a helpful tool by most studies, which we cover in the systematic review. However, XAI4SE is a relatively new domain with a lot of untouched potentials, including the SE tasks to help with, the ML4SE methods to explain, and the types of explanations to offer. This study encourages the researchers to work on the identified challenges and roadmap reported in the paper.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Improving Automated Program Repair with Domain Adaptation
Authors:
Armin Zirak,
Hadi Hemmati
Abstract:
Automated Program Repair (APR) is defined as the process of fixing a bug/defect in the source code, by an automated tool. APR tools have recently experienced promising results by leveraging state-of-the-art Neural Language Processing (NLP) techniques. APR tools such as TFix and CodeXGLUE combine text-to-text transformers with software-specific techniques are outperforming alternatives, these days.…
▽ More
Automated Program Repair (APR) is defined as the process of fixing a bug/defect in the source code, by an automated tool. APR tools have recently experienced promising results by leveraging state-of-the-art Neural Language Processing (NLP) techniques. APR tools such as TFix and CodeXGLUE combine text-to-text transformers with software-specific techniques are outperforming alternatives, these days. However, in most APR studies the train and test sets are chosen from the same set of projects. In reality, however, APR models are meant to be generalizable to new and different projects. Therefore, there is a potential threat that reported APR models with high effectiveness perform poorly when the characteristics of the new project or its bugs are different than the training set's(Domain Shift).
In this study, we first define and measure the domain shift problem in automated program repair. Then, we then propose a domain adaptation framework that can adapt an APR model for a given target project. We conduct an empirical study with three domain adaptation methods FullFineTuning, TuningWithLightWeightAdapterLayers, and CurriculumLearning using two state-of-the-art domain adaptation tools (TFix and CodeXGLUE) and two APR models on 611 bugs from 19 projects. The results show that our proposed framework can improve the effectiveness of TFix by 13.05% and CodeXGLUE by 23.4%. Another contribution of this study is the proposal of a data synthesis method to address the lack of labelled data in APR. We leverage transformers to create a bug generator model. We use the generated synthetic data to domain adapt TFix and CodeXGLUE on the projects with no data (Zero-shot learning), which results in an average improvement of 5.76% and 24.42% for TFix and CodeXGLUE, respectively.
△ Less
Submitted 21 December, 2022;
originally announced December 2022.
-
MDA: Availability-Aware Federated Learning Client Selection
Authors:
Amin Eslami Abyane,
Steve Drew,
Hadi Hemmati
Abstract:
Recently, a new distributed learning scheme called Federated Learning (FL) has been introduced. FL is designed so that server never collects user-owned data meaning it is great at preserving privacy. FL's process starts with the server sending a model to clients, then the clients train that model using their data and send the updated model back to the server. Afterward, the server aggregates all t…
▽ More
Recently, a new distributed learning scheme called Federated Learning (FL) has been introduced. FL is designed so that server never collects user-owned data meaning it is great at preserving privacy. FL's process starts with the server sending a model to clients, then the clients train that model using their data and send the updated model back to the server. Afterward, the server aggregates all the updates and modifies the global model. This process is repeated until the model converges. This study focuses on an FL setting called cross-device FL, which trains based on a large number of clients. Since many devices may be unavailable in cross-device FL, and communication between the server and all clients is extremely costly, only a fraction of clients gets selected for training at each round. In vanilla FL, clients are selected randomly, which results in an acceptable accuracy but is not ideal from the overall training time perspective, since some clients are slow and can cause some training rounds to be slow. If only fast clients get selected the learning would speed up, but it will be biased toward only the fast clients' data, and the accuracy degrades. Consequently, new client selection techniques have been proposed to improve the training time by considering individual clients' resources and speed. This paper introduces the first availability-aware selection strategy called MDA. The results show that our approach makes learning faster than vanilla FL by up to 6.5%. Moreover, we show that resource heterogeneity-aware techniques are effective but can become even better when combined with our approach, making it faster than the state-of-the-art selectors by up to 16%. Lastly, our approach selects more unique clients for training compared to client selectors that only select fast clients, which reduces our technique's bias.
△ Less
Submitted 25 November, 2022;
originally announced November 2022.
-
Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work?
Authors:
Ahmad Haji Mohammadkhani,
Chakkrit Tantithamthavorn,
Hadi Hemmati
Abstract:
In recent years, there has been a wide interest in designing deep neural network-based models that automate downstream software engineering tasks on source code, such as code document generation, code search, and program repair. Although the main objective of these studies is to improve the effectiveness of the downstream task, many studies only attempt to employ the next best neural network model…
▽ More
In recent years, there has been a wide interest in designing deep neural network-based models that automate downstream software engineering tasks on source code, such as code document generation, code search, and program repair. Although the main objective of these studies is to improve the effectiveness of the downstream task, many studies only attempt to employ the next best neural network model, without a proper in-depth analysis of why a particular solution works or does not, on particular tasks or scenarios. In this paper, using an example eXplainable AI (XAI) method (attention mechanism), we study two recent large language models (LLMs) for code (CodeBERT and GraphCodeBERT) on a set of software engineering downstream tasks: code document generation (CDG), code refinement (CR), and code translation (CT). Through quantitative and qualitative studies, we identify what CodeBERT and GraphCodeBERT learn (put the highest attention on, in terms of source code token types), on these tasks. We also show some of the common patterns when the model does not work as expected (performs poorly even on easy problems) and suggest recommendations that may alleviate the observed challenges.
△ Less
Submitted 28 August, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Improving the Performance of DNN-based Software Services using Automated Layer Caching
Authors:
Mohammadamin Abedi,
Yanni Iouannou,
Pooyan Jamshidi,
Hadi Hemmati
Abstract:
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or react to users' requests or to process a stream of incoming data on time. However, the trend in DNN design is toward larger models with many layers and parameters t…
▽ More
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or react to users' requests or to process a stream of incoming data on time. However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results. Although these models are often pre-trained, the computational complexity in such large models can still be relatively significant, hindering low inference latency. Implementing a caching mechanism is a typical systems engineering solution for speeding up a service response time. However, traditional caching is often not suitable for DNN-based services. In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services in terms of their computational complexity and inference latency. Our caching method adopts the ideas of self-distillation of DNN models and early exits. The proposed solution is an automated online layer caching mechanism that allows early exiting of a large model during inference time if the cache model in one of the early exits is confident enough for final prediction. One of the main contributions of this paper is that we have implemented the idea as an online caching, meaning that the cache models do not need access to training data and perform solely based on the incoming data at run-time, making it suitable for applications using pre-trained models. Our experiments results on two downstream tasks (face and object classification) show that, on average, caching can reduce the computational complexity of those services up to 58\% (in terms of FLOPs count) and improve their inference latency up to 46\% with low to zero reduction in accuracy.
△ Less
Submitted 18 September, 2022;
originally announced September 2022.
-
Test2Vec: An Execution Trace Embedding for Test Case Prioritization
Authors:
Emad Jabbar,
Soheila Zangeneh,
Hadi Hemmati,
Robert Feldt
Abstract:
Most automated software testing tasks can benefit from the abstract representation of test cases. Traditionally, this is done by encoding test cases based on their code coverage. Specification-level criteria can replace code coverage to better represent test cases' behavior, but they are often not cost-effective. In this paper, we hypothesize that execution traces of the test cases can be a good a…
▽ More
Most automated software testing tasks can benefit from the abstract representation of test cases. Traditionally, this is done by encoding test cases based on their code coverage. Specification-level criteria can replace code coverage to better represent test cases' behavior, but they are often not cost-effective. In this paper, we hypothesize that execution traces of the test cases can be a good alternative to abstract their behavior for automated testing tasks. We propose a novel embedding approach, Test2Vec, that maps test execution traces to a latent space. We evaluate this representation in the test case prioritization (TP) task. Our default TP method is based on the similarity of the embedded vectors to historical failing test vectors. We also study an alternative based on the diversity of test vectors. Finally, we propose a method to decide which TP to choose, for a given test suite. The experiment is based on several real and seeded faults with over a million execution traces. Results show that our proposed TP improves best alternatives by 41.80% in terms of the median normalized rank of the first failing test case (FFR). It outperforms traditional code coverage-based approaches by 25.05% and 59.25% in terms of median APFD and median normalized FFR.
△ Less
Submitted 28 June, 2022;
originally announced June 2022.
-
An Empirical Study on Bug Severity Estimation Using Source Code Metrics and Static Analysis
Authors:
Ehsan Mashhadi,
Shaiful Chowdhury,
Somayeh Modaberi,
Hadi Hemmati,
Gias Uddin
Abstract:
In the past couple of decades, significant research efforts are devoted to the prediction of software bugs (i.e., defects). These works leverage a diverse set of metrics, tools, and techniques to predict which classes, methods, lines, or commits are buggy. However, most existing work in this domain treats all bugs the same, which is not the case in practice. The more severe the bugs the higher the…
▽ More
In the past couple of decades, significant research efforts are devoted to the prediction of software bugs (i.e., defects). These works leverage a diverse set of metrics, tools, and techniques to predict which classes, methods, lines, or commits are buggy. However, most existing work in this domain treats all bugs the same, which is not the case in practice. The more severe the bugs the higher their consequences. Therefore, it is important for a defect prediction method to estimate the severity of the identified bugs, so that the higher severity ones get immediate attention. In this paper, we provide a quantitative and qualitative study on two popular datasets (Defects4J and Bugs.jar), using 10 common source code metrics, and also two popular static analysis tools (SpotBugs and Infer) for analyzing their capability in predicting defects and their severity. We studied 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that although code metrics are powerful in predicting the buggy code (Lines of the Code, Maintainable Index, FanOut, and Effort metrics are the best), they cannot estimate the severity level of the bugs. In addition, we observed that static analysis tools have weak performance in both predicting bugs (F1 score range of 3.1%-7.1%) and their severity label (F1 score under 2%). We also manually studied the characteristics of the severe bugs to identify possible reasons behind the weak performance of code metrics and static analysis tools in estimating the severity. Also, our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity. Finally, we show that code metrics and static analysis methods can be complementary in terms of estimating bug severity.
△ Less
Submitted 26 June, 2022;
originally announced June 2022.
-
Towards Understanding Quality Challenges of the Federated Learning for Neural Networks: A First Look from the Lens of Robustness
Authors:
Amin Eslami Abyane,
Derui Zhu,
Roberto Souza,
Lei Ma,
Hadi Hemmati
Abstract:
Federated learning (FL) is a distributed learning paradigm that preserves users' data privacy while leveraging the entire dataset of all participants. In FL, multiple models are trained independently on the clients and aggregated centrally to update a global model in an iterative process. Although this approach is excellent at preserving privacy, FL still suffers from quality issues such as attack…
▽ More
Federated learning (FL) is a distributed learning paradigm that preserves users' data privacy while leveraging the entire dataset of all participants. In FL, multiple models are trained independently on the clients and aggregated centrally to update a global model in an iterative process. Although this approach is excellent at preserving privacy, FL still suffers from quality issues such as attacks or byzantine faults. Recent attempts have been made to address such quality challenges on the robust aggregation techniques for FL. However, the effectiveness of state-of-the-art (SOTA) robust FL techniques is still unclear and lacks a comprehensive study. Therefore, to better understand the current quality status and challenges of these SOTA FL techniques in the presence of attacks and faults, we perform a large-scale empirical study to investigate the SOTA FL's quality from multiple angles of attacks, simulated faults (via mutation operators), and aggregation (defense) methods. In particular, we study FL's performance on the image classification tasks and use DNNs as our model type. Furthermore, we perform our study on two generic image datasets and one real-world federated medical image dataset. We also investigate the effect of the proportion of affected clients and the dataset distribution factors on the robustness of FL. After a large-scale analysis with 496 configurations, we find that most mutators on each user have a negligible effect on the final model in the generic datasets, and only one of them is effective in the medical dataset. Furthermore, we show that model poisoning attacks are more effective than data poisoning attacks. Moreover, choosing the most robust FL aggregator depends on the attacks and datasets. Finally, we illustrate that a simple ensemble of aggregators achieves a more robust solution than any single aggregator and is the best choice in 75% of the cases.
△ Less
Submitted 9 January, 2023; v1 submitted 4 January, 2022;
originally announced January 2022.
-
Robustness Analysis of Deep Learning Frameworks on Mobile Platforms
Authors:
Amin Eslami Abyane,
Hadi Hemmati
Abstract:
With the recent increase in the computational power of modern mobile devices, machine learning-based heavy tasks such as face detection and speech recognition are now integral parts of such devices. This requires frameworks to execute machine learning models (e.g., Deep Neural Networks) on mobile devices. Although there exist studies on the accuracy and performance of these frameworks, the quality…
▽ More
With the recent increase in the computational power of modern mobile devices, machine learning-based heavy tasks such as face detection and speech recognition are now integral parts of such devices. This requires frameworks to execute machine learning models (e.g., Deep Neural Networks) on mobile devices. Although there exist studies on the accuracy and performance of these frameworks, the quality of on-device deep learning frameworks, in terms of their robustness, has not been systematically studied yet. In this paper, we empirically compare two on-device deep learning frameworks with three adversarial attacks on three different model architectures. We also use both the quantized and unquantized variants for each architecture. The results show that, in general, neither of the deep learning frameworks is better than the other in terms of robustness, and there is not a significant difference between the PC and mobile frameworks either. However, in cases like Boundary attack, mobile version is more robust than PC. In addition, quantization improves robustness in all cases when moving from PC to mobile.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Applying CodeBERT for Automated Program Repair of Java Simple Bugs
Authors:
Ehsan Mashhadi,
Hadi Hemmati
Abstract:
Software debugging, and program repair are among the most time-consuming and labor-intensive tasks in software engineering that would benefit a lot from automation. In this paper, we propose a novel automated program repair approach based on CodeBERT, which is a transformer-based neural architecture pre-trained on large corpus of source code. We fine-tune our model on the ManySStuBs4J small and la…
▽ More
Software debugging, and program repair are among the most time-consuming and labor-intensive tasks in software engineering that would benefit a lot from automation. In this paper, we propose a novel automated program repair approach based on CodeBERT, which is a transformer-based neural architecture pre-trained on large corpus of source code. We fine-tune our model on the ManySStuBs4J small and large datasets to automatically generate the fix codes. The results show that our technique accurately predicts the fixed codes implemented by the developers in 19-72% of the cases, depending on the type of datasets, in less than a second per bug. We also observe that our method can generate varied-length fixes (short and long) and can fix different types of bugs, even if only a few instances of those types of bugs exist in the training dataset.
△ Less
Submitted 30 March, 2021; v1 submitted 22 March, 2021;
originally announced March 2021.
-
Dosimetric characterization of a new 192Ir pulse dose rate brachytherapy source with the Monte Carlo simulation and thermoluminescent dosimeter
Authors:
Vahid Lohrabian,
Alireza Kamali-Asl,
Hossein Arabi,
Hamidreza Hemmati,
Majid Pournezam Esfahani
Abstract:
In this study, recommendations of the AAPM TG- 43 (U1) report have been followed to characterize the new 192Ir pulse dose rate source, provided by the Applied Radiation Research School, Nuclear Science and Technology Research Institute in Iran. Dose rate constant, radial dose function, geometry factors, and anisotropy function were calculated according to the relevant American Association of Physi…
▽ More
In this study, recommendations of the AAPM TG- 43 (U1) report have been followed to characterize the new 192Ir pulse dose rate source, provided by the Applied Radiation Research School, Nuclear Science and Technology Research Institute in Iran. Dose rate constant, radial dose function, geometry factors, and anisotropy function were calculated according to the relevant American Association of Physicists in Medicine AAPM and TG43 (U1) reports. In this study, 192Ir source was characterized using Monte Carlo simulation in water phantom, and in addition, experimental measurements were carried out using thermoluminescent dosimeters (TLD-100) in plexiglass (PMMA) phantoms. The dose-rate constant for the 192Ir PDR was found to be equal to 1.131 cGyh-1U-1 and 1.173 cGyh-1U-1 with TLD measurement and Monte Carlo simulation, respectively. Also in this study, the geometry function, radial dose functions g(r), and the anisotropy function have been calculated at distances from 0.1 to 16 cm. The dose-rate constant of these calculations has been compared with measured values for an actual 192Ir seed. The results of dosimetry parameters, presented in tabulated and graphical formats, exhibited good agreement to those reported from other commercially available PDR 192Ir sources. The results obtained in this study are in close agreement with the characteristics of the commercially available 192Ir sources. The results obtained in this study can be treated as an initial assessment of this source to be employed in the conventional treatment planning systems subsequent to complementary investigations.
△ Less
Submitted 11 February, 2021;
originally announced February 2021.
-
A Search-Based Testing Framework for Deep Neural Networks of Source Code Embedding
Authors:
Maryam Vahdat Pour,
Zhuo Li,
Lei Ma,
Hadi Hemmati
Abstract:
Over the past few years, deep neural networks (DNNs) have been continuously expanding their real-world applications for source code processing tasks across the software engineering domain, e.g., clone detection, code search, comment generation. Although quite a few recent works have been performed on testing of DNNs in the context of image and speech processing, limited progress has been achieved…
▽ More
Over the past few years, deep neural networks (DNNs) have been continuously expanding their real-world applications for source code processing tasks across the software engineering domain, e.g., clone detection, code search, comment generation. Although quite a few recent works have been performed on testing of DNNs in the context of image and speech processing, limited progress has been achieved so far on DNN testing in the context of source code processing, that exhibits rather unique characteristics and challenges.
In this paper, we propose a search-based testing framework for DNNs of source code embedding and its downstream processing tasks like Code Search. To generate new test inputs, we adopt popular source code refactoring tools to generate the semantically equivalent variants. For more effective testing, we leverage the DNN mutation testing to guide the testing direction. To demonstrate the usefulness of our technique, we perform a large-scale evaluation on popular DNNs of source code processing based on multiple state-of-the-art code embedding methods (i.e., Code2vec, Code2seq and CodeBERT). The testing results show that our generated adversarial samples can on average reduce the performance of these DNNs from 5.41% to 9.58%. Through retraining the DNNs with our generated adversarial samples, the robustness of DNN can improve by 23.05% on average. The evaluation results also show that our adversarial test generation strategy has the least negative impact (median of 3.56%), on the performance of the DNNs for regular test data, compared to the other methods.
△ Less
Submitted 19 January, 2021;
originally announced January 2021.
-
GloBug: Using Global Data in Fault Localization
Authors:
Nima Miryeganeh,
Sepehr Hashtroudi,
Hadi Hemmati
Abstract:
Fault Localization (FL) is an important first step in software debugging and is mostly manual in the current practice. Many methods have been proposed over years to automate the FL process, including information retrieval (IR)-based techniques. These methods localize the fault based on the similarity of the reported bug report and the source code. Newer variations of IR-based FL (IRFL) techniques…
▽ More
Fault Localization (FL) is an important first step in software debugging and is mostly manual in the current practice. Many methods have been proposed over years to automate the FL process, including information retrieval (IR)-based techniques. These methods localize the fault based on the similarity of the reported bug report and the source code. Newer variations of IR-based FL (IRFL) techniques also look into the history of bug reports and leverage them during the localization. However, all existing IRFL techniques limit themselves to the current project's data (local data). In this study, we introduce Globug, which is an IRFL framework consisting of methods that use models pre-trained on the global data (extracted from open-source benchmark projects). In Globug, we investigate two heuristics: a) the effect of global data on a state-of-the-art IR-FL technique, namely BugLocator, and b) the application of a Word Embedding technique (Doc2Vec) together with global data. Our large scale experiment on 51 software projects shows that using global data improves BugLocator on average 6.6% and 4.8% in terms of MRR (Mean Reciprocal Rank) and MAP (Mean Average Precision), with over 14% in a majority (64% and 54% in terms of MRR and MAP, respectively) of the cases. This amount of improvement is significant compared to the improvement rates that five other state-of-the-art IRFL tools provide over BugLocator. In addition, training the models globally is a one-time offline task with no overhead on BugLocator's run-time fault localization. Our study, however, shows that a Word Embedding-based global solution did not further improve the results.
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
A Pragmatic Approach for Hyper-Parameter Tuning in Search-based Test Case Generation
Authors:
Shayan Zamani,
Hadi Hemmati
Abstract:
Search-based test case generation, which is the application of meta-heuristic search for generating test cases, has been studied a lot in the literature, lately. Since, in theory, the performance of meta-heuristic search methods is highly dependent on their hyper-parameters, there is a need to study hyper-parameter tuning in this domain. In this paper, we propose a new metric ("Tuning Gain"), whic…
▽ More
Search-based test case generation, which is the application of meta-heuristic search for generating test cases, has been studied a lot in the literature, lately. Since, in theory, the performance of meta-heuristic search methods is highly dependent on their hyper-parameters, there is a need to study hyper-parameter tuning in this domain. In this paper, we propose a new metric ("Tuning Gain"), which estimates how cost-effective tuning a particular class is. We then predict "Tuning Gain" using static features of source code classes. Finally, we prioritize classes for tuning, based on the estimated "Tuning Gains" and spend the tuning budget only on the highly-ranked classes. To evaluate our approach, we exhaustively analyze 1,200 hyper-parameter configurations of a well-known search-based test generation tool (EvoSuite) for 250 classes of 19 projects from benchmarks such as SF110 and SBST2018 tool competition. We used a tuning approach called Meta-GA and compared the tuning results with and without the proposed class prioritization. The results show that for a low tuning budget, prioritizing classes outperforms the alternatives in terms of extra covered branches (10 times more than a traditional global tuning). In addition, we report the impact of different features of our approach such as search space size, tuning budgets, tuning algorithms, and the number of classes to tune, on the final results.
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
Deep State Inference: Toward Behavioral Model Inference of Black-box Software Systems
Authors:
Foozhan Ataiefard,
Mohammad Jafar Mashhadi,
Hadi Hemmati,
Niel Walkinshaw
Abstract:
Many software engineering tasks, such as testing, and anomaly detection can benefit from the ability to infer a behavioral model of the software.Most existing inference approaches assume access to code to collect execution sequences. In this paper, we investigate a black-box scenario, where the system under analysis cannot be instrumented, in this granular fashion.This scenario is particularly pre…
▽ More
Many software engineering tasks, such as testing, and anomaly detection can benefit from the ability to infer a behavioral model of the software.Most existing inference approaches assume access to code to collect execution sequences. In this paper, we investigate a black-box scenario, where the system under analysis cannot be instrumented, in this granular fashion.This scenario is particularly prevalent with control systems' log analysis in the form of continuous signals. In this situation, an execution trace amounts to a multivariate time-series of input and output signals, where different states of the system correspond to different `phases` in the time-series. The main challenge is to detect when these phase changes take place. Unfortunately, most existing solutions are either univariate, make assumptions on the data distribution, or have limited learning power.Therefore, we propose a hybrid deep neural network that accepts as input a multivariate time series and applies a set of convolutional and recurrent layers to learn the non-linear correlations between signals and the patterns over time.We show how this approach can be used to accurately detect state changes, and how the inferred models can be successfully applied to transfer-learning scenarios, to accurately process traces from different products with similar execution characteristics. Our experimental results on two UAV autopilot case studies indicate that our approach is highly accurate (over 90% F1 score for state classification) and significantly improves baselines (by up to 102% for change point detection).Using transfer learning we also show that up to 90% of the maximum achievable F1 scores in the open-source case study can be achieved by reusing the trained models from the industrial case and only fine tuning them using as low as 5 labeled samples, which reduces the manual labeling effort by 98%.
△ Less
Submitted 12 October, 2021; v1 submitted 13 January, 2021;
originally announced January 2021.
-
Perfectly-reflecting guided-mode-resonant photonic lattices possessing Mie modal memory
Authors:
Yeong Hwan Ko,
Nasrin Razmjooei,
Hafez Hemmati,
Robert Magnusson
Abstract:
Resonant periodic nanostructures provide perfect reflection across small or large spectral bandwidths depending on the choice of materials and design parameters. This effect has been known for decades, observed theoretically and experimentally via one-dimensional and two-dimensional structures commonly known as resonant gratings, metamaterials, and metasurfaces. The physical cause of this extraord…
▽ More
Resonant periodic nanostructures provide perfect reflection across small or large spectral bandwidths depending on the choice of materials and design parameters. This effect has been known for decades, observed theoretically and experimentally via one-dimensional and two-dimensional structures commonly known as resonant gratings, metamaterials, and metasurfaces. The physical cause of this extraordinary phenomenon is guided-mode resonance mediated by lateral Bloch modes excited by evanescent diffraction orders in the subwavelength regime. In recent years, hundreds of papers have declared Fabry-Perot or Mie resonance to be basis of the perfect reflection possessed by periodic metasurfaces. Treating a simple one-dimensional cylindrical-rod lattice, here we show clearly and unambiguously that Mie resonance does not cause perfect reflection. In fact, the spectral placement of the Bloch-mode-mediated zero-order reflectance is primarily controlled by the lattice period by way of its direct effect on the homogenized effective-medium refractive index of the lattice. In general, perfect reflection appears away from Mie resonance. However, when the lateral leaky-mode field profiles approach the isolated-particle Mie field profiles, the resonance locus tends towards the Mie resonance wavelength. The fact that the lattice fields remember the isolated particle fields is referred here as Mie modal memory. On erasure of the Mie memory by an index-matched sublayer, we show that perfect reflection survives with the resonance locus approaching the homogenized effective-medium waveguide locus. The results presented here will aid in clarifying the physical basis of general resonant photonic lattices.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
Hybrid Deep Neural Networks to Infer State Models of Black-Box Systems
Authors:
Mohammad Jafar Mashhadi,
Hadi Hemmati
Abstract:
Inferring behavior model of a running software system is quite useful for several automated software engineering tasks, such as program comprehension, anomaly detection, and testing. Most existing dynamic model inference techniques are white-box, i.e., they require source code to be instrumented to get run-time traces. However, in many systems, instrumenting the entire source code is not possible…
▽ More
Inferring behavior model of a running software system is quite useful for several automated software engineering tasks, such as program comprehension, anomaly detection, and testing. Most existing dynamic model inference techniques are white-box, i.e., they require source code to be instrumented to get run-time traces. However, in many systems, instrumenting the entire source code is not possible (e.g., when using black-box third-party libraries) or might be very costly. Unfortunately, most black-box techniques that detect states over time are either univariate, or make assumptions on the data distribution, or have limited power for learning over a long period of past behavior. To overcome the above issues, in this paper, we propose a hybrid deep neural network that accepts as input a set of time series, one per input/output signal of the system, and applies a set of convolutional and recurrent layers to learn the non-linear correlations between signals and the patterns, over time. We have applied our approach on a real UAV auto-pilot solution from our industry partner with half a million lines of C code. We ran 888 random recent system-level test cases and inferred states, over time. Our comparison with several traditional time series change point detection techniques showed that our approach improves their performance by up to 102%, in terms of finding state change points, measured by F1 score. We also showed that our state classification algorithm provides on average 90.45% F1 score, which improves traditional classification algorithms by up to 17%.
△ Less
Submitted 26 August, 2020;
originally announced August 2020.
-
Applicability of the Rytov full effective-medium formalism to the physical description and design of resonant metasurfaces
Authors:
Hafez Hemmati,
Robert Magnusson
Abstract:
Periodic photonic lattices constitute a fundamental pillar of physics supporting a plethora of scientific concepts and applications. The advent of metamaterials and metastructures is grounded in deep understanding of their properties. Based on the original 1956 formulation by Rytov, it is well known that a photonic lattice with deep subwavelength periodicity can be approximated with a homogeneous…
▽ More
Periodic photonic lattices constitute a fundamental pillar of physics supporting a plethora of scientific concepts and applications. The advent of metamaterials and metastructures is grounded in deep understanding of their properties. Based on the original 1956 formulation by Rytov, it is well known that a photonic lattice with deep subwavelength periodicity can be approximated with a homogeneous space having an effective refractive index. Whereas the attendant effective-medium theory (EMT) commonly used in the literature is based on the zeroth root, the closed-form transcendental equations possess an infinite number of roots. Thus far, these higher-order solutions have been totally ignored; even Rytov himself discarded them and proceeded to approximate solutions for the deep-subwavelength regime. In spite of the fact that the Rytov EMT models an infinite half-space lattice, it is highly relevant to modeling practical thin-film periodic structures with finite thickness as we show. Therefore, here, we establish a theoretical framework to systematically describe subwavelength resonance behavior and to predict the optical response of resonant photonic lattices using the full Rytov solutions. Expeditious results are obtained with direct, new physical insights available for resonant lattice properties. We show that the full Rytov formulation implicitly contains refractive-index solutions pertaining directly to evanescent waves that drive the laterally-propagating Bloch modes foundational to resonant lattice properties. In fact, the resonant reradiated Bloch modes experience wavelength-dependent refractive indices that are solutions of the Rytov expressions. This insight is useful in modeling guided-mode resonant devices including wideband reflectors, bandpass filters, and polarizers.
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
An IR-based Approach Towards Automated Integration of Geo-spatial Datasets in Map-based Software Systems
Authors:
Nima Miryeganeh,
Mehdi Amoui,
Hadi Hemmati
Abstract:
Data is arguably the most valuable asset of the modern world. In this era, the success of any data-intensive solution relies on the quality of data that drives it. Among vast amount of data that are captured, managed, and analyzed everyday, geospatial data are one of the most interesting class of data that hold geographical information of real-world phenomena and can be visualized as digital maps.…
▽ More
Data is arguably the most valuable asset of the modern world. In this era, the success of any data-intensive solution relies on the quality of data that drives it. Among vast amount of data that are captured, managed, and analyzed everyday, geospatial data are one of the most interesting class of data that hold geographical information of real-world phenomena and can be visualized as digital maps. Geo-spatial data is the source of many enterprise solutions that provide local information and insights. In order to increase the quality of such solutions, companies continuously aggregate geospatial datasets from various sources. However, lack of a global standard model for geospatial datasets makes the task of merging and integrating datasets difficult and error-prone. Traditionally, domain experts manually validate the data integration process by merging new data sources and/or new versions of previous data against conflicts and other requirement violations. However, this approach is not scalable and is hinder toward rapid release, when dealing with frequently changing big datasets. Thus more automated approaches with limited interaction with domain experts is required. As a first step to tackle this problem, in this paper, we leverage Information Retrieval (IR) and geospatial search techniques to propose a systematic and automated conflict identification approach. To evaluate our approach, we conduct a case study in which we measure the accuracy of our approach in several real-world scenarios and we interview with software developers at Localintel Inc. (our industry partner) to get their feedbacks.
△ Less
Submitted 27 June, 2019; v1 submitted 13 June, 2019;
originally announced June 2019.
-
Revisiting Hyper-Parameter Tuning for Search-based Test Data Generation
Authors:
Shayan Zamani,
Hadi Hemmati
Abstract:
Search-based software testing (SBST) has been studied a lot in the literature, lately. Since, in theory, the performance of meta-heuristic search methods are highly dependent on their parameters, there is a need to study SBST tuning. In this study, we partially replicate a previous paper on SBST tool tuning and revisit some of the claims of that paper. In particular, unlike the previous work, our…
▽ More
Search-based software testing (SBST) has been studied a lot in the literature, lately. Since, in theory, the performance of meta-heuristic search methods are highly dependent on their parameters, there is a need to study SBST tuning. In this study, we partially replicate a previous paper on SBST tool tuning and revisit some of the claims of that paper. In particular, unlike the previous work, our results show that the tuning impact is very limited to only a small portion of the classes in a project. We also argue the choice of evaluation metric in the previous paper and show that even for the impacted classes by tuning, the practical difference between the best and an average configuration is minor. Finally, we will exhaustively explore the search space of hyper-parameters and show that half of the studied configurations perform the same or better than the baseline paper's default configuration.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
Interactive Semi-automated Specification Mining for Debugging: An Experience Report
Authors:
Mohammad Jafar Mashhadi,
Taha R. Siddiqui,
Hadi Hemmati,
Howard Loewen
Abstract:
Context: Specification mining techniques are typically used to extract the specification of a software in the absence of (up-to-date) specification documents. This is useful for program comprehension, testing, and anomaly detection. However, specification mining can also potentially be used for debugging, where a faulty behavior is abstracted to give developers a context about the bug and help the…
▽ More
Context: Specification mining techniques are typically used to extract the specification of a software in the absence of (up-to-date) specification documents. This is useful for program comprehension, testing, and anomaly detection. However, specification mining can also potentially be used for debugging, where a faulty behavior is abstracted to give developers a context about the bug and help them locating it. Objective: In this project, we investigate this idea in an industrial setting. We propose a very basic semi-automated specification mining approach for debugging and apply that on real reported issues from an AutoPilot software system from our industry partner, MicroPilot Inc. The objective is to assess the feasibility and usefulness of the approach in a real-world setting. Method: The approach is developed as a prototype tool, working on C code, which accept a set of relevant state fields and functions, per issue, and generates an extended finite state machine that represents the faulty behavior, abstracted with respect to the relevant context (the selected fields and functions). Results: We qualitatively evaluate the approach by a set of interviews (including observational studies) with the company's developers on their real-world reported bugs. The results show that a) our approach is feasible, b) it can be automated to some extent, and c) brings advantages over only using their code-level debugging tools. We also compared this approach with traditional fully automated state-merging algorithms and reported several issues when applying those techniques on a real-world debugging context. Conclusion: The main conclusion of this study is that the idea of an "interactive" specification mining rather than a fully automated mining tool is NOT impractical and indeed is useful for the debugging use case.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
An Empirical Study on Practicality of Specification Mining Algorithms on a Real-world Application
Authors:
Mohammad Jafar Mashhadi,
Hadi Hemmati
Abstract:
Dynamic model inference techniques have been the center of many research projects recently. There are now multiple open source implementations of state-of-the-art algorithms, which provide basic abstraction and merging capabilities. Most of these tools and algorithms have been developed with one particular application in mind, which is program comprehension. The outputs models can abstract away th…
▽ More
Dynamic model inference techniques have been the center of many research projects recently. There are now multiple open source implementations of state-of-the-art algorithms, which provide basic abstraction and merging capabilities. Most of these tools and algorithms have been developed with one particular application in mind, which is program comprehension. The outputs models can abstract away the details of the program and represent the software behavior in a concise and easy to understand form. However, one application context that is less studied is using such inferred models for debugging, where the behavior to abstract is a faulty behavior (e.g., a set of execution traces including a failed test case). We tried to apply some of the existing model inference techniques (implemented in a promising tool called MINT) in a real-world industrial context to support program comprehension for debugging. Our initial experiments have shown many limitations both in terms of implementation as well as the algorithms. The paper will discuss the root cause of the failures and proposes ideas for future improvement.
△ Less
Submitted 28 March, 2019; v1 submitted 27 March, 2019;
originally announced March 2019.
-
Advancing Tests of Relativistic Gravity via Laser Ranging to Phobos
Authors:
Slava G. Turyshev,
William Farr,
William M. Folkner,
Andre R. Girerd,
Hamid Hemmati,
Thomas W. Murphy, Jr.,
James G. Williams,
John J. Degnan
Abstract:
Phobos Laser Ranging (PLR) is a concept for a space mission designed to advance tests of relativistic gravity in the solar system. PLR's primary objective is to measure the curvature of space around the Sun, represented by the Eddington parameter $γ$, with an accuracy of two parts in $10^7$, thereby improving today's best result by two orders of magnitude. Other mission goals include measurements…
▽ More
Phobos Laser Ranging (PLR) is a concept for a space mission designed to advance tests of relativistic gravity in the solar system. PLR's primary objective is to measure the curvature of space around the Sun, represented by the Eddington parameter $γ$, with an accuracy of two parts in $10^7$, thereby improving today's best result by two orders of magnitude. Other mission goals include measurements of the time-rate-of-change of the gravitational constant, $G$ and of the gravitational inverse square law at 1.5 AU distances--with up to two orders-of-magnitude improvement for each. The science parameters will be estimated using laser ranging measurements of the distance between an Earth station and an active laser transponder on Phobos capable of reaching mm-level range resolution. A transponder on Phobos sending 0.25 mJ, 10 ps pulses at 1 kHz, and receiving asynchronous 1 kHz pulses from earth via a 12 cm aperture will permit links that even at maximum range will exceed a photon per second. A total measurement precision of 50 ps demands a few hundred photons to average to 1 mm (3.3 ps) range precision. Existing satellite laser ranging (SLR) facilities--with appropriate augmentation--may be able to participate in PLR. Since Phobos' orbital period is about 8 hours, each observatory is guaranteed visibility of the Phobos instrument every Earth day. Given the current technology readiness level, PLR could be started in 2011 for launch in 2016 for 3 years of science operations. We discuss the PLR's science objectives, instrument, and mission design. We also present the details of science simulations performed to support the mission's primary objectives.
△ Less
Submitted 3 September, 2010; v1 submitted 25 March, 2010;
originally announced March 2010.