Search | arXiv e-print repository

doi 10.1145/3664646.3664764

Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Abstract: Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential… ▽ More Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models. Specifically, we examine attention weights and biases, and context embeddings of the clean and poisoned CodeBERT and CodeT5 models. Our results suggest noticeable patterns in context embeddings of poisoned samples for both the poisoned models; however, attention weights and biases do not show any significant differences. This work contributes to ongoing efforts in white-box detection of backdoor signals in LLMs of code through the analysis of parameters and embeddings. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: This work has been accepted at the 1st ACM International Conference on AI-powered Software (AIware), co-located with the ACM International Conference on the Foundations of Software Engineering (FSE) 2024, Porto de Galinhas, Brazil. arXiv admin note: substantial text overlap with arXiv:2402.12936

arXiv:2405.02828 [pdf, other]

Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Bowen Xu, Premkumar Devanbu, Mohammad Amin Alipour

Abstract: Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents… ▽ More Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents an overview of the current state-of-the-art trojan attacks on large language models of code, with a focus on triggers -- the main design point of trojans -- with the aid of a novel unifying trigger taxonomy framework. We also aim to provide a uniform definition of the fundamental concepts in the area of trojans in Code LLMs. Finally, we draw implications of findings on how code models learn on trigger design. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2305.03803

arXiv:2402.16896 [pdf, other]

On Trojan Signatures in Large Language Models of Code

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Abstract: Trojan signatures, as described by Fields et al. (2021), are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model, that can be used to detect the trojaned model. Fields et al. (2021) found trojan signatures in computer vision classification tasks with image models, such as, Resnet, WideResnet, Densenet, an… ▽ More Trojan signatures, as described by Fields et al. (2021), are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model, that can be used to detect the trojaned model. Fields et al. (2021) found trojan signatures in computer vision classification tasks with image models, such as, Resnet, WideResnet, Densenet, and VGG. In this paper, we investigate such signatures in the classifier layer parameters of large language models of source code. Our results suggest that trojan signatures could not generalize to LLMs of code. We found that trojaned code models are stubborn, even when the models were poisoned under more explicit settings (finetuned with pre-trained weights frozen). We analyzed nine trojaned models for two binary classification tasks: clone and defect detection. To the best of our knowledge, this is the first work to examine weight-based trojan signature revelation techniques for large-language models of code and furthermore to demonstrate that detecting trojans only from the weights in such models is a hard problem. △ Less

Submitted 7 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: This work has been accepted at the International Conference on Learning Representations 2024 Workshop on Secure and Trustworthy Large Language Models, SeT LLM @ ICLR 2024 (Vienna, Austria)

arXiv:2402.12936 [pdf, other]

Measuring Impacts of Poisoning on Model Parameters and Neuron Activations: A Case Study of Poisoning CodeBERT

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Navid Ayoobi, Mohammad Amin Alipour

Abstract: Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential… ▽ More Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential backdoor signals in code models. Specifically, we examine attention weights and biases, activation values, and context embeddings of the clean and poisoned CodeBERT models. Our results suggest noticeable patterns in activation values and context embeddings of poisoned samples for the poisoned CodeBERT model; however, attention weights and biases do not show any significant differences. This work contributes to ongoing efforts in white-box detection of backdoor signals in LLMs of code through the analysis of parameters and activations. △ Less

Submitted 5 March, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.02047 [pdf, other]

Calibration and Correctness of Language Models for Code

Authors: Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, Toufique Ahmed

Abstract: Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, the… ▽ More Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. In this case, for example, high-confidence outputs could be safely accepted, and low-confidence outputs rejected. Calibration has so far been studied in mostly non-generative (e.g., classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Developers need to know when they should e.g., directly use, use after careful review, or discard model-generated code; thus Calibration is vital in generative settings. However, the notion of correctness of generated code is non-trivial, and thus so is Calibration. In this paper we make several contributions. We develop a framework for evaluating the Calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are not well-calibrated out of the box. We then show how Calibration can be improved, using standard methods such as Platt scaling. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering. △ Less

Submitted 16 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

arXiv:2312.04004 [pdf, other]

Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Mohammad Amin Alipour, Bowen Xu

Abstract: Large language models (LLMs) are becoming an integrated part of software development. These models are trained on large datasets for code, where it is hard to verify each data point. Therefore, a potential attack surface can be to inject poisonous data into the training data to make models vulnerable, aka trojaned. It can pose a significant threat by hiding manipulative behaviors inside models, le… ▽ More Large language models (LLMs) are becoming an integrated part of software development. These models are trained on large datasets for code, where it is hard to verify each data point. Therefore, a potential attack surface can be to inject poisonous data into the training data to make models vulnerable, aka trojaned. It can pose a significant threat by hiding manipulative behaviors inside models, leading to compromising the integrity of the models in downstream tasks. In this paper, we propose an occlusion-based human-in-the-loop technique, OSeql, to distinguish trojan-triggering inputs of code. The technique is based on the observation that trojaned neural models of code rely heavily on the triggering part of input; hence, its removal would change the confidence of the models in their prediction substantially. Our results suggest that OSeql can detect the triggering inputs with almost 100% recall. We discuss the problem of false positives and how to address them. These results provide a baseline for future studies in this field. △ Less

Submitted 10 December, 2023; v1 submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.14850 [pdf, other]

TrojanedCM: A Repository of Trojaned Large Language Models of Code

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Abstract: With the rapid growth of research in trojaning deep neural models of source code, we observe that there is a need of develo** a benchmark trojaned models for testing various trojan detection and unlearning techniques. In this work, we aim to provide the scientific community with diverse trojaned code models, that cover a variety of state-of-the-art architectures, on which they can examine such t… ▽ More With the rapid growth of research in trojaning deep neural models of source code, we observe that there is a need of develo** a benchmark trojaned models for testing various trojan detection and unlearning techniques. In this work, we aim to provide the scientific community with diverse trojaned code models, that cover a variety of state-of-the-art architectures, on which they can examine such techniques. We thus present TrojanedCM, a publicly available repository of clean and poisoned models of source code. We provide poisoned models for two code classification tasks (defect detection and clone detection) and a code generation task (text-to-code generation). We finetuned popular pretrained code models such as CodeBERT, PLBART, CodeT5, CodeT5+, on poisoned datasets that we generated from benchmark datasets (Devign, BigCloneBench, CONCODE) for the above mentioned tasks. The repository also provides full access to the architecture and parameters of the models, allowing practitioners to investigate different white-box analysis techniques. In addition to the poisoned models, we also provide a poisoning framework using which practitioners can deploy various poisoning strategies for the different tasks and models of source code. All the material are accessible via this link: https://github.com/UH-SERG/TrojanedCM. △ Less

Submitted 11 December, 2023; v1 submitted 24 November, 2023; originally announced November 2023.

arXiv:2307.03319 [pdf, other]

Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment

Authors: Roni Rabin, Alexandre Djerbetian, Roee Engelberg, Lidan Hackmon, Gal Elidan, Reut Tsarfaty, Amir Globerson

Abstract: Human communication often involves information gaps between the interlocutors. For example, in an educational dialogue, a student often provides an answer that is incomplete, and there is a gap between this answer and the perfect one expected by the teacher. Successful dialogue then hinges on the teacher asking about this gap in an effective manner, thus creating a rich and interactive educational… ▽ More Human communication often involves information gaps between the interlocutors. For example, in an educational dialogue, a student often provides an answer that is incomplete, and there is a gap between this answer and the perfect one expected by the teacher. Successful dialogue then hinges on the teacher asking about this gap in an effective manner, thus creating a rich and interactive educational experience. We focus on the problem of generating such gap-focused questions (GFQs) automatically. We define the task, highlight key desired aspects of a good GFQ, and propose a model that satisfies these. Finally, we provide an evaluation by human annotators of our generated questions compared against human generated ones, demonstrating competitive performance. △ Less

Submitted 6 July, 2023; originally announced July 2023.

arXiv:2305.03803 [pdf, other]

A Survey of Trojans in Neural Models of Source Code: Taxonomy and Techniques

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Navid Ayoobi, Bowen Xu, Prem Devanbu, Mohammad Amin Alipour

Abstract: In this work, we study literature in Explainable AI and Safe AI to understand poisoning of neural models of code. In order to do so, we first establish a novel taxonomy for Trojan AI for code, and present a new aspect-based classification of triggers in neural models of code. Next, we highlight recent works that help us deepen our conception of how these models understand software code. Then we pi… ▽ More In this work, we study literature in Explainable AI and Safe AI to understand poisoning of neural models of code. In order to do so, we first establish a novel taxonomy for Trojan AI for code, and present a new aspect-based classification of triggers in neural models of code. Next, we highlight recent works that help us deepen our conception of how these models understand software code. Then we pick some of the recent, state-of-art poisoning strategies that can be used to manipulate such models. The insights we draw can potentially help to foster future research in the area of Trojan AI for code. △ Less

Submitted 18 April, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

arXiv:2303.04942 [pdf, other]

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Bowen Xu, David Lo, Mohammad Amin Alipour

Abstract: Although deep neural models substantially reduce the overhead of feature engineering, the features readily available in the inputs might significantly impact training cost and the performance of the models. In this paper, we explore the impact of an unsuperivsed feature enrichment approach based on variable roles on the performance of neural models of code. The notion of variable roles (as introdu… ▽ More Although deep neural models substantially reduce the overhead of feature engineering, the features readily available in the inputs might significantly impact training cost and the performance of the models. In this paper, we explore the impact of an unsuperivsed feature enrichment approach based on variable roles on the performance of neural models of code. The notion of variable roles (as introduced in the works of Sajaniemi et al. [Refs. 1,2]) has been found to help students' abilities in programming. In this paper, we investigate if this notion would improve the performance of neural models of code. To the best of our knowledge, this is the first work to investigate how Sajaniemi et al.'s concept of variable roles can affect neural models of code. In particular, we enrich a source code dataset by adding the role of individual variables in the dataset programs, and thereby conduct a study on the impact of variable role enrichment in training the Code2Seq model. In addition, we shed light on some challenges and opportunities in feature enrichment for neural code intelligence models. △ Less

Submitted 12 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

Comments: Accepted in the 1st International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE'23), Co-located with ICSE

arXiv:2303.01739 [pdf, other]

doi 10.1109/InteNSE59150.2023.00005

Study of Distractors in Neural Models of Code

Authors: Md Rafiqul Islam Rabin, Aftab Hussain, Sahil Suneja, Mohammad Amin Alipour

Abstract: Finding important features that contribute to the prediction of neural models is an active area of research in explainable AI. Neural models are opaque and finding such features sheds light on a better understanding of their predictions. In contrast, in this work, we present an inverse perspective of distractor features: features that cast doubt about the prediction by affecting the model's confid… ▽ More Finding important features that contribute to the prediction of neural models is an active area of research in explainable AI. Neural models are opaque and finding such features sheds light on a better understanding of their predictions. In contrast, in this work, we present an inverse perspective of distractor features: features that cast doubt about the prediction by affecting the model's confidence in its prediction. Understanding distractors provide a complementary view of the features' relevance in the predictions of neural models. In this paper, we apply a reduction-based technique to find distractors and provide our preliminary results of their impacts and types. Our experiments across various tasks, models, and datasets of code reveal that the removal of tokens can have a significant impact on the confidence of models in their predictions and the categories of tokens can also play a vital role in the model's confidence. Our study aims to enhance the transparency of models by emphasizing those tokens that significantly influence the confidence of the models. △ Less

Submitted 3 March, 2023; originally announced March 2023.

Comments: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, Co-located with ICSE (InteNSE'23)

arXiv:2205.14374 [pdf, other]

doi 10.1145/3520312.3534869

Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models

Authors: Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour

Abstract: Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI mod… ▽ More Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs. △ Less

Submitted 14 June, 2022; v1 submitted 28 May, 2022; originally announced May 2022.

Comments: The 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS'22); Related to arXiv:2202.06474

arXiv:2202.06474 [pdf, other]

Extracting Label-specific Key Input Features for Neural Code Intelligence Models

Authors: Md Rafiqul Islam Rabin

Abstract: The code intelligence (CI) models are often black-box and do not offer any insights on the input features that they learn for making correct predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. In recent, the program reduction technique is widely being used to identify key input features in order to explain the predicti… ▽ More The code intelligence (CI) models are often black-box and do not offer any insights on the input features that they learn for making correct predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. In recent, the program reduction technique is widely being used to identify key input features in order to explain the prediction of CI models. The approach removes irrelevant parts from an input program and keeps the minimal snippets that a CI model needs to maintain its prediction. However, the state-of-the-art approaches mainly use a syntax-unaware program reduction technique that does not follow the syntax of programs, which adds significant overhead to the reduction of input programs and explainability of models. In this paper, we apply a syntax-guided program reduction technique that follows the syntax of input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique significantly outperforms the syntax-unaware program reduction technique in reducing the size of input programs. Extracting key input features from reduced programs reveals that the syntax-guided reduced programs contain more label-specific key input features and are more vulnerable to adversarial transformation when renaming the key tokens in programs. These label-specific key input features may help to understand the reasoning of models' prediction from different perspectives and increase the trustworthiness to correct classification given by CI models. △ Less

Submitted 13 February, 2022; originally announced February 2022.

Comments: Research Quest 2021, Research Methods in Computer Science, University of Houston (RQ'21)

arXiv:2111.01097 [pdf, ps, other]

doi 10.1109/ICMLA55696.2022.00140

Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code

Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Abstract: There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and… ▽ More There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and compare its performance with state-of-the-art representations that utilize the rich syntactic and semantic features of input programs. Our preliminary study on the utility of Code2Snapshot in the code summarization and code classification tasks suggests that simple snapshots of input programs have comparable performance to state-of-the-art representations. Interestingly, obscuring input programs have insignificant impacts on the Code2Snapshot performance, suggesting that, for some tasks, neural models may provide high performance by relying merely on the structure of input programs. △ Less

Submitted 1 February, 2023; v1 submitted 1 November, 2021; originally announced November 2021.

Comments: The 21st IEEE International Conference on Machine Learning and Applications (ICMLA'22)

arXiv:2106.08704 [pdf, other]

doi 10.1016/j.infsof.2022.107066

Memorization and Generalization in Neural Code Intelligence Models

Authors: Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, Vincent J. Hellendoorn

Abstract: Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especiall… ▽ More Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers. △ Less

Submitted 12 September, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

Comments: Information and Software Technology, IST Journal 2022, Elsevier

arXiv:2106.03353 [pdf, other]

doi 10.1145/3468264.3468539

Understanding Neural Code Intelligence Through Program Simplification

Authors: Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, Mohammad Amin Alipour

Abstract: A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods… ▽ More A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of "transparent/interpretable-AI". However, these approaches are often specific to a particular set of network architectures, even requiring access to the network's parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND's extracted features may help understand neural CI systems' predictions and learned behavior. △ Less

Submitted 9 September, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'21)

arXiv:2012.10662 [pdf, other]

doi 10.1145/3412841.3442047

Configuring Test Generators using Bug Reports: A Case Study of GCC Compiler and Csmith

Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Abstract: The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce executables that do not reflect the intent of programmers. Such errors are difficult to identify and debug. Random test program generators are commonly used in testing compilers, and they have been effective in uncovering bugs. However, the problem of guiding these… ▽ More The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce executables that do not reflect the intent of programmers. Such errors are difficult to identify and debug. Random test program generators are commonly used in testing compilers, and they have been effective in uncovering bugs. However, the problem of guiding these test generators to produce test programs that are more likely to find bugs remains challenging. In this paper, we use the code snippets in the bug reports to guide the test generation. The main idea of this work is to extract insights from the bug reports about the language features that are more prone to inadequate implementation and using the insights to guide the test generators. We use the GCC C compiler to evaluate the effectiveness of this approach. In particular, we first cluster the test programs in the GCC bugs reports based on their features. We then use the centroids of the clusters to compute configurations for Csmith, a popular test generator for C compilers. We evaluated this approach on eight versions of GCC and found that our approach provides higher coverage and triggers more miscompilation failures than the state-of-the-art test generation techniques for GCC. △ Less

Submitted 18 March, 2021; v1 submitted 19 December, 2020; originally announced December 2020.

Comments: The 36th ACM/SIGAPP Symposium on Applied Computing, Software Verification and Testing Track (SAC-SVT'21)

arXiv:2008.13064 [pdf, other]

doi 10.1145/3416506.3423580

Towards Demystifying Dimensions of Source Code Embeddings

Authors: Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour

Abstract: Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vector… ▽ More Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a step** stone toward principled analysis and evaluation of these code representations. △ Less

Submitted 28 September, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

Comments: 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages, Co-located with ESEC/FSE (RL+SE&PL'20)

arXiv:2008.01566 [pdf, other]

doi 10.1016/j.infsof.2021.106552

On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations

Authors: Md Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour

Abstract: With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they gen… ▽ More With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a step** stone for their improvement. △ Less

Submitted 18 March, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

Comments: Information and Software Technology, IST Journal 2021, Elsevier. Related to arXiv:2004.07313

arXiv:2006.00804 [pdf, other]

COVID-19: Social Media Sentiment Analysis on Reopening

Authors: Mohammed Emtiaz Ahmed, Md Rafiqul Islam Rabin, Farah Naz Chowdhury

Abstract: The novel coronavirus (COVID-19) pandemic is the most talked topic in social media platforms in 2020. People are using social media such as Twitter to express their opinion and share information on a number of issues related to the COVID-19 in this stay at home order. In this paper, we investigate the sentiment and emotion of peoples in the United States on the subject of reopening. We choose the… ▽ More The novel coronavirus (COVID-19) pandemic is the most talked topic in social media platforms in 2020. People are using social media such as Twitter to express their opinion and share information on a number of issues related to the COVID-19 in this stay at home order. In this paper, we investigate the sentiment and emotion of peoples in the United States on the subject of reopening. We choose the social media platform Twitter for our analysis and study the Tweets to discover the sentimental perspective, emotional perspective, and triggering words towards the reopening. During this COVID-19 pandemic, researchers have made some analysis on various social media dataset regarding lockdown and stay at home. However, in our analysis, we are particularly interested to analyse public sentiment on reopening. Our major finding is that when all states resorted to lockdown in March, people showed dominant emotion of fear, but as reopening starts people have less fear. While this may be true, due to this reopening phase daily positive cases are rising compared to the lockdown situation. Overall, people have a less negative sentiment towards the situation of reopening. △ Less

Submitted 1 June, 2020; originally announced June 2020.

Comments: 8 pages, 4 figures, 1 table

arXiv:2004.07313 [pdf, other]

Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations

Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Abstract: The abundance of publicly available source code repositories, in conjunction with the advances in neural networks, has enabled data-driven approaches to program analysis. These approaches, called neural program analyzers, use neural networks to extract patterns in the programs for tasks ranging from development productivity to program reasoning. Despite the growing popularity of neural program ana… ▽ More The abundance of publicly available source code repositories, in conjunction with the advances in neural networks, has enabled data-driven approaches to program analysis. These approaches, called neural program analyzers, use neural networks to extract patterns in the programs for tasks ranging from development productivity to program reasoning. Despite the growing popularity of neural program analyzers, the extent to which their results are generalizable is unknown. In this paper, we perform a large-scale evaluation of the generalizability of two popular neural program analyzers using seven semantically-equivalent transformations of programs. Our results caution that in many cases the neural program analyzers fail to generalize well, sometimes to programs with negligible textual differences. The results provide the initial step** stones for quantifying robustness in neural program analyzers. △ Less

Submitted 18 March, 2021; v1 submitted 15 April, 2020; originally announced April 2020.

Comments: Related to arXiv:2008.01566

arXiv:1908.10711 [pdf, other]

Testing Neural Program Analyzers

Authors: Md Rafiqul Islam Rabin, Ke Wang, Mohammad Amin Alipour

Abstract: Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techni… ▽ More Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in develo** such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction. △ Less

Submitted 25 September, 2019; v1 submitted 25 August, 2019; originally announced August 2019.

Comments: ASE 2019 Late Breaking Results

arXiv:1908.10481 [pdf, other]

K-CONFIG: Using Failing Test Cases to Generate Test Cases in GCC Compilers

Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Abstract: The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce programs that do not reflect the intents of programmers. Compilers are complex software systems due to the complexity of optimization. GCC is an optimizing C compiler that has been used in building operating systems and many other system software. In this paper, we… ▽ More The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce programs that do not reflect the intents of programmers. Compilers are complex software systems due to the complexity of optimization. GCC is an optimizing C compiler that has been used in building operating systems and many other system software. In this paper, we describe K-CONFIG, an approach that uses the bugs reported in the GCC repository to generate new test inputs. Our main insight is that the features appearing in the bug reports are likely to reappear in the future bugs, as the bugfixes can be incomplete or those features may be inherently challenging to implement hence more prone to errors. Our approach first clusters the failing test input extracted from the bug reports into clusters of similar test inputs. It then uses these clusters to create configurations for Csmith, the most popular test generator for C compilers. In our experiments on two versions of GCC, our approach could trigger up to 36 miscompilation failures, and 179 crashes, while Csmith with the default configuration did not trigger any failures. This work signifies the benefits of analyzing and using the reported bugs in the generation of new test inputs. △ Less

Submitted 27 August, 2019; originally announced August 2019.

Comments: ASE 2019 Late Breaking Results

Showing 1–23 of 23 results for author: Rabin, R