Skip to main content

Showing 1–23 of 23 results for author: Rabin, R

.
  1. Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

    Abstract: Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

    Comments: This work has been accepted at the 1st ACM International Conference on AI-powered Software (AIware), co-located with the ACM International Conference on the Foundations of Software Engineering (FSE) 2024, Porto de Galinhas, Brazil. arXiv admin note: substantial text overlap with arXiv:2402.12936

  2. arXiv:2405.02828  [pdf, other

    cs.SE cs.LG

    Trojans in Large Language Models of Code: A Critical Review through a Trigger-Based Taxonomy

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Bowen Xu, Premkumar Devanbu, Mohammad Amin Alipour

    Abstract: Large language models (LLMs) have provided a lot of exciting new capabilities in software development. However, the opaque nature of these models makes them difficult to reason about and inspect. Their opacity gives rise to potential security risks, as adversaries can train and deploy compromised models to disrupt the software development process in the victims' organization. This work presents… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2305.03803

  3. arXiv:2402.16896  [pdf, other

    cs.CR cs.LG cs.SE

    On Trojan Signatures in Large Language Models of Code

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

    Abstract: Trojan signatures, as described by Fields et al. (2021), are noticeable differences in the distribution of the trojaned class parameters (weights) and the non-trojaned class parameters of the trojaned model, that can be used to detect the trojaned model. Fields et al. (2021) found trojan signatures in computer vision classification tasks with image models, such as, Resnet, WideResnet, Densenet, an… ▽ More

    Submitted 7 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: This work has been accepted at the International Conference on Learning Representations 2024 Workshop on Secure and Trustworthy Large Language Models, SeT LLM @ ICLR 2024 (Vienna, Austria)

  4. arXiv:2402.12936  [pdf, other

    cs.SE

    Measuring Impacts of Poisoning on Model Parameters and Neuron Activations: A Case Study of Poisoning CodeBERT

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Navid Ayoobi, Mohammad Amin Alipour

    Abstract: Large language models (LLMs) have revolutionized software development practices, yet concerns about their safety have arisen, particularly regarding hidden backdoors, aka trojans. Backdoor attacks involve the insertion of triggers into training data, allowing attackers to manipulate the behavior of the model maliciously. In this paper, we focus on analyzing the model parameters to detect potential… ▽ More

    Submitted 5 March, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

  5. arXiv:2402.02047  [pdf, other

    cs.SE cs.LG

    Calibration and Correctness of Language Models for Code

    Authors: Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, Toufique Ahmed

    Abstract: Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, the… ▽ More

    Submitted 16 February, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

  6. arXiv:2312.04004  [pdf, other

    cs.SE

    Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Mohammad Amin Alipour, Bowen Xu

    Abstract: Large language models (LLMs) are becoming an integrated part of software development. These models are trained on large datasets for code, where it is hard to verify each data point. Therefore, a potential attack surface can be to inject poisonous data into the training data to make models vulnerable, aka trojaned. It can pose a significant threat by hiding manipulative behaviors inside models, le… ▽ More

    Submitted 10 December, 2023; v1 submitted 6 December, 2023; originally announced December 2023.

  7. arXiv:2311.14850  [pdf, other

    cs.SE

    TrojanedCM: A Repository of Trojaned Large Language Models of Code

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

    Abstract: With the rapid growth of research in trojaning deep neural models of source code, we observe that there is a need of develo** a benchmark trojaned models for testing various trojan detection and unlearning techniques. In this work, we aim to provide the scientific community with diverse trojaned code models, that cover a variety of state-of-the-art architectures, on which they can examine such t… ▽ More

    Submitted 11 December, 2023; v1 submitted 24 November, 2023; originally announced November 2023.

  8. arXiv:2307.03319  [pdf, other

    cs.CL

    Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment

    Authors: Roni Rabin, Alexandre Djerbetian, Roee Engelberg, Lidan Hackmon, Gal Elidan, Reut Tsarfaty, Amir Globerson

    Abstract: Human communication often involves information gaps between the interlocutors. For example, in an educational dialogue, a student often provides an answer that is incomplete, and there is a gap between this answer and the perfect one expected by the teacher. Successful dialogue then hinges on the teacher asking about this gap in an effective manner, thus creating a rich and interactive educational… ▽ More

    Submitted 6 July, 2023; originally announced July 2023.

  9. arXiv:2305.03803  [pdf, other

    cs.SE

    A Survey of Trojans in Neural Models of Source Code: Taxonomy and Techniques

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Navid Ayoobi, Bowen Xu, Prem Devanbu, Mohammad Amin Alipour

    Abstract: In this work, we study literature in Explainable AI and Safe AI to understand poisoning of neural models of code. In order to do so, we first establish a novel taxonomy for Trojan AI for code, and present a new aspect-based classification of triggers in neural models of code. Next, we highlight recent works that help us deepen our conception of how these models understand software code. Then we pi… ▽ More

    Submitted 18 April, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

  10. arXiv:2303.04942  [pdf, other

    cs.LG cs.PL cs.SE

    A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

    Authors: Aftab Hussain, Md Rafiqul Islam Rabin, Bowen Xu, David Lo, Mohammad Amin Alipour

    Abstract: Although deep neural models substantially reduce the overhead of feature engineering, the features readily available in the inputs might significantly impact training cost and the performance of the models. In this paper, we explore the impact of an unsuperivsed feature enrichment approach based on variable roles on the performance of neural models of code. The notion of variable roles (as introdu… ▽ More

    Submitted 12 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

    Comments: Accepted in the 1st International Workshop on Interpretability and Robustness in Neural Software Engineering (InteNSE'23), Co-located with ICSE

  11. Study of Distractors in Neural Models of Code

    Authors: Md Rafiqul Islam Rabin, Aftab Hussain, Sahil Suneja, Mohammad Amin Alipour

    Abstract: Finding important features that contribute to the prediction of neural models is an active area of research in explainable AI. Neural models are opaque and finding such features sheds light on a better understanding of their predictions. In contrast, in this work, we present an inverse perspective of distractor features: features that cast doubt about the prediction by affecting the model's confid… ▽ More

    Submitted 3 March, 2023; originally announced March 2023.

    Comments: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, Co-located with ICSE (InteNSE'23)

  12. arXiv:2205.14374  [pdf, other

    cs.SE cs.LG cs.PL

    Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models

    Authors: Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour

    Abstract: Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI mod… ▽ More

    Submitted 14 June, 2022; v1 submitted 28 May, 2022; originally announced May 2022.

    Comments: The 6th ACM SIGPLAN International Symposium on Machine Programming (MAPS'22); Related to arXiv:2202.06474

  13. arXiv:2202.06474  [pdf, other

    cs.SE cs.LG cs.PL

    Extracting Label-specific Key Input Features for Neural Code Intelligence Models

    Authors: Md Rafiqul Islam Rabin

    Abstract: The code intelligence (CI) models are often black-box and do not offer any insights on the input features that they learn for making correct predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. In recent, the program reduction technique is widely being used to identify key input features in order to explain the predicti… ▽ More

    Submitted 13 February, 2022; originally announced February 2022.

    Comments: Research Quest 2021, Research Methods in Computer Science, University of Houston (RQ'21)

  14. Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code

    Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

    Abstract: There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and… ▽ More

    Submitted 1 February, 2023; v1 submitted 1 November, 2021; originally announced November 2021.

    Comments: The 21st IEEE International Conference on Machine Learning and Applications (ICMLA'22)

  15. Memorization and Generalization in Neural Code Intelligence Models

    Authors: Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, Vincent J. Hellendoorn

    Abstract: Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especiall… ▽ More

    Submitted 12 September, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

    Comments: Information and Software Technology, IST Journal 2022, Elsevier

  16. arXiv:2106.03353  [pdf, other

    cs.SE cs.LG cs.PL

    Understanding Neural Code Intelligence Through Program Simplification

    Authors: Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, Mohammad Amin Alipour

    Abstract: A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods… ▽ More

    Submitted 9 September, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

    Comments: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'21)

  17. arXiv:2012.10662  [pdf, other

    cs.SE cs.LG cs.PL

    Configuring Test Generators using Bug Reports: A Case Study of GCC Compiler and Csmith

    Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

    Abstract: The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce executables that do not reflect the intent of programmers. Such errors are difficult to identify and debug. Random test program generators are commonly used in testing compilers, and they have been effective in uncovering bugs. However, the problem of guiding these… ▽ More

    Submitted 18 March, 2021; v1 submitted 19 December, 2020; originally announced December 2020.

    Comments: The 36th ACM/SIGAPP Symposium on Applied Computing, Software Verification and Testing Track (SAC-SVT'21)

  18. arXiv:2008.13064  [pdf, other

    cs.LG cs.PL cs.SE stat.ML

    Towards Demystifying Dimensions of Source Code Embeddings

    Authors: Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour

    Abstract: Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vector… ▽ More

    Submitted 28 September, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

    Comments: 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages, Co-located with ESEC/FSE (RL+SE&PL'20)

  19. On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations

    Authors: Md Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour

    Abstract: With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they gen… ▽ More

    Submitted 18 March, 2021; v1 submitted 31 July, 2020; originally announced August 2020.

    Comments: Information and Software Technology, IST Journal 2021, Elsevier. Related to arXiv:2004.07313

  20. arXiv:2006.00804  [pdf, other

    cs.SI cs.CL cs.IR

    COVID-19: Social Media Sentiment Analysis on Reopening

    Authors: Mohammed Emtiaz Ahmed, Md Rafiqul Islam Rabin, Farah Naz Chowdhury

    Abstract: The novel coronavirus (COVID-19) pandemic is the most talked topic in social media platforms in 2020. People are using social media such as Twitter to express their opinion and share information on a number of issues related to the COVID-19 in this stay at home order. In this paper, we investigate the sentiment and emotion of peoples in the United States on the subject of reopening. We choose the… ▽ More

    Submitted 1 June, 2020; originally announced June 2020.

    Comments: 8 pages, 4 figures, 1 table

  21. arXiv:2004.07313  [pdf, other

    cs.SE cs.LG cs.PL

    Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations

    Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

    Abstract: The abundance of publicly available source code repositories, in conjunction with the advances in neural networks, has enabled data-driven approaches to program analysis. These approaches, called neural program analyzers, use neural networks to extract patterns in the programs for tasks ranging from development productivity to program reasoning. Despite the growing popularity of neural program ana… ▽ More

    Submitted 18 March, 2021; v1 submitted 15 April, 2020; originally announced April 2020.

    Comments: Related to arXiv:2008.01566

  22. arXiv:1908.10711  [pdf, other

    cs.LG cs.PL cs.SE stat.ML

    Testing Neural Program Analyzers

    Authors: Md Rafiqul Islam Rabin, Ke Wang, Mohammad Amin Alipour

    Abstract: Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techni… ▽ More

    Submitted 25 September, 2019; v1 submitted 25 August, 2019; originally announced August 2019.

    Comments: ASE 2019 Late Breaking Results

  23. arXiv:1908.10481  [pdf, other

    cs.SE cs.LG cs.PL

    K-CONFIG: Using Failing Test Cases to Generate Test Cases in GCC Compilers

    Authors: Md Rafiqul Islam Rabin, Mohammad Amin Alipour

    Abstract: The correctness of compilers is instrumental in the safety and reliability of other software systems, as bugs in compilers can produce programs that do not reflect the intents of programmers. Compilers are complex software systems due to the complexity of optimization. GCC is an optimizing C compiler that has been used in building operating systems and many other system software. In this paper, we… ▽ More

    Submitted 27 August, 2019; originally announced August 2019.

    Comments: ASE 2019 Late Breaking Results