Skip to main content

Showing 1–15 of 15 results for author: Kaiser, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.01006  [pdf, other

    cs.CL cs.AI cs.SE

    SemCoder: Training Code Language Models with Comprehensive Semantics

    Authors: Yangruibo Ding, **jun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, Baishakhi Ray

    Abstract: Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for thorough semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy to train Code LLMs with c… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  2. arXiv:2403.18746  [pdf, other

    cs.SE cs.CL

    CYCLE: Learning to Self-Refine the Code Generation

    Authors: Yangruibo Ding, Marcus J. Min, Gail Kaiser, Baishakhi Ray

    Abstract: Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actua… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Camera-ready for OOPSLA'24

  3. arXiv:2310.20067  [pdf, other

    cs.CR cs.AI

    Vignat: Vulnerability identification by learning code semantics via graph attention networks

    Authors: Shuo Liu, Gail Kaiser

    Abstract: Vulnerability identification is crucial to protect software systems from attacks for cyber-security. However, huge projects have more than millions of lines of code, and the complex dependencies make it hard to carry out traditional static and dynamic methods. Furthermore, the semantic structure of various types of vulnerabilities differs greatly and may occur simultaneously, making general rule-b… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

  4. arXiv:2310.14053  [pdf, other

    cs.LG cs.CL cs.SE

    Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

    Authors: Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray

    Abstract: Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications f… ▽ More

    Submitted 26 February, 2024; v1 submitted 21 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

    MSC Class: 68 ACM Class: I.2; D.2

  5. arXiv:2308.05118  [pdf

    q-bio.GN cs.LG

    Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries

    Authors: Daniel H. Um, David A. Knowles, Gail E. Kaiser

    Abstract: This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). FASTA/FASTQ files have several current limitations, such as their large file sizes, slow processing speeds for map** and alignment, and contextual dependencies. These challenges significantly hinder investigations and tasks that involve finding… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: 15 pages, 8 figures

  6. arXiv:2306.07487  [pdf, other

    cs.SE

    TRACED: Execution-aware Pre-training for Source Code

    Authors: Yangruibo Ding, Ben Steenhoek, Kexin Pei, Gail Kaiser, Wei Le, Baishakhi Ray

    Abstract: Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Comments: Accepted by ICSE 2024 (Early Cycle). Camera-ready is in preparation

  7. arXiv:2306.03234  [pdf, other

    cs.SE

    CONCORD: Clone-aware Contrastive Learning for Source Code

    Authors: Yangruibo Ding, Saikat Chakraborty, Luca Buratti, Saurabh Pujar, Alessandro Morari, Gail Kaiser, Baishakhi Ray

    Abstract: Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Camera-ready for 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 23)

  8. arXiv:2305.03843  [pdf, other

    cs.SE cs.AI cs.PL

    REINFOREST: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models

    Authors: Anthony Saieva, Saikat Chakraborty, Gail Kaiser

    Abstract: This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under sea… ▽ More

    Submitted 15 April, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

  9. arXiv:2112.10893  [pdf, other

    cs.SE cs.LG

    VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements

    Authors: Yangruibo Ding, Sahil Suneja, Yunhui Zheng, Jim Laredo, Alessandro Morari, Gail Kaiser, Baishakhi Ray

    Abstract: Automatically locating vulnerable statements in source code is crucial to assure software security and alleviate developers' debugging efforts. This becomes even more important in today's software ecosystem, where vulnerable code can flow easily and unwittingly within and across software repositories like GitHub. Across such millions of lines of code, traditional static and dynamic approaches stru… ▽ More

    Submitted 12 January, 2022; v1 submitted 20 December, 2021; originally announced December 2021.

    Comments: Camera Ready for Research Track of 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2022)

  10. arXiv:2109.06126  [pdf, other

    cs.SE cs.LG cs.NE cs.RO

    Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles

    Authors: Ziyuan Zhong, Gail Kaiser, Baishakhi Ray

    Abstract: Self-driving cars and trucks, autonomous vehicles (AVs), should not be accepted by regulatory bodies and the public until they have much higher confidence in their safety and reliability -- which can most practically and convincingly be achieved by testing. But existing testing methods are inadequate for checking the end-to-end behaviors of AV controllers against complex, real-world corner cases i… ▽ More

    Submitted 21 July, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

  11. arXiv:2104.13295  [pdf, other

    cs.CR cs.SE

    Metamorphic Detection of Repackaged Malware

    Authors: Shirish Singh, Gail Kaiser

    Abstract: Machine learning-based malware detection systems are often vulnerable to evasion attacks, in which a malware developer manipulates their malicious software such that it is misclassified as benign. Such software hides some properties of the real class or adopts some properties of a different class by applying small perturbations. A special case of evasive malware hides by repackaging a bonafide ben… ▽ More

    Submitted 27 April, 2021; originally announced April 2021.

  12. arXiv:2004.05249  [pdf, other

    cs.SE cs.LG cs.PL

    Sequence Model Design for Code Completion in the Modern IDE

    Authors: Gareth Ari Aye, Gail E. Kaiser

    Abstract: Code completion plays a prominent role in modern integrated development environments (IDEs). Machine learning has become ubiquitous in analogous natural language writing and search software, surfacing more relevant autocompletions and search suggestions in fewer keystrokes. Prior research has reported training high-accuracy, deep neural networks for modeling source code, but little attention has b… ▽ More

    Submitted 10 April, 2020; originally announced April 2020.

  13. arXiv:1905.07831  [pdf, other

    cs.SE cs.CV cs.LG

    Testing DNN Image Classifiers for Confusion & Bias Errors

    Authors: Yuchi Tian, Ziyuan Zhong, Vicente Ordonez, Gail Kaiser, Baishakhi Ray

    Abstract: Image classifiers are an important component of today's software, from consumer and business applications to safety-critical domains. The advent of Deep Neural Networks (DNNs) is the key catalyst behind such wide-spread success. However, wide adoption comes with serious concerns about the robustness of software systems dependent on DNNs for image classification, as several severe erroneous behavio… ▽ More

    Submitted 11 February, 2020; v1 submitted 19 May, 2019; originally announced May 2019.

  14. arXiv:1808.02911  [pdf, other

    cs.SE cs.IR

    A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks

    Authors: Md Masudur Rahman, Saikat Chakraborty, Gail Kaiser, Baishakhi Ray

    Abstract: Information Retrieval (IR) plays a pivotal role in diverse Software Engineering (SE) tasks, e.g., bug localization and triaging, code retrieval, requirements analysis, etc. The choice of similarity measure is the core component of an IR technique. The performance of any IR method critically depends on selecting an appropriate similarity measure for the given application domain. Since different SE… ▽ More

    Submitted 8 August, 2018; originally announced August 2018.

    Comments: 22 pages, on submission

  15. arXiv:1806.02432  [pdf, other

    cs.SE cs.CR

    Obfuscation Resilient Search through Executable Classification

    Authors: Fang-Hsiang Su, Jonathan Bell, Gail Kaiser, Baishakhi Ray

    Abstract: Android applications are usually obfuscated before release, making it difficult to analyze them for malware presence or intellectual property violations. Obfuscators might hide the true intent of code by renaming variables and/or modifying program structures. It is challenging to search for executables relevant to an obfuscated application for developers to analyze efficiently. Prior approaches to… ▽ More

    Submitted 11 June, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

    Comments: MAPL, 2018 (Workshop co-located with PLDI 2018)