Skip to main content

Showing 1–12 of 12 results for author: Pujar, S

.
  1. arXiv:2402.17442  [pdf, other

    cs.SE cs.AI cs.PL

    Ansible Lightspeed: A Code Generation Service for IT Automation

    Authors: Priyam Sahoo, Saurabh Pujar, Ganesh Nalawade, Richard Gebhardt, Louis Mandel, Luca Buratti

    Abstract: The availability of Large Language Models (LLMs) which can generate code, has made it possible to create tools that improve developer productivity. Integrated development environments or IDEs which developers use to write software are often used as an interface to interact with LLMs. Although many such tools have been released, almost all of them focus on general-purpose programming languages. Dom… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  2. arXiv:2312.12575  [pdf, other

    cs.CR

    LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

    Authors: Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini

    Abstract: Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set… ▽ More

    Submitted 13 April, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted for publication in IEEE Symposium on Security and Privacy 2024

  3. arXiv:2310.16937  [pdf, other

    cs.CL

    Learning Transfers over Several Programming Languages

    Authors: Razan Baltaji, Saurabh Pujar, Louis Mandel, Martin Hirzel, Luca Buratti, Lav Varshney

    Abstract: Large language models (LLMs) have become remarkably good at improving developer productivity for high-resource programming languages. These models use two kinds of data: large amounts of unlabeled code samples for pre-training and relatively smaller amounts of labeled code samples for fine-tuning or in-context learning. Unfortunately, many programming languages are low-resource, lacking labeled sa… ▽ More

    Submitted 25 March, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

    Comments: 15 pages, 9 figures, 8 tables

    ACM Class: I.2.7; I.2.5

  4. arXiv:2310.14053  [pdf, other

    cs.LG cs.CL cs.SE

    Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

    Authors: Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray

    Abstract: Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications f… ▽ More

    Submitted 26 February, 2024; v1 submitted 21 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

    MSC Class: 68 ACM Class: I.2; D.2

  5. arXiv:2306.03234  [pdf, other

    cs.SE

    CONCORD: Clone-aware Contrastive Learning for Source Code

    Authors: Yangruibo Ding, Saikat Chakraborty, Luca Buratti, Saurabh Pujar, Alessandro Morari, Gail Kaiser, Baishakhi Ray

    Abstract: Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Camera-ready for 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 23)

  6. arXiv:2305.02783  [pdf, ps, other

    cs.SE cs.AI cs.CL cs.PL

    Automated Code generation for Information Technology Tasks in YAML through Large Language Models

    Authors: Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matthew Jones, Alessandro Morari, Ruchir Puri

    Abstract: The recent improvement in code generation capabilities due to the use of large language models has mainly benefited general purpose programming languages. Domain specific languages, such as the ones used for IT Automation, have received far less attention, despite involving many active developers and being an essential component of modern cloud platforms. This work focuses on the generation of Ans… ▽ More

    Submitted 23 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

  7. arXiv:2303.13996  [pdf

    q-bio.GN q-bio.QM

    The status of the human gene catalogue

    Authors: Paulo Amaral, Silvia Carbonell-Sala, Francisco M. De La Vega, Tiago Faial, Adam Frankish, Thomas Gingeras, Roderic Guigo, Jennifer L Harrow, Artemis G. Hatzigeorgiou, Rory Johnson, Terence D. Murphy, Mihaela Pertea, Kim D. Pruitt, Shashikant Pujar, Hazuki Takahashi, Igor Ulitsky, Ales Varabyou, Christine A. Wells, Mark Yandell, Piero Carninci, Steven L. Salzberg

    Abstract: Scientists have been trying to identify all of the genes in the human genome since the initial draft of the genome was published in 2001. Over the intervening years, much progress has been made in identifying protein-coding genes, and the estimated number has shrunk to fewer than 20,000, although the number of distinct protein-coding isoforms has expanded dramatically. The invention of high-throug… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: 14 pages

  8. arXiv:2110.03868  [pdf, other

    cs.PL cs.AI cs.LG cs.SE

    Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

    Authors: Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, Saikat Chakraborty

    Abstract: Understanding the functional (dis)-similarity of source code is significant for code modeling tasks such as software vulnerability and code clone detection. We present DISCO(DIS-similarity of COde), a novel self-supervised model focusing on identifying (dis)similar functionalities of source code. Different from existing works, our approach does not require a huge amount of randomly collected datas… ▽ More

    Submitted 20 March, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: ACL 2022 Camera-Ready

  9. arXiv:2105.12655  [pdf, other

    cs.SE cs.AI

    CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

    Authors: Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss

    Abstract: Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs,… ▽ More

    Submitted 29 August, 2021; v1 submitted 24 May, 2021; originally announced May 2021.

    Comments: 22 pages including references

  10. arXiv:2102.07995  [pdf, other

    cs.SE cs.AI cs.LG

    D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

    Authors: Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein, Bo Yang, Jim Laredo, Alessandro Morari, Zhong Su

    Abstract: Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, exist… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: Accepted to the 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP '21)

  11. arXiv:2006.12641  [pdf, ps, other

    cs.CL cs.LG cs.PL

    Exploring Software Naturalness through Neural Language Models

    Authors: Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, Giacomo Domeniconi

    Abstract: The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks. Present approaches to code analysis depend heavily on features derived from the Abstract Syntax Tree (AST) while our trans… ▽ More

    Submitted 24 June, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

  12. arXiv:1911.02984  [pdf, other

    cs.CL cs.IR

    The TechQA Dataset

    Authors: Vittorio Castelli, Rishav Chakravarti, Saswati Dana, Anthony Ferritto, Radu Florian, Martin Franz, Dinesh Garg, Dinesh Khandelwal, Scott McCarley, Mike McCawley, Mohamed Nasr, Lin Pan, Cezar Pendus, John Pitrelli, Saurabh Pujar, Salim Roukos, Andrzej Sakrajda, Avirup Sil, Rosario Uceda-Sosa, Todd Ward, Rong Zhang

    Abstract: We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size -- 600 training, 310 de… ▽ More

    Submitted 7 November, 2019; originally announced November 2019.

    Comments: Long version of conference paper to be submitted