Search | arXiv e-print repository

Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages

Abstract: Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of programs. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the user-intent formalization for programs -- that a specification adheres to the user's intent behind the… ▽ More Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of programs. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the user-intent formalization for programs -- that a specification adheres to the user's intent behind the program. The intent or requirement is expressed informally in natural language and the specification is a formal artefact. The advent of large language models (LLMs) has made strides bridging the gap between informal intent and formal program implementations recently, driven in large parts due to benchmarks and automated metrics for evaluation. Recent work has proposed evaluating {\it user-intent formalization} problem for mainstream programming languages~\cite{endres-fse24}. However, such an approach does not readily extend to verification-aware languages that support rich specifications (containing quantifiers and ghost variables) that cannot be evaluated through dynamic execution. Previous work also required generating program mutants using LLMs to create the benchmark. We advocate an alternate approach of {\it symbolically testing specifications} to provide an intuitive metric for evaluating the quality of specifications for verification-aware languages. We demonstrate that our automated metric agrees closely with mostly GPT-4 generated and human-labeled dataset of roughly 150 Dafny specifications for the popular MBPP code-generation benchmark, yet demonstrates cases where the human labeling is not perfect. We believe our work provides a step** stone to enable the establishment of a benchmark and research agenda for the problem of user-intent formalization for programs. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.04693 [pdf, other]

LLM-Vectorizer: LLM-based Verified Loop Vectorizer

Authors: Jubi Taneja, Avery Laird, Cong Yan, Madan Musuvathi, Shuvendu K. Lahiri

Abstract: Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently miss opportunities to vectorize code. On the other hand, writing vectorized code manually using compiler intrinsics is still a complex, error-prone task that de… ▽ More Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently miss opportunities to vectorize code. On the other hand, writing vectorized code manually using compiler intrinsics is still a complex, error-prone task that demands deep knowledge of specific architecture and compilers. In this paper, we evaluate the potential of large-language models (LLMs) to generate vectorized (Single Instruction Multiple Data) code from scalar programs that process individual array elements. We propose a novel finite-state machine multi-agents based approach that harnesses LLMs and test-based feedback to generate vectorized code. Our findings indicate that LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x as compared to the state-of-the-art compilers such as Intel Compiler, GCC, and Clang. To verify the correctness of vectorized code, we use Alive2, a leading bounded translation validation tool for LLVM IR. We describe a few domain-specific techniques to improve the scalability of Alive2 on our benchmark dataset. Overall, our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.06276 [pdf, other]

Transition from laminar to turbulent pipe flow as a process of growing material instabilities

Authors: Saptarshi Kumar Lahiri, Konstantin Volokh

Abstract: In this work, we simulate the transition to turbulence in the pipe flow based on the modified NS theory incorporating the viscous fluid strength in the constitutive equations. The latter concept enriches theory by allowing for material instabilities in addition to the kinematic ones. We present results of comparative numerical simulations based on the classical NS model and the NS model enhanced w… ▽ More In this work, we simulate the transition to turbulence in the pipe flow based on the modified NS theory incorporating the viscous fluid strength in the constitutive equations. The latter concept enriches theory by allowing for material instabilities in addition to the kinematic ones. We present results of comparative numerical simulations based on the classical NS model and the NS model enhanced with the finite viscous strength. As expected, simulations based on the classical NS model exhibit stable laminar flow in contrast to experimental observations. Conversely, simulations based on the modified NS model with viscous strength exhibit instabilities and transition to turbulence per experimental observations. The transition to turbulence is triggered by the growing material instabilities. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2404.10362 [pdf, other]

3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers

Authors: Sarah Fakhoury, Markus Kuppe, Shuvendu K. Lahiri, Tahina Ramananandro, Nikhil Swamy

Abstract: Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions in RFCs into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling inform… ▽ More Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions in RFCs into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling informal requirements into formal specifications is challenging and, despite their benefits, new, formal languages are hard for people to learn and use. In this work, we present 3DGen, a framework that makes use of AI agents to transform mixed informal input, including natural language documents (i.e., RFCs) and example inputs into format specifications in a language called 3D. To support humans in understanding and trusting the generated specifications, 3DGen uses symbolic methods to also synthesize test inputs that can be validated against an external oracle. Symbolic test generation also helps in distinguishing multiple plausible solutions. Through a process of repeated refinement, 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C. We have evaluated 3DGen on 20 Internet standard formats, demonstrating the potential for AI-agents to produce formally verified C code at a non-trivial scale. A key enabler is the use of a domain-specific language to limit AI outputs to a class for which automated, symbolic analysis is tractable. △ Less

Submitted 6 May, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.10100 [pdf, other]

LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation

Authors: Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Shuvendu K. Lahiri

Abstract: Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e.,… ▽ More Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 38.43% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2311.07948 [pdf, other]

Finding Inductive Loop Invariants using Large Language Models

Authors: Adharsh Kamath, Aditya Senthilnathan, Saikat Chakraborty, Pantazis Deligiannis, Shuvendu K. Lahiri, Akash Lal, Aseem Rastogi, Subhajit Roy, Rahul Sharma

Abstract: Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting t… ▽ More Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting the entire program, thus are indispensable artifacts in a formal proof of correctness. Finding inductive loop invariants is an undecidable problem, and despite a long history of research towards practical solutions, it remains far from a solved problem. This paper investigates the capabilities of the Large Language Models (LLMs) in offering a new solution towards this old, yet important problem. To that end, we first curate a dataset of verification problems on programs with loops. Next, we design a prompt for exploiting LLMs, obtaining inductive loop invariants, that are checked for correctness using sound symbolic tools. Finally, we explore the effectiveness of using an efficient combination of a symbolic tool and an LLM on our dataset and compare it against a purely symbolic baseline. Our results demonstrate that LLMs can help improve the state-of-the-art in automated program verification. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2310.09342 [pdf, other]

Ranking LLM-Generated Loop Invariants for Program Verification

Authors: Saikat Chakraborty, Shuvendu K. Lahiri, Sarah Fakhoury, Madanlal Musuvathi, Akash Lal, Aseem Rastogi, Aditya Senthilnathan, Rahul Sharma, Nikhil Swamy

Abstract: Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an… ▽ More Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{https://github.com/microsoft/NeuralInvariantRanker}. △ Less

Submitted 12 February, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Findings of The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP-findings 2023)

arXiv:2310.01831 [pdf, other]

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Authors: Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, Shuvendu K. Lahiri

Abstract: Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enha… ▽ More Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The emergent abilities of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcond, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcond postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that nl2postcond via LLMs has the potential to be helpful in practice; nl2postcond generated postconditions were able to catch 64 real-world historical bugs from Defects4J. △ Less

Submitted 15 April, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: To appear at the Proceedings of the ACM on Software Engineering (PACMSE), Issue Foundations of Software Engineering (FSE) 2024

arXiv:2306.10763 [pdf, other]

Guiding Language Models of Code with Global Context using Monitors

Authors: Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K. Lahiri, Sriram K. Rajamani

Abstract: Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs… ▽ More Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen . △ Less

Submitted 3 November, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

Comments: Accepted to NeurIPS 2023 and to appear as "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context" at https://neurips.cc/virtual/2023/poster/70362 . Contents: 11 pages, 15 additional pages of appendix, 13 figures, 3 tables

ACM Class: I.2.2; I.2.7; I.2.5

arXiv:2304.03816 [pdf, other]

Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions

Authors: Sarah Fakhoury, Saikat Chakraborty, Madan Musuvathi, Shuvendu K. Lahiri

Abstract: Large language models (LLMs), such as OpenAI's Codex, have demonstrated their potential to generate code from natural language descriptions across a wide range of programming tasks. Several benchmarks have recently emerged to evaluate the ability of LLMs to generate functionally correct code from natural language intent with respect to a set of hidden test cases. This has enabled the research comm… ▽ More Large language models (LLMs), such as OpenAI's Codex, have demonstrated their potential to generate code from natural language descriptions across a wide range of programming tasks. Several benchmarks have recently emerged to evaluate the ability of LLMs to generate functionally correct code from natural language intent with respect to a set of hidden test cases. This has enabled the research community to identify significant and reproducible advancements in LLM capabilities. However, there is currently a lack of benchmark datasets for assessing the ability of LLMs to generate functionally correct code edits based on natural language descriptions of intended changes. This paper aims to address this gap by motivating the problem NL2Fix of translating natural language descriptions of code changes (namely bug fixes described in Issue reports in repositories) into correct code fixes. To this end, we introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes, and empirically evaluate the performance of several state-of-the-art LLMs for the this task. Results show that these LLMS together are capable of generating plausible fixes for 64.6% of the bugs, and the best LLM-based technique can achieve up to 21.20% top-1 and 35.68% top-5 accuracy on this benchmark. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2208.05950 [pdf, other]

Interactive Code Generation via Test-Driven User-Intent Formalization

Authors: Shuvendu K. Lahiri, Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, Madanlal Musuvathi, Piali Choudhury, Curtis von Veh, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao

Abstract: Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, when interacting with LLMs, users have no guarantees that the code suggestions produced correctly satisfy the intent they provided. In fact, it is hard to define a notion of correctness since natural language can be ambig… ▽ More Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, when interacting with LLMs, users have no guarantees that the code suggestions produced correctly satisfy the intent they provided. In fact, it is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. In this paper, we propose the workflow of {\it interactive test-driven code generation}, which leverages lightweight user feedback to (a) formalize the user intent using generated tests that can be useful for debugging, and (b) produce an improved set of code suggestions by pruning and ranking candidate code suggestions. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder. We perform an automated evaluation of TiCoder on the \emph{MBPP} and \emph{HumanEval} code generation benchmarks. Our results are promising with using the OpenAI Codex LLM: our best algorithm improves the \passk{1} code generation accuracy (in absolute percentages) between $22.49\%$ to $37.71\%$ for MBPP and between $24.79\%$ to $53.98\%$ for HumanEval using between 1 to 5 simulated user queries. △ Less

Submitted 3 October, 2023; v1 submitted 11 August, 2022; originally announced August 2022.

Comments: 18 pages

arXiv:2206.03865 [pdf, other]

Fault-Aware Neural Code Rankers

Authors: Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnación, Shuvendu K Lahiri, Madanlal Musuvathi, Jianfeng Gao

Abstract: Large language models (LLMs) have demonstrated an impressive ability to generate code for various programming tasks. In many instances, LLMs can generate a correct program for a task when given numerous trials. Consequently, a recent trend is to do large scale sampling of programs using a model and then filtering/ranking the programs based on the program execution on a small number of known unit t… ▽ More Large language models (LLMs) have demonstrated an impressive ability to generate code for various programming tasks. In many instances, LLMs can generate a correct program for a task when given numerous trials. Consequently, a recent trend is to do large scale sampling of programs using a model and then filtering/ranking the programs based on the program execution on a small number of known unit tests to select one candidate solution. However, these approaches assume that the unit tests are given and assume the ability to safely execute the generated programs (which can do arbitrary dangerous operations such as file manipulations). Both of the above assumptions are impractical in real-world software development. In this paper, we propose CodeRanker, a neural ranker that can predict the correctness of a sampled program without executing it. Our CodeRanker is fault-aware i.e., it is trained to predict different kinds of execution information such as predicting the exact compile/runtime error type (e.g., an IndexError or a TypeError). We show that CodeRanker can significantly increase the pass@1 accuracy of various code generation models (including Codex, GPT-Neo, GPT-J) on APPS, HumanEval and MBPP datasets. △ Less

Submitted 9 December, 2022; v1 submitted 4 June, 2022; originally announced June 2022.

Comments: In the proceedings of Advances in Neural Information Processing Systems, 2022

arXiv:2111.11904 [pdf, other]

Can Pre-trained Language Models be Used to Resolve Textual and Semantic Merge Conflicts?

Authors: Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, Shuvendu K. Lahiri

Abstract: Program merging is standard practice when developers integrate their individual changes to a common code base. When the merge algorithm fails, this is called a merge conflict. The conflict either manifests in textual merge conflicts where the merge fails to produce code, or semantic merge conflicts where the merged code results in compiler or test breaks. Resolving these conflicts for large code p… ▽ More Program merging is standard practice when developers integrate their individual changes to a common code base. When the merge algorithm fails, this is called a merge conflict. The conflict either manifests in textual merge conflicts where the merge fails to produce code, or semantic merge conflicts where the merged code results in compiler or test breaks. Resolving these conflicts for large code projects is expensive because it requires developers to manually identify the sources of conflict and correct them. In this paper, we explore the feasibility of automatically repairing merge conflicts (both textual and semantic) using k-shot learning with large neural language models (LM) such as GPT-3. One of the challenges in leveraging such language models is to fit the examples and the queries within a small prompt (2048 tokens). We evaluate LMs and k-shot learning for two broad applications: (a) textual and semantic merge conflicts for a divergent fork Microsoft Edge, and (b) textual merge conflicts for a large number of JavaScript projects in GitHub. Our results are mixed: one one-hand, LMs provide the state-of-the-art (SOTA) performance on semantic merge conflict resolution for Edge compared to earlier symbolic approaches; on the other hand, LMs do not yet obviate the benefits of fine-tuning neural models (when sufficient data is available) or the design of special purpose domain-specific languages (DSL) for restricted patterns for program synthesis. △ Less

Submitted 23 November, 2021; originally announced November 2021.

arXiv:2110.00677 [pdf, other]

doi 10.1145/1122445.1122456

SolType: Refinement Types for Arithmetic Overflow in Solidity

Authors: Bryan Tan, Benjamin Mariano, Shuvendu K. Lahiri, Isil Dillig, Yu Feng

Abstract: As smart contracts gain adoption in financial transactions, it becomes increasingly important to ensure that they are free of bugs and security vulnerabilities. Of particular relevance in this context are arithmetic overflow bugs, as integers are often used to represent financial assets like account balances. Motivated by this observation, this paper presents SolType, a refinement type system for… ▽ More As smart contracts gain adoption in financial transactions, it becomes increasingly important to ensure that they are free of bugs and security vulnerabilities. Of particular relevance in this context are arithmetic overflow bugs, as integers are often used to represent financial assets like account balances. Motivated by this observation, this paper presents SolType, a refinement type system for Solidity that can be used to prevent arithmetic over- and under-flows in smart contracts. SolType allows developers to add refinement type annotations and uses them to prove that arithmetic operations do not lead to over- and under-flows. SolType incorporates a rich vocabulary of refinement terms that allow expressing relationships between integer values and aggregate properties of complex data structures. Furthermore, our implementation, called Solid, incorporates a type inference engine and can automatically infer useful type annotations, including non-trivial contract invariants. To evaluate the usefulness of our type system, we use Solid to prove arithmetic safety of a total of 120 smart contracts. When used in its fully automated mode (i.e., using Solid's type inference capabilities), Solid is able to eliminate 86.3% of redundant runtime checks used to guard against overflows. We also compare Solid against a state-of-the-art arithmetic safety verifier called VeriSmart and show that Solid has a significantly lower false positive rate, while being significantly faster in terms of verification time. △ Less

Submitted 20 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: To appear in POPL '22. This is the extended version of the paper with the proofs, after the main text went through peer review. 51 pages, 15 figures

arXiv:2109.09262 [pdf, other]

doi 10.1145/3510003.3510141

TOGA: A Neural Method for Test Oracle Generation

Authors: Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, Shuvendu K. Lahiri

Abstract: Testing is widely recognized as an important stage of the software development lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation. In terms of documentation, unit tests express a unit's intended functionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior of a… ▽ More Testing is widely recognized as an important stage of the software development lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation. In terms of documentation, unit tests express a unit's intended functionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior of a unit under a given test prefix. Synthesizing a functional test oracle is a challenging problem, as it must capture the intended functionality rather than the implemented functionality. In this paper, we propose TOGA (a neural method for Test Oracle GenerAtion), a unified transformer-based neural approach to infer both exceptional and assertion test oracles based on the context of the focal method. Our approach can handle units with ambiguous or missing documentation, and even units with a missing implementation. We evaluate our approach on both oracle inference accuracy and functional bug-finding. Our technique improves accuracy by 33\% over existing oracle inference approaches, achieving 96\% overall accuracy on a held out test dataset. Furthermore, we show that when integrated with a automated test generation tool (EvoSuite), our approach finds 57 real world bugs in large-scale Java programs, including 30 bugs that are not found by any other automated testing method in our evaluation. △ Less

Submitted 20 April, 2022; v1 submitted 19 September, 2021; originally announced September 2021.

Comments: To appear in ICSE 2022

ACM Class: D.2.5; D.2.1

arXiv:2105.07569 [pdf, other]

DeepMerge: Learning to Merge Programs

Authors: Elizabeth Dinella, Todd Mytkowicz, Alexey Svyatkovskiy, Christian Bird, Mayur Naik, Shuvendu K. Lahiri

Abstract: In collaborative software development, program merging is the mechanism to integrate changes from multiple programmers. Merge algorithms in modern version control systems report a conflict when changes interfere textually. Merge conflicts require manual intervention and frequently stall modern continuous integration pipelines. Prior work found that, although costly, a large majority of resolutions… ▽ More In collaborative software development, program merging is the mechanism to integrate changes from multiple programmers. Merge algorithms in modern version control systems report a conflict when changes interfere textually. Merge conflicts require manual intervention and frequently stall modern continuous integration pipelines. Prior work found that, although costly, a large majority of resolutions involve re-arranging text without writing any new code. Inspired by this observation we propose the first data-driven approach to resolve merge conflicts with a machine learning model. We realize our approach in a tool DeepMerge that uses a novel combination of (i) an edit-aware embedding of merge inputs and (ii) a variation of pointer networks, to construct resolutions from input segments. We also propose an algorithm to localize manual resolutions in a resolved file and employ it to curate a ground-truth dataset comprising 8,719 non-trivial resolutions in JavaScript programs. Our evaluation shows that, on a held out test set, DeepMerge can predict correct resolutions for 37% of non-trivial merges, compared to only 4% by a state-of-the-art semistructured merge technique. Furthermore, on the subset of merges with upto 3 lines (comprising 24% of the total dataset), DeepMerge can predict correct resolutions with 78% accuracy. △ Less

Submitted 6 September, 2021; v1 submitted 16 May, 2021; originally announced May 2021.

Comments: 11 pages

arXiv:2003.12912 [pdf, ps, other]

doi 10.1145/3379597.3387498

A Dataset of Dockerfiles

Authors: Jordan Henkel, Christian Bird, Shuvendu K. Lahiri, Thomas Reps

Abstract: Dockerfiles are one of the most prevalent kinds of DevOps artifacts used in industry. Despite their prevalence, there is a lack of sophisticated semantics-aware static analysis of Dockerfiles. In this paper, we introduce a dataset of approximately 178,000 unique Dockerfiles collected from GitHub. To enhance the usability of this data, we describe five representations we have devised for working wi… ▽ More Dockerfiles are one of the most prevalent kinds of DevOps artifacts used in industry. Despite their prevalence, there is a lack of sophisticated semantics-aware static analysis of Dockerfiles. In this paper, we introduce a dataset of approximately 178,000 unique Dockerfiles collected from GitHub. To enhance the usability of this data, we describe five representations we have devised for working with, mining from, and analyzing these Dockerfiles. Each Dockerfile representation builds upon the previous ones, and the final representation, created by three levels of nested parsing and abstraction, makes tasks such as mining and static checking tractable. The Dockerfiles, in each of the five representations, along with metadata and the tools used to shepard the data from one representation to the next are all available at: https://doi.org/10.5281/zenodo.3628771. △ Less

Submitted 28 March, 2020; originally announced March 2020.

Comments: Published as a Data Showcase in MSR'2020

arXiv:2002.03064 [pdf, other]

doi 10.1145/3377811.3380406

Learning from, Understanding, and Supporting DevOps Artifacts for Docker

Authors: Jordan Henkel, Christian Bird, Shuvendu K. Lahiri, Thomas Reps

Abstract: With the growing use of DevOps tools and frameworks, there is an increased need for tools and techniques that support more than code. The current state-of-the-art in static developer assistance for tools like Docker is limited to shallow syntactic validation. We identify three core challenges in the realm of learning from, understanding, and supporting developers writing DevOps artifacts: (i) nest… ▽ More With the growing use of DevOps tools and frameworks, there is an increased need for tools and techniques that support more than code. The current state-of-the-art in static developer assistance for tools like Docker is limited to shallow syntactic validation. We identify three core challenges in the realm of learning from, understanding, and supporting developers writing DevOps artifacts: (i) nested languages in DevOps artifacts, (ii) rule mining, and (iii) the lack of semantic rule-based analysis. To address these challenges we introduce a toolset, binnacle, that enabled us to ingest 900,000 GitHub repositories. Focusing on Docker, we extracted approximately 178,000 unique Dockerfiles, and also identified a Gold Set of Dockerfiles written by Docker experts. We addressed challenge (i) by reducing the number of effectively uninterpretable nodes in our ASTs by over 80% via a technique we call phased parsing. To address challenge (ii), we introduced a novel rule-mining technique capable of recovering two-thirds of the rules in a benchmark we curated. Through this automated mining, we were able to recover 16 new rules that were not found during manual rule collection. To address challenge (iii), we manually collected a set of rules for Dockerfiles from commits to the files in the Gold Set. These rules encapsulate best practices, avoid docker build failures, and improve image size and build latency. We created an analyzer that used these rules, and found that, on average, Dockerfiles on GitHub violated the rules five times more frequently than the Dockerfiles in our Gold Set. We also found that industrial Dockerfiles fared no better than those sourced from GitHub. The learned rules and analyzer in binnacle can be used to aid developers in the IDE when creating Dockerfiles, and in a post-hoc fashion to identify issues in, and to improve, existing Dockerfiles. △ Less

Submitted 7 February, 2020; originally announced February 2020.

Comments: Published in ICSE'2020

arXiv:2001.03416 [pdf, other]

doi 10.1016/j.jcp.2020.109761

A stable SPH with adaptive B-spline kernel

Authors: Saptarshi Kumar Lahiri, Kanishka Bhattacharya, Amit Shaw, L S Ramachandra

Abstract: Tensile instability, often observed in smoothed particle hydrodynamics (SPH), is a numerical artifact that manifests itself by unphysical clustering or separation of particles. The instability originates in estimating the derivatives of the smoothing functions which, when interact with material constitution may result in negative stiffness in the discretized system. In the present study, a stable… ▽ More Tensile instability, often observed in smoothed particle hydrodynamics (SPH), is a numerical artifact that manifests itself by unphysical clustering or separation of particles. The instability originates in estimating the derivatives of the smoothing functions which, when interact with material constitution may result in negative stiffness in the discretized system. In the present study, a stable formulation of SPH is developed where the kernel function is continuously adapted at every material point depending on its state of stress. Bspline basis function with a variable intermediate knot is used as the kernel function. The shape of the kernel function is then modified by changing the intermediate knot position such that the condition associated with instability does not arise. While implementing the algorithm the simplicity and computational efficiency of SPH are not compromised. One-dimensional dispersion analysis is performed to understand the effect adaptive kernel on the stability. Finally, the efficacy of the algorithm is demonstrated through some benchmark elastic dynamics problems. △ Less

Submitted 4 January, 2020; originally announced January 2020.

Comments: 34 Pages, 22 Figures

Journal ref: Journal of Computational Physics 422(2020) 109761

arXiv:1904.12098 [pdf, other]

Enabling Open-World Specification Mining via Unsupervised Learning

Authors: Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, Thomas Reps

Abstract: Many programming tasks require using both domain-specific code and well-established patterns (such as routines concerned with file IO). Together, several small patterns combine to create complex interactions. This compounding effect, mixed with domain-specific idiosyncrasies, creates a challenging environment for fully automatic specification inference. Mining specifications in this environment, w… ▽ More Many programming tasks require using both domain-specific code and well-established patterns (such as routines concerned with file IO). Together, several small patterns combine to create complex interactions. This compounding effect, mixed with domain-specific idiosyncrasies, creates a challenging environment for fully automatic specification inference. Mining specifications in this environment, without the aid of rule templates, user-directed feedback, or predefined API surfaces, is a major challenge. We call this challenge Open-World Specification Mining. In this paper, we present a framework for mining specifications and usage patterns in an Open-World setting. We design this framework to be miner-agnostic and instead focus on disentangling complex and noisy API interactions. To evaluate our framework, we introduce a benchmark of 71 clusters extracted from five open-source projects. Using this dataset, we show that interesting clusters can be recovered, in a fully automatic way, by leveraging unsupervised learning in the form of word embeddings. Once clusters have been recovered, the challenge of Open-World Specification Mining is simplified and any trace-based mining technique can be applied. In addition, we provide a comprehensive evaluation of three word-vector learners to showcase the value of sub-word information for embeddings learned in the software-engineering domain. △ Less

Submitted 26 April, 2019; originally announced April 2019.

arXiv:1812.08829 [pdf, other]

Formal Specification and Verification of Smart Contracts for Azure Blockchain

Authors: Yuepeng Wang, Shuvendu K. Lahiri, Shuo Chen, Rong Pan, Isil Dillig, Cody Born, Immad Naseer

Abstract: Ensuring correctness of smart contracts is paramount to ensuring trust in blockchain-based systems. This paper studies the safety and security of smart contracts in the \emph{Azure Blockchain Workbench}, an enterprise Blockchain-as-a-Service offering from Microsoft. As part of this study, we formalize \emph{semantic conformance} of smart contracts against a state machine model with access-control… ▽ More Ensuring correctness of smart contracts is paramount to ensuring trust in blockchain-based systems. This paper studies the safety and security of smart contracts in the \emph{Azure Blockchain Workbench}, an enterprise Blockchain-as-a-Service offering from Microsoft. As part of this study, we formalize \emph{semantic conformance} of smart contracts against a state machine model with access-control policy and develop a highly-automated formal verifier for Solidity that can produce proofs as well as counterexamples. We have applied our verifier {\sc VeriSol} to analyze {\it all} contracts shipped with the Azure Blockchain Workbench, which includes application samples as well as a governance contract for Proof of Authority (PoA). We have found previously unknown bugs in these published smart contracts. After fixing these bugs, {\sc VeriSol} was able to successfully perform full verification for all of these contracts. △ Less

Submitted 29 April, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

Comments: 13 pages

ACM Class: F.3.1

arXiv:1803.06686 [pdf, other]

doi 10.1145/3236024.3236085

Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces

Authors: Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, Thomas Reps

Abstract: With the rise of machine learning, there is a great deal of interest in treating programs as data to be fed to learning algorithms. However, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied. In this paper, we use abst… ▽ More With the rise of machine learning, there is a great deal of interest in treating programs as data to be fed to learning algorithms. However, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied. In this paper, we use abstractions of traces obtained from symbolic execution of a program as a representation for learning word embeddings. We trained a variety of word embeddings under hundreds of parameterizations, and evaluated each learned embedding on a suite of different tasks. In our evaluation, we obtain 93% top-1 accuracy on a benchmark consisting of over 19,000 API-usage analogies extracted from the Linux kernel. In addition, we show that embeddings learned from (mainly) semantic abstractions provide nearly triple the accuracy of those learned from (mainly) syntactic abstractions. △ Less

Submitted 20 August, 2018; v1 submitted 18 March, 2018; originally announced March 2018.

arXiv:1710.07660 [pdf, other]

Verifying Equivalence of Database-Driven Applications

Authors: Yuepeng Wang, Isil Dillig, Shuvendu K. Lahiri, William R. Cook

Abstract: This paper addresses the problem of verifying equivalence between a pair of programs that operate over databases with different schemas. This problem is particularly important in the context of web applications, which typically undergo database refactoring either for performance or maintainability reasons. While web applications should have the same externally observable behavior before and after… ▽ More This paper addresses the problem of verifying equivalence between a pair of programs that operate over databases with different schemas. This problem is particularly important in the context of web applications, which typically undergo database refactoring either for performance or maintainability reasons. While web applications should have the same externally observable behavior before and after schema migration, there are no existing tools for proving equivalence of such programs. This paper takes a first step towards solving this problem by formalizing the equivalence and refinement checking problems for database-driven applications. We also propose a proof methodology based on the notion of bisimulation invariants over relational algebra with updates and describe a technique for synthesizing such bisimulation invariants. We have implemented the proposed technique in a tool called Mediator for verifying equivalence between database-driven applications written in our intermediate language and evaluate our tool on 21 benchmarks extracted from textbooks and real-world web applications. Our results show that the proposed methodology can successfully verify 20 of these benchmarks. △ Less

Submitted 20 October, 2017; originally announced October 2017.

arXiv:1609.08734 [pdf, ps, other]

Interprocedural Semantic Change-Impact Analysis using Equivalence Relations

Authors: Alex Gyori, Shuvendu K. Lahiri, Nimrod Partush

Abstract: Change-impact analysis (CIA) is the task of determining the set of program elements impacted by a program change. Precise CIA has great potential to avoid expensive testing and code reviews for (parts of) changes that are refactorings (semantics-preserving). Existing CIA is imprecise because it is coarse-grained, deals with only few refactoring patterns, or is unaware of the change semantics. We… ▽ More Change-impact analysis (CIA) is the task of determining the set of program elements impacted by a program change. Precise CIA has great potential to avoid expensive testing and code reviews for (parts of) changes that are refactorings (semantics-preserving). Existing CIA is imprecise because it is coarse-grained, deals with only few refactoring patterns, or is unaware of the change semantics. We formalize the notion of change impact in terms of the trace semantics of two program versions. We show how to leverage equivalence relations to make dataflow-based CIA aware of the change semantics, thereby improving precision in the presence of semantics-preserving changes. We propose an anytime algorithm that allows applying costly equivalence relation inference incrementally to refine the set of impacted statements. We have implemented a prototype in SymDiff, and evaluated it on 322 real-world changes from open-source projects and benchmark programs used by prior research. The evaluation results show an average 35% improvement in the size of the set of impacted statements compared to standard dataflow-based techniques. △ Less

Submitted 27 September, 2016; originally announced September 2016.

Comments: 12 pages

ACM Class: D.2.4; D.2.5

arXiv:cs/0612003 [pdf, ps, other]

doi 10.2168/LMCS-3(2:1)2007

Predicate Abstraction via Symbolic Decision Procedures

Authors: Shuvendu K. Lahiri, Thomas Ball, Byron Cook

Abstract: We present a new approach for performing predicate abstraction based on symbolic decision procedures. Intuitively, a symbolic decision procedure for a theory takes a set of predicates in the theory and symbolically executes a decision procedure on all the subsets over the set of predicates. The result of the symbolic decision procedure is a shared expression (represented by a directed acyclic gr… ▽ More We present a new approach for performing predicate abstraction based on symbolic decision procedures. Intuitively, a symbolic decision procedure for a theory takes a set of predicates in the theory and symbolically executes a decision procedure on all the subsets over the set of predicates. The result of the symbolic decision procedure is a shared expression (represented by a directed acyclic graph) that implicitly represents the answer to a predicate abstraction query. We present symbolic decision procedures for the logic of Equality and Uninterpreted Functions (EUF) and Difference logic (DIFF) and show that these procedures run in pseudo-polynomial (rather than exponential) time. We then provide a method to construct symbolic decision procedures for simple mixed theories (including the two theories mentioned above) using an extension of the Nelson-Oppen combination method. We present preliminary evaluation of our Procedure on predicate abstraction benchmarks from device driver verification in SLAM. △ Less

Submitted 24 April, 2007; v1 submitted 1 December, 2006; originally announced December 2006.

Comments: The final accepted paper for Logical Methods in Computer Science, special issue on CAV 2005. Editor Sriram Rajamani ([email protected]). Please perform make to build the paper. The pdf file is paper.pdf, and the comments for the referee's is present in referee_comments

ACM Class: F.3.1; F.4.1

Journal ref: Logical Methods in Computer Science, Volume 3, Issue 2 (April 24, 2007) lmcs:2218

arXiv:cs/0407006 [pdf, ps, other]

Predicate Abstraction with Indexed Predicates

Authors: Shuvendu K. Lahiri, Randal E. Bryant

Abstract: Predicate abstraction provides a powerful tool for verifying properties of infinite-state systems using a combination of a decision procedure for a subset of first-order logic and symbolic methods originally developed for finite-state model checking. We consider models containing first-order state variables, where the system state includes mutable functions and predicates. Such a model can descr… ▽ More Predicate abstraction provides a powerful tool for verifying properties of infinite-state systems using a combination of a decision procedure for a subset of first-order logic and symbolic methods originally developed for finite-state model checking. We consider models containing first-order state variables, where the system state includes mutable functions and predicates. Such a model can describe systems containing arbitrarily large memories, buffers, and arrays of identical processes. We describe a form of predicate abstraction that constructs a formula over a set of universally quantified variables to describe invariant properties of the first-order state variables. We provide a formal justification of the soundness of our approach and describe how it has been used to verify several hardware and software designs, including a directory-based cache coherence protocol. △ Less

Submitted 2 July, 2004; originally announced July 2004.

Comments: 27 pages, 4 figures, 1 table, short version appeared in International Conference on Verification, Model Checking and Abstract Interpretation (VMCAI'04), LNCS 2937, pages = 267--281

ACM Class: F.3.1

Showing 1–26 of 26 results for author: Lahiri, S K