-
Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages
Authors:
Shuvendu K. Lahiri
Abstract:
Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of programs. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the user-intent formalization for programs -- that a specification adheres to the user's intent behind the…
▽ More
Verification-aware programming languages such as Dafny and F* provide means to formally specify and prove properties of programs. Although the problem of checking an implementation against a specification can be defined mechanically, there is no algorithmic way of ensuring the correctness of the user-intent formalization for programs -- that a specification adheres to the user's intent behind the program. The intent or requirement is expressed informally in natural language and the specification is a formal artefact. The advent of large language models (LLMs) has made strides bridging the gap between informal intent and formal program implementations recently, driven in large parts due to benchmarks and automated metrics for evaluation.
Recent work has proposed evaluating {\it user-intent formalization} problem for mainstream programming languages~\cite{endres-fse24}. However, such an approach does not readily extend to verification-aware languages that support rich specifications (containing quantifiers and ghost variables) that cannot be evaluated through dynamic execution. Previous work also required generating program mutants using LLMs to create the benchmark. We advocate an alternate approach of {\it symbolically testing specifications} to provide an intuitive metric for evaluating the quality of specifications for verification-aware languages. We demonstrate that our automated metric agrees closely with mostly GPT-4 generated and human-labeled dataset of roughly 150 Dafny specifications for the popular MBPP code-generation benchmark, yet demonstrates cases where the human labeling is not perfect. We believe our work provides a step** stone to enable the establishment of a benchmark and research agenda for the problem of user-intent formalization for programs.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
LLM-Vectorizer: LLM-based Verified Loop Vectorizer
Authors:
Jubi Taneja,
Avery Laird,
Cong Yan,
Madan Musuvathi,
Shuvendu K. Lahiri
Abstract:
Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently miss opportunities to vectorize code. On the other hand, writing vectorized code manually using compiler intrinsics is still a complex, error-prone task that de…
▽ More
Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently miss opportunities to vectorize code. On the other hand, writing vectorized code manually using compiler intrinsics is still a complex, error-prone task that demands deep knowledge of specific architecture and compilers.
In this paper, we evaluate the potential of large-language models (LLMs) to generate vectorized (Single Instruction Multiple Data) code from scalar programs that process individual array elements. We propose a novel finite-state machine multi-agents based approach that harnesses LLMs and test-based feedback to generate vectorized code. Our findings indicate that LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x as compared to the state-of-the-art compilers such as Intel Compiler, GCC, and Clang.
To verify the correctness of vectorized code, we use Alive2, a leading bounded translation validation tool for LLVM IR. We describe a few domain-specific techniques to improve the scalability of Alive2 on our benchmark dataset. Overall, our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Transition from laminar to turbulent pipe flow as a process of growing material instabilities
Authors:
Saptarshi Kumar Lahiri,
Konstantin Volokh
Abstract:
In this work, we simulate the transition to turbulence in the pipe flow based on the modified NS theory incorporating the viscous fluid strength in the constitutive equations. The latter concept enriches theory by allowing for material instabilities in addition to the kinematic ones. We present results of comparative numerical simulations based on the classical NS model and the NS model enhanced w…
▽ More
In this work, we simulate the transition to turbulence in the pipe flow based on the modified NS theory incorporating the viscous fluid strength in the constitutive equations. The latter concept enriches theory by allowing for material instabilities in addition to the kinematic ones. We present results of comparative numerical simulations based on the classical NS model and the NS model enhanced with the finite viscous strength. As expected, simulations based on the classical NS model exhibit stable laminar flow in contrast to experimental observations. Conversely, simulations based on the modified NS model with viscous strength exhibit instabilities and transition to turbulence per experimental observations. The transition to turbulence is triggered by the growing material instabilities.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers
Authors:
Sarah Fakhoury,
Markus Kuppe,
Shuvendu K. Lahiri,
Tahina Ramananandro,
Nikhil Swamy
Abstract:
Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions in RFCs into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling inform…
▽ More
Improper parsing of attacker-controlled input is a leading source of software security vulnerabilities, especially when programmers transcribe informal format descriptions in RFCs into efficient parsing logic in low-level, memory unsafe languages. Several researchers have proposed formal specification languages for data formats from which efficient code can be extracted. However, distilling informal requirements into formal specifications is challenging and, despite their benefits, new, formal languages are hard for people to learn and use.
In this work, we present 3DGen, a framework that makes use of AI agents to transform mixed informal input, including natural language documents (i.e., RFCs) and example inputs into format specifications in a language called 3D. To support humans in understanding and trusting the generated specifications, 3DGen uses symbolic methods to also synthesize test inputs that can be validated against an external oracle. Symbolic test generation also helps in distinguishing multiple plausible solutions. Through a process of repeated refinement, 3DGen produces a 3D specification that conforms to a test suite, and which yields safe, efficient, provably correct, parsing code in C.
We have evaluated 3DGen on 20 Internet standard formats, demonstrating the potential for AI-agents to produce formally verified C code at a non-trivial scale. A key enabler is the use of a domain-specific language to limit AI outputs to a class for which automated, symbolic analysis is tractable.
△ Less
Submitted 6 May, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
LLM-based Test-driven Interactive Code Generation: User Study and Empirical Evaluation
Authors:
Sarah Fakhoury,
Aaditya Naik,
Georgios Sakkas,
Saikat Chakraborty,
Shuvendu K. Lahiri
Abstract:
Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e.,…
▽ More
Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 38.43% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Finding Inductive Loop Invariants using Large Language Models
Authors:
Adharsh Kamath,
Aditya Senthilnathan,
Saikat Chakraborty,
Pantazis Deligiannis,
Shuvendu K. Lahiri,
Akash Lal,
Aseem Rastogi,
Subhajit Roy,
Rahul Sharma
Abstract:
Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting t…
▽ More
Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting the entire program, thus are indispensable artifacts in a formal proof of correctness. Finding inductive loop invariants is an undecidable problem, and despite a long history of research towards practical solutions, it remains far from a solved problem. This paper investigates the capabilities of the Large Language Models (LLMs) in offering a new solution towards this old, yet important problem. To that end, we first curate a dataset of verification problems on programs with loops. Next, we design a prompt for exploiting LLMs, obtaining inductive loop invariants, that are checked for correctness using sound symbolic tools. Finally, we explore the effectiveness of using an efficient combination of a symbolic tool and an LLM on our dataset and compare it against a purely symbolic baseline. Our results demonstrate that LLMs can help improve the state-of-the-art in automated program verification.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Ranking LLM-Generated Loop Invariants for Program Verification
Authors:
Saikat Chakraborty,
Shuvendu K. Lahiri,
Sarah Fakhoury,
Madanlal Musuvathi,
Akash Lal,
Aseem Rastogi,
Aditya Senthilnathan,
Rahul Sharma,
Nikhil Swamy
Abstract:
Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an…
▽ More
Synthesizing inductive loop invariants is fundamental to automating program verification. In this work, we observe that Large Language Models (such as gpt-3.5 or gpt-4) are capable of synthesizing loop invariants for a class of programs in a 0-shot setting, yet require several samples to generate the correct invariants. This can lead to a large number of calls to a program verifier to establish an invariant. To address this issue, we propose a {\it re-ranking} approach for the generated results of LLMs. We have designed a ranker that can distinguish between correct inductive invariants and incorrect attempts based on the problem definition. The ranker is optimized as a contrastive ranker. Experimental results demonstrate that this re-ranking mechanism significantly improves the ranking of correct invariants among the generated candidates, leading to a notable reduction in the number of calls to a verifier. The source code and the experimental data for this paper are available in \url{https://github.com/microsoft/NeuralInvariantRanker}.
△ Less
Submitted 12 February, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?
Authors:
Madeline Endres,
Sarah Fakhoury,
Saikat Chakraborty,
Shuvendu K. Lahiri
Abstract:
Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enha…
▽ More
Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The emergent abilities of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcond, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcond postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that nl2postcond via LLMs has the potential to be helpful in practice; nl2postcond generated postconditions were able to catch 64 real-world historical bugs from Defects4J.
△ Less
Submitted 15 April, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Guiding Language Models of Code with Global Context using Monitors
Authors:
Lakshya A Agrawal,
Aditya Kanade,
Navin Goyal,
Shuvendu K. Lahiri,
Sriram K. Rajamani
Abstract:
Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating.
Integrated development environments (IDEs…
▽ More
Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating.
Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model.
We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen .
△ Less
Submitted 3 November, 2023; v1 submitted 19 June, 2023;
originally announced June 2023.
-
Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions
Authors:
Sarah Fakhoury,
Saikat Chakraborty,
Madan Musuvathi,
Shuvendu K. Lahiri
Abstract:
Large language models (LLMs), such as OpenAI's Codex, have demonstrated their potential to generate code from natural language descriptions across a wide range of programming tasks. Several benchmarks have recently emerged to evaluate the ability of LLMs to generate functionally correct code from natural language intent with respect to a set of hidden test cases. This has enabled the research comm…
▽ More
Large language models (LLMs), such as OpenAI's Codex, have demonstrated their potential to generate code from natural language descriptions across a wide range of programming tasks. Several benchmarks have recently emerged to evaluate the ability of LLMs to generate functionally correct code from natural language intent with respect to a set of hidden test cases. This has enabled the research community to identify significant and reproducible advancements in LLM capabilities. However, there is currently a lack of benchmark datasets for assessing the ability of LLMs to generate functionally correct code edits based on natural language descriptions of intended changes. This paper aims to address this gap by motivating the problem NL2Fix of translating natural language descriptions of code changes (namely bug fixes described in Issue reports in repositories) into correct code fixes. To this end, we introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes, and empirically evaluate the performance of several state-of-the-art LLMs for the this task. Results show that these LLMS together are capable of generating plausible fixes for 64.6% of the bugs, and the best LLM-based technique can achieve up to 21.20% top-1 and 35.68% top-5 accuracy on this benchmark.
△ Less
Submitted 7 April, 2023;
originally announced April 2023.
-
Interactive Code Generation via Test-Driven User-Intent Formalization
Authors:
Shuvendu K. Lahiri,
Sarah Fakhoury,
Aaditya Naik,
Georgios Sakkas,
Saikat Chakraborty,
Madanlal Musuvathi,
Piali Choudhury,
Curtis von Veh,
Jeevana Priya Inala,
Chenglong Wang,
Jianfeng Gao
Abstract:
Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, when interacting with LLMs, users have no guarantees that the code suggestions produced correctly satisfy the intent they provided. In fact, it is hard to define a notion of correctness since natural language can be ambig…
▽ More
Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, when interacting with LLMs, users have no guarantees that the code suggestions produced correctly satisfy the intent they provided. In fact, it is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
In this paper, we propose the workflow of {\it interactive test-driven code generation}, which leverages lightweight user feedback to (a) formalize the user intent using generated tests that can be useful for debugging, and (b) produce an improved set of code suggestions by pruning and ranking candidate code suggestions. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder. We perform an automated evaluation of TiCoder on the \emph{MBPP} and \emph{HumanEval} code generation benchmarks. Our results are promising with using the OpenAI Codex LLM: our best algorithm improves the \passk{1} code generation accuracy (in absolute percentages) between $22.49\%$ to $37.71\%$ for MBPP and between $24.79\%$ to $53.98\%$ for HumanEval using between 1 to 5 simulated user queries.
△ Less
Submitted 3 October, 2023; v1 submitted 11 August, 2022;
originally announced August 2022.
-
Fault-Aware Neural Code Rankers
Authors:
Jeevana Priya Inala,
Chenglong Wang,
Mei Yang,
Andres Codas,
Mark EncarnaciĆ³n,
Shuvendu K Lahiri,
Madanlal Musuvathi,
Jianfeng Gao
Abstract:
Large language models (LLMs) have demonstrated an impressive ability to generate code for various programming tasks. In many instances, LLMs can generate a correct program for a task when given numerous trials. Consequently, a recent trend is to do large scale sampling of programs using a model and then filtering/ranking the programs based on the program execution on a small number of known unit t…
▽ More
Large language models (LLMs) have demonstrated an impressive ability to generate code for various programming tasks. In many instances, LLMs can generate a correct program for a task when given numerous trials. Consequently, a recent trend is to do large scale sampling of programs using a model and then filtering/ranking the programs based on the program execution on a small number of known unit tests to select one candidate solution. However, these approaches assume that the unit tests are given and assume the ability to safely execute the generated programs (which can do arbitrary dangerous operations such as file manipulations). Both of the above assumptions are impractical in real-world software development. In this paper, we propose CodeRanker, a neural ranker that can predict the correctness of a sampled program without executing it. Our CodeRanker is fault-aware i.e., it is trained to predict different kinds of execution information such as predicting the exact compile/runtime error type (e.g., an IndexError or a TypeError). We show that CodeRanker can significantly increase the pass@1 accuracy of various code generation models (including Codex, GPT-Neo, GPT-J) on APPS, HumanEval and MBPP datasets.
△ Less
Submitted 9 December, 2022; v1 submitted 4 June, 2022;
originally announced June 2022.
-
Can Pre-trained Language Models be Used to Resolve Textual and Semantic Merge Conflicts?
Authors:
Jialu Zhang,
Todd Mytkowicz,
Mike Kaufman,
Ruzica Piskac,
Shuvendu K. Lahiri
Abstract:
Program merging is standard practice when developers integrate their individual changes to a common code base. When the merge algorithm fails, this is called a merge conflict. The conflict either manifests in textual merge conflicts where the merge fails to produce code, or semantic merge conflicts where the merged code results in compiler or test breaks. Resolving these conflicts for large code p…
▽ More
Program merging is standard practice when developers integrate their individual changes to a common code base. When the merge algorithm fails, this is called a merge conflict. The conflict either manifests in textual merge conflicts where the merge fails to produce code, or semantic merge conflicts where the merged code results in compiler or test breaks. Resolving these conflicts for large code projects is expensive because it requires developers to manually identify the sources of conflict and correct them.
In this paper, we explore the feasibility of automatically repairing merge conflicts (both textual and semantic) using k-shot learning with large neural language models (LM) such as GPT-3. One of the challenges in leveraging such language models is to fit the examples and the queries within a small prompt (2048 tokens). We evaluate LMs and k-shot learning for two broad applications: (a) textual and semantic merge conflicts for a divergent fork Microsoft Edge, and (b) textual merge conflicts for a large number of JavaScript projects in GitHub. Our results are mixed: one one-hand, LMs provide the state-of-the-art (SOTA) performance on semantic merge conflict resolution for Edge compared to earlier symbolic approaches; on the other hand, LMs do not yet obviate the benefits of fine-tuning neural models (when sufficient data is available) or the design of special purpose domain-specific languages (DSL) for restricted patterns for program synthesis.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
SolType: Refinement Types for Arithmetic Overflow in Solidity
Authors:
Bryan Tan,
Benjamin Mariano,
Shuvendu K. Lahiri,
Isil Dillig,
Yu Feng
Abstract:
As smart contracts gain adoption in financial transactions, it becomes increasingly important to ensure that they are free of bugs and security vulnerabilities. Of particular relevance in this context are arithmetic overflow bugs, as integers are often used to represent financial assets like account balances. Motivated by this observation, this paper presents SolType, a refinement type system for…
▽ More
As smart contracts gain adoption in financial transactions, it becomes increasingly important to ensure that they are free of bugs and security vulnerabilities. Of particular relevance in this context are arithmetic overflow bugs, as integers are often used to represent financial assets like account balances. Motivated by this observation, this paper presents SolType, a refinement type system for Solidity that can be used to prevent arithmetic over- and under-flows in smart contracts. SolType allows developers to add refinement type annotations and uses them to prove that arithmetic operations do not lead to over- and under-flows. SolType incorporates a rich vocabulary of refinement terms that allow expressing relationships between integer values and aggregate properties of complex data structures. Furthermore, our implementation, called Solid, incorporates a type inference engine and can automatically infer useful type annotations, including non-trivial contract invariants.
To evaluate the usefulness of our type system, we use Solid to prove arithmetic safety of a total of 120 smart contracts. When used in its fully automated mode (i.e., using Solid's type inference capabilities), Solid is able to eliminate 86.3% of redundant runtime checks used to guard against overflows. We also compare Solid against a state-of-the-art arithmetic safety verifier called VeriSmart and show that Solid has a significantly lower false positive rate, while being significantly faster in terms of verification time.
△ Less
Submitted 20 December, 2021; v1 submitted 1 October, 2021;
originally announced October 2021.
-
TOGA: A Neural Method for Test Oracle Generation
Authors:
Elizabeth Dinella,
Gabriel Ryan,
Todd Mytkowicz,
Shuvendu K. Lahiri
Abstract:
Testing is widely recognized as an important stage of the software development lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation. In terms of documentation, unit tests express a unit's intended functionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior of a…
▽ More
Testing is widely recognized as an important stage of the software development lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation. In terms of documentation, unit tests express a unit's intended functionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior of a unit under a given test prefix. Synthesizing a functional test oracle is a challenging problem, as it must capture the intended functionality rather than the implemented functionality.
In this paper, we propose TOGA (a neural method for Test Oracle GenerAtion), a unified transformer-based neural approach to infer both exceptional and assertion test oracles based on the context of the focal method. Our approach can handle units with ambiguous or missing documentation, and even units with a missing implementation. We evaluate our approach on both oracle inference accuracy and functional bug-finding. Our technique improves accuracy by 33\% over existing oracle inference approaches, achieving 96\% overall accuracy on a held out test dataset. Furthermore, we show that when integrated with a automated test generation tool (EvoSuite), our approach finds 57 real world bugs in large-scale Java programs, including 30 bugs that are not found by any other automated testing method in our evaluation.
△ Less
Submitted 20 April, 2022; v1 submitted 19 September, 2021;
originally announced September 2021.
-
DeepMerge: Learning to Merge Programs
Authors:
Elizabeth Dinella,
Todd Mytkowicz,
Alexey Svyatkovskiy,
Christian Bird,
Mayur Naik,
Shuvendu K. Lahiri
Abstract:
In collaborative software development, program merging is the mechanism to integrate changes from multiple programmers. Merge algorithms in modern version control systems report a conflict when changes interfere textually. Merge conflicts require manual intervention and frequently stall modern continuous integration pipelines. Prior work found that, although costly, a large majority of resolutions…
▽ More
In collaborative software development, program merging is the mechanism to integrate changes from multiple programmers. Merge algorithms in modern version control systems report a conflict when changes interfere textually. Merge conflicts require manual intervention and frequently stall modern continuous integration pipelines. Prior work found that, although costly, a large majority of resolutions involve re-arranging text without writing any new code. Inspired by this observation we propose the first data-driven approach to resolve merge conflicts with a machine learning model. We realize our approach in a tool DeepMerge that uses a novel combination of (i) an edit-aware embedding of merge inputs and (ii) a variation of pointer networks, to construct resolutions from input segments. We also propose an algorithm to localize manual resolutions in a resolved file and employ it to curate a ground-truth dataset comprising 8,719 non-trivial resolutions in JavaScript programs. Our evaluation shows that, on a held out test set, DeepMerge can predict correct resolutions for 37% of non-trivial merges, compared to only 4% by a state-of-the-art semistructured merge technique. Furthermore, on the subset of merges with upto 3 lines (comprising 24% of the total dataset), DeepMerge can predict correct resolutions with 78% accuracy.
△ Less
Submitted 6 September, 2021; v1 submitted 16 May, 2021;
originally announced May 2021.
-
A Dataset of Dockerfiles
Authors:
Jordan Henkel,
Christian Bird,
Shuvendu K. Lahiri,
Thomas Reps
Abstract:
Dockerfiles are one of the most prevalent kinds of DevOps artifacts used in industry. Despite their prevalence, there is a lack of sophisticated semantics-aware static analysis of Dockerfiles. In this paper, we introduce a dataset of approximately 178,000 unique Dockerfiles collected from GitHub. To enhance the usability of this data, we describe five representations we have devised for working wi…
▽ More
Dockerfiles are one of the most prevalent kinds of DevOps artifacts used in industry. Despite their prevalence, there is a lack of sophisticated semantics-aware static analysis of Dockerfiles. In this paper, we introduce a dataset of approximately 178,000 unique Dockerfiles collected from GitHub. To enhance the usability of this data, we describe five representations we have devised for working with, mining from, and analyzing these Dockerfiles. Each Dockerfile representation builds upon the previous ones, and the final representation, created by three levels of nested parsing and abstraction, makes tasks such as mining and static checking tractable. The Dockerfiles, in each of the five representations, along with metadata and the tools used to shepard the data from one representation to the next are all available at: https://doi.org/10.5281/zenodo.3628771.
△ Less
Submitted 28 March, 2020;
originally announced March 2020.
-
Learning from, Understanding, and Supporting DevOps Artifacts for Docker
Authors:
Jordan Henkel,
Christian Bird,
Shuvendu K. Lahiri,
Thomas Reps
Abstract:
With the growing use of DevOps tools and frameworks, there is an increased need for tools and techniques that support more than code. The current state-of-the-art in static developer assistance for tools like Docker is limited to shallow syntactic validation. We identify three core challenges in the realm of learning from, understanding, and supporting developers writing DevOps artifacts: (i) nest…
▽ More
With the growing use of DevOps tools and frameworks, there is an increased need for tools and techniques that support more than code. The current state-of-the-art in static developer assistance for tools like Docker is limited to shallow syntactic validation. We identify three core challenges in the realm of learning from, understanding, and supporting developers writing DevOps artifacts: (i) nested languages in DevOps artifacts, (ii) rule mining, and (iii) the lack of semantic rule-based analysis. To address these challenges we introduce a toolset, binnacle, that enabled us to ingest 900,000 GitHub repositories.
Focusing on Docker, we extracted approximately 178,000 unique Dockerfiles, and also identified a Gold Set of Dockerfiles written by Docker experts. We addressed challenge (i) by reducing the number of effectively uninterpretable nodes in our ASTs by over 80% via a technique we call phased parsing. To address challenge (ii), we introduced a novel rule-mining technique capable of recovering two-thirds of the rules in a benchmark we curated. Through this automated mining, we were able to recover 16 new rules that were not found during manual rule collection. To address challenge (iii), we manually collected a set of rules for Dockerfiles from commits to the files in the Gold Set. These rules encapsulate best practices, avoid docker build failures, and improve image size and build latency. We created an analyzer that used these rules, and found that, on average, Dockerfiles on GitHub violated the rules five times more frequently than the Dockerfiles in our Gold Set. We also found that industrial Dockerfiles fared no better than those sourced from GitHub.
The learned rules and analyzer in binnacle can be used to aid developers in the IDE when creating Dockerfiles, and in a post-hoc fashion to identify issues in, and to improve, existing Dockerfiles.
△ Less
Submitted 7 February, 2020;
originally announced February 2020.
-
A stable SPH with adaptive B-spline kernel
Authors:
Saptarshi Kumar Lahiri,
Kanishka Bhattacharya,
Amit Shaw,
L S Ramachandra
Abstract:
Tensile instability, often observed in smoothed particle hydrodynamics (SPH), is a numerical artifact that manifests itself by unphysical clustering or separation of particles. The instability originates in estimating the derivatives of the smoothing functions which, when interact with material constitution may result in negative stiffness in the discretized system. In the present study, a stable…
▽ More
Tensile instability, often observed in smoothed particle hydrodynamics (SPH), is a numerical artifact that manifests itself by unphysical clustering or separation of particles. The instability originates in estimating the derivatives of the smoothing functions which, when interact with material constitution may result in negative stiffness in the discretized system. In the present study, a stable formulation of SPH is developed where the kernel function is continuously adapted at every material point depending on its state of stress. Bspline basis function with a variable intermediate knot is used as the kernel function. The shape of the kernel function is then modified by changing the intermediate knot position such that the condition associated with instability does not arise. While implementing the algorithm the simplicity and computational efficiency of SPH are not compromised. One-dimensional dispersion analysis is performed to understand the effect adaptive kernel on the stability. Finally, the efficacy of the algorithm is demonstrated through some benchmark elastic dynamics problems.
△ Less
Submitted 4 January, 2020;
originally announced January 2020.
-
Enabling Open-World Specification Mining via Unsupervised Learning
Authors:
Jordan Henkel,
Shuvendu K. Lahiri,
Ben Liblit,
Thomas Reps
Abstract:
Many programming tasks require using both domain-specific code and well-established patterns (such as routines concerned with file IO). Together, several small patterns combine to create complex interactions. This compounding effect, mixed with domain-specific idiosyncrasies, creates a challenging environment for fully automatic specification inference. Mining specifications in this environment, w…
▽ More
Many programming tasks require using both domain-specific code and well-established patterns (such as routines concerned with file IO). Together, several small patterns combine to create complex interactions. This compounding effect, mixed with domain-specific idiosyncrasies, creates a challenging environment for fully automatic specification inference. Mining specifications in this environment, without the aid of rule templates, user-directed feedback, or predefined API surfaces, is a major challenge. We call this challenge Open-World Specification Mining.
In this paper, we present a framework for mining specifications and usage patterns in an Open-World setting. We design this framework to be miner-agnostic and instead focus on disentangling complex and noisy API interactions. To evaluate our framework, we introduce a benchmark of 71 clusters extracted from five open-source projects. Using this dataset, we show that interesting clusters can be recovered, in a fully automatic way, by leveraging unsupervised learning in the form of word embeddings. Once clusters have been recovered, the challenge of Open-World Specification Mining is simplified and any trace-based mining technique can be applied. In addition, we provide a comprehensive evaluation of three word-vector learners to showcase the value of sub-word information for embeddings learned in the software-engineering domain.
△ Less
Submitted 26 April, 2019;
originally announced April 2019.
-
Formal Specification and Verification of Smart Contracts for Azure Blockchain
Authors:
Yuepeng Wang,
Shuvendu K. Lahiri,
Shuo Chen,
Rong Pan,
Isil Dillig,
Cody Born,
Immad Naseer
Abstract:
Ensuring correctness of smart contracts is paramount to ensuring trust in blockchain-based systems. This paper studies the safety and security of smart contracts in the \emph{Azure Blockchain Workbench}, an enterprise Blockchain-as-a-Service offering from Microsoft. As part of this study, we formalize \emph{semantic conformance} of smart contracts against a state machine model with access-control…
▽ More
Ensuring correctness of smart contracts is paramount to ensuring trust in blockchain-based systems. This paper studies the safety and security of smart contracts in the \emph{Azure Blockchain Workbench}, an enterprise Blockchain-as-a-Service offering from Microsoft. As part of this study, we formalize \emph{semantic conformance} of smart contracts against a state machine model with access-control policy and develop a highly-automated formal verifier for Solidity that can produce proofs as well as counterexamples. We have applied our verifier {\sc VeriSol} to analyze {\it all} contracts shipped with the Azure Blockchain Workbench, which includes application samples as well as a governance contract for Proof of Authority (PoA). We have found previously unknown bugs in these published smart contracts. After fixing these bugs, {\sc VeriSol} was able to successfully perform full verification for all of these contracts.
△ Less
Submitted 29 April, 2019; v1 submitted 20 December, 2018;
originally announced December 2018.
-
Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces
Authors:
Jordan Henkel,
Shuvendu K. Lahiri,
Ben Liblit,
Thomas Reps
Abstract:
With the rise of machine learning, there is a great deal of interest in treating programs as data to be fed to learning algorithms. However, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied.
In this paper, we use abst…
▽ More
With the rise of machine learning, there is a great deal of interest in treating programs as data to be fed to learning algorithms. However, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied.
In this paper, we use abstractions of traces obtained from symbolic execution of a program as a representation for learning word embeddings. We trained a variety of word embeddings under hundreds of parameterizations, and evaluated each learned embedding on a suite of different tasks. In our evaluation, we obtain 93% top-1 accuracy on a benchmark consisting of over 19,000 API-usage analogies extracted from the Linux kernel. In addition, we show that embeddings learned from (mainly) semantic abstractions provide nearly triple the accuracy of those learned from (mainly) syntactic abstractions.
△ Less
Submitted 20 August, 2018; v1 submitted 18 March, 2018;
originally announced March 2018.
-
Verifying Equivalence of Database-Driven Applications
Authors:
Yuepeng Wang,
Isil Dillig,
Shuvendu K. Lahiri,
William R. Cook
Abstract:
This paper addresses the problem of verifying equivalence between a pair of programs that operate over databases with different schemas. This problem is particularly important in the context of web applications, which typically undergo database refactoring either for performance or maintainability reasons. While web applications should have the same externally observable behavior before and after…
▽ More
This paper addresses the problem of verifying equivalence between a pair of programs that operate over databases with different schemas. This problem is particularly important in the context of web applications, which typically undergo database refactoring either for performance or maintainability reasons. While web applications should have the same externally observable behavior before and after schema migration, there are no existing tools for proving equivalence of such programs. This paper takes a first step towards solving this problem by formalizing the equivalence and refinement checking problems for database-driven applications. We also propose a proof methodology based on the notion of bisimulation invariants over relational algebra with updates and describe a technique for synthesizing such bisimulation invariants. We have implemented the proposed technique in a tool called Mediator for verifying equivalence between database-driven applications written in our intermediate language and evaluate our tool on 21 benchmarks extracted from textbooks and real-world web applications. Our results show that the proposed methodology can successfully verify 20 of these benchmarks.
△ Less
Submitted 20 October, 2017;
originally announced October 2017.
-
Interprocedural Semantic Change-Impact Analysis using Equivalence Relations
Authors:
Alex Gyori,
Shuvendu K. Lahiri,
Nimrod Partush
Abstract:
Change-impact analysis (CIA) is the task of determining the set of program elements impacted by a program change. Precise CIA has great potential to avoid expensive testing and code reviews for (parts of) changes that are refactorings (semantics-preserving). Existing CIA is imprecise because it is coarse-grained, deals with only few refactoring patterns, or is unaware of the change semantics.
We…
▽ More
Change-impact analysis (CIA) is the task of determining the set of program elements impacted by a program change. Precise CIA has great potential to avoid expensive testing and code reviews for (parts of) changes that are refactorings (semantics-preserving). Existing CIA is imprecise because it is coarse-grained, deals with only few refactoring patterns, or is unaware of the change semantics.
We formalize the notion of change impact in terms of the trace semantics of two program versions. We show how to leverage equivalence relations to make dataflow-based CIA aware of the change semantics, thereby improving precision in the presence of semantics-preserving changes. We propose an anytime algorithm that allows applying costly equivalence relation inference incrementally to refine the set of impacted statements. We have implemented a prototype in SymDiff, and evaluated it on 322 real-world changes from open-source projects and benchmark programs used by prior research. The evaluation results show an average 35% improvement in the size of the set of impacted statements compared to standard dataflow-based techniques.
△ Less
Submitted 27 September, 2016;
originally announced September 2016.
-
Predicate Abstraction via Symbolic Decision Procedures
Authors:
Shuvendu K. Lahiri,
Thomas Ball,
Byron Cook
Abstract:
We present a new approach for performing predicate abstraction based on symbolic decision procedures. Intuitively, a symbolic decision procedure for a theory takes a set of predicates in the theory and symbolically executes a decision procedure on all the subsets over the set of predicates. The result of the symbolic decision procedure is a shared expression (represented by a directed acyclic gr…
▽ More
We present a new approach for performing predicate abstraction based on symbolic decision procedures. Intuitively, a symbolic decision procedure for a theory takes a set of predicates in the theory and symbolically executes a decision procedure on all the subsets over the set of predicates. The result of the symbolic decision procedure is a shared expression (represented by a directed acyclic graph) that implicitly represents the answer to a predicate abstraction query.
We present symbolic decision procedures for the logic of Equality and Uninterpreted Functions (EUF) and Difference logic (DIFF) and show that these procedures run in pseudo-polynomial (rather than exponential) time. We then provide a method to construct symbolic decision procedures for simple mixed theories (including the two theories mentioned above) using an extension of the Nelson-Oppen combination method. We present preliminary evaluation of our Procedure on predicate abstraction benchmarks from device driver verification in SLAM.
△ Less
Submitted 24 April, 2007; v1 submitted 1 December, 2006;
originally announced December 2006.
-
Predicate Abstraction with Indexed Predicates
Authors:
Shuvendu K. Lahiri,
Randal E. Bryant
Abstract:
Predicate abstraction provides a powerful tool for verifying properties of infinite-state systems using a combination of a decision procedure for a subset of first-order logic and symbolic methods originally developed for finite-state model checking. We consider models containing first-order state variables, where the system state includes mutable functions and predicates. Such a model can descr…
▽ More
Predicate abstraction provides a powerful tool for verifying properties of infinite-state systems using a combination of a decision procedure for a subset of first-order logic and symbolic methods originally developed for finite-state model checking. We consider models containing first-order state variables, where the system state includes mutable functions and predicates. Such a model can describe systems containing arbitrarily large memories, buffers, and arrays of identical processes. We describe a form of predicate abstraction that constructs a formula over a set of universally quantified variables to describe invariant properties of the first-order state variables. We provide a formal justification of the soundness of our approach and describe how it has been used to verify several hardware and software designs, including a directory-based cache coherence protocol.
△ Less
Submitted 2 July, 2004;
originally announced July 2004.