Search | arXiv e-print repository

Generative Explanations for Program Synthesizers

Authors: Amirmohammad Nazari, Souti Chattopadhyay, Swabha Swayamdipta, Mukund Raghothaman

Abstract: Despite great advances in program synthesis techniques, they remain algorithmic black boxes. Although they guarantee that when synthesis is successful, the implementation satisfies the specification, they provide no additional information regarding how the implementation works or the manner in which the specification is realized. One possibility to answer these questions is to use large language m… ▽ More Despite great advances in program synthesis techniques, they remain algorithmic black boxes. Although they guarantee that when synthesis is successful, the implementation satisfies the specification, they provide no additional information regarding how the implementation works or the manner in which the specification is realized. One possibility to answer these questions is to use large language models (LLMs) to construct human-readable explanations. Unfortunately, experiments reveal that LLMs frequently produce nonsensical or misleading explanations when applied to the unidiomatic code produced by program synthesizers. In this paper, we develop an approach to reliably augment the implementation with explanatory names. We recover fine-grained input-output data from the synthesis algorithm to enhance the prompt supplied to the LLM, and use a combination of a program verifier and a second language model to validate the proposed explanations before presenting them to the user. Together, these techniques massively improve the accuracy of the proposed names, from 24% to 79% respectively. Through a pair of small user studies, we find that users significantly prefer the explanations produced by our technique (76% of responses indicating the appropriateness of the presenting names) to the baseline (with only 2% of responses approving of the suggestions), and that the proposed names measurably help users in understanding the synthesized implementation. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2403.01314 [pdf, other]

Superflows: A New Tool for Forensic Network Flow Analysis

Authors: Michael Collins, Jyotirmoy V. Deshmukh, Dristi Dinesh, Mukund Raghothaman, Srivatsan Ravi, Yuan Xia

Abstract: Network security analysts gather data from diverse sources, from high-level summaries of network flow and traffic volumes to low-level details such as service logs from servers and the contents of individual packets. They validate and check this data against traffic patterns and historical indicators of compromise. Based on the results of this analysis, a decision is made to either automatically m… ▽ More Network security analysts gather data from diverse sources, from high-level summaries of network flow and traffic volumes to low-level details such as service logs from servers and the contents of individual packets. They validate and check this data against traffic patterns and historical indicators of compromise. Based on the results of this analysis, a decision is made to either automatically manage the traffic or report it to an analyst for further investigation. Unfortunately, due rapidly increasing traffic volumes, there are far more events to check than operational teams can handle for effective forensic analysis. However, just as packets are grouped into flows that share a commonality, we argue that a high-level construct for grou** network flows into a set a flows that share a hypothesis is needed to significantly improve the quality of operational network response by increasing Events Per Analysts Hour (EPAH). In this paper, we propose a formalism for describing a superflow construct, which we characterize as an aggregation of one or more flows based on an analyst-specific hypothesis about traffic behavior. We demonstrate simple superflow constructions and representations, and perform a case study to explain how the formalism can be used to reduce the volume of data for forensic analysis. △ Less

Submitted 2 March, 2024; originally announced March 2024.

arXiv:2312.17527 [pdf, ps, other]

Data-Driven Template-Free Invariant Generation

Authors: Yuan Xia, Jyotirmoy V. Deshmukh, Mukund Raghothaman, Srivatsan Ravi

Abstract: Automatic verification of concurrent programs faces state explosion due to the exponential possible interleavings of its sequential components coupled with large or infinite state spaces. An alternative is deductive verification, where given a candidate invariant, we establish inductive invariance and show that any state satisfying the invariant is also safe. However, learning (inductive) program… ▽ More Automatic verification of concurrent programs faces state explosion due to the exponential possible interleavings of its sequential components coupled with large or infinite state spaces. An alternative is deductive verification, where given a candidate invariant, we establish inductive invariance and show that any state satisfying the invariant is also safe. However, learning (inductive) program invariants is difficult. To this end, we propose a data-driven procedure to synthesize program invariants, where it is assumed that the program invariant is an expression that characterizes a (hopefully tight) over-approximation of the reachable program states. The main ideas of our approach are: (1) We treat a candidate invariant as a classifier separating states observed in (sampled) program traces from those speculated to be unreachable. (2) We develop an enumerative, template-free approach to learn such classifiers from positive and negative examples. At its core, our enumerative approach employs decision trees to generate expressions that do not over-fit to the observed states (and thus generalize). (3) We employ a runtime framework to monitor program executions that may refute the candidate invariant; every refutation triggers a revision of the candidate invariant. Our runtime framework can be viewed as an instance of statistical model checking, which gives us probabilistic guarantees on the candidate invariant. We also show that such in some cases, our counterexample-guided inductive synthesis approach converges (in probability) to an overapproximation of the reachable set of states. Our experimental results show that our framework excels in learning useful invariants using only a fraction of the set of reachable states for a wide variety of concurrent programs. △ Less

Submitted 29 December, 2023; originally announced December 2023.

arXiv:2312.06001 [pdf, other]

The SyGuS Language Standard Version 2.1

Authors: Saswat Padhi, Elizabeth Polgreen, Mukund Raghothaman, Andrew Reynolds, Abhishek Udupa

Abstract: The classical formulation of the program-synthesis problem is to find a program that meets a correctness specification given as a logical formula. Syntax-guided synthesis (SyGuS) is a standardized format for specifying the correctness specification with a syntactic template that constrains the space of allowed implementations. The input to SyGuS consists of a background theory, a semantic correc… ▽ More The classical formulation of the program-synthesis problem is to find a program that meets a correctness specification given as a logical formula. Syntax-guided synthesis (SyGuS) is a standardized format for specifying the correctness specification with a syntactic template that constrains the space of allowed implementations. The input to SyGuS consists of a background theory, a semantic correctness specification for the desired program given by a logical formula, and a syntactic set of candidate implementations given by a grammar. The computational problem then is to find an implementation from the set of candidate expressions that satisfies the specification in the given theory. The formulation of the problem builds on SMT-LIB. This document defines the SyGuS 2.1 standard, which is intended to be used as the standard input and output language for solvers targeting the syntax-guided synthesis problem. It borrows many concepts and language constructs from the standard format for Satisfiability Modulo Theories (SMT) solvers, the SMT-LIB 2.6 standard. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: 36 pages, introduced in the SYNT workshop at CAV 2021

arXiv:2102.06753 [pdf, ps, other]

Data-Driven Synthesis of Provably Sound Side Channel Analyses

Authors: **gbo Wang, Chungha Sung, Mukund Raghothaman, Chao Wang

Abstract: We propose a data-driven method for synthesizing a static analyzer to detect side-channel information leaks in cryptographic software. Compared to the conventional way of manually crafting such a static analyzer, which can be labor intensive, error prone and suboptimal, our learning-based technique is not only automated but also provably sound. Our analyzer consists of a set of type-inference rule… ▽ More We propose a data-driven method for synthesizing a static analyzer to detect side-channel information leaks in cryptographic software. Compared to the conventional way of manually crafting such a static analyzer, which can be labor intensive, error prone and suboptimal, our learning-based technique is not only automated but also provably sound. Our analyzer consists of a set of type-inference rules learned from the training data, i.e., example code snippets annotated with ground truth. Internally, we use syntax-guided synthesis (SyGuS) to generate new features and decision tree learning (DTL) to generate type-inference rules based on these features. We guarantee soundness by formally proving each learned rule via a technique called Datalog query containment checking. We have implemented our technique in the LLVM compiler and used it to detect power side channels in C programs. Our results show that, in addition to being automated and provably sound during synthesis, the learned analyzer also has the same empirical accuracy as two state-of-the-art, manually crafted analyzers while being 300X and 900X faster, respectively. △ Less

Submitted 12 February, 2021; originally announced February 2021.

arXiv:1906.00163 [pdf, other]

Synthesizing Datalog Programs Using Numerical Relaxation

Authors: Xujie Si, Mukund Raghothaman, Kihong Heo, Mayur Naik

Abstract: The problem of learning logical rules from examples arises in diverse fields, including program synthesis, logic programming, and machine learning. Existing approaches either involve solving computationally difficult combinatorial problems, or performing parameter estimation in complex statistical models. In this paper, we present Difflog, a technique to extend the logic programming language Dat… ▽ More The problem of learning logical rules from examples arises in diverse fields, including program synthesis, logic programming, and machine learning. Existing approaches either involve solving computationally difficult combinatorial problems, or performing parameter estimation in complex statistical models. In this paper, we present Difflog, a technique to extend the logic programming language Datalog to the continuous setting. By attaching real-valued weights to individual rules of a Datalog program, we naturally associate numerical values with individual conclusions of the program. Analogous to the strategy of numerical relaxation in optimization problems, we can now first determine the rule weights which cause the best agreement between the training labels and the induced values of output tuples, and subsequently recover the classical discrete-valued target program from the continuous optimum. We evaluate Difflog on a suite of 34 benchmark problems from recent literature in knowledge discovery, formal verification, and database query-by-example, and demonstrate significant improvements in learning complex programs with recursive rules, invented predicates, and relations of arbitrary arity. △ Less

Submitted 25 June, 2019; v1 submitted 1 June, 2019; originally announced June 2019.

Comments: Per editor's instructions, this is only an early preprint of the paper which will be presented at IJCAI 2019

arXiv:1807.03865 [pdf, other]

Streamable Regular Transductions

Authors: Rajeev Alur, Dana Fisman, Konstantinos Mamouras, Mukund Raghothaman, Caleb Stanford

Abstract: Motivated by real-time monitoring and data processing applications, we develop a formal theory of quantitative queries for streaming data that can be evaluated efficiently. We consider the model of unambiguous Cost Register Automata (CRAs), which are machines that combine finite-state control (for identifying regular patterns) with a finite set of data registers (for computing numerical aggregates… ▽ More Motivated by real-time monitoring and data processing applications, we develop a formal theory of quantitative queries for streaming data that can be evaluated efficiently. We consider the model of unambiguous Cost Register Automata (CRAs), which are machines that combine finite-state control (for identifying regular patterns) with a finite set of data registers (for computing numerical aggregates). The definition of CRAs is parameterized by the collection of numerical operations that can be applied to the registers. These machines give rise to the class of streamable regular transductions (SR), and to the class of streamable linear regular transductions (SLR) when the register updates are copyless, i.e. every register appears at most once the right-hand-side expressions of the updates. We give a logical characterization of the class SR (resp., SLR) using MSO-definable transformations from strings to DAGs (resp., trees) without backward edges. Additionally, we establish that the two classes SR and SLR are closed under operations that are relevant for designing query languages. Finally, we study the relationship with weighted automata (WA), and show that CRAs over a suitably chosen set of operations correspond to WA, thus establishing that WA are a special case of CRAs. △ Less

Submitted 3 November, 2019; v1 submitted 10 July, 2018; originally announced July 2018.

Comments: 53 pages

arXiv:1707.04148 [pdf, ps, other]

On Repair with Probabilistic Attribute Grammars

Authors: Manos Koukoutos, Mukund Raghothaman, Etienne Kneuss, Viktor Kuncak

Abstract: Program synthesis and repair have emerged as an exciting area of research, driven by the potential for revolutionary advances in programmer productivity. Among most promising ideas emerging for synthesis are syntax-driven search, probabilistic models of code, and the use of input-output examples. Our paper shows how to combine these techniques and use them for program repair, which is among the mo… ▽ More Program synthesis and repair have emerged as an exciting area of research, driven by the potential for revolutionary advances in programmer productivity. Among most promising ideas emerging for synthesis are syntax-driven search, probabilistic models of code, and the use of input-output examples. Our paper shows how to combine these techniques and use them for program repair, which is among the most relevant applications of synthesis to general-purpose code. Our approach combines semantic specifications, in the form of pre- and post-conditions and input-output examples with syntactic specifications in the form of term grammars and AST-level statistics extracted from code corpora. We show that synthesis in this framework can be viewed as an instance of graph search, permitting the use of well-understood families of techniques such as A*. We implement our algorithm in a framework for verification, synthesis and repair of functional programs, demonstrating that our approach can repair programs that are beyond the reach of previous tools. △ Less

Submitted 13 July, 2017; originally announced July 2017.

arXiv:1511.08497 [pdf, other]

doi 10.1145/2884781.2884808

SWIM: Synthesizing What I Mean

Authors: Mukund Raghothaman, Yi Wei, Youssef Hamadi

Abstract: Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as "generate md5 hash code".… ▽ More Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as "generate md5 hash code". We translate user queries into the APIs of interest using clickthrough data from the Bing search engine. Then, based on patterns learned from open-source code repositories, we synthesize idiomatic code describing the use of these APIs. We introduce \emph{structured call sequences} to capture API-usage patterns. Structured call sequences are a generalized form of method call sequences, with if-branches and while-loops to represent conditional and repeated API usage patterns, and are simple to extract and amenable to synthesis. We evaluated SWIM with 30 common C# API-related queries received by Bing. For 70% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet. △ Less

Submitted 13 February, 2016; v1 submitted 26 November, 2015; originally announced November 2015.

Comments: Final draft of paper to be presented at ICSE 2016

arXiv:1505.04409 [pdf, other]

Automatic Completion of Distributed Protocols with Symmetry

Authors: Rajeev Alur, Mukund Raghothaman, Christos Stergiou, Stavros Tripakis, Abhishek Udupa

Abstract: A distributed protocol is typically modeled as a set of communicating processes, where each process is described as an extended state machine along with fairness assumptions, and its correctness is specified using safety and liveness requirements. Designing correct distributed protocols is a challenging task. Aimed at simplifying this task, we allow the designer to leave some of the guards and upd… ▽ More A distributed protocol is typically modeled as a set of communicating processes, where each process is described as an extended state machine along with fairness assumptions, and its correctness is specified using safety and liveness requirements. Designing correct distributed protocols is a challenging task. Aimed at simplifying this task, we allow the designer to leave some of the guards and updates to state variables in the description of extended state machines as unknown functions. The protocol completion problem then is to find interpretations for these unknown functions while guaranteeing correctness. In many distributed protocols, process behaviors are naturally symmetric, and thus, synthesized expressions are further required to obey symmetry constraints. Our counterexample-guided synthesis algorithm consists of repeatedly invoking two phases. In the first phase, candidates for unknown expressions are generated using the SMT solver Z3. This phase requires carefully orchestrating constraints to enforce the desired symmetry in read/write accesses. In the second phase, the resulting completed protocol is checked for correctness using a custom-built model checker that handles fairness assumptions, safety and liveness requirements, and exploits symmetry. When model checking fails, our tool examines a set of counterexamples to safety/liveness properties to generate constraints on unknown functions that must be satisfied by subsequent completions. For evaluation, we show that our prototype is able to automatically discover interesting missing details in distributed protocols for mutual exclusion, self stabilization, and cache coherence. △ Less

Submitted 17 May, 2015; originally announced May 2015.

Comments: Full version of paper presented at CAV 2015

arXiv:1405.5590 [pdf, other]

Language to Specify Syntax-Guided Synthesis Problems

Authors: Mukund Raghothaman, Abhishek Udupa

Abstract: We present a language to specify syntax guided synthesis (SyGuS) problems. Syntax guidance is a prominent theme in contemporary program synthesis approaches, and SyGuS was first described in [1]. This paper describes concretely the input format of a SyGuS solver. [1] Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Sola… ▽ More We present a language to specify syntax guided synthesis (SyGuS) problems. Syntax guidance is a prominent theme in contemporary program synthesis approaches, and SyGuS was first described in [1]. This paper describes concretely the input format of a SyGuS solver. [1] Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntax-guided synthesis. In FMCAD, pages 1--17, 2013. △ Less

Submitted 23 October, 2016; v1 submitted 21 May, 2014; originally announced May 2014.

Comments: Fixed small typo in the SyGuS grammar specification pointed out by Sergey Mechtaev

arXiv:1402.7150 [pdf, other]

Synthesizing Finite-state Protocols from Scenarios and Requirements

Authors: Rajeev Alur, Milo Martin, Mukund Raghothaman, Christos Stergiou, Stavros Tripakis, Abhishek Udupa

Abstract: Scenarios, or Message Sequence Charts, offer an intuitive way of describing the desired behaviors of a distributed protocol. In this paper we propose a new way of specifying finite-state protocols using scenarios: we show that it is possible to automatically derive a distributed implementation from a set of scenarios augmented with a set of safety and liveness requirements, provided the given scen… ▽ More Scenarios, or Message Sequence Charts, offer an intuitive way of describing the desired behaviors of a distributed protocol. In this paper we propose a new way of specifying finite-state protocols using scenarios: we show that it is possible to automatically derive a distributed implementation from a set of scenarios augmented with a set of safety and liveness requirements, provided the given scenarios adequately \emph{cover} all the states of the desired implementation. We first derive incomplete state machines from the given scenarios, and then synthesis corresponds to completing the transition relation of individual processes so that the global product meets the specified requirements. This completion problem, in general, has the same complexity, PSPACE, as the verification problem, but unlike the verification problem, is NP-complete for a constant number of processes. We present two algorithms for solving the completion problem, one based on a heuristic search in the space of possible completions and one based on OBDD-based symbolic fixpoint computation. We evaluate the proposed methodology for protocol specification and the effectiveness of the synthesis algorithms using the classical alternating-bit protocol. △ Less

Submitted 28 February, 2014; originally announced February 2014.

Comments: This is the working draft of a paper currently in submission. (February 10, 2014)

arXiv:1402.3021 [pdf, other]

Regular Combinators for String Transformations

Authors: Rajeev Alur, Adam Freilich, Mukund Raghothaman

Abstract: We focus on (partial) functions that map input strings to a monoid such as the set of integers with addition and the set of output strings with concatenation. The notion of regularity for such functions has been defined using two-way finite-state transducers, (one-way) cost register automata, and MSO-definable graph transformations. In this paper, we give an algebraic and machine-independent chara… ▽ More We focus on (partial) functions that map input strings to a monoid such as the set of integers with addition and the set of output strings with concatenation. The notion of regularity for such functions has been defined using two-way finite-state transducers, (one-way) cost register automata, and MSO-definable graph transformations. In this paper, we give an algebraic and machine-independent characterization of this class analogous to the definition of regular languages by regular expressions. When the monoid is commutative, we prove that every regular function can be constructed from constant functions using the combinators of choice, split sum, and iterated sum, that are analogs of union, concatenation, and Kleene-*, respectively, but enforce unique (or unambiguous) parsing. Our main result is for the general case of non-commutative monoids, which is of particular interest for capturing regular string-to-string transformations for document processing. We prove that the following additional combinators suffice for constructing all regular functions: (1) the left-additive versions of split sum and iterated sum, which allow transformations such as string reversal; (2) sum of functions, which allows transformations such as copying of strings; and (3) function composition, or alternatively, a new concept of chained sum, which allows output values from adjacent blocks to mix. △ Less

Submitted 12 February, 2014; originally announced February 2014.

Comments: This is the full version, with omitted proofs and constructions, of the conference paper currently in submission

arXiv:1304.7029 [pdf, other]

Decision Problems for Additive Regular Functions

Authors: Rajeev Alur, Mukund Raghothaman

Abstract: Additive Cost Register Automata (ACRA) map strings to integers using a finite set of registers that are updated using assignments of the form "x := y + c" at every step. The corresponding class of additive regular functions has multiple equivalent characterizations, appealing closure properties, and a decidable equivalence problem. In this paper, we solve two decision problems for this model. Firs… ▽ More Additive Cost Register Automata (ACRA) map strings to integers using a finite set of registers that are updated using assignments of the form "x := y + c" at every step. The corresponding class of additive regular functions has multiple equivalent characterizations, appealing closure properties, and a decidable equivalence problem. In this paper, we solve two decision problems for this model. First, we define the register complexity of an additive regular function to be the minimum number of registers that an ACRA needs to compute it. We characterize the register complexity by a necessary and sufficient condition regarding the largest subset of registers whose values can be made far apart from one another. We then use this condition to design a PSPACE algorithm to compute the register complexity of a given ACRA, and establish a matching lower bound. Our results also lead to a machine-independent characterization of the register complexity of additive regular functions. Second, we consider two-player games over ACRAs, where the objective of one of the players is to reach a target set while minimizing the cost. We show the corresponding decision problem to be EXPTIME-complete when costs are non-negative integers, but undecidable when costs are integers. △ Less

Submitted 25 April, 2013; originally announced April 2013.

Comments: Conference version to appear in ICALP 2013

arXiv:1111.0670 [pdf, ps, other]

Regular Functions, Cost Register Automata, and Generalized Min-Cost Problems

Authors: Rajeev Alur, Loris D'Antoni, Jyotirmoy V. Deshmukh, Mukund Raghothaman, Yifei Yuan

Abstract: Motivated by the successful application of the theory of regular languages to formal verification of finite-state systems, there is a renewed interest in develo** a theory of analyzable functions from strings to numerical values that can provide a foundation for analyzing {\em quantitative} properties of finite-state systems. In this paper, we propose a deterministic model for associating costs… ▽ More Motivated by the successful application of the theory of regular languages to formal verification of finite-state systems, there is a renewed interest in develo** a theory of analyzable functions from strings to numerical values that can provide a foundation for analyzing {\em quantitative} properties of finite-state systems. In this paper, we propose a deterministic model for associating costs with strings that is parameterized by operations of interest (such as addition, scaling, and $\min$), a notion of {\em regularity} that provides a yardstick to measure expressiveness, and study decision problems and theoretical properties of resulting classes of cost functions. Our definition of regularity relies on the theory of string-to-tree transducers, and allows associating costs with events that are conditional upon regular properties of future events. Our model of {\em cost register automata} allows computation of regular functions using multiple "write-only" registers whose values can be combined using the allowed set of operations. We show that classical shortest-path algorithms as well as algorithms designed for computing {\em discounted costs}, can be adopted for solving the min-cost problems for the more general classes of functions specified in our model. Cost register automata with $\min$ and increment give a deterministic model that is equivalent to {\em weighted automata}, an extensively studied nondeterministic model, and this connection results in new insights and new open problems. △ Less

Submitted 21 February, 2012; v1 submitted 2 November, 2011; originally announced November 2011.

Comments: ICALP12 submission, technical report/extended version. 33 pages+title page

Showing 1–15 of 15 results for author: Raghothaman, M