-
Generative Explanations for Program Synthesizers
Authors:
Amirmohammad Nazari,
Souti Chattopadhyay,
Swabha Swayamdipta,
Mukund Raghothaman
Abstract:
Despite great advances in program synthesis techniques, they remain algorithmic black boxes. Although they guarantee that when synthesis is successful, the implementation satisfies the specification, they provide no additional information regarding how the implementation works or the manner in which the specification is realized. One possibility to answer these questions is to use large language m…
▽ More
Despite great advances in program synthesis techniques, they remain algorithmic black boxes. Although they guarantee that when synthesis is successful, the implementation satisfies the specification, they provide no additional information regarding how the implementation works or the manner in which the specification is realized. One possibility to answer these questions is to use large language models (LLMs) to construct human-readable explanations. Unfortunately, experiments reveal that LLMs frequently produce nonsensical or misleading explanations when applied to the unidiomatic code produced by program synthesizers.
In this paper, we develop an approach to reliably augment the implementation with explanatory names. We recover fine-grained input-output data from the synthesis algorithm to enhance the prompt supplied to the LLM, and use a combination of a program verifier and a second language model to validate the proposed explanations before presenting them to the user. Together, these techniques massively improve the accuracy of the proposed names, from 24% to 79% respectively. Through a pair of small user studies, we find that users significantly prefer the explanations produced by our technique (76% of responses indicating the appropriateness of the presenting names) to the baseline (with only 2% of responses approving of the suggestions), and that the proposed names measurably help users in understanding the synthesized implementation.
△ Less
Submitted 5 March, 2024;
originally announced March 2024.
-
Superflows: A New Tool for Forensic Network Flow Analysis
Authors:
Michael Collins,
Jyotirmoy V. Deshmukh,
Dristi Dinesh,
Mukund Raghothaman,
Srivatsan Ravi,
Yuan Xia
Abstract:
Network security analysts gather data from diverse sources, from high-level summaries of network flow and traffic volumes to low-level details such as service logs from servers and the contents of individual packets. They validate and check this data against traffic patterns and historical indicators of compromise. Based on the results of this analysis, a decision is made to either automatically m…
▽ More
Network security analysts gather data from diverse sources, from high-level summaries of network flow and traffic volumes to low-level details such as service logs from servers and the contents of individual packets. They validate and check this data against traffic patterns and historical indicators of compromise. Based on the results of this analysis, a decision is made to either automatically manage the traffic or report it to an analyst for further investigation. Unfortunately, due rapidly increasing traffic volumes, there are far more events to check than operational teams can handle for effective forensic analysis. However, just as packets are grouped into flows that share a commonality, we argue that a high-level construct for grou** network flows into a set a flows that share a hypothesis is needed to significantly improve the quality of operational network response by increasing Events Per Analysts Hour (EPAH).
In this paper, we propose a formalism for describing a superflow construct, which we characterize as an aggregation of one or more flows based on an analyst-specific hypothesis about traffic behavior. We demonstrate simple superflow constructions and representations, and perform a case study to explain how the formalism can be used to reduce the volume of data for forensic analysis.
△ Less
Submitted 2 March, 2024;
originally announced March 2024.
-
Data-Driven Template-Free Invariant Generation
Authors:
Yuan Xia,
Jyotirmoy V. Deshmukh,
Mukund Raghothaman,
Srivatsan Ravi
Abstract:
Automatic verification of concurrent programs faces state explosion due to the exponential possible interleavings of its sequential components coupled with large or infinite state spaces. An alternative is deductive verification, where given a candidate invariant, we establish inductive invariance and show that any state satisfying the invariant is also safe. However, learning (inductive) program…
▽ More
Automatic verification of concurrent programs faces state explosion due to the exponential possible interleavings of its sequential components coupled with large or infinite state spaces. An alternative is deductive verification, where given a candidate invariant, we establish inductive invariance and show that any state satisfying the invariant is also safe. However, learning (inductive) program invariants is difficult. To this end, we propose a data-driven procedure to synthesize program invariants, where it is assumed that the program invariant is an expression that characterizes a (hopefully tight) over-approximation of the reachable program states. The main ideas of our approach are: (1) We treat a candidate invariant as a classifier separating states observed in (sampled) program traces from those speculated to be unreachable. (2) We develop an enumerative, template-free approach to learn such classifiers from positive and negative examples. At its core, our enumerative approach employs decision trees to generate expressions that do not over-fit to the observed states (and thus generalize). (3) We employ a runtime framework to monitor program executions that may refute the candidate invariant; every refutation triggers a revision of the candidate invariant. Our runtime framework can be viewed as an instance of statistical model checking, which gives us probabilistic guarantees on the candidate invariant. We also show that such in some cases, our counterexample-guided inductive synthesis approach converges (in probability) to an overapproximation of the reachable set of states. Our experimental results show that our framework excels in learning useful invariants using only a fraction of the set of reachable states for a wide variety of concurrent programs.
△ Less
Submitted 29 December, 2023;
originally announced December 2023.
-
The SyGuS Language Standard Version 2.1
Authors:
Saswat Padhi,
Elizabeth Polgreen,
Mukund Raghothaman,
Andrew Reynolds,
Abhishek Udupa
Abstract:
The classical formulation of the program-synthesis problem is to find a program that meets a correctness specification given as a logical formula. Syntax-guided synthesis (SyGuS) is a standardized format for specifying the correctness specification with a syntactic template that constrains the space of allowed implementations.
The input to SyGuS consists of a background theory, a semantic correc…
▽ More
The classical formulation of the program-synthesis problem is to find a program that meets a correctness specification given as a logical formula. Syntax-guided synthesis (SyGuS) is a standardized format for specifying the correctness specification with a syntactic template that constrains the space of allowed implementations.
The input to SyGuS consists of a background theory, a semantic correctness specification for the desired program given by a logical formula, and a syntactic set of candidate implementations given by a grammar. The computational problem then is to find an implementation from the set of candidate expressions that satisfies the specification in the given theory. The formulation of the problem builds on SMT-LIB.
This document defines the SyGuS 2.1 standard, which is intended to be used as the standard input and output language for solvers targeting the syntax-guided synthesis problem. It borrows many concepts and language constructs from the standard format for Satisfiability Modulo Theories (SMT) solvers, the SMT-LIB 2.6 standard.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
Data-Driven Synthesis of Provably Sound Side Channel Analyses
Authors:
**gbo Wang,
Chungha Sung,
Mukund Raghothaman,
Chao Wang
Abstract:
We propose a data-driven method for synthesizing a static analyzer to detect side-channel information leaks in cryptographic software. Compared to the conventional way of manually crafting such a static analyzer, which can be labor intensive, error prone and suboptimal, our learning-based technique is not only automated but also provably sound. Our analyzer consists of a set of type-inference rule…
▽ More
We propose a data-driven method for synthesizing a static analyzer to detect side-channel information leaks in cryptographic software. Compared to the conventional way of manually crafting such a static analyzer, which can be labor intensive, error prone and suboptimal, our learning-based technique is not only automated but also provably sound. Our analyzer consists of a set of type-inference rules learned from the training data, i.e., example code snippets annotated with ground truth. Internally, we use syntax-guided synthesis (SyGuS) to generate new features and decision tree learning (DTL) to generate type-inference rules based on these features. We guarantee soundness by formally proving each learned rule via a technique called Datalog query containment checking. We have implemented our technique in the LLVM compiler and used it to detect power side channels in C programs. Our results show that, in addition to being automated and provably sound during synthesis, the learned analyzer also has the same empirical accuracy as two state-of-the-art, manually crafted analyzers while being 300X and 900X faster, respectively.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Synthesizing Datalog Programs Using Numerical Relaxation
Authors:
Xujie Si,
Mukund Raghothaman,
Kihong Heo,
Mayur Naik
Abstract:
The problem of learning logical rules from examples arises in diverse fields, including program synthesis, logic programming, and machine learning. Existing approaches either involve solving computationally difficult combinatorial problems, or performing parameter estimation in complex statistical models.
In this paper, we present Difflog, a technique to extend the logic programming language Dat…
▽ More
The problem of learning logical rules from examples arises in diverse fields, including program synthesis, logic programming, and machine learning. Existing approaches either involve solving computationally difficult combinatorial problems, or performing parameter estimation in complex statistical models.
In this paper, we present Difflog, a technique to extend the logic programming language Datalog to the continuous setting. By attaching real-valued weights to individual rules of a Datalog program, we naturally associate numerical values with individual conclusions of the program. Analogous to the strategy of numerical relaxation in optimization problems, we can now first determine the rule weights which cause the best agreement between the training labels and the induced values of output tuples, and subsequently recover the classical discrete-valued target program from the continuous optimum.
We evaluate Difflog on a suite of 34 benchmark problems from recent literature in knowledge discovery, formal verification, and database query-by-example, and demonstrate significant improvements in learning complex programs with recursive rules, invented predicates, and relations of arbitrary arity.
△ Less
Submitted 25 June, 2019; v1 submitted 1 June, 2019;
originally announced June 2019.
-
Streamable Regular Transductions
Authors:
Rajeev Alur,
Dana Fisman,
Konstantinos Mamouras,
Mukund Raghothaman,
Caleb Stanford
Abstract:
Motivated by real-time monitoring and data processing applications, we develop a formal theory of quantitative queries for streaming data that can be evaluated efficiently. We consider the model of unambiguous Cost Register Automata (CRAs), which are machines that combine finite-state control (for identifying regular patterns) with a finite set of data registers (for computing numerical aggregates…
▽ More
Motivated by real-time monitoring and data processing applications, we develop a formal theory of quantitative queries for streaming data that can be evaluated efficiently. We consider the model of unambiguous Cost Register Automata (CRAs), which are machines that combine finite-state control (for identifying regular patterns) with a finite set of data registers (for computing numerical aggregates). The definition of CRAs is parameterized by the collection of numerical operations that can be applied to the registers. These machines give rise to the class of streamable regular transductions (SR), and to the class of streamable linear regular transductions (SLR) when the register updates are copyless, i.e. every register appears at most once the right-hand-side expressions of the updates. We give a logical characterization of the class SR (resp., SLR) using MSO-definable transformations from strings to DAGs (resp., trees) without backward edges. Additionally, we establish that the two classes SR and SLR are closed under operations that are relevant for designing query languages. Finally, we study the relationship with weighted automata (WA), and show that CRAs over a suitably chosen set of operations correspond to WA, thus establishing that WA are a special case of CRAs.
△ Less
Submitted 3 November, 2019; v1 submitted 10 July, 2018;
originally announced July 2018.
-
On Repair with Probabilistic Attribute Grammars
Authors:
Manos Koukoutos,
Mukund Raghothaman,
Etienne Kneuss,
Viktor Kuncak
Abstract:
Program synthesis and repair have emerged as an exciting area of research, driven by the potential for revolutionary advances in programmer productivity. Among most promising ideas emerging for synthesis are syntax-driven search, probabilistic models of code, and the use of input-output examples. Our paper shows how to combine these techniques and use them for program repair, which is among the mo…
▽ More
Program synthesis and repair have emerged as an exciting area of research, driven by the potential for revolutionary advances in programmer productivity. Among most promising ideas emerging for synthesis are syntax-driven search, probabilistic models of code, and the use of input-output examples. Our paper shows how to combine these techniques and use them for program repair, which is among the most relevant applications of synthesis to general-purpose code. Our approach combines semantic specifications, in the form of pre- and post-conditions and input-output examples with syntactic specifications in the form of term grammars and AST-level statistics extracted from code corpora. We show that synthesis in this framework can be viewed as an instance of graph search, permitting the use of well-understood families of techniques such as A*. We implement our algorithm in a framework for verification, synthesis and repair of functional programs, demonstrating that our approach can repair programs that are beyond the reach of previous tools.
△ Less
Submitted 13 July, 2017;
originally announced July 2017.
-
SWIM: Synthesizing What I Mean
Authors:
Mukund Raghothaman,
Yi Wei,
Youssef Hamadi
Abstract:
Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as "generate md5 hash code".…
▽ More
Modern programming frameworks come with large libraries, with diverse applications such as for matching regular expressions, parsing XML files and sending email. Programmers often use search engines such as Google and Bing to learn about existing APIs. In this paper, we describe SWIM, a tool which suggests code snippets given API-related natural language queries such as "generate md5 hash code".
We translate user queries into the APIs of interest using clickthrough data from the Bing search engine. Then, based on patterns learned from open-source code repositories, we synthesize idiomatic code describing the use of these APIs. We introduce \emph{structured call sequences} to capture API-usage patterns. Structured call sequences are a generalized form of method call sequences, with if-branches and while-loops to represent conditional and repeated API usage patterns, and are simple to extract and amenable to synthesis.
We evaluated SWIM with 30 common C# API-related queries received by Bing. For 70% of the queries, the first suggested snippet was a relevant solution, and a relevant solution was present in the top 10 results for all benchmarked queries. The online portion of the workflow is also very responsive, at an average of 1.5 seconds per snippet.
△ Less
Submitted 13 February, 2016; v1 submitted 26 November, 2015;
originally announced November 2015.
-
Automatic Completion of Distributed Protocols with Symmetry
Authors:
Rajeev Alur,
Mukund Raghothaman,
Christos Stergiou,
Stavros Tripakis,
Abhishek Udupa
Abstract:
A distributed protocol is typically modeled as a set of communicating processes, where each process is described as an extended state machine along with fairness assumptions, and its correctness is specified using safety and liveness requirements. Designing correct distributed protocols is a challenging task. Aimed at simplifying this task, we allow the designer to leave some of the guards and upd…
▽ More
A distributed protocol is typically modeled as a set of communicating processes, where each process is described as an extended state machine along with fairness assumptions, and its correctness is specified using safety and liveness requirements. Designing correct distributed protocols is a challenging task. Aimed at simplifying this task, we allow the designer to leave some of the guards and updates to state variables in the description of extended state machines as unknown functions. The protocol completion problem then is to find interpretations for these unknown functions while guaranteeing correctness. In many distributed protocols, process behaviors are naturally symmetric, and thus, synthesized expressions are further required to obey symmetry constraints. Our counterexample-guided synthesis algorithm consists of repeatedly invoking two phases. In the first phase, candidates for unknown expressions are generated using the SMT solver Z3. This phase requires carefully orchestrating constraints to enforce the desired symmetry in read/write accesses. In the second phase, the resulting completed protocol is checked for correctness using a custom-built model checker that handles fairness assumptions, safety and liveness requirements, and exploits symmetry. When model checking fails, our tool examines a set of counterexamples to safety/liveness properties to generate constraints on unknown functions that must be satisfied by subsequent completions. For evaluation, we show that our prototype is able to automatically discover interesting missing details in distributed protocols for mutual exclusion, self stabilization, and cache coherence.
△ Less
Submitted 17 May, 2015;
originally announced May 2015.
-
Language to Specify Syntax-Guided Synthesis Problems
Authors:
Mukund Raghothaman,
Abhishek Udupa
Abstract:
We present a language to specify syntax guided synthesis (SyGuS) problems. Syntax guidance is a prominent theme in contemporary program synthesis approaches, and SyGuS was first described in [1]. This paper describes concretely the input format of a SyGuS solver.
[1] Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Sola…
▽ More
We present a language to specify syntax guided synthesis (SyGuS) problems. Syntax guidance is a prominent theme in contemporary program synthesis approaches, and SyGuS was first described in [1]. This paper describes concretely the input format of a SyGuS solver.
[1] Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. Syntax-guided synthesis. In FMCAD, pages 1--17, 2013.
△ Less
Submitted 23 October, 2016; v1 submitted 21 May, 2014;
originally announced May 2014.
-
Synthesizing Finite-state Protocols from Scenarios and Requirements
Authors:
Rajeev Alur,
Milo Martin,
Mukund Raghothaman,
Christos Stergiou,
Stavros Tripakis,
Abhishek Udupa
Abstract:
Scenarios, or Message Sequence Charts, offer an intuitive way of describing the desired behaviors of a distributed protocol. In this paper we propose a new way of specifying finite-state protocols using scenarios: we show that it is possible to automatically derive a distributed implementation from a set of scenarios augmented with a set of safety and liveness requirements, provided the given scen…
▽ More
Scenarios, or Message Sequence Charts, offer an intuitive way of describing the desired behaviors of a distributed protocol. In this paper we propose a new way of specifying finite-state protocols using scenarios: we show that it is possible to automatically derive a distributed implementation from a set of scenarios augmented with a set of safety and liveness requirements, provided the given scenarios adequately \emph{cover} all the states of the desired implementation. We first derive incomplete state machines from the given scenarios, and then synthesis corresponds to completing the transition relation of individual processes so that the global product meets the specified requirements. This completion problem, in general, has the same complexity, PSPACE, as the verification problem, but unlike the verification problem, is NP-complete for a constant number of processes. We present two algorithms for solving the completion problem, one based on a heuristic search in the space of possible completions and one based on OBDD-based symbolic fixpoint computation. We evaluate the proposed methodology for protocol specification and the effectiveness of the synthesis algorithms using the classical alternating-bit protocol.
△ Less
Submitted 28 February, 2014;
originally announced February 2014.
-
Regular Combinators for String Transformations
Authors:
Rajeev Alur,
Adam Freilich,
Mukund Raghothaman
Abstract:
We focus on (partial) functions that map input strings to a monoid such as the set of integers with addition and the set of output strings with concatenation. The notion of regularity for such functions has been defined using two-way finite-state transducers, (one-way) cost register automata, and MSO-definable graph transformations. In this paper, we give an algebraic and machine-independent chara…
▽ More
We focus on (partial) functions that map input strings to a monoid such as the set of integers with addition and the set of output strings with concatenation. The notion of regularity for such functions has been defined using two-way finite-state transducers, (one-way) cost register automata, and MSO-definable graph transformations. In this paper, we give an algebraic and machine-independent characterization of this class analogous to the definition of regular languages by regular expressions. When the monoid is commutative, we prove that every regular function can be constructed from constant functions using the combinators of choice, split sum, and iterated sum, that are analogs of union, concatenation, and Kleene-*, respectively, but enforce unique (or unambiguous) parsing. Our main result is for the general case of non-commutative monoids, which is of particular interest for capturing regular string-to-string transformations for document processing. We prove that the following additional combinators suffice for constructing all regular functions: (1) the left-additive versions of split sum and iterated sum, which allow transformations such as string reversal; (2) sum of functions, which allows transformations such as copying of strings; and (3) function composition, or alternatively, a new concept of chained sum, which allows output values from adjacent blocks to mix.
△ Less
Submitted 12 February, 2014;
originally announced February 2014.
-
Decision Problems for Additive Regular Functions
Authors:
Rajeev Alur,
Mukund Raghothaman
Abstract:
Additive Cost Register Automata (ACRA) map strings to integers using a finite set of registers that are updated using assignments of the form "x := y + c" at every step. The corresponding class of additive regular functions has multiple equivalent characterizations, appealing closure properties, and a decidable equivalence problem. In this paper, we solve two decision problems for this model. Firs…
▽ More
Additive Cost Register Automata (ACRA) map strings to integers using a finite set of registers that are updated using assignments of the form "x := y + c" at every step. The corresponding class of additive regular functions has multiple equivalent characterizations, appealing closure properties, and a decidable equivalence problem. In this paper, we solve two decision problems for this model. First, we define the register complexity of an additive regular function to be the minimum number of registers that an ACRA needs to compute it. We characterize the register complexity by a necessary and sufficient condition regarding the largest subset of registers whose values can be made far apart from one another. We then use this condition to design a PSPACE algorithm to compute the register complexity of a given ACRA, and establish a matching lower bound. Our results also lead to a machine-independent characterization of the register complexity of additive regular functions. Second, we consider two-player games over ACRAs, where the objective of one of the players is to reach a target set while minimizing the cost. We show the corresponding decision problem to be EXPTIME-complete when costs are non-negative integers, but undecidable when costs are integers.
△ Less
Submitted 25 April, 2013;
originally announced April 2013.
-
Regular Functions, Cost Register Automata, and Generalized Min-Cost Problems
Authors:
Rajeev Alur,
Loris D'Antoni,
Jyotirmoy V. Deshmukh,
Mukund Raghothaman,
Yifei Yuan
Abstract:
Motivated by the successful application of the theory of regular languages to formal verification of finite-state systems, there is a renewed interest in develo** a theory of analyzable functions from strings to numerical values that can provide a foundation for analyzing {\em quantitative} properties of finite-state systems. In this paper, we propose a deterministic model for associating costs…
▽ More
Motivated by the successful application of the theory of regular languages to formal verification of finite-state systems, there is a renewed interest in develo** a theory of analyzable functions from strings to numerical values that can provide a foundation for analyzing {\em quantitative} properties of finite-state systems. In this paper, we propose a deterministic model for associating costs with strings that is parameterized by operations of interest (such as addition, scaling, and $\min$), a notion of {\em regularity} that provides a yardstick to measure expressiveness, and study decision problems and theoretical properties of resulting classes of cost functions. Our definition of regularity relies on the theory of string-to-tree transducers, and allows associating costs with events that are conditional upon regular properties of future events. Our model of {\em cost register automata} allows computation of regular functions using multiple "write-only" registers whose values can be combined using the allowed set of operations. We show that classical shortest-path algorithms as well as algorithms designed for computing {\em discounted costs}, can be adopted for solving the min-cost problems for the more general classes of functions specified in our model. Cost register automata with $\min$ and increment give a deterministic model that is equivalent to {\em weighted automata}, an extensively studied nondeterministic model, and this connection results in new insights and new open problems.
△ Less
Submitted 21 February, 2012; v1 submitted 2 November, 2011;
originally announced November 2011.