-
Ansible Lightspeed: A Code Generation Service for IT Automation
Authors:
Priyam Sahoo,
Saurabh Pujar,
Ganesh Nalawade,
Richard Gebhardt,
Louis Mandel,
Luca Buratti
Abstract:
The availability of Large Language Models (LLMs) which can generate code, has made it possible to create tools that improve developer productivity. Integrated development environments or IDEs which developers use to write software are often used as an interface to interact with LLMs. Although many such tools have been released, almost all of them focus on general-purpose programming languages. Dom…
▽ More
The availability of Large Language Models (LLMs) which can generate code, has made it possible to create tools that improve developer productivity. Integrated development environments or IDEs which developers use to write software are often used as an interface to interact with LLMs. Although many such tools have been released, almost all of them focus on general-purpose programming languages. Domain-specific languages, such as those crucial for IT automation, have not received much attention. Ansible is one such YAML-based IT automation-specific language. Red Hat Ansible Lightspeed with IBM Watson Code Assistant, further referred to as Ansible Lightspeed, is an LLM-based service designed explicitly for natural language to Ansible code generation.
In this paper, we describe the design and implementation of the Ansible Lightspeed service and analyze feedback from thousands of real users. We examine diverse performance indicators, classified according to both immediate and extended utilization patterns along with user sentiments. The analysis shows that the user acceptance rate of Ansible Lightspeed suggestions is higher than comparable tools that are more general and not specific to a programming language. This remains true even after we use much more stringent criteria for what is considered an accepted model suggestion, discarding suggestions which were heavily edited after being accepted. The relatively high acceptance rate results in higher-than-expected user retention and generally positive user feedback. This paper provides insights on how a comparatively small, dedicated model performs on a domain-specific language and more importantly, how it is received by users.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Learning Transfers over Several Programming Languages
Authors:
Razan Baltaji,
Saurabh Pujar,
Louis Mandel,
Martin Hirzel,
Luca Buratti,
Lav Varshney
Abstract:
Large language models (LLMs) have become remarkably good at improving developer productivity for high-resource programming languages. These models use two kinds of data: large amounts of unlabeled code samples for pre-training and relatively smaller amounts of labeled code samples for fine-tuning or in-context learning. Unfortunately, many programming languages are low-resource, lacking labeled sa…
▽ More
Large language models (LLMs) have become remarkably good at improving developer productivity for high-resource programming languages. These models use two kinds of data: large amounts of unlabeled code samples for pre-training and relatively smaller amounts of labeled code samples for fine-tuning or in-context learning. Unfortunately, many programming languages are low-resource, lacking labeled samples for most tasks and often even lacking unlabeled samples. Therefore, users of low-resource languages (e.g., legacy or new languages) miss out on the benefits of LLMs. Cross-lingual transfer uses data from a source language to improve model performance on a target language. It has been well-studied for natural languages, but has received little attention for programming languages. This paper reports extensive experiments on four tasks using a transformer-based LLM and 11 to 41 programming languages to explore the following questions. First, how well does cross-lingual transfer work for a given task across different language pairs. Second, given a task and target language, how should one choose a source language. Third, which characteristics of a language pair are predictive of transfer performance, and how does that depend on the given task. Our empirical study with 1,808 experiments reveals practical and scientific insights, such as Kotlin and JavaScript being the most transferable source languages and different tasks relying on substantially different features. Overall, we find that learning transfers well across several programming languages.
△ Less
Submitted 25 March, 2024; v1 submitted 25 October, 2023;
originally announced October 2023.
-
Density-Based Semantics for Reactive Probabilistic Programming
Authors:
Guillaume Baudart,
Louis Mandel,
Christine Tasson
Abstract:
Synchronous languages are now a standard industry tool for critical embedded systems. Designers write high-level specifications by composing streams of values using block diagrams. These languages have been extended with Bayesian reasoning to program state-space models which compute a stream of distributions given a stream of observations. However, the semantics of probabilistic models is only def…
▽ More
Synchronous languages are now a standard industry tool for critical embedded systems. Designers write high-level specifications by composing streams of values using block diagrams. These languages have been extended with Bayesian reasoning to program state-space models which compute a stream of distributions given a stream of observations. However, the semantics of probabilistic models is only defined for scheduled equations -- a significant limitation compared to dataflow synchronous languages and block diagrams which do not require any ordering.
In this paper we propose two schedule agnostic semantics for a probabilistic synchronous language. The key idea is to interpret probabilistic expressions as a stream of un-normalized density functions which maps random variable values to a result and positive score. The co-iterative semantics interprets programs as state machines and equations are computed using a fixpoint operator. The relational semantics directly manipulates streams and is thus a better fit to reason about program equivalence. We use the relational semantics to prove the correctness of a program transformation required to run an optimized inference algorithm for state-space models with constant parameters.
△ Less
Submitted 7 September, 2023; v1 submitted 3 August, 2023;
originally announced August 2023.
-
Verifying Performance Properties of Probabilistic Inference
Authors:
Eric Atkinson,
Ellie Y. Cheng,
Guillaume Baudart,
Louis Mandel,
Michael Carbin
Abstract:
In this extended abstract, we discuss the opportunity to formally verify that inference systems for probabilistic programming guarantee good performance. In particular, we focus on hybrid inference systems that combine exact and approximate inference to try to exploit the advantages of each. Their performance depends critically on a) the division between exact and approximate inference, and b) the…
▽ More
In this extended abstract, we discuss the opportunity to formally verify that inference systems for probabilistic programming guarantee good performance. In particular, we focus on hybrid inference systems that combine exact and approximate inference to try to exploit the advantages of each. Their performance depends critically on a) the division between exact and approximate inference, and b) the computational resources consumed by exact inference.
We describe several projects in this direction. Semi-symbolic Inference (SSI) is a type of hybrid inference system that provides limited guarantees by construction on the exact/approximate division. In addition to these limited guarantees, we also describe ongoing work to extend guarantees to a more complex class of programs, requiring a program analysis to ensure the guarantees. Finally, we also describe work on verifying that inference systems using delayed sampling -- another type of hybrid inference -- execute in bounded memory. Together, these projects show that verification can deliver the performance guarantees that probabilistic programming languages need.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Semi-Symbolic Inference for Efficient Streaming Probabilistic Programming
Authors:
Eric Atkinson,
Charles Yuan,
Guillaume Baudart,
Louis Mandel,
Michael Carbin
Abstract:
Efficient inference is often possible in a streaming context using Rao-Blackwellized particle filters (RBPFs), which exactly solve inference problems when possible and fall back on sampling approximations when necessary. While RBPFs can be implemented by hand to provide efficient inference, the goal of streaming probabilistic programming is to automatically generate such efficient inference implem…
▽ More
Efficient inference is often possible in a streaming context using Rao-Blackwellized particle filters (RBPFs), which exactly solve inference problems when possible and fall back on sampling approximations when necessary. While RBPFs can be implemented by hand to provide efficient inference, the goal of streaming probabilistic programming is to automatically generate such efficient inference implementations given input probabilistic programs.
In this work, we propose semi-symbolic inference, a technique for executing probabilistic programs using a runtime inference system that automatically implements Rao-Blackwellized particle filtering. To perform exact and approximate inference together, the semi-symbolic inference system manipulates symbolic distributions to perform exact inference when possible and falls back on approximate sampling when necessary. This approach enables the system to implement the same RBPF a developer would write by hand. To ensure this, we identify closed families of distributions -- such as linear-Gaussian and finite discrete models -- on which the inference system guarantees exact inference. We have implemented the runtime inference system in the ProbZelus streaming probabilistic programming language. Despite an average $1.6\times$ slowdown compared to the state of the art on existing benchmarks, our evaluation shows that speedups of $3\times$-$87\times$ are obtainable on a new set of challenging benchmarks we have designed to exploit closed families.
△ Less
Submitted 5 November, 2022; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Translating Canonical SQL to Imperative Code in Coq
Authors:
Véronique Benzaken,
Évelyne Contejean,
Mohammed Houssem Hachmaoui,
Chantal Keller,
Louis Mandel,
Avraham Shinnar,
Jérôme Siméon
Abstract:
SQL is by far the most widely used and implemented query language. Yet, on some key features, such as correlated queries and NULL value semantics, many implementations diverge or contain bugs. We leverage recent advances in the formalization of SQL and query compilers to develop DBCert, the first mechanically verified compiler from SQL queries written in a canonical form to imperative code. Buildi…
▽ More
SQL is by far the most widely used and implemented query language. Yet, on some key features, such as correlated queries and NULL value semantics, many implementations diverge or contain bugs. We leverage recent advances in the formalization of SQL and query compilers to develop DBCert, the first mechanically verified compiler from SQL queries written in a canonical form to imperative code. Building DBCert required several new contributions which are described in this paper. First, we specify and mechanize a complete translation from SQL to the Nested Relational Algebra which can be used for query optimization. Second, we define Imp, a small imperative language sufficient to express SQL and which can target several execution languages including JavaScript. Finally, we develop a mechanized translation from the nested relational algebra to Imp, using the nested relational calculus as an intermediate step.
△ Less
Submitted 22 March, 2022; v1 submitted 16 March, 2022;
originally announced March 2022.
-
Automatic Guide Generation for Stan via NumPyro
Authors:
Guillaume Baudart,
Louis Mandel
Abstract:
Stan is a very popular probabilistic language with a state-of-the-art HMC sampler but it only offers a limited choice of algorithms for black-box variational inference. In this paper, we show that using our recently proposed compiler from Stan to Pyro, Stan users can easily try the set of algorithms implemented in Pyro for black-box variational inference. We evaluate our approach on PosteriorDB, a…
▽ More
Stan is a very popular probabilistic language with a state-of-the-art HMC sampler but it only offers a limited choice of algorithms for black-box variational inference. In this paper, we show that using our recently proposed compiler from Stan to Pyro, Stan users can easily try the set of algorithms implemented in Pyro for black-box variational inference. We evaluate our approach on PosteriorDB, a database of Stan models with corresponding data and reference posterior samples. Results show that the eight algorithms available in Pyro offer a range of possible compromises between complexity and accuracy. This paper illustrates that compiling Stan to another probabilistic language can be used to leverage new features for Stan users, and give access to a large set of examples for language developers who implement these new features.
△ Less
Submitted 22 October, 2021;
originally announced October 2021.
-
Statically Bounded-Memory Delayed Sampling for Probabilistic Streams
Authors:
Eric Atkinson,
Guillaume Baudart,
Louis Mandel,
Charles Yuan,
Michael Carbin
Abstract:
Probabilistic programming languages aid developers performing Bayesian inference. These languages provide programming constructs and tools for probabilistic modeling and automated inference. Prior work introduced a probabilistic programming language, ProbZelus, to extend probabilistic programming functionality to unbounded streams of data. This work demonstrated that the delayed sampling inference…
▽ More
Probabilistic programming languages aid developers performing Bayesian inference. These languages provide programming constructs and tools for probabilistic modeling and automated inference. Prior work introduced a probabilistic programming language, ProbZelus, to extend probabilistic programming functionality to unbounded streams of data. This work demonstrated that the delayed sampling inference algorithm could be extended to work in a streaming context. ProbZelus showed that while delayed sampling could be effectively deployed on some programs, depending on the probabilistic model under consideration, delayed sampling is not guaranteed to use a bounded amount of memory over the course of the execution of the program.
In this paper, we present conditions on a probabilistic program's execution under which delayed sampling will execute in bounded memory. The two conditions are dataflow properties of the core operations of delayed sampling: the $m$-consumed property and the unseparated paths property. A program executes in bounded memory under delayed sampling if, and only if, it satisfies the $m$-consumed and unseparated paths properties. We propose a static analysis that abstracts over these properties to soundly ensure that any program that passes the analysis satisfies these properties, and thus executes in bounded memory under delayed sampling.
△ Less
Submitted 13 December, 2021; v1 submitted 25 September, 2021;
originally announced September 2021.
-
Learning GraphQL Query Costs (Extended Version)
Authors:
Georgios Mavroudeas,
Guillaume Baudart,
Alan Cha,
Martin Hirzel,
Jim A. Laredo,
Malik Magdon-Ismail,
Louis Mandel,
Erik Wittern
Abstract:
GraphQL is a query language for APIs and a runtime for executing those queries, fetching the requested data from existing microservices, REST APIs, databases, or other sources. Its expressiveness and its flexibility have made it an attractive candidate for API providers in many industries, especially through the web. A major drawback to blindly servicing a client's query in GraphQL is that the cos…
▽ More
GraphQL is a query language for APIs and a runtime for executing those queries, fetching the requested data from existing microservices, REST APIs, databases, or other sources. Its expressiveness and its flexibility have made it an attractive candidate for API providers in many industries, especially through the web. A major drawback to blindly servicing a client's query in GraphQL is that the cost of a query can be unexpectedly large, creating computation and resource overload for the provider, and API rate-limit overages and infrastructure overload for the client. To mitigate these drawbacks, it is necessary to efficiently estimate the cost of a query before executing it. Estimating query cost is challenging, because GraphQL queries have a nested structure, GraphQL APIs follow different design conventions, and the underlying data sources are hidden. Estimates based on worst-case static query analysis have had limited success because they tend to grossly overestimate cost. We propose a machine-learning approach to efficiently and accurately estimate the query cost. We also demonstrate the power of this approach by testing it on query-response data from publicly available commercial APIs. Our framework is efficient and predicts query costs with high accuracy, consistently outperforming the static analysis by a large margin.
△ Less
Submitted 26 August, 2021; v1 submitted 25 August, 2021;
originally announced August 2021.
-
A Principled Approach to GraphQL Query Cost Analysis
Authors:
Alan Cha,
Erik Wittern,
Guillaume Baudart,
James C. Davis,
Louis Mandel,
Jim A. Laredo
Abstract:
The landscape of web APIs is evolving to meet new client requirements and to facilitate how providers fulfill them. A recent web API model is GraphQL, which is both a query language and a runtime. Using GraphQL, client queries express the data they want to retrieve or mutate, and servers respond with exactly those data or changes. GraphQL's expressiveness is risky for service providers because cli…
▽ More
The landscape of web APIs is evolving to meet new client requirements and to facilitate how providers fulfill them. A recent web API model is GraphQL, which is both a query language and a runtime. Using GraphQL, client queries express the data they want to retrieve or mutate, and servers respond with exactly those data or changes. GraphQL's expressiveness is risky for service providers because clients can succinctly request stupendous amounts of data, and responding to overly complex queries can be costly or disrupt service availability. Recent empirical work has shown that many service providers are at risk. Using traditional API management methods is not sufficient, and practitioners lack principled means of estimating and measuring the cost of the GraphQL queries they receive. In this work, we present a linear-time GraphQL query analysis that can measure the cost of a query without executing it. Our approach can be applied in a separate API management layer and used with arbitrary GraphQL backends. In contrast to existing static approaches, our analysis supports common GraphQL conventions that affect query cost, and our analysis is provably correct based on our formal specification of GraphQL semantics. We demonstrate the potential of our approach using a novel GraphQL query-response corpus for two commercial GraphQL APIs. Our query analysis consistently obtains upper cost bounds, tight enough relative to the true response sizes to be actionable for service providers. In contrast, existing static GraphQL query analyses exhibit over-estimates and under-estimates because they fail to support GraphQL conventions.
△ Less
Submitted 11 September, 2020;
originally announced September 2020.
-
Reactive Probabilistic Programming
Authors:
Guillaume Baudart,
Louis Mandel,
Eric Atkinson,
Benjamin Sherman,
Marc Pouzet,
Michael Carbin
Abstract:
Synchronous modeling is at the heart of programming languages like Lustre, Esterel, or Scade used routinely for implementing safety critical control software, e.g., fly-by-wire and engine control in planes. However, to date these languages have had limited modern support for modeling uncertainty -- probabilistic aspects of software's environment or behavior -- even though modeling uncertainty is a…
▽ More
Synchronous modeling is at the heart of programming languages like Lustre, Esterel, or Scade used routinely for implementing safety critical control software, e.g., fly-by-wire and engine control in planes. However, to date these languages have had limited modern support for modeling uncertainty -- probabilistic aspects of software's environment or behavior -- even though modeling uncertainty is a primary activity when designing a control system.
In this paper we present ProbZelus the first synchronous probabilistic programming language. ProbZelus conservatively provides the facilities of a synchronous language to write control software, with probabilistic constructs to model uncertainties and perform inference-in-the-loop.
We present the design and implementation of the language. We propose a measure-theoretic semantics of probabilistic stream functions and a simple type discipline to separate deterministic and probabilistic expressions. We demonstrate a semantics-preserving compilation into a first-order functional language that lends itself to a simple presentation of inference algorithms for streaming models. We also redesign the delayed sampling inference algorithm to provide efficient streaming inference. Together with an evaluation on several reactive applications, our results demonstrate that ProbZelus enables the design of reactive probabilistic applications and efficient, bounded memory inference.
△ Less
Submitted 9 April, 2020; v1 submitted 20 August, 2019;
originally announced August 2019.
-
An Empirical Study of GraphQL Schemas
Authors:
Erik Wittern,
Alan Cha,
James C. Davis,
Guillaume Baudart,
Louis Mandel
Abstract:
GraphQL is a query language for APIs and a runtime to execute queries. Using GraphQL queries, clients define precisely what data they wish to retrieve or mutate on a server, leading to fewer round trips and reduced response sizes. Although interest in GraphQL is on the rise, with increasing adoption at major organizations, little is known about what GraphQL interfaces look like in practice. This l…
▽ More
GraphQL is a query language for APIs and a runtime to execute queries. Using GraphQL queries, clients define precisely what data they wish to retrieve or mutate on a server, leading to fewer round trips and reduced response sizes. Although interest in GraphQL is on the rise, with increasing adoption at major organizations, little is known about what GraphQL interfaces look like in practice. This lack of knowledge makes it hard for providers to understand what practices promote idiomatic, easy-to-use APIs, and what pitfalls to avoid. To address this gap, we study the design of GraphQL interfaces in practice by analyzing their schemas - the descriptions of their exposed data types and the possible operations on the underlying data. We base our study on two novel corpuses of GraphQL schemas, one of 16 commercial GraphQL schemas and the other of 8,399 GraphQL schemas mined from GitHub projects. We make both corpuses available to other researchers. Using these corpuses, we characterize the size of schemas and their use of GraphQL features and assess the use of both prescribed and organic naming conventions. We also report that a majority of APIs are susceptible to denial of service through complex queries, posing real security risks previously discussed only in theory. We also assess ways in which GraphQL APIs attempt to address these concerns.
△ Less
Submitted 30 July, 2019;
originally announced July 2019.
-
Yaps: Python Frontend to Stan
Authors:
Guillaume Baudart,
Martin Hirzel,
Kiran Kate,
Louis Mandel,
Avraham Shinnar
Abstract:
Stan is a popular probabilistic programming language with a self-contained syntax and semantics that is close to graphical models. Unfortunately, existing embeddings of Stan in Python use multi-line strings. That approach forces users to switch between two different language styles, with no support for syntax highlighting or simple error reporting within the Stan code. This paper tackles the quest…
▽ More
Stan is a popular probabilistic programming language with a self-contained syntax and semantics that is close to graphical models. Unfortunately, existing embeddings of Stan in Python use multi-line strings. That approach forces users to switch between two different language styles, with no support for syntax highlighting or simple error reporting within the Stan code. This paper tackles the question of whether Stan could use Python syntax while retaining its self-contained semantics. The answer is yes, that can be accomplished by reinterpreting the Python syntax. This paper introduces Yaps, a new frontend to Stan based on reinterpreted Python. We tested Yaps on over a thousand Stan models and made it available open-source.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
Compiling Stan to Generative Probabilistic Languages and Extension to Deep Probabilistic Programming
Authors:
Guillaume Baudart,
Javier Burroni,
Martin Hirzel,
Louis Mandel,
Avraham Shinnar
Abstract:
Stan is a probabilistic programming language that is popular in the statistics community, with a high-level syntax for expressing probabilistic models. Stan differs by nature from generative probabilistic programming languages like Church, Anglican, or Pyro. This paper presents a comprehensive compilation scheme to compile any Stan model to a generative language and proves its correctness. We use…
▽ More
Stan is a probabilistic programming language that is popular in the statistics community, with a high-level syntax for expressing probabilistic models. Stan differs by nature from generative probabilistic programming languages like Church, Anglican, or Pyro. This paper presents a comprehensive compilation scheme to compile any Stan model to a generative language and proves its correctness. We use our compilation scheme to build two new backends for the Stanc3 compiler targeting Pyro and NumPyro. Experimental results show that the NumPyro backend yields a 2.3x speedup compared to Stan in geometric mean over 26 benchmarks. Building on Pyro we extend Stan with support for explicit variational inference guides and deep probabilistic models. That way, users familiar with Stan get access to new features without having to learn a fundamentally new language.
△ Less
Submitted 11 April, 2021; v1 submitted 30 September, 2018;
originally announced October 2018.
-
Deep Probabilistic Programming Languages: A Qualitative Study
Authors:
Guillaume Baudart,
Martin Hirzel,
Louis Mandel
Abstract:
Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic progra…
▽ More
Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses.
△ Less
Submitted 17 April, 2018;
originally announced April 2018.