Search | arXiv e-print repository

Automatic Rao-Blackwellization for Sequential Monte Carlo with Belief Propagation

Authors: Waïss Azizian, Guillaume Baudart, Marc Lelarge

Abstract: Exact Bayesian inference on state-space models~(SSM) is in general untractable, and unfortunately, basic Sequential Monte Carlo~(SMC) methods do not yield correct approximations for complex models. In this paper, we propose a mixed inference algorithm that computes closed-form solutions using belief propagation as much as possible, and falls back to sampling-based SMC methods when exact computatio… ▽ More Exact Bayesian inference on state-space models~(SSM) is in general untractable, and unfortunately, basic Sequential Monte Carlo~(SMC) methods do not yield correct approximations for complex models. In this paper, we propose a mixed inference algorithm that computes closed-form solutions using belief propagation as much as possible, and falls back to sampling-based SMC methods when exact computations fail. This algorithm thus implements automatic Rao-Blackwellization and is even exact for Gaussian tree models. △ Less

Submitted 15 December, 2023; originally announced December 2023.

arXiv:2308.01676 [pdf, other]

Density-Based Semantics for Reactive Probabilistic Programming

Authors: Guillaume Baudart, Louis Mandel, Christine Tasson

Abstract: Synchronous languages are now a standard industry tool for critical embedded systems. Designers write high-level specifications by composing streams of values using block diagrams. These languages have been extended with Bayesian reasoning to program state-space models which compute a stream of distributions given a stream of observations. However, the semantics of probabilistic models is only def… ▽ More Synchronous languages are now a standard industry tool for critical embedded systems. Designers write high-level specifications by composing streams of values using block diagrams. These languages have been extended with Bayesian reasoning to program state-space models which compute a stream of distributions given a stream of observations. However, the semantics of probabilistic models is only defined for scheduled equations -- a significant limitation compared to dataflow synchronous languages and block diagrams which do not require any ordering. In this paper we propose two schedule agnostic semantics for a probabilistic synchronous language. The key idea is to interpret probabilistic expressions as a stream of un-normalized density functions which maps random variable values to a result and positive score. The co-iterative semantics interprets programs as state machines and equations are computed using a fixpoint operator. The relational semantics directly manipulates streams and is thus a better fit to reason about program equivalence. We use the relational semantics to prove the correctness of a program transformation required to run an optimized inference algorithm for state-space models with constant parameters. △ Less

Submitted 7 September, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

arXiv:2307.07355 [pdf, ps, other]

Verifying Performance Properties of Probabilistic Inference

Authors: Eric Atkinson, Ellie Y. Cheng, Guillaume Baudart, Louis Mandel, Michael Carbin

Abstract: In this extended abstract, we discuss the opportunity to formally verify that inference systems for probabilistic programming guarantee good performance. In particular, we focus on hybrid inference systems that combine exact and approximate inference to try to exploit the advantages of each. Their performance depends critically on a) the division between exact and approximate inference, and b) the… ▽ More In this extended abstract, we discuss the opportunity to formally verify that inference systems for probabilistic programming guarantee good performance. In particular, we focus on hybrid inference systems that combine exact and approximate inference to try to exploit the advantages of each. Their performance depends critically on a) the division between exact and approximate inference, and b) the computational resources consumed by exact inference. We describe several projects in this direction. Semi-symbolic Inference (SSI) is a type of hybrid inference system that provides limited guarantees by construction on the exact/approximate division. In addition to these limited guarantees, we also describe ongoing work to extend guarantees to a more complex class of programs, requiring a program analysis to ensure the guarantees. Finally, we also describe work on verifying that inference systems using delayed sampling -- another type of hybrid inference -- execute in bounded memory. Together, these projects show that verification can deliver the performance guarantees that probabilistic programming languages need. △ Less

Submitted 14 July, 2023; originally announced July 2023.

arXiv:2209.07490 [pdf, other]

Semi-Symbolic Inference for Efficient Streaming Probabilistic Programming

Authors: Eric Atkinson, Charles Yuan, Guillaume Baudart, Louis Mandel, Michael Carbin

Abstract: Efficient inference is often possible in a streaming context using Rao-Blackwellized particle filters (RBPFs), which exactly solve inference problems when possible and fall back on sampling approximations when necessary. While RBPFs can be implemented by hand to provide efficient inference, the goal of streaming probabilistic programming is to automatically generate such efficient inference implem… ▽ More Efficient inference is often possible in a streaming context using Rao-Blackwellized particle filters (RBPFs), which exactly solve inference problems when possible and fall back on sampling approximations when necessary. While RBPFs can be implemented by hand to provide efficient inference, the goal of streaming probabilistic programming is to automatically generate such efficient inference implementations given input probabilistic programs. In this work, we propose semi-symbolic inference, a technique for executing probabilistic programs using a runtime inference system that automatically implements Rao-Blackwellized particle filtering. To perform exact and approximate inference together, the semi-symbolic inference system manipulates symbolic distributions to perform exact inference when possible and falls back on approximate sampling when necessary. This approach enables the system to implement the same RBPF a developer would write by hand. To ensure this, we identify closed families of distributions -- such as linear-Gaussian and finite discrete models -- on which the inference system guarantees exact inference. We have implemented the runtime inference system in the ProbZelus streaming probabilistic programming language. Despite an average $1.6\times$ slowdown compared to the state of the art on existing benchmarks, our evaluation shows that speedups of $3\times$-$87\times$ are obtainable on a new set of challenging benchmarks we have designed to exploit closed families. △ Less

Submitted 5 November, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2110.11790 [pdf, other]

Automatic Guide Generation for Stan via NumPyro

Authors: Guillaume Baudart, Louis Mandel

Abstract: Stan is a very popular probabilistic language with a state-of-the-art HMC sampler but it only offers a limited choice of algorithms for black-box variational inference. In this paper, we show that using our recently proposed compiler from Stan to Pyro, Stan users can easily try the set of algorithms implemented in Pyro for black-box variational inference. We evaluate our approach on PosteriorDB, a… ▽ More Stan is a very popular probabilistic language with a state-of-the-art HMC sampler but it only offers a limited choice of algorithms for black-box variational inference. In this paper, we show that using our recently proposed compiler from Stan to Pyro, Stan users can easily try the set of algorithms implemented in Pyro for black-box variational inference. We evaluate our approach on PosteriorDB, a database of Stan models with corresponding data and reference posterior samples. Results show that the eight algorithms available in Pyro offer a range of possible compromises between complexity and accuracy. This paper illustrates that compiling Stan to another probabilistic language can be used to leverage new features for Stan users, and give access to a large set of examples for language developers who implement these new features. △ Less

Submitted 22 October, 2021; originally announced October 2021.

Comments: PROBPROG 2021

arXiv:2109.12473 [pdf, other]

doi 10.1145/3485492

Statically Bounded-Memory Delayed Sampling for Probabilistic Streams

Authors: Eric Atkinson, Guillaume Baudart, Louis Mandel, Charles Yuan, Michael Carbin

Abstract: Probabilistic programming languages aid developers performing Bayesian inference. These languages provide programming constructs and tools for probabilistic modeling and automated inference. Prior work introduced a probabilistic programming language, ProbZelus, to extend probabilistic programming functionality to unbounded streams of data. This work demonstrated that the delayed sampling inference… ▽ More Probabilistic programming languages aid developers performing Bayesian inference. These languages provide programming constructs and tools for probabilistic modeling and automated inference. Prior work introduced a probabilistic programming language, ProbZelus, to extend probabilistic programming functionality to unbounded streams of data. This work demonstrated that the delayed sampling inference algorithm could be extended to work in a streaming context. ProbZelus showed that while delayed sampling could be effectively deployed on some programs, depending on the probabilistic model under consideration, delayed sampling is not guaranteed to use a bounded amount of memory over the course of the execution of the program. In this paper, we present conditions on a probabilistic program's execution under which delayed sampling will execute in bounded memory. The two conditions are dataflow properties of the core operations of delayed sampling: the $m$-consumed property and the unseparated paths property. A program executes in bounded memory under delayed sampling if, and only if, it satisfies the $m$-consumed and unseparated paths properties. We propose a static analysis that abstracts over these properties to soundly ensure that any program that passes the analysis satisfies these properties, and thus executes in bounded memory under delayed sampling. △ Less

Submitted 13 December, 2021; v1 submitted 25 September, 2021; originally announced September 2021.

Comments: The following is a summary of the changes in each revision. [v2] corrected the URL for the code repository. [v3] corrected the definition of the m-consumed semantic property. [v4] fixed a typo. [v5] added this comment

Journal ref: Proc. ACM Program. Lang. 5, OOPSLA, Article 115 (October 2021)

arXiv:2108.11139 [pdf, other]

Learning GraphQL Query Costs (Extended Version)

Authors: Georgios Mavroudeas, Guillaume Baudart, Alan Cha, Martin Hirzel, Jim A. Laredo, Malik Magdon-Ismail, Louis Mandel, Erik Wittern

Abstract: GraphQL is a query language for APIs and a runtime for executing those queries, fetching the requested data from existing microservices, REST APIs, databases, or other sources. Its expressiveness and its flexibility have made it an attractive candidate for API providers in many industries, especially through the web. A major drawback to blindly servicing a client's query in GraphQL is that the cos… ▽ More GraphQL is a query language for APIs and a runtime for executing those queries, fetching the requested data from existing microservices, REST APIs, databases, or other sources. Its expressiveness and its flexibility have made it an attractive candidate for API providers in many industries, especially through the web. A major drawback to blindly servicing a client's query in GraphQL is that the cost of a query can be unexpectedly large, creating computation and resource overload for the provider, and API rate-limit overages and infrastructure overload for the client. To mitigate these drawbacks, it is necessary to efficiently estimate the cost of a query before executing it. Estimating query cost is challenging, because GraphQL queries have a nested structure, GraphQL APIs follow different design conventions, and the underlying data sources are hidden. Estimates based on worst-case static query analysis have had limited success because they tend to grossly overestimate cost. We propose a machine-learning approach to efficiently and accurately estimate the query cost. We also demonstrate the power of this approach by testing it on query-response data from publicly available commercial APIs. Our framework is efficient and predicts query costs with high accuracy, consistently outperforming the static analysis by a large margin. △ Less

Submitted 26 August, 2021; v1 submitted 25 August, 2021; originally announced August 2021.

arXiv:2009.05632 [pdf, other]

A Principled Approach to GraphQL Query Cost Analysis

Authors: Alan Cha, Erik Wittern, Guillaume Baudart, James C. Davis, Louis Mandel, Jim A. Laredo

Abstract: The landscape of web APIs is evolving to meet new client requirements and to facilitate how providers fulfill them. A recent web API model is GraphQL, which is both a query language and a runtime. Using GraphQL, client queries express the data they want to retrieve or mutate, and servers respond with exactly those data or changes. GraphQL's expressiveness is risky for service providers because cli… ▽ More The landscape of web APIs is evolving to meet new client requirements and to facilitate how providers fulfill them. A recent web API model is GraphQL, which is both a query language and a runtime. Using GraphQL, client queries express the data they want to retrieve or mutate, and servers respond with exactly those data or changes. GraphQL's expressiveness is risky for service providers because clients can succinctly request stupendous amounts of data, and responding to overly complex queries can be costly or disrupt service availability. Recent empirical work has shown that many service providers are at risk. Using traditional API management methods is not sufficient, and practitioners lack principled means of estimating and measuring the cost of the GraphQL queries they receive. In this work, we present a linear-time GraphQL query analysis that can measure the cost of a query without executing it. Our approach can be applied in a separate API management layer and used with arbitrary GraphQL backends. In contrast to existing static approaches, our analysis supports common GraphQL conventions that affect query cost, and our analysis is provably correct based on our formal specification of GraphQL semantics. We demonstrate the potential of our approach using a novel GraphQL query-response corpus for two commercial GraphQL APIs. Our query analysis consistently obtains upper cost bounds, tight enough relative to the true response sizes to be actionable for service providers. In contrast, existing static GraphQL query analyses exhibit over-estimates and under-estimates because they fail to support GraphQL conventions. △ Less

Submitted 11 September, 2020; originally announced September 2020.

Comments: Published at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2020

arXiv:2007.01977 [pdf, other]

Lale: Consistent Automated Machine Learning

Authors: Guillaume Baudart, Martin Hirzel, Kiran Kate, Parikshit Ram, Avraham Shinnar

Abstract: Automated machine learning makes it easier for data scientists to develop pipelines by searching over possible choices for hyperparameters, algorithms, and even pipeline topologies. Unfortunately, the syntax for automated machine learning tools is inconsistent with manual machine learning, with each other, and with error checks. Furthermore, few tools support advanced features such as topology sea… ▽ More Automated machine learning makes it easier for data scientists to develop pipelines by searching over possible choices for hyperparameters, algorithms, and even pipeline topologies. Unfortunately, the syntax for automated machine learning tools is inconsistent with manual machine learning, with each other, and with error checks. Furthermore, few tools support advanced features such as topology search or higher-order operators. This paper introduces Lale, a library of high-level Python interfaces that simplifies and unifies automated machine learning in a consistent way. △ Less

Submitted 3 July, 2020; originally announced July 2020.

Comments: KDD Workshop on Automation in Machine Learning (AutoML@KDD), August 2020

arXiv:2006.16984 [pdf, other]

Mining Documentation to Extract Hyperparameter Schemas

Authors: Guillaume Baudart, Peter D. Kirchner, Martin Hirzel, Kiran Kate

Abstract: AI automation tools need machine-readable hyperparameter schemas to define their search spaces. At the same time, AI libraries often come with good human-readable documentation. While such documentation contains most of the necessary information, it is unfortunately not ready to consume by tools. This paper describes how to automatically mine Python docstrings in AI libraries to extract JSON Schem… ▽ More AI automation tools need machine-readable hyperparameter schemas to define their search spaces. At the same time, AI libraries often come with good human-readable documentation. While such documentation contains most of the necessary information, it is unfortunately not ready to consume by tools. This paper describes how to automatically mine Python docstrings in AI libraries to extract JSON Schemas for their hyperparameters. We evaluate our approach on 119 transformers and estimators from three different libraries and find that it is effective at extracting machine-readable schemas. Our vision is to reduce the burden to manually create and maintain such schemas for AI automation tools and broaden the reach of automation to larger libraries and richer schemas. △ Less

Submitted 2 July, 2020; v1 submitted 30 June, 2020; originally announced June 2020.

arXiv:1908.07563 [pdf, other]

Reactive Probabilistic Programming

Authors: Guillaume Baudart, Louis Mandel, Eric Atkinson, Benjamin Sherman, Marc Pouzet, Michael Carbin

Abstract: Synchronous modeling is at the heart of programming languages like Lustre, Esterel, or Scade used routinely for implementing safety critical control software, e.g., fly-by-wire and engine control in planes. However, to date these languages have had limited modern support for modeling uncertainty -- probabilistic aspects of software's environment or behavior -- even though modeling uncertainty is a… ▽ More Synchronous modeling is at the heart of programming languages like Lustre, Esterel, or Scade used routinely for implementing safety critical control software, e.g., fly-by-wire and engine control in planes. However, to date these languages have had limited modern support for modeling uncertainty -- probabilistic aspects of software's environment or behavior -- even though modeling uncertainty is a primary activity when designing a control system. In this paper we present ProbZelus the first synchronous probabilistic programming language. ProbZelus conservatively provides the facilities of a synchronous language to write control software, with probabilistic constructs to model uncertainties and perform inference-in-the-loop. We present the design and implementation of the language. We propose a measure-theoretic semantics of probabilistic stream functions and a simple type discipline to separate deterministic and probabilistic expressions. We demonstrate a semantics-preserving compilation into a first-order functional language that lends itself to a simple presentation of inference algorithms for streaming models. We also redesign the delayed sampling inference algorithm to provide efficient streaming inference. Together with an evaluation on several reactive applications, our results demonstrate that ProbZelus enables the design of reactive probabilistic applications and efficient, bounded memory inference. △ Less

Submitted 9 April, 2020; v1 submitted 20 August, 2019; originally announced August 2019.

Comments: Version with appendices of the PLDI 2020 paper "Reactive Probabilistic Programming"

arXiv:1907.13012 [pdf, other]

An Empirical Study of GraphQL Schemas

Authors: Erik Wittern, Alan Cha, James C. Davis, Guillaume Baudart, Louis Mandel

Abstract: GraphQL is a query language for APIs and a runtime to execute queries. Using GraphQL queries, clients define precisely what data they wish to retrieve or mutate on a server, leading to fewer round trips and reduced response sizes. Although interest in GraphQL is on the rise, with increasing adoption at major organizations, little is known about what GraphQL interfaces look like in practice. This l… ▽ More GraphQL is a query language for APIs and a runtime to execute queries. Using GraphQL queries, clients define precisely what data they wish to retrieve or mutate on a server, leading to fewer round trips and reduced response sizes. Although interest in GraphQL is on the rise, with increasing adoption at major organizations, little is known about what GraphQL interfaces look like in practice. This lack of knowledge makes it hard for providers to understand what practices promote idiomatic, easy-to-use APIs, and what pitfalls to avoid. To address this gap, we study the design of GraphQL interfaces in practice by analyzing their schemas - the descriptions of their exposed data types and the possible operations on the underlying data. We base our study on two novel corpuses of GraphQL schemas, one of 16 commercial GraphQL schemas and the other of 8,399 GraphQL schemas mined from GitHub projects. We make both corpuses available to other researchers. Using these corpuses, we characterize the size of schemas and their use of GraphQL features and assess the use of both prescribed and organic naming conventions. We also report that a majority of APIs are susceptible to denial of service through complex queries, posing real security risks previously discussed only in theory. We also assess ways in which GraphQL APIs attempt to address these concerns. △ Less

Submitted 30 July, 2019; originally announced July 2019.

arXiv:1812.04125 [pdf, other]

Yaps: Python Frontend to Stan

Authors: Guillaume Baudart, Martin Hirzel, Kiran Kate, Louis Mandel, Avraham Shinnar

Abstract: Stan is a popular probabilistic programming language with a self-contained syntax and semantics that is close to graphical models. Unfortunately, existing embeddings of Stan in Python use multi-line strings. That approach forces users to switch between two different language styles, with no support for syntax highlighting or simple error reporting within the Stan code. This paper tackles the quest… ▽ More Stan is a popular probabilistic programming language with a self-contained syntax and semantics that is close to graphical models. Unfortunately, existing embeddings of Stan in Python use multi-line strings. That approach forces users to switch between two different language styles, with no support for syntax highlighting or simple error reporting within the Stan code. This paper tackles the question of whether Stan could use Python syntax while retaining its self-contained semantics. The answer is yes, that can be accomplished by reinterpreting the Python syntax. This paper introduces Yaps, a new frontend to Stan based on reinterpreted Python. We tested Yaps on over a thousand Stan models and made it available open-source. △ Less

Submitted 5 December, 2018; originally announced December 2018.

arXiv:1810.00873 [pdf, other]

Compiling Stan to Generative Probabilistic Languages and Extension to Deep Probabilistic Programming

Authors: Guillaume Baudart, Javier Burroni, Martin Hirzel, Louis Mandel, Avraham Shinnar

Abstract: Stan is a probabilistic programming language that is popular in the statistics community, with a high-level syntax for expressing probabilistic models. Stan differs by nature from generative probabilistic programming languages like Church, Anglican, or Pyro. This paper presents a comprehensive compilation scheme to compile any Stan model to a generative language and proves its correctness. We use… ▽ More Stan is a probabilistic programming language that is popular in the statistics community, with a high-level syntax for expressing probabilistic models. Stan differs by nature from generative probabilistic programming languages like Church, Anglican, or Pyro. This paper presents a comprehensive compilation scheme to compile any Stan model to a generative language and proves its correctness. We use our compilation scheme to build two new backends for the Stanc3 compiler targeting Pyro and NumPyro. Experimental results show that the NumPyro backend yields a 2.3x speedup compared to Stan in geometric mean over 26 benchmarks. Building on Pyro we extend Stan with support for explicit variational inference guides and deep probabilistic models. That way, users familiar with Stan get access to new features without having to learn a fundamentally new language. △ Less

Submitted 11 April, 2021; v1 submitted 30 September, 2018; originally announced October 2018.

arXiv:1804.06458 [pdf, other]

Deep Probabilistic Programming Languages: A Qualitative Study

Authors: Guillaume Baudart, Martin Hirzel, Louis Mandel

Abstract: Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic progra… ▽ More Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses. △ Less

Submitted 17 April, 2018; originally announced April 2018.

Showing 1–15 of 15 results for author: Baudart, G