Search | arXiv e-print repository

Rényi Neural Processes

Authors: Xuesong Wang, He Zhao, Edwin V. Bonilla

Abstract: Neural Processes (NPs) are variational frameworks that aim to represent stochastic processes with deep neural networks. Despite their obvious benefits in uncertainty estimation for complex distributions via data-driven priors, NPs enforce network parameter sharing between the conditional prior and posterior distributions, thereby risking introducing a misspecified prior. We hereby propose Rényi Ne… ▽ More Neural Processes (NPs) are variational frameworks that aim to represent stochastic processes with deep neural networks. Despite their obvious benefits in uncertainty estimation for complex distributions via data-driven priors, NPs enforce network parameter sharing between the conditional prior and posterior distributions, thereby risking introducing a misspecified prior. We hereby propose Rényi Neural Processes (RNP) to relax the influence of the misspecified prior and optimize a tighter bound of the marginal likelihood. More specifically, by replacing the standard KL divergence with the Rényi divergence between the posterior and the approximated prior, we ameliorate the impact of the misspecified prior via a parameter α so that the resulting posterior focuses more on tail samples and reduce density on overconfident regions. Our experiments showed log-likelihood improvements on several existing NP families. We demonstrated the superior performance of our approach on various benchmarks including regression and image inpainting tasks. We also validate the effectiveness of RNPs on real-world tabular regression problems. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.15167 [pdf, other]

ProDAG: Projection-induced variational inference for directed acyclic graphs

Authors: Ryan Thompson, Edwin V. Bonilla, Robert Kohn

Abstract: Directed acyclic graph (DAG) learning is a rapidly expanding field of research. Though the field has witnessed remarkable advances over the past few years, it remains statistically and computationally challenging to learn a single (point estimate) DAG from data, let alone provide uncertainty quantification. Our article addresses the difficult task of quantifying graph uncertainty by develo** a v… ▽ More Directed acyclic graph (DAG) learning is a rapidly expanding field of research. Though the field has witnessed remarkable advances over the past few years, it remains statistically and computationally challenging to learn a single (point estimate) DAG from data, let alone provide uncertainty quantification. Our article addresses the difficult task of quantifying graph uncertainty by develo** a variational Bayes inference framework based on novel distributions that have support directly on the space of DAGs. The distributions, which we use to form our prior and variational posterior, are induced by a projection operation, whereby an arbitrary continuous distribution is projected onto the space of sparse weighted acyclic adjacency matrices (matrix representations of DAGs) with probability mass on exact zeros. Though the projection constitutes a combinatorial optimization problem, it is solvable at scale via recently developed techniques that reformulate acyclicity as a continuous constraint. We empirically demonstrate that our method, ProDAG, can deliver accurate inference, and often outperforms existing state-of-the-art alternatives. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2402.15255 [pdf, other]

Optimal Transport for Structure Learning Under Missing Data

Authors: Vy Vo, He Zhao, Trung Le, Edwin V. Bonilla, Dinh Phung

Abstract: Causal discovery in the presence of missing data introduces a chicken-and-egg dilemma. While the goal is to recover the true causal structure, robust imputation requires considering the dependencies or, preferably, causal relations among variables. Merely filling in missing values with existing imputation methods and subsequently applying structure learning on the complete data is empirically show… ▽ More Causal discovery in the presence of missing data introduces a chicken-and-egg dilemma. While the goal is to recover the true causal structure, robust imputation requires considering the dependencies or, preferably, causal relations among variables. Merely filling in missing values with existing imputation methods and subsequently applying structure learning on the complete data is empirically shown to be sub-optimal. To address this problem, we propose a score-based algorithm for learning causal structures from missing data based on optimal transport. This optimal transport viewpoint diverges from existing score-based approaches that are dominantly based on expectation maximization. We formulate structure learning as a density fitting problem, where the goal is to find the causal model that induces a distribution of minimum Wasserstein distance with the observed data distribution. Our framework is shown to recover the true causal graphs more effectively than competing methods in most simulations and real-data settings. Empirical evidence also shows the superior scalability of our approach, along with the flexibility to incorporate any off-the-shelf causal discovery methods for complete data. △ Less

Submitted 1 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Journal ref: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

arXiv:2402.03614 [pdf, other]

Bayesian Vector AutoRegression with Factorised Granger-Causal Graphs

Authors: He Zhao, Vassili Kitsios, Terence J. O'Kane, Edwin V. Bonilla

Abstract: We study the problem of automatically discovering Granger causal relations from observational multivariate time-series data.Vector autoregressive (VAR) models have been time-tested for this problem, including Bayesian variants and more recent developments using deep neural networks. Most existing VAR methods for Granger causality use sparsity-inducing penalties/priors or post-hoc thresholds to int… ▽ More We study the problem of automatically discovering Granger causal relations from observational multivariate time-series data.Vector autoregressive (VAR) models have been time-tested for this problem, including Bayesian variants and more recent developments using deep neural networks. Most existing VAR methods for Granger causality use sparsity-inducing penalties/priors or post-hoc thresholds to interpret their coefficients as Granger causal graphs. Instead, we propose a new Bayesian VAR model with a hierarchical factorised prior distribution over binary Granger causal graphs, separately from the VAR coefficients. We develop an efficient algorithm to infer the posterior over binary Granger causal graphs. Comprehensive experiments on synthetic, semi-synthetic, and climate data show that our method is more uncertainty aware, has less hyperparameters, and achieves better performance than competing approaches, especially in low-data regimes where there are less observations. △ Less

Submitted 23 May, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.02644 [pdf, other]

Variational DAG Estimation via State Augmentation With Stochastic Permutations

Authors: Edwin V. Bonilla, Pantelis Elinas, He Zhao, Maurizio Filippone, Vassili Kitsios, Terry O'Kane

Abstract: Estimating the structure of a Bayesian network, in the form of a directed acyclic graph (DAG), from observational data is a statistically and computationally hard problem with essential applications in areas such as causal discovery. Bayesian approaches are a promising direction for solving this task, as they allow for uncertainty quantification and deal with well-known identifiability issues. Fro… ▽ More Estimating the structure of a Bayesian network, in the form of a directed acyclic graph (DAG), from observational data is a statistically and computationally hard problem with essential applications in areas such as causal discovery. Bayesian approaches are a promising direction for solving this task, as they allow for uncertainty quantification and deal with well-known identifiability issues. From a probabilistic inference perspective, the main challenges are (i) representing distributions over graphs that satisfy the DAG constraint and (ii) estimating a posterior over the underlying combinatorial space. We propose an approach that addresses these challenges by formulating a joint distribution on an augmented space of DAGs and permutations. We carry out posterior estimation via variational inference, where we exploit continuous relaxations of discrete distributions. We show that our approach performs competitively when compared with a wide range of Bayesian and non-Bayesian benchmarks on a range of synthetic and real datasets. △ Less

Submitted 28 May, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2310.15627 [pdf, other]

Contextual Directed Acyclic Graphs

Authors: Ryan Thompson, Edwin V. Bonilla, Robert Kohn

Abstract: Estimating the structure of directed acyclic graphs (DAGs) from observational data remains a significant challenge in machine learning. Most research in this area concentrates on learning a single DAG for the entire population. This paper considers an alternative setting where the graph structure varies across individuals based on available "contextual" features. We tackle this contextual DAG prob… ▽ More Estimating the structure of directed acyclic graphs (DAGs) from observational data remains a significant challenge in machine learning. Most research in this area concentrates on learning a single DAG for the entire population. This paper considers an alternative setting where the graph structure varies across individuals based on available "contextual" features. We tackle this contextual DAG problem via a neural network that maps the contextual features to a DAG, represented as a weighted adjacency matrix. The neural network is equipped with a novel projection layer that ensures the output matrices are sparse and satisfy a recently developed characterization of acyclicity. We devise a scalable computational framework for learning contextual DAGs and provide a convergence guarantee and an analytical gradient for backpropagating through the projection layer. Our experiments suggest that the new approach can recover the true context-specific graph where existing approaches fail. △ Less

Submitted 20 February, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: To appear in the Proceedings of the 27th International Conference on Artificial Intelligence and Statistics

arXiv:2305.18435 [pdf, other]

Statistically Efficient Bayesian Sequential Experiment Design via Reinforcement Learning with Cross-Entropy Estimators

Authors: Tom Blau, Iadine Chades, Amir Dezfouli, Daniel Steinberg, Edwin V. Bonilla

Abstract: Reinforcement learning can learn amortised design policies for designing sequences of experiments. However, current amortised methods rely on estimators of expected information gain (EIG) that require an exponential number of samples on the magnitude of the EIG to achieve an unbiased estimation. We propose the use of an alternative estimator based on the cross-entropy of the joint model distributi… ▽ More Reinforcement learning can learn amortised design policies for designing sequences of experiments. However, current amortised methods rely on estimators of expected information gain (EIG) that require an exponential number of samples on the magnitude of the EIG to achieve an unbiased estimation. We propose the use of an alternative estimator based on the cross-entropy of the joint model distribution and a flexible proposal distribution. This proposal distribution approximates the true posterior of the model parameters given the experimental history and the design policy. Our method overcomes the exponential-sample complexity of previous approaches and provide more accurate estimates of high EIG values. More importantly, it allows learning of superior design policies, and is compatible with continuous and discrete design spaces, non-differentiable likelihoods and even implicit probabilistic models. △ Less

Submitted 4 February, 2024; v1 submitted 28 May, 2023; originally announced May 2023.

arXiv:2302.09921 [pdf, other]

Free-Form Variational Inference for Gaussian Process State-Space Models

Authors: Xuhui Fan, Edwin V. Bonilla, Terence J. O'Kane, Scott A. Sisson

Abstract: Gaussian process state-space models (GPSSMs) provide a principled and flexible approach to modeling the dynamics of a latent state, which is observed at discrete-time points via a likelihood model. However, inference in GPSSMs is computationally and statistically challenging due to the large number of latent variables in the model and the strong temporal dependencies between them. In this paper, w… ▽ More Gaussian process state-space models (GPSSMs) provide a principled and flexible approach to modeling the dynamics of a latent state, which is observed at discrete-time points via a likelihood model. However, inference in GPSSMs is computationally and statistically challenging due to the large number of latent variables in the model and the strong temporal dependencies between them. In this paper, we propose a new method for inference in Bayesian GPSSMs, which overcomes the drawbacks of previous approaches, namely over-simplified assumptions, and high computational requirements. Our method is based on free-form variational inference via stochastic gradient Hamiltonian Monte Carlo within the inducing-variable formalism. Furthermore, by exploiting our proposed variational distribution, we provide a collapsed extension of our method where the inducing variables are marginalized analytically. We also showcase results when combining our framework with particle MCMC methods. We show that, on six real-world datasets, our approach can learn transition dynamics and latent states more accurately than competing methods. △ Less

Submitted 16 July, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

Comments: Updating to final version to appear in the proceedings

arXiv:2211.00335 [pdf, other]

Recurrent Neural Networks and Universal Approximation of Bayesian Filters

Authors: Adrian N. Bishop, Edwin V. Bonilla

Abstract: We consider the Bayesian optimal filtering problem: i.e. estimating some conditional statistics of a latent time-series signal from an observation sequence. Classical approaches often rely on the use of assumed or estimated transition and observation models. Instead, we formulate a generic recurrent neural network framework and seek to learn directly a recursive map** from observational inputs t… ▽ More We consider the Bayesian optimal filtering problem: i.e. estimating some conditional statistics of a latent time-series signal from an observation sequence. Classical approaches often rely on the use of assumed or estimated transition and observation models. Instead, we formulate a generic recurrent neural network framework and seek to learn directly a recursive map** from observational inputs to the desired estimator statistics. The main focus of this article is the approximation capabilities of this framework. We provide approximation error bounds for filtering in general non-compact domains. We also consider strong time-uniform approximation error bounds that guarantee good long-time performance. We discuss and illustrate a number of practical concerns and implications of these results. △ Less

Submitted 15 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

Journal ref: In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR: Volume 206

arXiv:2202.12508 [pdf, other]

Addressing Over-Smoothing in Graph Neural Networks via Deep Supervision

Authors: Pantelis Elinas, Edwin V. Bonilla

Abstract: Learning useful node and graph representations with graph neural networks (GNNs) is a challenging task. It is known that deep GNNs suffer from over-smoothing where, as the number of layers increases, node representations become nearly indistinguishable and model performance on the downstream task degrades significantly. To address this problem, we propose deeply-supervised GNNs (DSGNNs), i.e., GNN… ▽ More Learning useful node and graph representations with graph neural networks (GNNs) is a challenging task. It is known that deep GNNs suffer from over-smoothing where, as the number of layers increases, node representations become nearly indistinguishable and model performance on the downstream task degrades significantly. To address this problem, we propose deeply-supervised GNNs (DSGNNs), i.e., GNNs enhanced with deep supervision where representations learned at all layers are used for training. We show empirically that DSGNNs are resilient to over-smoothing and can outperform competitive benchmarks on node and graph property prediction problems. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2202.00821 [pdf, other]

Optimizing Sequential Experimental Design with Deep Reinforcement Learning

Authors: Tom Blau, Edwin V. Bonilla, Iadine Chades, Amir Dezfouli

Abstract: Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design… ▽ More Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design space, require access to a differentiable probabilistic model and can only optimize over continuous design spaces. Here, we address these limitations by showing that the problem of optimizing policies can be reduced to solving a Markov decision process (MDP). We solve the equivalent MDP with modern deep reinforcement learning techniques. Our experiments show that our approach is also computationally efficient at deployment time and exhibits state-of-the-art performance on both continuous and discrete design spaces, even when the probabilistic model is a black box. △ Less

Submitted 17 June, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

Journal ref: International Conference on Machine Learning (2022)

arXiv:2107.01650 [pdf, other]

Learning ODEs via Diffeomorphisms for Fast and Robust Integration

Authors: Weiming Zhi, Tin Lai, Lionel Ott, Edwin V. Bonilla, Fabio Ramos

Abstract: Advances in differentiable numerical integrators have enabled the use of gradient descent techniques to learn ordinary differential equations (ODEs). In the context of machine learning, differentiable solvers are central for Neural ODEs (NODEs), a class of deep learning models with continuous depth, rather than discrete layers. However, these integrators can be unsatisfactorily slow and inaccurate… ▽ More Advances in differentiable numerical integrators have enabled the use of gradient descent techniques to learn ordinary differential equations (ODEs). In the context of machine learning, differentiable solvers are central for Neural ODEs (NODEs), a class of deep learning models with continuous depth, rather than discrete layers. However, these integrators can be unsatisfactorily slow and inaccurate when learning systems of ODEs from long sequences, or when solutions of the system vary at widely different timescales in each dimension. In this paper we propose an alternative approach to learning ODEs from data: we represent the underlying ODE as a vector field that is related to another base vector field by a differentiable bijection, modelled by an invertible neural network. By restricting the base ODE to be amenable to integration, we can drastically speed up and improve the robustness of integration. We demonstrate the efficacy of our method in training and evaluating continuous neural networks models, as well as in learning benchmark ODE systems. We observe improvements of up to two orders of magnitude when integrating learned ODEs with GPUs computation. △ Less

Submitted 4 July, 2021; originally announced July 2021.

arXiv:2106.06245 [pdf, other]

Model Selection for Bayesian Autoencoders

Authors: Ba-Hien Tran, Simone Rossi, Dimitrios Milios, Pietro Michiardi, Edwin V. Bonilla, Maurizio Filippone

Abstract: We develop a novel method for carrying out model selection for Bayesian autoencoders (BAEs) by means of prior hyper-parameter optimization. Inspired by the common practice of type-II maximum likelihood optimization and its equivalence to Kullback-Leibler divergence minimization, we propose to optimize the distributional sliced-Wasserstein distance (DSWD) between the output of the autoencoder and t… ▽ More We develop a novel method for carrying out model selection for Bayesian autoencoders (BAEs) by means of prior hyper-parameter optimization. Inspired by the common practice of type-II maximum likelihood optimization and its equivalence to Kullback-Leibler divergence minimization, we propose to optimize the distributional sliced-Wasserstein distance (DSWD) between the output of the autoencoder and the empirical data distribution. The advantages of this formulation are that we can estimate the DSWD based on samples and handle high-dimensional problems. We carry out posterior estimation of the BAE parameters via stochastic gradient Hamiltonian Monte Carlo and turn our BAE into a generative model by fitting a flexible Dirichlet mixture model in the latent space. Consequently, we obtain a powerful alternative to variational autoencoders, which are the preferred choice in modern applications of autoencoders for representation learning with uncertainty. We evaluate our approach qualitatively and quantitatively using a vast experimental campaign on a number of unsupervised learning tasks and show that, in small-data regimes where priors matter, our approach provides state-of-the-art results, outperforming multiple competitive baselines. △ Less

Submitted 11 June, 2021; originally announced June 2021.

arXiv:2105.04211 [pdf, other]

SigGPDE: Scaling Sparse Gaussian Processes on Sequential Data

Authors: Maud Lemercier, Cristopher Salvi, Thomas Cass, Edwin V. Bonilla, Theodoros Damoulas, Terry Lyons

Abstract: Making predictions and quantifying their uncertainty when the input data is sequential is a fundamental learning challenge, recently attracting increasing attention. We develop SigGPDE, a new scalable sparse variational inference framework for Gaussian Processes (GPs) on sequential data. Our contribution is twofold. First, we construct inducing variables underpinning the sparse approximation so th… ▽ More Making predictions and quantifying their uncertainty when the input data is sequential is a fundamental learning challenge, recently attracting increasing attention. We develop SigGPDE, a new scalable sparse variational inference framework for Gaussian Processes (GPs) on sequential data. Our contribution is twofold. First, we construct inducing variables underpinning the sparse approximation so that the resulting evidence lower bound (ELBO) does not require any matrix inversion. Second, we show that the gradients of the GP signature kernel are solutions of a hyperbolic partial differential equation (PDE). This theoretical insight allows us to build an efficient back-propagation algorithm to optimize the ELBO. We showcase the significant computational gains of SigGPDE compared to existing methods, while achieving state-of-the-art performance for classification tasks on large datasets of up to 1 million multivariate time series. △ Less

Submitted 12 October, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

Comments: Published at ICML 2021

MSC Class: 60L10; 60L20

arXiv:2102.09009 [pdf, other]

BORE: Bayesian Optimization by Density-Ratio Estimation

Authors: Louis C. Tiao, Aaron Klein, Matthias Seeger, Edwin V. Bonilla, Cedric Archambeau, Fabio Ramos

Abstract: Bayesian optimization (BO) is among the most effective and widely-used blackbox optimization methods. BO proposes solutions according to an explore-exploit trade-off criterion encoded in an acquisition function, many of which are computed from the posterior predictive of a probabilistic surrogate model. Prevalent among these is the expected improvement (EI) function. The need to ensure analytical… ▽ More Bayesian optimization (BO) is among the most effective and widely-used blackbox optimization methods. BO proposes solutions according to an explore-exploit trade-off criterion encoded in an acquisition function, many of which are computed from the posterior predictive of a probabilistic surrogate model. Prevalent among these is the expected improvement (EI) function. The need to ensure analytical tractability of the predictive often poses limitations that can hinder the efficiency and applicability of BO. In this paper, we cast the computation of EI as a binary classification problem, building on the link between class-probability estimation and density-ratio estimation, and the lesser-known link between density-ratios and EI. By circumventing the tractability constraints, this reformulation provides numerous advantages, not least in terms of expressiveness, versatility, and scalability. △ Less

Submitted 17 February, 2021; originally announced February 2021.

Comments: preprint, under review

arXiv:2006.05805 [pdf, other]

Distribution Regression for Sequential Data

Authors: Maud Lemercier, Cristopher Salvi, Theodoros Damoulas, Edwin V. Bonilla, Terry Lyons

Abstract: Distribution regression refers to the supervised learning problem where labels are only available for groups of inputs instead of individual inputs. In this paper, we develop a rigorous mathematical framework for distribution regression where inputs are complex data streams. Leveraging properties of the expected signature and a recent signature kernel trick for sequential data from stochastic anal… ▽ More Distribution regression refers to the supervised learning problem where labels are only available for groups of inputs instead of individual inputs. In this paper, we develop a rigorous mathematical framework for distribution regression where inputs are complex data streams. Leveraging properties of the expected signature and a recent signature kernel trick for sequential data from stochastic analysis, we introduce two new learning techniques, one feature-based and the other kernel-based. Each is suited to a different data regime in terms of the number of data streams and the dimensionality of the individual streams. We provide theoretical results on the universality of both approaches and demonstrate empirically their robustness to irregularly sampled multivariate time-series, achieving state-of-the-art performance on both synthetic and real-world examples from thermodynamics, mathematical finance and agricultural science. △ Less

Submitted 29 September, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: Published at AISTATS 2021

MSC Class: 60L10; 60L20

arXiv:2003.03080 [pdf, other]

Sparse Gaussian Processes Revisited: Bayesian Approaches to Inducing-Variable Approximations

Authors: Simone Rossi, Markus Heinonen, Edwin V. Bonilla, Zheyang Shen, Maurizio Filippone

Abstract: Variational inference techniques based on inducing variables provide an elegant framework for scalable posterior estimation in Gaussian process (GP) models. Besides enabling scalability, one of their main advantages over sparse approximations using direct marginal likelihood maximization is that they provide a robust alternative for point estimation of the inducing inputs, i.e. the location of the… ▽ More Variational inference techniques based on inducing variables provide an elegant framework for scalable posterior estimation in Gaussian process (GP) models. Besides enabling scalability, one of their main advantages over sparse approximations using direct marginal likelihood maximization is that they provide a robust alternative for point estimation of the inducing inputs, i.e. the location of the inducing variables. In this work we challenge the common wisdom that optimizing the inducing inputs in the variational framework yields optimal performance. We show that, by revisiting old model approximations such as the fully-independent training conditionals endowed with powerful sampling-based inference methods, treating both inducing locations and GP hyper-parameters in a Bayesian way can improve performance significantly. Based on stochastic gradient Hamiltonian Monte Carlo, we develop a fully Bayesian approach to scalable GP and deep GP models, and demonstrate its state-of-the-art performance through an extensive experimental campaign across several regression and classification problems. △ Less

Submitted 23 February, 2021; v1 submitted 6 March, 2020; originally announced March 2020.

arXiv:1912.10200 [pdf, other]

Quantile Propagation for Wasserstein-Approximate Gaussian Processes

Authors: Rui Zhang, Christian J. Walder, Edwin V. Bonilla, Marian-Andrei Rizoiu, Lexing Xie

Abstract: Approximate inference techniques are the cornerstone of probabilistic methods based on Gaussian process priors. Despite this, most work approximately optimizes standard divergence measures such as the Kullback-Leibler (KL) divergence, which lack the basic desiderata for the task at hand, while chiefly offering merely technical convenience. We develop a new approximate inference method for Gaussian… ▽ More Approximate inference techniques are the cornerstone of probabilistic methods based on Gaussian process priors. Despite this, most work approximately optimizes standard divergence measures such as the Kullback-Leibler (KL) divergence, which lack the basic desiderata for the task at hand, while chiefly offering merely technical convenience. We develop a new approximate inference method for Gaussian process models which overcomes the technical challenges arising from abandoning these convenient divergences. Our method---dubbed Quantile Propagation (QP)---is similar to expectation propagation (EP) but minimizes the $L_2$ Wasserstein distance (WD) instead of the KL divergence. The WD exhibits all the required properties of a distance metric, while respecting the geometry of the underlying sample space. We show that QP matches quantile functions rather than moments as in EP and has the same mean update but a smaller variance update than EP, thereby alleviating EP's tendency to over-estimate posterior variances. Crucially, despite the significant complexity of dealing with the WD, QP has the same favorable locality property as EP, and thereby admits an efficient algorithm. Experiments on classification and Poisson regression show that QP outperforms both EP and variational Bayes. △ Less

Submitted 5 November, 2020; v1 submitted 21 December, 2019; originally announced December 2019.

Comments: 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada

arXiv:1906.03161 [pdf, other]

Structured Variational Inference in Continuous Cox Process Models

Authors: Virginia Aglietti, Edwin V. Bonilla, Theodoros Damoulas, Sally Cripps

Abstract: We propose a scalable framework for inference in an inhomogeneous Poisson process modeled by a continuous sigmoidal Cox process that assumes the corresponding intensity function is given by a Gaussian process (GP) prior transformed with a scaled logistic sigmoid function. We present a tractable representation of the likelihood through augmentation with a superposition of Poisson processes. This vi… ▽ More We propose a scalable framework for inference in an inhomogeneous Poisson process modeled by a continuous sigmoidal Cox process that assumes the corresponding intensity function is given by a Gaussian process (GP) prior transformed with a scaled logistic sigmoid function. We present a tractable representation of the likelihood through augmentation with a superposition of Poisson processes. This view enables a structured variational approximation capturing dependencies across variables in the model. Our framework avoids discretization of the domain, does not require accurate numerical integration over the input space and is not limited to GPs with squared exponential kernels. We evaluate our approach on synthetic and real-world data showing that its benefits are particularly pronounced on multivariate input settings where it overcomes the limitations of mean-field methods and sampling schemes. We provide the state of-the-art in terms of speed, accuracy and uncertainty quantification trade-offs. △ Less

Submitted 7 June, 2019; originally announced June 2019.

arXiv:1906.01852 [pdf, other]

Variational Inference for Graph Convolutional Networks in the Absence of Graph Data and Adversarial Settings

Authors: Pantelis Elinas, Edwin V. Bonilla, Louis Tiao

Abstract: We propose a framework that lifts the capabilities of graph convolutional networks (GCNs) to scenarios where no input graph is given and increases their robustness to adversarial attacks. We formulate a joint probabilistic model that considers a prior distribution over graphs along with a GCN-based likelihood and develop a stochastic variational inference algorithm to estimate the graph posterior… ▽ More We propose a framework that lifts the capabilities of graph convolutional networks (GCNs) to scenarios where no input graph is given and increases their robustness to adversarial attacks. We formulate a joint probabilistic model that considers a prior distribution over graphs along with a GCN-based likelihood and develop a stochastic variational inference algorithm to estimate the graph posterior and the GCN parameters jointly. To address the problem of propagating gradients through latent variables drawn from discrete distributions, we use their continuous relaxations known as Concrete distributions. We show that, on real datasets, our approach can outperform state-of-the-art Bayesian and non-Bayesian graph neural network algorithms on the task of semi-supervised classification in the absence of graph data and when the network structure is subjected to adversarial perturbations. △ Less

Submitted 20 October, 2020; v1 submitted 5 June, 2019; originally announced June 2019.

arXiv:1903.03986 [pdf, other]

Scalable Grouped Gaussian Processes via Direct Cholesky Functional Representations

Authors: Astrid Dahl, Edwin V. Bonilla

Abstract: We consider multi-task regression models where observations are assumed to be a linear combination of several latent node and weight functions, all drawn from Gaussian process (GP) priors that allow nonzero covariance between grouped latent functions. We show that when these grouped functions are conditionally independent given a group-dependent pivot, it is possible to parameterize the prior thro… ▽ More We consider multi-task regression models where observations are assumed to be a linear combination of several latent node and weight functions, all drawn from Gaussian process (GP) priors that allow nonzero covariance between grouped latent functions. We show that when these grouped functions are conditionally independent given a group-dependent pivot, it is possible to parameterize the prior through sparse Cholesky factors directly, hence avoiding their computation during inference. Furthermore, we establish that kernels that are multiplicatively separable over input points give rise to such sparse parameterizations naturally without any additional assumptions. Finally, we extend the use of these sparse structures to approximate posteriors within variational inference, further improving scalability on the number of functions. We test our approach on multi-task datasets concerning distributed solar forecasting and show that it outperforms several multi-task GP baselines and that our sparse specifications achieve the same or better accuracy than non-sparse counterparts. △ Less

Submitted 22 July, 2019; v1 submitted 10 March, 2019; originally announced March 2019.

Comments: 14 pages, 4 figures

MSC Class: 62G08 ACM Class: G.3; I.2.6; J.2

arXiv:1806.02543 [pdf, other]

Grouped Gaussian Processes for Solar Power Prediction

Authors: Astrid Dahl, Edwin V. Bonilla

Abstract: We consider multi-task regression models where the observations are assumed to be a linear combination of several latent node functions and weight functions, which are both drawn from Gaussian process priors. Driven by the problem of develo** scalable methods for forecasting distributed solar and other renewable power generation, we propose coupled priors over groups of (node or weight) processe… ▽ More We consider multi-task regression models where the observations are assumed to be a linear combination of several latent node functions and weight functions, which are both drawn from Gaussian process priors. Driven by the problem of develo** scalable methods for forecasting distributed solar and other renewable power generation, we propose coupled priors over groups of (node or weight) processes to exploit spatial dependence between functions. We estimate forecast models for solar power at multiple distributed sites and ground wind speed at multiple proximate weather stations. Our results show that our approach maintains or improves point-prediction accuracy relative to competing solar benchmarks and improves over wind forecast benchmark models on all measures. Our approach consistently dominates the equivalent model without coupled priors, achieving faster gains in forecast accuracy. At the same time our approach provides better quantification of predictive uncertainties. △ Less

Submitted 3 December, 2018; v1 submitted 7 June, 2018; originally announced June 2018.

Comments: 15 pages, 4 figures. Replacement extends last version with further experimental results and additional figures

ACM Class: G.3; I.2.6; J.2

arXiv:1806.01771 [pdf, other]

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Authors: Louis C. Tiao, Edwin V. Bonilla, Fabio Ramos

Abstract: We formalize the problem of learning interdomain correspondences in the absence of paired data as Bayesian inference in a latent variable model (LVM), where one seeks the underlying hidden representations of entities from one domain as entities from the other domain. First, we introduce implicit latent variable models, where the prior over hidden representations can be specified flexibly as an imp… ▽ More We formalize the problem of learning interdomain correspondences in the absence of paired data as Bayesian inference in a latent variable model (LVM), where one seeks the underlying hidden representations of entities from one domain as entities from the other domain. First, we introduce implicit latent variable models, where the prior over hidden representations can be specified flexibly as an implicit distribution. Next, we develop a new variational inference (VI) algorithm for this model based on minimization of the symmetric Kullback-Leibler (KL) divergence between a variational joint and the exact joint distribution. Lastly, we demonstrate that the state-of-the-art cycle-consistent adversarial learning (CYCLEGAN) models can be derived as a special case within our proposed VI framework, thus establishing its connection to approximate Bayesian inference methods. △ Less

Submitted 24 August, 2018; v1 submitted 5 June, 2018; originally announced June 2018.

Comments: Presented at the ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models. Stockholm, Sweden, 2018

arXiv:1805.10522 [pdf, other]

Calibrating Deep Convolutional Gaussian Processes

Authors: Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, Pietro Michiardi, Maurizio Filippone

Abstract: The wide adoption of Convolutional Neural Networks (CNNs) in applications where decision-making under uncertainty is fundamental, has brought a great deal of attention to the ability of these models to accurately quantify the uncertainty in their predictions. Previous work on combining CNNs with Gaussian processes (GPs) has been developed under the assumption that the predictive probabilities of t… ▽ More The wide adoption of Convolutional Neural Networks (CNNs) in applications where decision-making under uncertainty is fundamental, has brought a great deal of attention to the ability of these models to accurately quantify the uncertainty in their predictions. Previous work on combining CNNs with Gaussian processes (GPs) has been developed under the assumption that the predictive probabilities of these models are well-calibrated. In this paper we show that, in fact, current combinations of CNNs and GPs are miscalibrated. We proposes a novel combination that considerably outperforms previous approaches on this aspect, while achieving state-of-the-art performance on image classification tasks. △ Less

Submitted 26 May, 2018; originally announced May 2018.

Comments: 12 pages

arXiv:1702.08530 [pdf, ps, other]

Semi-parametric Network Structure Discovery Models

Authors: Amir Dezfouli, Edwin V. Bonilla, Richard Nock

Abstract: We propose a network structure discovery model for continuous observations that generalizes linear causal models by incorporating a Gaussian process (GP) prior on a network-independent component, and random sparsity and weight matrices as the network-dependent parameters. This approach provides flexible modeling of network-independent trends in the observations as well as uncertainty quantificatio… ▽ More We propose a network structure discovery model for continuous observations that generalizes linear causal models by incorporating a Gaussian process (GP) prior on a network-independent component, and random sparsity and weight matrices as the network-dependent parameters. This approach provides flexible modeling of network-independent trends in the observations as well as uncertainty quantification around the discovered network structure. We establish a connection between our model and multi-task GPs and develop an efficient stochastic variational inference algorithm for it. Furthermore, we formally show that our approach is numerically stable and in fact numerically easy to carry out almost everywhere on the support of the random variables involved. Finally, we evaluate our model on three applications, showing that it outperforms previous approaches. We provide a qualitative and quantitative analysis of the structures discovered for domains such as the study of the full genome regulation of the yeast Saccharomyces cerevisiae. △ Less

Submitted 27 February, 2017; originally announced February 2017.

ACM Class: I.2.6; I.5.1

arXiv:1610.05392 [pdf, other]

AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models

Authors: Karl Krauth, Edwin V. Bonilla, Kurt Cutajar, Maurizio Filippone

Abstract: We investigate the capabilities and limitations of Gaussian process models by jointly exploring three complementary directions: (i) scalable and statistically efficient inference; (ii) flexible kernels; and (iii) objective functions for hyperparameter learning alternative to the marginal likelihood. Our approach outperforms all previously reported GP methods on the standard MNIST dataset; performs… ▽ More We investigate the capabilities and limitations of Gaussian process models by jointly exploring three complementary directions: (i) scalable and statistically efficient inference; (ii) flexible kernels; and (iii) objective functions for hyperparameter learning alternative to the marginal likelihood. Our approach outperforms all previously reported GP methods on the standard MNIST dataset; performs comparatively to previous kernel-based methods using the RECTANGLES-IMAGE dataset; and breaks the 1% error-rate barrier in GP models using the MNIST8M dataset, showing along the way the scalability of our method at unprecedented scale for GP models (8 million observations) in classification problems. Overall, our approach represents a significant breakthrough in kernel methods and GP models, bridging the gap between deep learning approaches and kernel machines. △ Less

Submitted 5 March, 2017; v1 submitted 17 October, 2016; originally announced October 2016.

Comments: Edited results on RECTANGLES-IMAGE and related comments; minor additional edits

arXiv:1610.04386 [pdf, other]

Random Feature Expansions for Deep Gaussian Processes

Authors: Kurt Cutajar, Edwin V. Bonilla, Pietro Michiardi, Maurizio Filippone

Abstract: The composition of multiple Gaussian Processes as a Deep Gaussian Process (DGP) enables a deep probabilistic nonparametric approach to flexibly tackle complex machine learning problems with sound quantification of uncertainty. Existing inference approaches for DGP models have limited scalability and are notoriously cumbersome to construct. In this work, we introduce a novel formulation of DGPs bas… ▽ More The composition of multiple Gaussian Processes as a Deep Gaussian Process (DGP) enables a deep probabilistic nonparametric approach to flexibly tackle complex machine learning problems with sound quantification of uncertainty. Existing inference approaches for DGP models have limited scalability and are notoriously cumbersome to construct. In this work, we introduce a novel formulation of DGPs based on random feature expansions that we train using stochastic variational inference. This yields a practical learning framework which significantly advances the state-of-the-art in inference for DGPs, and enables accurate quantification of uncertainty. We extensively showcase the scalability and performance of our proposal on several datasets with up to 8 million observations, and various DGP architectures with up to 30 hidden layers. △ Less

Submitted 1 March, 2017; v1 submitted 14 October, 2016; originally announced October 2016.

arXiv:1609.04289 [pdf, other]

Gray-box inference for structured Gaussian process models

Authors: Pietro Galliani, Amir Dezfouli, Edwin V. Bonilla, Novi Quadrianto

Abstract: We develop an automated variational inference method for Bayesian structured prediction problems with Gaussian process (GP) priors and linear-chain likelihoods. Our approach does not need to know the details of the structured likelihood model and can scale up to a large number of observations. Furthermore, we show that the required expected likelihood term and its gradients in the variational obje… ▽ More We develop an automated variational inference method for Bayesian structured prediction problems with Gaussian process (GP) priors and linear-chain likelihoods. Our approach does not need to know the details of the structured likelihood model and can scale up to a large number of observations. Furthermore, we show that the required expected likelihood term and its gradients in the variational objective (ELBO) can be estimated efficiently by using expectations over very low-dimensional Gaussian distributions. Optimization of the ELBO is fully parallelizable over sequences and amenable to stochastic optimization, which we use along with control variate techniques and state-of-the-art incremental optimization to make our framework useful in practice. Results on a set of natural language processing tasks show that our method can be as good as (and sometimes better than) hard-coded approaches including SVM-struct and CRFs, and overcomes the scalability limitations of previous inference algorithms based on sampling. Overall, this is a fundamental step to develo** automated inference methods for Bayesian structured prediction. △ Less

Submitted 14 September, 2016; originally announced September 2016.

arXiv:1609.00577 [pdf, other]

Generic Inference in Latent Gaussian Process Models

Authors: Edwin V. Bonilla, Karl Krauth, Amir Dezfouli

Abstract: We develop an automated variational method for inference in models with Gaussian process (GP) priors and general likelihoods. The method supports multiple outputs and multiple latent functions and does not require detailed knowledge of the conditional likelihood, only needing its evaluation as a black-box function. Using a mixture of Gaussians as the variational distribution, we show that the evid… ▽ More We develop an automated variational method for inference in models with Gaussian process (GP) priors and general likelihoods. The method supports multiple outputs and multiple latent functions and does not require detailed knowledge of the conditional likelihood, only needing its evaluation as a black-box function. Using a mixture of Gaussians as the variational distribution, we show that the evidence lower bound and its gradients can be estimated efficiently using samples from univariate Gaussian distributions. Furthermore, the method is scalable to large datasets which is achieved by using an augmented prior via the inducing-variable approach underpinning most sparse GP approximations, along with parallel computation and stochastic optimization. We evaluate our approach quantitatively and qualitatively with experiments on small datasets, medium-scale datasets and large datasets, showing its competitiveness under different likelihood models and sparsity levels. On the large-scale experiments involving prediction of airline delays and classification of handwritten digits, we show that our method is on par with the state-of-the-art hard-coded approaches for scalable GP regression and classification. △ Less

Submitted 5 November, 2018; v1 submitted 2 September, 2016; originally announced September 2016.

Comments: 61 pages

Showing 1–29 of 29 results for author: Bonilla, E V