-
The VampPrior Mixture Model
Authors:
Andrew Stirn,
David A. Knowles
Abstract:
Current clustering priors for deep latent variable models (DLVMs) require defining the number of clusters a-priori and are susceptible to poor initializations. Addressing these deficiencies could greatly benefit deep learning-based scRNA-seq analysis by performing integration and clustering simultaneously. We adapt the VampPrior (Tomczak & Welling, 2018) into a Dirichlet process Gaussian mixture m…
▽ More
Current clustering priors for deep latent variable models (DLVMs) require defining the number of clusters a-priori and are susceptible to poor initializations. Addressing these deficiencies could greatly benefit deep learning-based scRNA-seq analysis by performing integration and clustering simultaneously. We adapt the VampPrior (Tomczak & Welling, 2018) into a Dirichlet process Gaussian mixture model, resulting in the VampPrior Mixture Model (VMM), a novel prior for DLVMs. We propose an inference procedure that alternates between variational inference and Empirical Bayes to cleanly distinguish variational and prior parameters. Using the VMM in a Variational Autoencoder attains highly competitive clustering performance on benchmark datasets. Augmenting scVI (Lopez et al., 2018), a popular scRNA-seq integration method, with the VMM significantly improves its performance and automatically arranges cells into biologically meaningful clusters.
△ Less
Submitted 4 March, 2024; v1 submitted 6 February, 2024;
originally announced February 2024.
-
Faithful Heteroscedastic Regression with Neural Networks
Authors:
Andrew Stirn,
Hans-Hermann Wessels,
Megan Schertzer,
Laura Pereira,
Neville E. Sanjana,
David A. Knowles
Abstract:
Heteroscedastic regression models a Gaussian variable's mean and variance as a function of covariates. Parametric methods that employ neural networks for these parameter maps can capture complex relationships in the data. Yet, optimizing network parameters via log likelihood gradients can yield suboptimal mean and uncalibrated variance estimates. Current solutions side-step this optimization probl…
▽ More
Heteroscedastic regression models a Gaussian variable's mean and variance as a function of covariates. Parametric methods that employ neural networks for these parameter maps can capture complex relationships in the data. Yet, optimizing network parameters via log likelihood gradients can yield suboptimal mean and uncalibrated variance estimates. Current solutions side-step this optimization problem with surrogate objectives or Bayesian treatments. Instead, we make two simple modifications to optimization. Notably, their combination produces a heteroscedastic model with mean estimates that are provably as accurate as those from its homoscedastic counterpart (i.e.~fitting the mean under squared error loss). For a wide variety of network and task complexities, we find that mean estimates from existing heteroscedastic solutions can be significantly less accurate than those from an equivalently expressive mean-only model. Our approach provably retains the accuracy of an equally flexible mean-only model while also offering best-in-class variance calibration. Lastly, we show how to leverage our method to recover the underlying heteroscedastic noise variance.
△ Less
Submitted 18 December, 2022;
originally announced December 2022.
-
Variational Variance: Simple, Reliable, Calibrated Heteroscedastic Noise Variance Parameterization
Authors:
Andrew Stirn,
David A. Knowles
Abstract:
Brittle optimization has been observed to adversely impact model likelihoods for regression and VAEs when simultaneously fitting neural network map**s from a (random) variable onto the mean and variance of a dependent Gaussian variable. Previous works have bolstered optimization and improved likelihoods, but fail other basic posterior predictive checks (PPCs). Under the PPC framework, we propose…
▽ More
Brittle optimization has been observed to adversely impact model likelihoods for regression and VAEs when simultaneously fitting neural network map**s from a (random) variable onto the mean and variance of a dependent Gaussian variable. Previous works have bolstered optimization and improved likelihoods, but fail other basic posterior predictive checks (PPCs). Under the PPC framework, we propose critiques to test predictive mean and variance calibration and the predictive distribution's ability to generate sensible data. We find that our attractively simple solution, to treat heteroscedastic variance variationally, sufficiently regularizes variance to pass these PPCs. We consider a diverse gamut of existing and novel priors and find our methods preserve or outperform existing model likelihoods while significantly improving parameter calibration and sample quality for regression and VAEs.
△ Less
Submitted 30 October, 2020; v1 submitted 8 June, 2020;
originally announced June 2020.
-
A New Distribution on the Simplex with Auto-Encoding Applications
Authors:
Andrew Stirn,
Tony Jebara,
David A Knowles
Abstract:
We construct a new distribution for the simplex using the Kumaraswamy distribution and an ordered stick-breaking process. We explore and develop the theoretical properties of this new distribution and prove that it exhibits symmetry under the same conditions as the well-known Dirichlet. Like the Dirichlet, the new distribution is adept at capturing sparsity but, unlike the Dirichlet, has an exact…
▽ More
We construct a new distribution for the simplex using the Kumaraswamy distribution and an ordered stick-breaking process. We explore and develop the theoretical properties of this new distribution and prove that it exhibits symmetry under the same conditions as the well-known Dirichlet. Like the Dirichlet, the new distribution is adept at capturing sparsity but, unlike the Dirichlet, has an exact and closed form reparameterization--making it well suited for deep variational Bayesian modeling. We demonstrate the distribution's utility in a variety of semi-supervised auto-encoding tasks. In all cases, the resulting models achieve competitive performance commensurate with their simplicity, use of explicit probability models, and abstinence from adversarial training.
△ Less
Submitted 14 December, 2019; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Stochastic gradient variational Bayes for gamma approximating distributions
Authors:
David A. Knowles
Abstract:
While stochastic variational inference is relatively well known for scaling inference in Bayesian probabilistic models, related methods also offer ways to circumnavigate the approximation of analytically intractable expectations. The key challenge in either setting is controlling the variance of gradient estimates: recent work has shown that for continuous latent variables, particularly multivaria…
▽ More
While stochastic variational inference is relatively well known for scaling inference in Bayesian probabilistic models, related methods also offer ways to circumnavigate the approximation of analytically intractable expectations. The key challenge in either setting is controlling the variance of gradient estimates: recent work has shown that for continuous latent variables, particularly multivariate Gaussians, this can be achieved by using the gradient of the log posterior. In this paper we apply the same idea to gamma distributed latent variables given gamma variational distributions, enabling straightforward "black box" variational inference in models where sparsity and non-negativity are appropriate. We demonstrate the method on a recently proposed gamma process model for network data, as well as a novel sparse factor analysis. We outperform generic sampling algorithms and the approach of using Gaussian variational distributions on transformed variables.
△ Less
Submitted 4 September, 2015;
originally announced September 2015.
-
An Empirical Study of Stochastic Variational Algorithms for the Beta Bernoulli Process
Authors:
Amar Shah,
David A. Knowles,
Zoubin Ghahramani
Abstract:
Stochastic variational inference (SVI) is emerging as the most promising candidate for scaling inference in Bayesian probabilistic models to large datasets. However, the performance of these methods has been assessed primarily in the context of Bayesian topic models, particularly latent Dirichlet allocation (LDA). Deriving several new algorithms, and using synthetic, image and genomic datasets, we…
▽ More
Stochastic variational inference (SVI) is emerging as the most promising candidate for scaling inference in Bayesian probabilistic models to large datasets. However, the performance of these methods has been assessed primarily in the context of Bayesian topic models, particularly latent Dirichlet allocation (LDA). Deriving several new algorithms, and using synthetic, image and genomic datasets, we investigate whether the understanding gleaned from LDA applies in the setting of sparse latent factor models, specifically beta process factor analysis (BPFA). We demonstrate that the big picture is consistent: using Gibbs sampling within SVI to maintain certain posterior dependencies is extremely effective. However, we find that different posterior dependencies are important in BPFA relative to LDA. Particularly, approximations able to model intra-local variable dependence perform best.
△ Less
Submitted 26 June, 2015;
originally announced June 2015.
-
Beta diffusion trees and hierarchical feature allocations
Authors:
Creighton Heaukulani,
David A. Knowles,
Zoubin Ghahramani
Abstract:
We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlap** subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003), which defines a tree structure…
▽ More
We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlap** subsets of objects, known as a feature allocation. A generative process for the tree structure is defined in terms of particles (representing the objects) diffusing in some continuous space, analogously to the Dirichlet diffusion tree (Neal, 2003), which defines a tree structure over partitions (i.e., non-overlap** subsets) of the objects. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may therefore belong to multiple subsets of particles. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements.
△ Less
Submitted 3 April, 2015; v1 submitted 14 August, 2014;
originally announced August 2014.
-
A reversible infinite HMM using normalised random measures
Authors:
Konstantina Palla,
David A. Knowles,
Zoubin Ghahramani
Abstract:
We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely relat…
▽ More
We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording.
△ Less
Submitted 17 March, 2014;
originally announced March 2014.
-
On Using Control Variates with Stochastic Approximation for Variational Bayes and its Connection to Stochastic Linear Regression
Authors:
Tim Salimans,
David A. Knowles
Abstract:
Recently, we and several other authors have written about the possibilities of using stochastic approximation techniques for fitting variational approximations to intractable Bayesian posterior distributions. Naive implementations of stochastic approximation suffer from high variance in this setting. Several authors have therefore suggested using control variates to reduce this variance, while we…
▽ More
Recently, we and several other authors have written about the possibilities of using stochastic approximation techniques for fitting variational approximations to intractable Bayesian posterior distributions. Naive implementations of stochastic approximation suffer from high variance in this setting. Several authors have therefore suggested using control variates to reduce this variance, while we have taken a different but analogous approach to reducing the variance which we call stochastic linear regression. In this note we take the former perspective and derive the ideal set of control variates for stochastic approximation variational Bayes under a certain set of assumptions. We then show that using these control variates is closely related to using the stochastic linear regression approximation technique we proposed earlier. A simple example shows that our method for constructing control variates leads to stochastic estimators with much lower variance compared to other approaches.
△ Less
Submitted 12 January, 2014; v1 submitted 6 January, 2014;
originally announced January 2014.
-
The Supervised IBP: Neighbourhood Preserving Infinite Latent Feature Models
Authors:
Novi Quadrianto,
Viktoriia Sharmanska,
David A. Knowles,
Zoubin Ghahramani
Abstract:
We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dis…
▽ More
We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space.
△ Less
Submitted 26 September, 2013;
originally announced September 2013.
-
A dependent partition-valued process for multitask clustering and time evolving network modelling
Authors:
Konstantina Palla,
David A. Knowles,
Zoubin Ghahramani
Abstract:
The fundamental aim of clustering algorithms is to partition data points. We consider tasks where the discovered partition is allowed to vary with some covariate such as space or time. One approach would be to use fragmentation-coagulation processes, but these, being Markov processes, are restricted to linear or tree structured covariate spaces. We define a partition-valued process on an arbitrary…
▽ More
The fundamental aim of clustering algorithms is to partition data points. We consider tasks where the discovered partition is allowed to vary with some covariate such as space or time. One approach would be to use fragmentation-coagulation processes, but these, being Markov processes, are restricted to linear or tree structured covariate spaces. We define a partition-valued process on an arbitrary covariate space using Gaussian processes. We use the process to construct a multitask clustering model which partitions datapoints in a similar way across multiple data sources, and a time series model of network data which allows cluster assignments to vary over time. We describe sampling algorithms for inference and apply our method to defining cancer subtypes based on different types of cellular characteristics, finding regulatory modules from gene expression data from multiple human populations, and discovering time varying community structure in a social network.
△ Less
Submitted 31 October, 2013; v1 submitted 13 March, 2013;
originally announced March 2013.
-
Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression
Authors:
Tim Salimans,
David A. Knowles
Abstract:
We propose a general algorithm for approximating nonstandard Bayesian posterior distributions. The algorithm minimizes the Kullback-Leibler divergence of an approximating distribution to the intractable posterior distribution. Our method can be used to approximate any posterior distribution, provided that it is given in closed form up to the proportionality constant. The approximation can be any d…
▽ More
We propose a general algorithm for approximating nonstandard Bayesian posterior distributions. The algorithm minimizes the Kullback-Leibler divergence of an approximating distribution to the intractable posterior distribution. Our method can be used to approximate any posterior distribution, provided that it is given in closed form up to the proportionality constant. The approximation can be any distribution in the exponential family or any mixture of such distributions, which means that it can be made arbitrarily precise. Several examples illustrate the speed and accuracy of our approximation method in practice.
△ Less
Submitted 28 July, 2014; v1 submitted 28 June, 2012;
originally announced June 2012.
-
Gaussian Process Regression Networks
Authors:
Andrew Gordon Wilson,
David A. Knowles,
Zoubin Ghahramani
Abstract:
We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distr…
▽ More
We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.
△ Less
Submitted 19 October, 2011;
originally announced October 2011.
-
Pitman-Yor Diffusion Trees
Authors:
David A. Knowles,
Zoubin Ghahramani
Abstract:
We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a…
▽ More
We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.
△ Less
Submitted 16 June, 2011; v1 submitted 13 June, 2011;
originally announced June 2011.