-
Deep Bayesian Active Learning with Image Data
Authors:
Yarin Gal,
Riashat Islam,
Zoubin Ghahramani
Abstract:
Even though active learning forms an important pillar of machine learning, deep learning tools are not prevalent within it. Deep learning poses several difficulties when used in an active learning setting. First, active learning (AL) methods generally rely on being able to learn and update models from small amounts of data. Recent advances in deep learning, on the other hand, are notorious for the…
▽ More
Even though active learning forms an important pillar of machine learning, deep learning tools are not prevalent within it. Deep learning poses several difficulties when used in an active learning setting. First, active learning (AL) methods generally rely on being able to learn and update models from small amounts of data. Recent advances in deep learning, on the other hand, are notorious for their dependence on large amounts of data. Second, many AL acquisition functions rely on model uncertainty, yet deep learning methods rarely represent such model uncertainty. In this paper we combine recent advances in Bayesian deep learning into the active learning framework in a practical way. We develop an active learning framework for high dimensional data, a task which has been extremely challenging so far, with very sparse existing literature. Taking advantage of specialised models such as Bayesian convolutional neural networks, we demonstrate our active learning techniques with image data, obtaining a significant improvement on existing active learning approaches. We demonstrate this on both the MNIST dataset, as well as for skin cancer diagnosis from lesion images (ISIC2016 task).
△ Less
Submitted 8 March, 2017;
originally announced March 2017.
-
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks
Authors:
Yarin Gal,
Zoubin Ghahramani
Abstract:
Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This gr…
▽ More
Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This grounding of dropout in approximate Bayesian inference suggests an extension of the theoretical results, offering insights into the use of dropout with RNN models. We apply this new variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks. The new approach outperforms existing techniques, and to the best of our knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.
△ Less
Submitted 5 October, 2016; v1 submitted 16 December, 2015;
originally announced December 2015.
-
Dirichlet Fragmentation Processes
Authors:
Hong Ge,
Yarin Gal,
Zoubin Ghahramani
Abstract:
Tree structures are ubiquitous in data across many domains, and many datasets are naturally modelled by unobserved tree structures. In this paper, first we review the theory of random fragmentation processes [Bertoin, 2006], and a number of existing methods for modelling trees, including the popular nested Chinese restaurant process (nCRP). Then we define a general class of probability distributio…
▽ More
Tree structures are ubiquitous in data across many domains, and many datasets are naturally modelled by unobserved tree structures. In this paper, first we review the theory of random fragmentation processes [Bertoin, 2006], and a number of existing methods for modelling trees, including the popular nested Chinese restaurant process (nCRP). Then we define a general class of probability distributions over trees: the Dirichlet fragmentation process (DFP) through a novel combination of the theory of Dirichlet processes and random fragmentation processes. This DFP presents a stick-breaking construction, and relates to the nCRP in the same way the Dirichlet process relates to the Chinese restaurant process. Furthermore, we develop a novel hierarchical mixture model with the DFP, and empirically compare the new model to similar models in machine learning. Experiments show the DFP mixture model to be convincingly better than existing state-of-the-art approaches for hierarchical clustering and density modelling.
△ Less
Submitted 15 September, 2015;
originally announced September 2015.
-
Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference
Authors:
Yarin Gal,
Zoubin Ghahramani
Abstract:
Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a prob…
▽ More
Convolutional neural networks (CNNs) work well on large datasets. But labelled data is hard to collect, and in some applications larger amounts of data are not available. The problem then is how to use CNNs with small data -- as CNNs overfit quickly. We present an efficient Bayesian CNN, offering better robustness to over-fitting on small data than traditional approaches. This is by placing a probability distribution over the CNN's kernels. We approximate our model's intractable posterior with Bernoulli variational distributions, requiring no additional model parameters.
On the theoretical side, we cast dropout network training as approximate inference in Bayesian neural networks. This allows us to implement our model using existing tools in deep learning with no increase in time complexity, while highlighting a negative result in the field. We show a considerable improvement in classification accuracy compared to standard techniques and improve on published state-of-the-art results for CIFAR-10.
△ Less
Submitted 18 January, 2016; v1 submitted 6 June, 2015;
originally announced June 2015.
-
Dropout as a Bayesian Approximation: Appendix
Authors:
Yarin Gal,
Zoubin Ghahramani
Abstract:
We show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to a well known Bayesian model. This interpretation might offer an explanation to some of dropout's key properties, such as its robustness to over-fitting. Our interpretation allows us to reason about uncertainty in deep learning,…
▽ More
We show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight layer, is mathematically equivalent to an approximation to a well known Bayesian model. This interpretation might offer an explanation to some of dropout's key properties, such as its robustness to over-fitting. Our interpretation allows us to reason about uncertainty in deep learning, and allows the introduction of the Bayesian machinery into existing deep learning frameworks in a principled way.
This document is an appendix for the main paper "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning" by Gal and Ghahramani, 2015.
△ Less
Submitted 25 May, 2016; v1 submitted 6 June, 2015;
originally announced June 2015.
-
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
Authors:
Yarin Gal,
Zoubin Ghahramani
Abstract:
Deep learning tools have gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty. In comparison, Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with a prohibitive computational cost. In this paper we develop a new theoretical framework casting dropou…
▽ More
Deep learning tools have gained tremendous attention in applied machine learning. However such tools for regression and classification do not capture model uncertainty. In comparison, Bayesian models offer a mathematically grounded framework to reason about model uncertainty, but usually come with a prohibitive computational cost. In this paper we develop a new theoretical framework casting dropout training in deep neural networks (NNs) as approximate Bayesian inference in deep Gaussian processes. A direct result of this theory gives us tools to model uncertainty with dropout NNs -- extracting information from existing models that has been thrown away so far. This mitigates the problem of representing uncertainty in deep learning without sacrificing either computational complexity or test accuracy. We perform an extensive study of the properties of dropout's uncertainty. Various network architectures and non-linearities are assessed on tasks of regression and classification, using MNIST as an example. We show a considerable improvement in predictive log-likelihood and RMSE compared to existing state-of-the-art methods, and finish by using dropout's uncertainty in deep reinforcement learning.
△ Less
Submitted 4 October, 2016; v1 submitted 6 June, 2015;
originally announced June 2015.
-
Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs
Authors:
Yarin Gal,
Richard Turner
Abstract:
Standard sparse pseudo-input approximations to the Gaussian process (GP) cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We suggest the use of variational inference for the sparse spectrum approximation to avoid both issues. We model the covariance function with a finite Fourier series approximation and treat it as a random varia…
▽ More
Standard sparse pseudo-input approximations to the Gaussian process (GP) cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We suggest the use of variational inference for the sparse spectrum approximation to avoid both issues. We model the covariance function with a finite Fourier series approximation and treat it as a random variable. The random covariance function has a posterior, on which a variational distribution is placed. The variational distribution transforms the random covariance function to fit the data. We study the properties of our approximate inference, compare it to alternative ones, and extend it to the distributed and stochastic domains. Our approximation captures complex functions better than standard approaches and avoids over-fitting.
△ Less
Submitted 20 March, 2015; v1 submitted 9 March, 2015;
originally announced March 2015.
-
Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data
Authors:
Yarin Gal,
Yutian Chen,
Zoubin Ghahramani
Abstract:
Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed obse…
▽ More
Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous space to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data.
△ Less
Submitted 7 March, 2015;
originally announced March 2015.
-
Semantics, Modelling, and the Problem of Representation of Meaning -- a Brief Survey of Recent Literature
Authors:
Yarin Gal
Abstract:
Over the past 50 years many have debated what representation should be used to capture the meaning of natural language utterances. Recently new needs of such representations have been raised in research. Here I survey some of the interesting representations suggested to answer for these new needs.
Over the past 50 years many have debated what representation should be used to capture the meaning of natural language utterances. Recently new needs of such representations have been raised in research. Here I survey some of the interesting representations suggested to answer for these new needs.
△ Less
Submitted 28 February, 2014;
originally announced February 2014.
-
Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models - a Gentle Tutorial
Authors:
Yarin Gal,
Mark van der Wilk
Abstract:
In this tutorial we explain the inference procedures developed for the sparse Gaussian process (GP) regression and Gaussian process latent variable model (GPLVM). Due to page limit the derivation given in Titsias (2009) and Titsias & Lawrence (2010) is brief, hence getting a full picture of it requires collecting results from several different sources and a substantial amount of algebra to fill-in…
▽ More
In this tutorial we explain the inference procedures developed for the sparse Gaussian process (GP) regression and Gaussian process latent variable model (GPLVM). Due to page limit the derivation given in Titsias (2009) and Titsias & Lawrence (2010) is brief, hence getting a full picture of it requires collecting results from several different sources and a substantial amount of algebra to fill-in the gaps. Our main goal is thus to collect all the results and full derivations into one place to help speed up understanding this work. In doing so we present a re-parametrisation of the inference that allows it to be carried out in parallel. A secondary goal for this document is, therefore, to accompany our paper and open-source implementation of the parallel inference scheme for the models. We hope that this document will bridge the gap between the equations as implemented in code and those published in the original papers, in order to make it easier to extend existing work. We assume prior knowledge of Gaussian processes and variational inference, but we also include references for further reading where appropriate.
△ Less
Submitted 29 September, 2014; v1 submitted 6 February, 2014;
originally announced February 2014.
-
Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models
Authors:
Yarin Gal,
Mark van der Wilk,
Carl E. Rasmussen
Abstract:
Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of resear…
▽ More
Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression (on flight data with 2 million records) and latent variable modelling (on MNIST). The results show that GPs perform better than many common models often used for big data.
△ Less
Submitted 29 September, 2014; v1 submitted 6 February, 2014;
originally announced February 2014.
-
Networks of Influence Diagrams: A Formalism for Representing Agents' Beliefs and Decision-Making Processes
Authors:
Yaakov Gal,
Avi Pfeffer
Abstract:
This paper presents Networks of Influence Diagrams (NID), a compact, natural and highly expressive language for reasoning about agents beliefs and decision-making processes. NIDs are graphical structures in which agents mental models are represented as nodes in a network; a mental model for an agent may itself use descriptions of the mental models of other agents. NIDs are demonstrated by examples…
▽ More
This paper presents Networks of Influence Diagrams (NID), a compact, natural and highly expressive language for reasoning about agents beliefs and decision-making processes. NIDs are graphical structures in which agents mental models are represented as nodes in a network; a mental model for an agent may itself use descriptions of the mental models of other agents. NIDs are demonstrated by examples, showing how they can be used to describe conflicting and cyclic belief structures, and certain forms of bounded rationality. In an opponent modeling domain, NIDs were able to outperform other computational agents whose strategies were not known in advance. NIDs are equivalent in representation to Bayesian games but they are more compact and structured than this formalism. In particular, the equilibrium definition for NIDs makes an explicit distinction between agents optimal strategies, and how they actually behave in reality.
△ Less
Submitted 14 January, 2014;
originally announced January 2014.