-
Learning in Deep Factor Graphs with Gaussian Belief Propagation
Authors:
Seth Nabarro,
Mark van der Wilk,
Andrew J Davison
Abstract:
We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, latents) as random variables in a graphical model, and view both training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently loca…
▽ More
We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, latents) as random variables in a graphical model, and view both training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently local, presenting exciting opportunities for distributed and asynchronous training. Our approach can be scaled to deep networks and provides a natural means to do continual learning: use the BP-estimated parameter marginals of the current task as parameter priors for the next. On a video denoising task we demonstrate the benefit of learnable parameters over a classical factor graph approach and we show encouraging performance of deep factor graphs for continual image classification.
△ Less
Submitted 28 February, 2024; v1 submitted 24 November, 2023;
originally announced November 2023.
-
Data augmentation in Bayesian neural networks and the cold posterior effect
Authors:
Seth Nabarro,
Stoil Ganev,
AdriĆ Garriga-Alonso,
Vincent Fortuin,
Mark van der Wilk,
Laurence Aitchison
Abstract:
Bayesian neural networks that incorporate data augmentation implicitly use a ``randomly perturbed log-likelihood [which] does not have a clean interpretation as a valid likelihood function'' (Izmailov et al. 2021). Here, we provide several approaches to develo** principled Bayesian neural networks incorporating data augmentation. We introduce a ``finite orbit'' setting which allows likelihoods t…
▽ More
Bayesian neural networks that incorporate data augmentation implicitly use a ``randomly perturbed log-likelihood [which] does not have a clean interpretation as a valid likelihood function'' (Izmailov et al. 2021). Here, we provide several approaches to develo** principled Bayesian neural networks incorporating data augmentation. We introduce a ``finite orbit'' setting which allows likelihoods to be computed exactly, and give tight multi-sample bounds in the more usual ``full orbit'' setting. These models cast light on the origin of the cold posterior effect. In particular, we find that the cold posterior effect persists even in these principled models incorporating data augmentation. This suggests that the cold posterior effect cannot be dismissed as an artifact of data augmentation using incorrect likelihoods.
△ Less
Submitted 9 December, 2021; v1 submitted 10 June, 2021;
originally announced June 2021.
-
Hardware-accelerated Simulation-based Inference of Stochastic Epidemiology Models for COVID-19
Authors:
Sourabh Kulkarni,
Mario Michael Krell,
Seth Nabarro,
Csaba Andras Moritz
Abstract:
Epidemiology models are central in understanding and controlling large scale pandemics. Several epidemiology models require simulation-based inference such as Approximate Bayesian Computation (ABC) to fit their parameters to observations. ABC inference is highly amenable to efficient hardware acceleration. In this work, we develop parallel ABC inference of a stochastic epidemiology model for COVID…
▽ More
Epidemiology models are central in understanding and controlling large scale pandemics. Several epidemiology models require simulation-based inference such as Approximate Bayesian Computation (ABC) to fit their parameters to observations. ABC inference is highly amenable to efficient hardware acceleration. In this work, we develop parallel ABC inference of a stochastic epidemiology model for COVID-19. The statistical inference framework is implemented and compared on Intel Xeon CPU, NVIDIA Tesla V100 GPU and the Graphcore Mk1 IPU, and the results are discussed in the context of their computational architectures. Results show that GPUs are 4x and IPUs are 30x faster than Xeon CPUs. Extensive performance analysis indicates that the difference between IPU and GPU can be attributed to higher communication bandwidth, closeness of memory to compute, and higher compute power in the IPU. The proposed framework scales across 16 IPUs, with scaling overhead not exceeding 8% for the experiments performed. We present an example of our framework in practice, performing inference on the epidemiology model across three countries, and giving a brief overview of the results.
△ Less
Submitted 22 December, 2020;
originally announced December 2020.
-
Parallel Training of Deep Networks with Local Updates
Authors:
Michael Laskin,
Luke Metz,
Seth Nabarro,
Mark Saroufim,
Badreddine Noune,
Carlo Luschi,
Jascha Sohl-Dickstein,
Pieter Abbeel
Abstract:
Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep…
▽ More
Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep networks have been data and model parallelism. While useful, data and model parallelism suffer from diminishing returns in terms of compute efficiency for large batch sizes. In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
△ Less
Submitted 15 June, 2021; v1 submitted 7 December, 2020;
originally announced December 2020.
-
Spatiotemporal Prediction of Ambulance Demand using Gaussian Process Regression
Authors:
Seth Nabarro,
Tristan Fletcher,
John Shawe-Taylor
Abstract:
Accurately predicting when and where ambulance call-outs occur can reduce response times and ensure the patient receives urgent care sooner. Here we present a novel method for ambulance demand prediction using Gaussian process regression (GPR) in time and geographic space. The method exhibits superior accuracy to MEDIC, a method which has been used in industry. The use of GPR has additional benefi…
▽ More
Accurately predicting when and where ambulance call-outs occur can reduce response times and ensure the patient receives urgent care sooner. Here we present a novel method for ambulance demand prediction using Gaussian process regression (GPR) in time and geographic space. The method exhibits superior accuracy to MEDIC, a method which has been used in industry. The use of GPR has additional benefits such as the quantification of uncertainty with each prediction, the choice of kernel functions to encode prior knowledge and the ability to capture spatial correlation. Measures to increase the utility of GPR in the current context, with large training sets and a Poisson-distributed output, are outlined.
△ Less
Submitted 28 June, 2018;
originally announced June 2018.