Search | arXiv e-print repository

Diffusion Model With Optimal Covariance Matching

Authors: Zi**g Ou, Mingtian Zhang, Andi Zhang, Tim Z. Xiao, Yingzhen Li, David Barber

Abstract: The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed full covariance moment matching technique and introduce a novel method for learning covarianc… ▽ More The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed full covariance moment matching technique and introduce a novel method for learning covariances. Unlike traditional data-driven covariance approximation approaches, our method involves directly regressing the optimal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This approach can significantly reduce the approximation error in covariance prediction. We demonstrate how our method can substantially enhance the sampling efficiency of both Markovian (DDPM) and non-Markovian (DDIM) diffusion model families. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2402.17512 [pdf, other]

Latent Attention for Linear Time Transformers

Authors: Rares Dolga, Marius Cobzarenco, David Barber

Abstract: The time complexity of the standard attention mechanism in a transformer scales quadratically with the length of the sequence. We introduce a method to reduce this to linear scaling with time, based on defining attention via latent vectors. The method is readily usable as a drop-in replacement for the standard attention mechanism. Our "Latte Transformer" model can be implemented for both bidirecti… ▽ More The time complexity of the standard attention mechanism in a transformer scales quadratically with the length of the sequence. We introduce a method to reduce this to linear scaling with time, based on defining attention via latent vectors. The method is readily usable as a drop-in replacement for the standard attention mechanism. Our "Latte Transformer" model can be implemented for both bidirectional and unidirectional tasks, with the causal version allowing a recurrent implementation which is memory and time-efficient during inference of language generation tasks. Whilst next token prediction scales linearly with the sequence length for a standard transformer, a Latte Transformer requires constant time to compute the next token. The empirical performance of our method is comparable to standard attention, yet allows scaling to context windows much larger than practical in standard attention. △ Less

Submitted 4 March, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.12177 [pdf, ps, other]

Mafin: Enhancing Black-Box Embeddings with Model Augmented Fine-Tuning

Authors: Mingtian Zhang, Shawn Lan, Peter Hayes, David Barber

Abstract: Retrieval Augmented Generation (RAG) has emerged as an effective solution for mitigating hallucinations in Large Language Models (LLMs). The retrieval stage in RAG typically involves a pre-trained embedding model, which converts queries and passages into vectors to capture their semantics. However, a standard pre-trained embedding model may exhibit sub-optimal performance when applied to specific… ▽ More Retrieval Augmented Generation (RAG) has emerged as an effective solution for mitigating hallucinations in Large Language Models (LLMs). The retrieval stage in RAG typically involves a pre-trained embedding model, which converts queries and passages into vectors to capture their semantics. However, a standard pre-trained embedding model may exhibit sub-optimal performance when applied to specific domain knowledge, necessitating fine-tuning. This paper addresses scenarios where the embeddings are only available from a black-box model. We introduce Model augmented fine-tuning (Mafin) -- a novel approach for fine-tuning a black-box embedding model by augmenting it with a trainable embedding model. Our results demonstrate that Mafin significantly enhances the performance of the black-box embeddings by only requiring the training of a small augmented model. We validate the effectiveness of our method on both labeled and unlabeled datasets, illustrating its broad applicability and efficiency. △ Less

Submitted 12 March, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

arXiv:2402.08114 [pdf, other]

Active Preference Learning for Large Language Models

Authors: William Muldrew, Peter Hayes, Mingtian Zhang, David Barber

Abstract: As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of… ▽ More As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data. △ Less

Submitted 28 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: 13 pages, 5 figures, 6 tables

arXiv:2402.03008 [pdf, other]

Diffusive Gibbs Sampling

Authors: Wenlin Chen, Mingtian Zhang, Brooks Paige, José Miguel Hernández-Lobato, David Barber

Abstract: The inadequate mixing of conventional Markov Chain Monte Carlo (MCMC) methods for multi-modal distributions presents a significant challenge in practical applications such as Bayesian inference and molecular dynamics. Addressing this, we propose Diffusive Gibbs Sampling (DiGS), an innovative family of sampling methods designed for effective sampling from distributions characterized by distant and… ▽ More The inadequate mixing of conventional Markov Chain Monte Carlo (MCMC) methods for multi-modal distributions presents a significant challenge in practical applications such as Bayesian inference and molecular dynamics. Addressing this, we propose Diffusive Gibbs Sampling (DiGS), an innovative family of sampling methods designed for effective sampling from distributions characterized by distant and disconnected modes. DiGS integrates recent developments in diffusion models, leveraging Gaussian convolution to create an auxiliary noisy distribution that bridges isolated modes in the original space and applying Gibbs sampling to alternately draw samples from both spaces. A novel Metropolis-within-Gibbs scheme is proposed to enhance mixing in the denoising sampling step. DiGS exhibits a better mixing property for sampling multi-modal distributions than state-of-the-art methods such as parallel tempering, attaining substantially improved performance across various tasks, including mixtures of Gaussians, Bayesian neural networks and molecular dynamics. △ Less

Submitted 29 May, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted for publication at ICML 2024. Code available: https://github.com/Wenlin-Chen/DiGS

arXiv:2309.03851 [pdf, other]

CenTime: Event-Conditional Modelling of Censoring in Survival Analysis

Authors: Ahmed H. Shahin, An Zhao, Alexander C. Whitehead, Daniel C. Alexander, Joseph Jacob, David Barber

Abstract: Survival analysis is a valuable tool for estimating the time until specific events, such as death or cancer recurrence, based on baseline observations. This is particularly useful in healthcare to prognostically predict clinically important events based on patient data. However, existing approaches often have limitations; some focus only on ranking patients by survivability, neglecting to estimate… ▽ More Survival analysis is a valuable tool for estimating the time until specific events, such as death or cancer recurrence, based on baseline observations. This is particularly useful in healthcare to prognostically predict clinically important events based on patient data. However, existing approaches often have limitations; some focus only on ranking patients by survivability, neglecting to estimate the actual event time, while others treat the problem as a classification task, ignoring the inherent time-ordered structure of the events. Furthermore, the effective utilization of censored samples - training data points where the exact event time is unknown - is essential for improving the predictive accuracy of the model. In this paper, we introduce CenTime, a novel approach to survival analysis that directly estimates the time to event. Our method features an innovative event-conditional censoring mechanism that performs robustly even when uncensored data is scarce. We demonstrate that our approach forms a consistent estimator for the event model parameters, even in the absence of uncensored data. Furthermore, CenTime is easily integrated with deep learning models with no restrictions on batch size or the number of uncensored samples. We compare our approach with standard survival analysis methods, including the Cox proportional-hazard model and DeepHit. Our results indicate that CenTime offers state-of-the-art performance in predicting time-to-death while maintaining comparable ranking performance. Our implementation is publicly available at https://github.com/ahmedhshahin/CenTime. △ Less

Submitted 10 January, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

arXiv:2305.11650 [pdf, other]

Moment Matching Denoising Gibbs Sampling

Authors: Mingtian Zhang, Alex Hawkins-Hooker, Brooks Paige, David Barber

Abstract: Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a `noisy' data distribution. In this work, we propose an efficient sampli… ▽ More Energy-Based Models (EBMs) offer a versatile framework for modeling complex data distributions. However, training and sampling from EBMs continue to pose significant challenges. The widely-used Denoising Score Matching (DSM) method for scalable EBM training suffers from inconsistency issues, causing the energy model to learn a `noisy' data distribution. In this work, we propose an efficient sampling framework: (pseudo)-Gibbs sampling with moment matching, which enables effective sampling from the underlying clean model when given a `noisy' model that has been well-trained via DSM. We explore the benefits of our approach compared to related methods and demonstrate how to scale the method to high-dimensional datasets. △ Less

Submitted 19 March, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

arXiv:2305.11023 [pdf, other]

Generalized Multiple Intent Conditioned Slot Filling

Authors: Harshil Shah, Arthur Wilcke, Marius Cobzarenco, Cristi Cobzarenco, Edward Challis, David Barber

Abstract: Natural language understanding includes the tasks of intent detection (identifying a user's objectives) and slot filling (extracting the entities relevant to those objectives). Prior slot filling methods assume that each intent type cannot occur more than once within a message, however this is often not a valid assumption for real-world settings. In this work, we generalize slot filling by removin… ▽ More Natural language understanding includes the tasks of intent detection (identifying a user's objectives) and slot filling (extracting the entities relevant to those objectives). Prior slot filling methods assume that each intent type cannot occur more than once within a message, however this is often not a valid assumption for real-world settings. In this work, we generalize slot filling by removing the constraint of unique intents in a message. We cast this as a JSON generation task and approach it using a language model. We create a pre-training dataset by combining DBpedia and existing slot filling datasets that we convert for JSON generation. We also generate an in-domain dataset using GPT-3. We train T5 models for this task (with and without exemplars in the prompt) and find that both training datasets improve performance, and that the model is able to generalize to intent types not seen during training. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2303.10789 [pdf, other]

A hybrid CNN-RNN approach for survival analysis in a Lung Cancer Screening study

Authors: Yaozhi Lu, Shahab Aslani, An Zhao, Ahmed Shahin, David Barber, Mark Emberton, Daniel C. Alexander, Joseph Jacob

Abstract: In this study, we present a hybrid CNN-RNN approach to investigate long-term survival of subjects in a lung cancer screening study. Subjects who died of cardiovascular and respiratory causes were identified whereby the CNN model was used to capture imaging features in the CT scans and the RNN model was used to investigate time series and thus global information. The models were trained on subjects… ▽ More In this study, we present a hybrid CNN-RNN approach to investigate long-term survival of subjects in a lung cancer screening study. Subjects who died of cardiovascular and respiratory causes were identified whereby the CNN model was used to capture imaging features in the CT scans and the RNN model was used to investigate time series and thus global information. The models were trained on subjects who underwent cardiovascular and respiratory deaths and a control cohort matched to participant age, gender, and smoking history. The combined model can achieve an AUC of 0.76 which outperforms humans at cardiovascular mortality prediction. The corresponding F1 and Matthews Correlation Coefficient are 0.63 and 0.42 respectively. The generalisability of the model is further validated on an 'external' cohort. The same models were applied to survival analysis with the Cox Proportional Hazard model. It was demonstrated that incorporating the follow-up history can lead to improvement in survival prediction. The Cox neural network can achieve an IPCW C-index of 0.75 on the internal dataset and 0.69 on an external dataset. Delineating imaging features associated with long-term survival can help focus preventative interventions appropriately, particularly for under-recognised pathologies thereby potentially reducing patient morbidity. △ Less

Submitted 19 March, 2023; originally announced March 2023.

arXiv:2303.08631 [pdf, other]

Smoothed Q-learning

Authors: David Barber

Abstract: In Reinforcement Learning the Q-learning algorithm provably converges to the optimal solution. However, as others have demonstrated, Q-learning can also overestimate the values and thereby spend too long exploring unhelpful states. Double Q-learning is a provably convergent alternative that mitigates some of the overestimation issues, though sometimes at the expense of slower convergence. We intro… ▽ More In Reinforcement Learning the Q-learning algorithm provably converges to the optimal solution. However, as others have demonstrated, Q-learning can also overestimate the values and thereby spend too long exploring unhelpful states. Double Q-learning is a provably convergent alternative that mitigates some of the overestimation issues, though sometimes at the expense of slower convergence. We introduce an alternative algorithm that replaces the max operation with an average, resulting also in a provably convergent off-policy algorithm which can mitigate overestimation yet retain similar convergence as standard Q-learning. △ Less

Submitted 15 March, 2023; originally announced March 2023.

arXiv:2209.07396 [pdf, other]

Towards Healing the Blindness of Score Matching

Authors: Mingtian Zhang, Oscar Key, Peter Hayes, David Barber, Brooks Paige, François-Xavier Briol

Abstract: Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of de… ▽ More Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches. △ Less

Submitted 15 October, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2206.09496 [pdf, other]

Integrated Weak Learning

Authors: Peter Hayes, Mingtian Zhang, Raza Habib, Jordan Burgess, Emine Yilmaz, David Barber

Abstract: We introduce Integrated Weak Learning, a principled framework that integrates weak supervision into the training process of machine learning models. Our approach jointly trains the end-model and a label model that aggregates multiple sources of weak supervision. We introduce a label model that can learn to aggregate weak supervision sources differently for different datapoints and takes into consi… ▽ More We introduce Integrated Weak Learning, a principled framework that integrates weak supervision into the training process of machine learning models. Our approach jointly trains the end-model and a label model that aggregates multiple sources of weak supervision. We introduce a label model that can learn to aggregate weak supervision sources differently for different datapoints and takes into consideration the performance of the end-model during training. We show that our approach outperforms existing weak learning techniques across a set of 6 benchmark classification datasets. When both a small amount of labeled data and weak supervision are present the increase in performance is both consistent and large, reliably getting a 2-5 point test F1 score gain over non-integrated methods. △ Less

Submitted 19 June, 2022; originally announced June 2022.

Comments: 14 pages, 4 figures

arXiv:2205.14539 [pdf, other]

Improving VAE-based Representation Learning

Authors: Mingtian Zhang, Tim Z. Xiao, Brooks Paige, David Barber

Abstract: Latent variable models like the Variational Auto-Encoder (VAE) are commonly used to learn representations of images. However, for downstream tasks like semantic classification, the representations learned by VAE are less competitive than other non-latent variable models. This has led to some speculations that latent variable models may be fundamentally unsuitable for representation learning. In th… ▽ More Latent variable models like the Variational Auto-Encoder (VAE) are commonly used to learn representations of images. However, for downstream tasks like semantic classification, the representations learned by VAE are less competitive than other non-latent variable models. This has led to some speculations that latent variable models may be fundamentally unsuitable for representation learning. In this work, we study what properties are required for good representations and how different VAE structure choices could affect the learned properties. We show that by using a decoder that prefers to learn local features, the remaining global features can be well captured by the latent, which significantly improves performance of a downstream classification task. We further apply the proposed model to semi-supervised learning tasks and demonstrate improvements in data efficiency. △ Less

Submitted 28 May, 2022; originally announced May 2022.

arXiv:2205.11640 [pdf, other]

Generalization Gap in Amortized Inference

Authors: Mingtian Zhang, Peter Hayes, David Barber

Abstract: The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalization of a popular class of probabilistic model - the Variational Auto-Encoder (VAE). We discuss the two generalization gaps that affect VAEs and show that overfitting is usually dominated by amortized i… ▽ More The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalization of a popular class of probabilistic model - the Variational Auto-Encoder (VAE). We discuss the two generalization gaps that affect VAEs and show that overfitting is usually dominated by amortized inference. Based on this observation, we propose a new training objective that improves the generalization of amortized inference. We demonstrate how our method can improve performance in the context of image modeling and lossless compression. △ Less

Submitted 15 October, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

arXiv:2203.11391 [pdf, other]

Survival Analysis for Idiopathic Pulmonary Fibrosis using CT Images and Incomplete Clinical Data

Authors: Ahmed H. Shahin, Joseph Jacob, Daniel C. Alexander, David Barber

Abstract: Idiopathic Pulmonary Fibrosis (IPF) is an inexorably progressive fibrotic lung disease with a variable and unpredictable rate of progression. CT scans of the lungs inform clinical assessment of IPF patients and contain pertinent information related to disease progression. In this work, we propose a multi-modal method that uses neural networks and memory banks to predict the survival of IPF patient… ▽ More Idiopathic Pulmonary Fibrosis (IPF) is an inexorably progressive fibrotic lung disease with a variable and unpredictable rate of progression. CT scans of the lungs inform clinical assessment of IPF patients and contain pertinent information related to disease progression. In this work, we propose a multi-modal method that uses neural networks and memory banks to predict the survival of IPF patients using clinical and imaging data. The majority of clinical IPF patient records have missing data (e.g. missing lung function tests). To this end, we propose a probabilistic model that captures the dependencies between the observed clinical variables and imputes missing ones. This principled approach to missing data imputation can be naturally combined with a deep survival analysis model. We show that the proposed framework yields significantly better survival analysis results than baselines in terms of concordance index and integrated Brier score. Our work also provides insights into novel image-based biomarkers that are linked to mortality. △ Less

Submitted 21 March, 2022; originally announced March 2022.

Comments: Accepted as a full paper at the Medical Imaging with Deep Learning conference (MIDL 2022)

arXiv:2201.05213 [pdf, other]

Parallel Neural Local Lossless Compression

Authors: Mingtian Zhang, James Townsend, Ning Kang, David Barber

Abstract: The recently proposed Neural Local Lossless Compression (NeLLoC), which is based on a local autoregressive model, has achieved state-of-the-art (SOTA) out-of-distribution (OOD) generalization performance in the image compression task. In addition to the encouragement of OOD generalization, the local model also allows parallel inference in the decoding stage. In this paper, we propose two paralleli… ▽ More The recently proposed Neural Local Lossless Compression (NeLLoC), which is based on a local autoregressive model, has achieved state-of-the-art (SOTA) out-of-distribution (OOD) generalization performance in the image compression task. In addition to the encouragement of OOD generalization, the local model also allows parallel inference in the decoding stage. In this paper, we propose two parallelization schemes for local autoregressive models. We discuss the practicalities of implementing the schemes and provide experimental evidence of significant gains in compression runtime compared to the previous, non-parallel implementation. △ Less

Submitted 26 June, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

arXiv:2112.00174 [pdf, other]

Adaptive Optimization with Examplewise Gradients

Authors: Julius Kunze, James Townsend, David Barber

Abstract: We propose a new, more general approach to the design of stochastic gradient-based optimization methods for machine learning. In this new framework, optimizers assume access to a batch of gradient estimates per iteration, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups. To demonstrate the usefulness of this generaliz… ▽ More We propose a new, more general approach to the design of stochastic gradient-based optimization methods for machine learning. In this new framework, optimizers assume access to a batch of gradient estimates per iteration, rather than a single estimate. This better reflects the information that is actually available in typical machine learning setups. To demonstrate the usefulness of this generalized approach, we develop Eve, an adaptation of the Adam optimizer which uses examplewise gradients to obtain more accurate second-moment estimates. We provide preliminary experiments, without hyperparameter tuning, which show that the new optimizer slightly outperforms Adam on a small scale benchmark and performs the same or worse on larger scale benchmarks. Further work is needed to refine the algorithm and tune hyperparameters. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: 9 pages, 1 figure, 3 tables

arXiv:2109.12043 [pdf, other]

Sample Efficient Model Evaluation

Authors: Emine Yilmaz, Peter Hayes, Raza Habib, Jordan Burgess, David Barber

Abstract: Labelling data is a major practical bottleneck in training and testing classifiers. Given a collection of unlabelled data points, we address how to select which subset to label to best estimate test metrics such as accuracy, $F_1$ score or micro/macro $F_1$. We consider two sampling based approaches, namely the well-known Importance Sampling and we introduce a novel application of Poisson Sampling… ▽ More Labelling data is a major practical bottleneck in training and testing classifiers. Given a collection of unlabelled data points, we address how to select which subset to label to best estimate test metrics such as accuracy, $F_1$ score or micro/macro $F_1$. We consider two sampling based approaches, namely the well-known Importance Sampling and we introduce a novel application of Poisson Sampling. For both approaches we derive the minimal error sampling distributions and how to approximate and use them to form estimators and confidence intervals. We show that Poisson Sampling outperforms Importance Sampling both theoretically and experimentally. △ Less

Submitted 24 September, 2021; originally announced September 2021.

arXiv:2103.16210 [pdf, other]

Locally-Contextual Nonlinear CRFs for Sequence Labeling

Authors: Harshil Shah, Tim Xiao, David Barber

Abstract: Linear chain conditional random fields (CRFs) combined with contextual word embeddings have achieved state of the art performance on sequence labeling tasks. In many of these tasks, the identity of the neighboring words is often the most useful contextual information when predicting the label of a given word. However, contextual embeddings are usually trained in a task-agnostic manner. This means… ▽ More Linear chain conditional random fields (CRFs) combined with contextual word embeddings have achieved state of the art performance on sequence labeling tasks. In many of these tasks, the identity of the neighboring words is often the most useful contextual information when predicting the label of a given word. However, contextual embeddings are usually trained in a task-agnostic manner. This means that although they may encode information about the neighboring words, it is not guaranteed. It can therefore be beneficial to design the sequence labeling architecture to directly extract this information from the embeddings. We propose locally-contextual nonlinear CRFs for sequence labeling. Our approach directly incorporates information from the neighboring embeddings when predicting the label for a given word, and parametrizes the potential functions using deep neural networks. Our model serves as a drop-in replacement for the linear chain CRF, consistently outperforming it in our ablation study. On a variety of tasks, our results are competitive with those of the best published methods. In particular, we outperform the previous state of the art on chunking on CoNLL 2000 and named entity recognition on OntoNotes 5.0 English. △ Less

Submitted 30 March, 2021; originally announced March 2021.

arXiv:2010.13476 [pdf, other]

Reducing the Computational Cost of Deep Generative Models with Binary Neural Networks

Authors: Thomas Bird, Friso H. Kingma, David Barber

Abstract: Deep generative models provide a powerful set of tools to understand real-world data. But as these models improve, they increase in size and complexity, so their computational cost in memory and execution time grows. Using binary weights in neural networks is one method which has shown promise in reducing this cost. However, whether binary neural networks can be used in generative models is an ope… ▽ More Deep generative models provide a powerful set of tools to understand real-world data. But as these models improve, they increase in size and complexity, so their computational cost in memory and execution time grows. Using binary weights in neural networks is one method which has shown promise in reducing this cost. However, whether binary neural networks can be used in generative models is an open problem. In this work we show, for the first time, that we can successfully train generative models which utilize binary neural networks. This reduces the computational cost of the models massively. We develop a new class of binary weight normalization, and provide insights for architecture designs of these binarized generative models. We demonstrate that two state-of-the-art deep generative models, the ResNet VAE and Flow++ models, can be binarized effectively using these techniques. We train binary models that achieve loss values close to those of the regular models but are 90%-94% smaller in size, and also allow significant speed-ups in execution time. △ Less

Submitted 3 May, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

Comments: Accepted to ICLR 2021

arXiv:2010.12464 [pdf, other]

Representation Learning for High-Dimensional Data Collection under Local Differential Privacy

Authors: Alex Mansbridge, Gregory Barbour, Davide Piras, Michael Murray, Christopher Frye, Ilya Feige, David Barber

Abstract: The collection of individuals' data has become commonplace in many industries. Local differential privacy (LDP) offers a rigorous approach to preserving privacy whereby the individual privatises their data locally, allowing only their perturbed datum to leave their possession. LDP thus provides a provable privacy guarantee to the individual against both adversaries and database administrators. Exi… ▽ More The collection of individuals' data has become commonplace in many industries. Local differential privacy (LDP) offers a rigorous approach to preserving privacy whereby the individual privatises their data locally, allowing only their perturbed datum to leave their possession. LDP thus provides a provable privacy guarantee to the individual against both adversaries and database administrators. Existing LDP mechanisms have successfully been applied to low-dimensional data, but in high dimensions the privacy-inducing noise largely destroys the utility of the data. In this work, our contributions are two-fold: first, by adapting state-of-the-art techniques from representation learning, we introduce a novel approach to learning LDP mechanisms. These mechanisms add noise to powerful representations on the low-dimensional manifold underlying the data, thereby overcoming the prohibitive noise requirements of LDP in high dimensions. Second, we introduce a novel denoising approach for downstream model learning. The training of performant machine learning models using collected LDP data is a common goal for data collectors, and downstream model performance forms a proxy for the LDP data utility. Our approach significantly outperforms current state-of-the-art LDP mechanisms. △ Less

Submitted 14 May, 2022; v1 submitted 23 October, 2020; originally announced October 2020.

arXiv:2010.03467 [pdf, other]

Learning Deep-Latent Hierarchies by Stacking Wasserstein Autoencoders

Authors: Benoit Gaujac, Ilya Feige, David Barber

Abstract: Probabilistic models with hierarchical-latent-variable structures provide state-of-the-art results amongst non-autoregressive, unsupervised density-based models. However, the most common approach to training such models based on Variational Autoencoders (VAEs) often fails to leverage deep-latent hierarchies; successful approaches require complex inference and optimisation schemes. Optimal Transpor… ▽ More Probabilistic models with hierarchical-latent-variable structures provide state-of-the-art results amongst non-autoregressive, unsupervised density-based models. However, the most common approach to training such models based on Variational Autoencoders (VAEs) often fails to leverage deep-latent hierarchies; successful approaches require complex inference and optimisation schemes. Optimal Transport is an alternative, non-likelihood-based framework for training generative models with appealing theoretical properties, in principle allowing easier training convergence between distributions. In this work we propose a novel approach to training models with deep-latent hierarchies based on Optimal Transport, without the need for highly bespoke models and inference networks. We show that our method enables the generative model to fully leverage its deep-latent hierarchy, avoiding the well known "latent variable collapse" issue of VAEs; therefore, providing qualitatively better sample generations as well as more interpretable latent representation than the original Wasserstein Autoencoder with Maximum Mean Discrepancy divergence. △ Less

Submitted 7 October, 2020; originally announced October 2020.

arXiv:2010.03459 [pdf, other]

Learning disentangled representations with the Wasserstein Autoencoder

Authors: Benoit Gaujac, Ilya Feige, David Barber

Abstract: Disentangled representation learning has undoubtedly benefited from objective function surgery. However, a delicate balancing act of tuning is still required in order to trade off reconstruction fidelity versus disentanglement. Building on previous successes of penalizing the total correlation in the latent variables, we propose TCWAE (Total Correlation Wasserstein Autoencoder). Working in the WAE… ▽ More Disentangled representation learning has undoubtedly benefited from objective function surgery. However, a delicate balancing act of tuning is still required in order to trade off reconstruction fidelity versus disentanglement. Building on previous successes of penalizing the total correlation in the latent variables, we propose TCWAE (Total Correlation Wasserstein Autoencoder). Working in the WAE paradigm naturally enables the separation of the total-correlation term, thus providing disentanglement control over the learned representation, while offering more flexibility in the choice of reconstruction cost. We propose two variants using different KL estimators and perform extensive quantitative comparisons on data sets with known generative factors, showing competitive results relative to state-of-the-art techniques. We further study the trade off between disentanglement and reconstruction on more-difficult data sets with unknown generative factors, where the flexibility of the WAE paradigm in the reconstruction term improves reconstructions. △ Less

Submitted 7 October, 2020; originally announced October 2020.

arXiv:2005.00146 [pdf, other]

Addressing Catastrophic Forgetting in Few-Shot Problems

Authors: Pauching Yap, Hippolyt Ritter, David Barber

Abstract: Neural networks are known to suffer from catastrophic forgetting when trained on sequential datasets. While there have been numerous attempts to solve this problem in large-scale supervised classification, little has been done to overcome catastrophic forgetting in few-shot classification problems. We demonstrate that the popular gradient-based model-agnostic meta-learning algorithm (MAML) indeed… ▽ More Neural networks are known to suffer from catastrophic forgetting when trained on sequential datasets. While there have been numerous attempts to solve this problem in large-scale supervised classification, little has been done to overcome catastrophic forgetting in few-shot classification problems. We demonstrate that the popular gradient-based model-agnostic meta-learning algorithm (MAML) indeed suffers from catastrophic forgetting and introduce a Bayesian online meta-learning framework that tackles this problem. Our framework utilises Bayesian online learning and meta-learning along with Laplace approximation and variational inference to overcome catastrophic forgetting in few-shot classification problems. The experimental evaluations demonstrate that our framework can effectively achieve this goal in comparison with various baselines. As an additional utility, we also demonstrate empirically that our framework is capable of meta-learning on sequentially arriving few-shot tasks from a stationary task distribution. △ Less

Submitted 21 June, 2021; v1 submitted 30 April, 2020; originally announced May 2020.

Comments: ICML 2021

arXiv:2001.04942 [pdf, other]

Private Machine Learning via Randomised Response

Authors: David Barber

Abstract: We introduce a general learning framework for private machine learning based on randomised response. Our assumption is that all actors are potentially adversarial and as such we trust only to release a single noisy version of an individual's datapoint. We discuss a general approach that forms a consistent way to estimate the true underlying machine learning model and demonstrate this in the case o… ▽ More We introduce a general learning framework for private machine learning based on randomised response. Our assumption is that all actors are potentially adversarial and as such we trust only to release a single noisy version of an individual's datapoint. We discuss a general approach that forms a consistent way to estimate the true underlying machine learning model and demonstrate this in the case of logistic regression. △ Less

Submitted 24 February, 2020; v1 submitted 14 January, 2020; originally announced January 2020.

arXiv:1912.09953 [pdf, other]

HiLLoC: Lossless Image Compression with Hierarchical Latent Variable Models

Authors: James Townsend, Thomas Bird, Julius Kunze, David Barber

Abstract: We make the following striking observation: fully convolutional VAE models trained on 32x32 ImageNet can generalize well, not just to 64x64 but also to far larger photographs, with no changes to the model. We use this property, applying fully convolutional models to lossless compression, demonstrating a method to scale the VAE-based 'Bits-Back with ANS' algorithm for lossless compression to large… ▽ More We make the following striking observation: fully convolutional VAE models trained on 32x32 ImageNet can generalize well, not just to 64x64 but also to far larger photographs, with no changes to the model. We use this property, applying fully convolutional models to lossless compression, demonstrating a method to scale the VAE-based 'Bits-Back with ANS' algorithm for lossless compression to large color photographs, and achieving state of the art for compression of full size ImageNet images. We release Craystack, an open source library for convenient prototy** of lossless compression using probabilistic models, along with full implementations of all of our compression results. △ Less

Submitted 20 December, 2019; originally announced December 2019.

arXiv:1907.11891 [pdf, other]

Variational f-divergence Minimization

Authors: Mingtian Zhang, Thomas Bird, Raza Habib, Tianlin Xu, David Barber

Abstract: Probabilistic models are often trained by maximum likelihood, which corresponds to minimizing a specific f-divergence between the model and data distribution. In light of recent successes in training Generative Adversarial Networks, alternative non-likelihood training criteria have been proposed. Whilst not necessarily statistically efficient, these alternatives may better match user requirements… ▽ More Probabilistic models are often trained by maximum likelihood, which corresponds to minimizing a specific f-divergence between the model and data distribution. In light of recent successes in training Generative Adversarial Networks, alternative non-likelihood training criteria have been proposed. Whilst not necessarily statistically efficient, these alternatives may better match user requirements such as sharp image generation. A general variational method for training probabilistic latent variable models using maximum likelihood is well established; however, how to train latent variable models using other f-divergences is comparatively unknown. We discuss a variational approach that, when combined with the recently introduced Spread Divergence, can be applied to train a large class of latent variable models using any f-divergence. △ Less

Submitted 27 July, 2019; originally announced July 2019.

arXiv:1902.04340 [pdf, other]

doi 10.3390/e21080758

Gaussian Mean Field Regularizes by Limiting Learned Information

Authors: Julius Kunze, Louis Kirsch, Hippolyt Ritter, David Barber

Abstract: Variational inference with a factorized Gaussian posterior estimate is a widely used approach for learning parameters and hidden variables. Empirically, a regularizing effect can be observed that is poorly understood. In this work, we show how mean field inference improves generalization by limiting mutual information between learned parameters and the data through noise. We quantify a maximum cap… ▽ More Variational inference with a factorized Gaussian posterior estimate is a widely used approach for learning parameters and hidden variables. Empirically, a regularizing effect can be observed that is poorly understood. In this work, we show how mean field inference improves generalization by limiting mutual information between learned parameters and the data through noise. We quantify a maximum capacity when the posterior variance is either fixed or learned and connect it to generalization error, even when the KL-divergence in the objective is rescaled. Our experiments demonstrate that bounding information between parameters and data effectively regularizes neural networks on both supervised and unsupervised tasks. △ Less

Submitted 12 February, 2019; originally announced February 2019.

arXiv:1901.04866 [pdf, other]

Practical Lossless Compression with Latent Variables using Bits Back Coding

Authors: James Townsend, Tom Bird, David Barber

Abstract: Deep latent variable models have seen recent success in many data domains. Lossless compression is an application of these models which, despite having the potential to be highly useful, has yet to be implemented in a practical manner. We present `Bits Back with ANS' (BB-ANS), a scheme to perform lossless compression with latent variable models at a near optimal rate. We demonstrate this scheme by… ▽ More Deep latent variable models have seen recent success in many data domains. Lossless compression is an application of these models which, despite having the potential to be highly useful, has yet to be implemented in a practical manner. We present `Bits Back with ANS' (BB-ANS), a scheme to perform lossless compression with latent variable models at a near optimal rate. We demonstrate this scheme by using it to compress the MNIST dataset with a variational auto-encoder model (VAE), achieving compression rates superior to standard methods with only a simple VAE. Given that the scheme is highly amenable to parallelization, we conclude that with a sufficiently high quality generative model this scheme could be used to achieve substantial improvements in compression rate with acceptable running time. We make our implementation available open source at https://github.com/bits-back/bits-back . △ Less

Submitted 15 January, 2019; originally announced January 2019.

arXiv:1811.08968 [pdf, other]

Spread Divergence

Authors: Mingtian Zhang, Peter Hayes, Tom Bird, Raza Habib, David Barber

Abstract: For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discrimina… ▽ More For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks). △ Less

Submitted 4 December, 2022; v1 submitted 21 November, 2018; originally announced November 2018.

Journal ref: Volume 119: International Conference on Machine Learning, 13-18 July 2020, Virtual

arXiv:1811.05249 [pdf, other]

Modular Networks: Learning to Decompose Neural Computation

Authors: Louis Kirsch, Julius Kunze, David Barber

Abstract: Scaling model capacity has been vital in the success of deep learning. For a typical network, necessary compute resources and training time grow dramatically with model size. Conditional computation is a promising way to increase the number of parameters with a relatively small increase in resources. We propose a training algorithm that flexibly chooses neural modules based on the data to be proce… ▽ More Scaling model capacity has been vital in the success of deep learning. For a typical network, necessary compute resources and training time grow dramatically with model size. Conditional computation is a promising way to increase the number of parameters with a relatively small increase in resources. We propose a training algorithm that flexibly chooses neural modules based on the data to be processed. Both the decomposition and modules are learned end-to-end. In contrast to existing approaches, training does not rely on regularization to enforce diversity in module use. We apply modular networks both to image recognition and language modeling tasks, where we achieve superior performance compared to several baselines. Introspection reveals that modules specialize in interpretable contexts. △ Less

Submitted 13 November, 2018; originally announced November 2018.

Comments: NIPS 2018

arXiv:1809.04855 [pdf, other]

Stochastic Variational Optimization

Authors: Thomas Bird, Julius Kunze, David Barber

Abstract: Variational Optimization forms a differentiable upper bound on an objective. We show that approaches such as Natural Evolution Strategies and Gaussian Perturbation, are special cases of Variational Optimization in which the expectations are approximated by Gaussian sampling. These approaches are of particular interest because they are parallelizable. We calculate the approximate bias and variance… ▽ More Variational Optimization forms a differentiable upper bound on an objective. We show that approaches such as Natural Evolution Strategies and Gaussian Perturbation, are special cases of Variational Optimization in which the expectations are approximated by Gaussian sampling. These approaches are of particular interest because they are parallelizable. We calculate the approximate bias and variance of the corresponding gradient estimators and demonstrate that using antithetic sampling or a baseline is crucial to mitigate their problems. We contrast these methods with an alternative parallelizable method, namely Directional Derivatives. We conclude that, for differentiable objectives, using Directional Derivatives is preferable to using Variational Optimization to perform parallel Stochastic Gradient Descent. △ Less

Submitted 13 September, 2018; originally announced September 2018.

arXiv:1809.03137 [pdf, other]

Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers

Authors: Zhen He, Jian Li, Daxue Liu, Hangen He, David Barber

Abstract: Online Multi-Object Tracking (MOT) from videos is a challenging computer vision task which has been extensively studied for decades. Most of the existing MOT algorithms are based on the Tracking-by-Detection (TBD) paradigm combined with popular machine learning approaches which largely reduce the human effort to tune algorithm parameters. However, the commonly used supervised learning approaches r… ▽ More Online Multi-Object Tracking (MOT) from videos is a challenging computer vision task which has been extensively studied for decades. Most of the existing MOT algorithms are based on the Tracking-by-Detection (TBD) paradigm combined with popular machine learning approaches which largely reduce the human effort to tune algorithm parameters. However, the commonly used supervised learning approaches require the labeled data (e.g., bounding boxes), which is expensive for videos. Also, the TBD framework is usually suboptimal since it is not end-to-end, i.e., it considers the task as detection and tracking, but not jointly. To achieve both label-free and end-to-end learning of MOT, we propose a Tracking-by-Animation framework, where a differentiable neural model first tracks objects from input frames and then animates these objects into reconstructed frames. Learning is then driven by the reconstruction error through backpropagation. We further propose a Reprioritized Attentive Tracking to improve the robustness of data association. Experiments conducted on both synthetic and real video datasets show the potential of the proposed model. Our project page is publicly available at: https://github.com/zhen-he/tracking-by-animation △ Less

Submitted 8 April, 2019; v1 submitted 10 September, 2018; originally announced September 2018.

Comments: CVPR 2019

arXiv:1806.05178 [pdf, other]

Generating Sentences Using a Dynamic Canvas

Authors: Harshil Shah, Bowen Zheng, David Barber

Abstract: We introduce the Attentive Unsupervised Text (W)riter (AUTR), which is a word level generative model for natural language. It uses a recurrent neural network with a dynamic attention and canvas memory mechanism to iteratively construct sentences. By viewing the state of the memory at intermediate stages and where the model is placing its attention, we gain insight into how it constructs sentences.… ▽ More We introduce the Attentive Unsupervised Text (W)riter (AUTR), which is a word level generative model for natural language. It uses a recurrent neural network with a dynamic attention and canvas memory mechanism to iteratively construct sentences. By viewing the state of the memory at intermediate stages and where the model is placing its attention, we gain insight into how it constructs sentences. We demonstrate that AUTR learns a meaningful latent representation for each sentence, and achieves competitive log-likelihood lower bounds whilst being computationally efficient. It is effective at generating and reconstructing sentences, as well as imputing missing words. △ Less

Submitted 13 June, 2018; originally announced June 2018.

Comments: AAAI 2018

arXiv:1806.05138 [pdf, other]

Generative Neural Machine Translation

Authors: Harshil Shah, David Barber

Abstract: We introduce Generative Neural Machine Translation (GNMT), a latent variable architecture which is designed to model the semantics of the source and target sentences. We modify an encoder-decoder translation model by adding a latent variable as a language agnostic representation which is encouraged to learn the meaning of the sentence. GNMT achieves competitive BLEU scores on pure translation task… ▽ More We introduce Generative Neural Machine Translation (GNMT), a latent variable architecture which is designed to model the semantics of the source and target sentences. We modify an encoder-decoder translation model by adding a latent variable as a language agnostic representation which is encouraged to learn the meaning of the sentence. GNMT achieves competitive BLEU scores on pure translation tasks, and is superior when there are missing words in the source sentence. We augment the model to facilitate multilingual translation and semi-supervised learning without adding parameters. This framework significantly reduces overfitting when there is limited paired data available, and is effective for translating between pairs of languages not seen during training. △ Less

Submitted 13 June, 2018; originally announced June 2018.

arXiv:1806.04480 [pdf, other]

Improving latent variable descriptiveness with AutoGen

Authors: Alex Mansbridge, Roberto Fierimonte, Ilya Feige, David Barber

Abstract: Powerful generative models, particularly in Natural Language Modelling, are commonly trained by maximizing a variational lower bound on the data log likelihood. These models often suffer from poor use of their latent variable, with ad-hoc annealing factors used to encourage retention of information in the latent variable. We discuss an alternative and general approach to latent variable modelling,… ▽ More Powerful generative models, particularly in Natural Language Modelling, are commonly trained by maximizing a variational lower bound on the data log likelihood. These models often suffer from poor use of their latent variable, with ad-hoc annealing factors used to encourage retention of information in the latent variable. We discuss an alternative and general approach to latent variable modelling, based on an objective that combines the data log likelihood as well as the likelihood of a perfect reconstruction through an autoencoder. Tying these together ensures by design that the latent variable captures information about the observations, whilst retaining the ability to generate well. Interestingly, though this approach is a priori unrelated to VAEs, the lower bound attained is identical to the standard VAE bound but with the addition of a simple pre-factor; thus, providing a formal interpretation of the commonly used, ad-hoc pre-factors in training VAEs. △ Less

Submitted 12 June, 2018; originally announced June 2018.

Comments: 8 pages, 2 figures, 5 tables

arXiv:1806.04465 [pdf, other]

Gaussian mixture models with Wasserstein distance

Authors: Benoit Gaujac, Ilya Feige, David Barber

Abstract: Generative models with both discrete and continuous latent variables are highly motivated by the structure of many real-world data sets. They present, however, subtleties in training often manifesting in the discrete latent being under leveraged. In this paper, we show that such models are more amenable to training when using the Optimal Transport framework of Wasserstein Autoencoders. We find our… ▽ More Generative models with both discrete and continuous latent variables are highly motivated by the structure of many real-world data sets. They present, however, subtleties in training often manifesting in the discrete latent being under leveraged. In this paper, we show that such models are more amenable to training when using the Optimal Transport framework of Wasserstein Autoencoders. We find our discrete latent variable to be fully leveraged by the model when trained, without any modifications to the objective function or significant fine tuning. Our model generates comparable samples to other approaches while using relatively simple neural networks, since the discrete latent variable carries much of the descriptive burden. Furthermore, the discrete latent provides significant control over generation. △ Less

Submitted 12 June, 2018; originally announced June 2018.

Comments: 8 pages, 5 figures

arXiv:1805.07810 [pdf, other]

Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting

Authors: Hippolyt Ritter, Aleksandar Botev, David Barber

Abstract: We introduce the Kronecker factored online Laplace approximation for overcoming catastrophic forgetting in neural networks. The method is grounded in a Bayesian online learning framework, where we recursively approximate the posterior after every task with a Gaussian, leading to a quadratic penalty on changes to the weights. The Laplace approximation requires calculating the Hessian around a mode,… ▽ More We introduce the Kronecker factored online Laplace approximation for overcoming catastrophic forgetting in neural networks. The method is grounded in a Bayesian online learning framework, where we recursively approximate the posterior after every task with a Gaussian, leading to a quadratic penalty on changes to the weights. The Laplace approximation requires calculating the Hessian around a mode, which is typically intractable for modern architectures. In order to make our method scalable, we leverage recent block-diagonal Kronecker factored approximations to the curvature. Our algorithm achieves over 90% test accuracy across a sequence of 50 instantiations of the permuted MNIST dataset, substantially outperforming related methods for overcoming catastrophic forgetting. △ Less

Submitted 20 May, 2018; originally announced May 2018.

Comments: 13 pages, 6 figures

arXiv:1711.01577 [pdf, other]

Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning

Authors: Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He, David Barber

Abstract: Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden s… ▽ More Long Short-Term Memory (LSTM) is a popular approach to boosting the ability of Recurrent Neural Networks to store longer term temporal information. The capacity of an LSTM network can be increased by widening and adding layers. However, usually the former introduces additional parameters, while the latter increases the runtime. As an alternative we propose the Tensorized LSTM in which the hidden states are represented by tensors and updated via a cross-layer convolution. By increasing the tensor size, the network can be widened efficiently without additional parameters since the parameters are shared across different locations in the tensor; by delaying the output, the network can be deepened implicitly with little additional runtime since deep computations for each timestep are merged into temporal computations of the sequence. Experiments conducted on five challenging sequence learning tasks show the potential of the proposed model. △ Less

Submitted 12 December, 2017; v1 submitted 5 November, 2017; originally announced November 2017.

Comments: Accepted by NIPS 2017

arXiv:1705.08439 [pdf, other]

Thinking Fast and Slow with Deep Learning and Tree Search

Authors: Thomas Anthony, Zheng Tian, David Barber

Abstract: Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search… ▽ More Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalise plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex 1.0, the most recent Olympiad Champion player to be publicly released. △ Less

Submitted 3 December, 2017; v1 submitted 23 May, 2017; originally announced May 2017.

Comments: v1 to v2: - Add a value function in MCTS - Some MCTS hyper-parameters changed - Repetition of experiments: improved accuracy and errors shown. (note the reduction in effect size for the tpt/cat experiment) - Results from a longer training run, including changes in expert strength in training - Comparison to MoHex. v3: clarify independence of ExIt and AG0. v4: see appendix E

arXiv:1703.08561 [pdf, other]

AutonoVi: Autonomous Vehicle Planning with Dynamic Maneuvers and Traffic Constraints

Authors: Andrew Best, Sahil Narang, Daniel Barber, Dinesh Manocha

Abstract: We present AutonoVi:, a novel algorithm for autonomous vehicle navigation that supports dynamic maneuvers and satisfies traffic constraints and norms. Our approach is based on optimization-based maneuver planning that supports dynamic lane-changes, swerving, and braking in all traffic scenarios and guides the vehicle to its goal position. We take into account various traffic constraints, including… ▽ More We present AutonoVi:, a novel algorithm for autonomous vehicle navigation that supports dynamic maneuvers and satisfies traffic constraints and norms. Our approach is based on optimization-based maneuver planning that supports dynamic lane-changes, swerving, and braking in all traffic scenarios and guides the vehicle to its goal position. We take into account various traffic constraints, including collision avoidance with other vehicles, pedestrians, and cyclists using control velocity obstacles. We use a data-driven approach to model the vehicle dynamics for control and collision avoidance. Furthermore, our trajectory computation algorithm takes into account traffic rules and behaviors, such as stop** at intersections and stoplights, based on an arc-spline representation. We have evaluated our algorithm in a simulated environment and tested its interactive performance in urban and highway driving scenarios with tens of vehicles, pedestrians, and cyclists. These scenarios include jaywalking pedestrians, sudden stops from high speeds, safely passing cyclists, a vehicle suddenly swerving into the roadway, and high-density traffic where the vehicle must change lanes to progress more effectively. △ Less

Submitted 29 March, 2017; v1 submitted 24 March, 2017; originally announced March 2017.

Comments: 9 pages, 6 figures

arXiv:1607.01981 [pdf, other]

Nesterov's Accelerated Gradient and Momentum as approximations to Regularised Update Descent

Authors: Aleksandar Botev, Guy Lever, David Barber

Abstract: We present a unifying framework for adapting the update direction in gradient-based iterative optimization methods. As natural special cases we re-derive classical momentum and Nesterov's accelerated gradient method, lending a new intuitive interpretation to the latter algorithm. We show that a new algorithm, which we term Regularised Gradient Descent, can converge more quickly than either Nestero… ▽ More We present a unifying framework for adapting the update direction in gradient-based iterative optimization methods. As natural special cases we re-derive classical momentum and Nesterov's accelerated gradient method, lending a new intuitive interpretation to the latter algorithm. We show that a new algorithm, which we term Regularised Gradient Descent, can converge more quickly than either Nesterov's algorithm or the classical momentum algorithm. △ Less

Submitted 11 July, 2016; v1 submitted 7 July, 2016; originally announced July 2016.

arXiv:1212.4507 [pdf, ps, other]

Variational Optimization

Authors: Joe Staines, David Barber

Abstract: We discuss a general technique that can be used to form a differentiable bound on the optima of non-differentiable or discrete objective functions. We form a unified description of these methods and consider under which circumstances the bound is concave. In particular we consider two concrete applications of the method, namely sparse learning and support vector classification. We discuss a general technique that can be used to form a differentiable bound on the optima of non-differentiable or discrete objective functions. We form a unified description of these methods and consider under which circumstances the bound is concave. In particular we consider two concrete applications of the method, namely sparse learning and support vector classification. △ Less

Submitted 20 December, 2012; v1 submitted 18 December, 2012; originally announced December 2012.

MSC Class: 65K10 ACM Class: G.1.6

arXiv:1206.6459 [pdf]

Bayesian Conditional Cointegration

Authors: Chris Bracegirdle, David Barber

Abstract: Cointegration is an important topic for time-series, and describes a relationship between two series in which a linear combination is stationary. Classically, the test for cointegration is based on a two stage process in which first the linear relation between the series is estimated by Ordinary Least Squares. Subsequently a unit root test is performed on the residuals. A well-known deficiency of… ▽ More Cointegration is an important topic for time-series, and describes a relationship between two series in which a linear combination is stationary. Classically, the test for cointegration is based on a two stage process in which first the linear relation between the series is estimated by Ordinary Least Squares. Subsequently a unit root test is performed on the residuals. A well-known deficiency of this classical approach is that it can lead to erroneous conclusions about the presence of cointegration. As an alternative, we present a framework for estimating whether cointegration exists using Bayesian inference which is empirically superior to the classical approach. Finally, we apply our technique to model segmented cointegration in which cointegration may exist only for limited time. In contrast to previous approaches our model makes no restriction on the number of possible cointegration segments. △ Less

Submitted 27 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

arXiv:1206.3237 [pdf]

Clique Matrices for Statistical Graph Decomposition and Parameterising Restricted Positive Definite Matrices

Authors: David Barber

Abstract: We introduce Clique Matrices as an alternative representation of undirected graphs, being a generalisation of the incidence matrix representation. Here we use clique matrices to decompose a graph into a set of possibly overlap** clusters, de ned as well-connected subsets of vertices. The decomposition is based on a statistical description which encourages clusters to be well connected and few in… ▽ More We introduce Clique Matrices as an alternative representation of undirected graphs, being a generalisation of the incidence matrix representation. Here we use clique matrices to decompose a graph into a set of possibly overlap** clusters, de ned as well-connected subsets of vertices. The decomposition is based on a statistical description which encourages clusters to be well connected and few in number. Inference is carried out using a variational approximation. Clique matrices also play a natural role in parameterising positive de nite matrices under zero constraints on elements of the matrix. We show that clique matrices can parameterise all positive de nite matrices restricted according to a decomposable graph and form a structured Factor Analysis approximation in the non-decomposable case. △ Less

Submitted 13 June, 2012; originally announced June 2012.

Comments: Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008)

Report number: UAI-P-2008-PG-26-33

arXiv:1202.3720 [pdf]

Efficient Inference in Markov Control Problems

Authors: Thomas Furmston, David Barber

Abstract: Markov control algorithms that perform smooth, non-greedy updates of the policy have been shown to be very general and versatile, with policy gradient and Expectation Maximisation algorithms being particularly popular. For these algorithms, marginal inference of the reward weighted trajectory distribution is required to perform policy updates. We discuss a new exact inference algorithm for these m… ▽ More Markov control algorithms that perform smooth, non-greedy updates of the policy have been shown to be very general and versatile, with policy gradient and Expectation Maximisation algorithms being particularly popular. For these algorithms, marginal inference of the reward weighted trajectory distribution is required to perform policy updates. We discuss a new exact inference algorithm for these marginals in the finite horizon case that is more efficient than the standard approach based on classical forward-backward recursions. We also provide a principled extension to infinite horizon Markov Decision Problems that explicitly accounts for an infinite horizon. This extension provides a novel algorithm for both policy gradients and Expectation Maximisation in infinite horizon problems. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Report number: UAI-P-2011-PG-221-229

arXiv:1107.3090 [pdf, other]

On the Computational Complexity of Stochastic Controller Optimization in POMDPs

Authors: Nikos Vlassis, Michael L. Littman, David Barber

Abstract: We show that the problem of finding an optimal stochastic 'blind' controller in a Markov decision process is an NP-hard problem. The corresponding decision problem is NP-hard, in PSPACE, and SQRT-SUM-hard, hence placing it in NP would imply breakthroughs in long-standing open problems in computer science. Our result establishes that the more general problem of stochastic controller optimization in… ▽ More We show that the problem of finding an optimal stochastic 'blind' controller in a Markov decision process is an NP-hard problem. The corresponding decision problem is NP-hard, in PSPACE, and SQRT-SUM-hard, hence placing it in NP would imply breakthroughs in long-standing open problems in computer science. Our result establishes that the more general problem of stochastic controller optimization in POMDPs is also NP-hard. Nonetheless, we outline a special case that is convex and admits efficient global solutions. △ Less

Submitted 4 October, 2012; v1 submitted 15 July, 2011; originally announced July 2011.

Comments: Corrected error in the proof of Theorem 2, and revised Section 5

ACM Class: F.2.1

arXiv:1105.5455 [pdf, ps]

doi 10.1613/jair.567

Variational Cumulant Expansions for Intractable Distributions

Authors: D. Barber, P. de van Laar

Abstract: Intractable distributions present a common difficulty in inference within the probabilistic knowledge representation framework and variational methods have recently been popular in providing an approximate solution. In this article, we describe a perturbational approach in the form of a cumulant expansion which, to lowest order, recovers the standard Kullback-Leibler variational bou… ▽ More Intractable distributions present a common difficulty in inference within the probabilistic knowledge representation framework and variational methods have recently been popular in providing an approximate solution. In this article, we describe a perturbational approach in the form of a cumulant expansion which, to lowest order, recovers the standard Kullback-Leibler variational bound. Higher-order terms describe corrections on the variational approach without incurring much further computational cost. The relationship to other perturbational approaches such as TAP is also elucidated. We demonstrate the method on a particular class of undirected graphical models, Boltzmann machines, for which our simulation results confirm improved accuracy and enhanced stability during learning. △ Less

Submitted 26 May, 2011; originally announced May 2011.

Journal ref: Journal Of Artificial Intelligence Research, Volume 10, pages 435-455, 1999

Showing 1–48 of 48 results for author: Barber, D