-
Bayesian Flow Networks
Authors:
Alex Graves,
Rupesh Kumar Srivastava,
Timothy Atkinson,
Faustino Gomez
Abstract:
This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution. Starting from a simple prior and iteratively updating the two distributions yields a ge…
▽ More
This paper introduces Bayesian Flow Networks (BFNs), a new class of generative model in which the parameters of a set of independent distributions are modified with Bayesian inference in the light of noisy data samples, then passed as input to a neural network that outputs a second, interdependent distribution. Starting from a simple prior and iteratively updating the two distributions yields a generative procedure similar to the reverse process of diffusion models; however it is conceptually simpler in that no forward process is required. Discrete and continuous-time loss functions are derived for continuous, discretised and discrete data, along with sample generation procedures. Notably, the network inputs for discrete data lie on the probability simplex, and are therefore natively differentiable, paving the way for gradient-based sample guidance and few-step generation in discrete domains such as language modelling. The loss function directly optimises data compression and places no restrictions on the network architecture. In our experiments BFNs achieve competitive log-likelihoods for image modelling on dynamically binarized MNIST and CIFAR-10, and outperform all known discrete diffusion models on the text8 character-level language modelling task.
△ Less
Submitted 3 February, 2024; v1 submitted 14 August, 2023;
originally announced August 2023.
-
Jammed solids with pins: Thresholds, Force networks and Elasticity
Authors:
Andy L. Zhang,
Sean A. Ridout,
Celia Parts,
Aarushi Sachdeva,
Cacey S. Bester,
Katharina Vollmayr-Lee,
Brian C. Utter,
Ted Brzinski,
Amy L. Graves
Abstract:
The role of fixed degrees of freedom in soft/granular matter systems has broad applicability and theoretical interest. Here we address questions of the geometrical role that a scaffolding of fixed particles plays in tuning the threshold volume fraction and force network in the vicinity of jamming. Our 2d simulated system consists of soft particles and fixed "pins", both of which harmonically repel…
▽ More
The role of fixed degrees of freedom in soft/granular matter systems has broad applicability and theoretical interest. Here we address questions of the geometrical role that a scaffolding of fixed particles plays in tuning the threshold volume fraction and force network in the vicinity of jamming. Our 2d simulated system consists of soft particles and fixed "pins", both of which harmonically repel overlaps. On one hand, we find that many of the critical scalings associated with jamming in the absence of pins continue to hold in the presence of even dense pin latices. On the other hand, the presence of pins lowers the jamming threshold, in a universal way at low pin densities and a geometry-dependent manner at high pin densities, producing packings with lower densities and fewer contacts between particles. The onset of strong lattice dependence coincides with the development of bond-orientational order. Furthermore, the presence of pins dramatically modifies the network of forces, with both unusually weak and unusually strong forces becoming more abundant. The spatial organization of this force network depends on pin geometry and is described in detail. Using persistent homology we demonstrate that pins modify the topology of the network. Finally, we observe clear signatures of this develo** bond-orientational order and broad force distribution in the elastic moduli which characterize the linear response of these packings to strain.
△ Less
Submitted 25 August, 2022; v1 submitted 29 May, 2022;
originally announced May 2022.
-
A Practical Sparse Approximation for Real Time Recurrent Learning
Authors:
Jacob Menick,
Erich Elsen,
Utku Evci,
Simon Osindero,
Karen Simonyan,
Alex Graves
Abstract:
Current methods for training recurrent neural networks are based on backpropagation through time, which requires storing a complete history of network states, and prohibits updating the weights `online' (after every timestep). Real Time Recurrent Learning (RTRL) eliminates the need for history storage and allows for online weight updates, but does so at the expense of computational costs that are…
▽ More
Current methods for training recurrent neural networks are based on backpropagation through time, which requires storing a complete history of network states, and prohibits updating the weights `online' (after every timestep). Real Time Recurrent Learning (RTRL) eliminates the need for history storage and allows for online weight updates, but does so at the expense of computational costs that are quartic in the state size. This renders RTRL training intractable for all but the smallest networks, even ones that are made highly sparse.
We introduce the Sparse n-step Approximation (SnAp) to the RTRL influence matrix, which only keeps entries that are nonzero within n steps of the recurrent core. SnAp with n=1 is no more expensive than backpropagation, and we find that it substantially outperforms other RTRL approximations with comparable costs such as Unbiased Online Recurrent Optimization. For highly sparse networks, SnAp with n=2 remains tractable and can outperform backpropagation through time in terms of learning speed when updates are done online. SnAp becomes equivalent to RTRL when n is large.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Structured randomness: Jamming of soft discs and pins
Authors:
Prairie Wentworth-Nice,
Sean A. Ridout,
Brian Jenike,
Ari Liloia,
Amy L. Graves
Abstract:
Simulations are used to find the zero temperature jamming threshold, $φ_j$, for soft, bidisperse disks in the presence of small fixed particles, or "pins", arranged in a lattice. The presence of pins leads, as one expects, to a decrease in $φ_j$. Structural properties of the system near the jamming threshold are calculated as a function of the pin density. While the correlation length exponent rem…
▽ More
Simulations are used to find the zero temperature jamming threshold, $φ_j$, for soft, bidisperse disks in the presence of small fixed particles, or "pins", arranged in a lattice. The presence of pins leads, as one expects, to a decrease in $φ_j$. Structural properties of the system near the jamming threshold are calculated as a function of the pin density. While the correlation length exponent remains $ν= 1/2$ at low pin densities, the system is mechanically stable with more bonds, yet fewer contacts than the Maxwell criterion implies in the absence of pins. In addition, as pin density increases, novel bond orientational order and long-range spatial order appear, which are correlated with the square symmetry of the pin lattice.
△ Less
Submitted 29 April, 2020; v1 submitted 9 April, 2020;
originally announced April 2020.
-
Hitting the Ground Running: Computational physics education to prepare students for computational physics research
Authors:
Amy Lisa Graves,
Adam D. Light
Abstract:
Momentum exists in the physics community for integrating computation into the undergraduate curriculum. One of many benefits would be preparation for computational research. Our investigation poses the question of which computational skills might be best learned in the curriculum (prior to research) versus during research. Based on a survey of computational physicists, we present evidence that man…
▽ More
Momentum exists in the physics community for integrating computation into the undergraduate curriculum. One of many benefits would be preparation for computational research. Our investigation poses the question of which computational skills might be best learned in the curriculum (prior to research) versus during research. Based on a survey of computational physicists, we present evidence that many relevant skills are developed naturally in a research context while others stand out as best learned in advance.
△ Less
Submitted 8 April, 2020;
originally announced April 2020.
-
Proton-induced reactions on Fe, Cu, & Ti from threshold to 55 MeV
Authors:
Andrew S. Voyles,
Amanda M. Lewis,
Jonathan T. Morrell,
M. Shamsuzzoha Basunia,
Lee A. Bernstein,
Jonathan W. Engle,
Stephen A. Graves,
Eric F. Matthews
Abstract:
Theoretical models often differ significantly from measured data in their predictions of the magnitude of nuclear reactions that produce radionuclides for medical, research, and national security applications. In this paper, we compare a priori predictions from several state-of-the-art reaction modeling packages (CoH, EMPIRE, TALYS, and ALICE) to cross sections measured using the stacked-target ac…
▽ More
Theoretical models often differ significantly from measured data in their predictions of the magnitude of nuclear reactions that produce radionuclides for medical, research, and national security applications. In this paper, we compare a priori predictions from several state-of-the-art reaction modeling packages (CoH, EMPIRE, TALYS, and ALICE) to cross sections measured using the stacked-target activation method. The experiment was performed using the LBNL 88-Inch Cyclotron with beams of 25 and 55 MeV protons on a stack of iron, copper, and titanium foils. 34 excitation functions were measured for 4 < Ep < 55 MeV, including the first measurement of the independent cross sections for natFe(p,x) 49,51Cr, 51,52m,52g,56Mn, and 58m,58gCo. All of the models failed to reproduce the isomer-to-ground state ratio for reaction channels at compound and pre-compound energies, suggesting issues in modeling the deposition or distribution of angular momentum in these residual nuclei.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
Excitation functions for (p,x) reactions of niobium in the energy range of E$_{\text{p}}$ = 40-90 MeV
Authors:
Andrew S. Voyles,
Lee A. Bernstein,
Eva R. Birnbaum,
Jonathan W. Engle,
Stephen A. Graves,
Toshihiko Kawano,
Amanda M. Lewis,
Francois M. Nortier
Abstract:
A stack of thin Nb foils was irradiated with the 100 MeV proton beam at Los Alamos National Laboratory's Isotope Production Facility, to investigate the $^{93}$Nb(p,4n)$^{90}$Mo nuclear reaction as a monitor for intermediate energy proton experiments and to benchmark state-of-the-art reaction model codes. A set of 38 measured cross sections for $^{\text{nat}}$Nb(p,x) and $^{\text{nat}}$Cu(p,x) rea…
▽ More
A stack of thin Nb foils was irradiated with the 100 MeV proton beam at Los Alamos National Laboratory's Isotope Production Facility, to investigate the $^{93}$Nb(p,4n)$^{90}$Mo nuclear reaction as a monitor for intermediate energy proton experiments and to benchmark state-of-the-art reaction model codes. A set of 38 measured cross sections for $^{\text{nat}}$Nb(p,x) and $^{\text{nat}}$Cu(p,x) reactions between 40-90 MeV, as well as 5 independent measurements of isomer branching ratios, are reported. These are useful in medical and basic science radionuclide productions at intermediate energies. The $^{\text{nat}}$Cu(p,x)$^{56}$Co, $^{\text{nat}}$Cu(p,x)$^{62}$Zn, and $^{\text{nat}}$Cu(p,x)$^{65}$Zn reactions were used to determine proton fluence, and all activities were quantified using HPGe spectrometry. Variance minimization techniques were employed to reduce systematic uncertainties in proton energy and fluence, improving the reliability of these measurements. The measured cross sections are shown to be in excellent agreement with literature values, and have been measured with improved precision compared with previous measurements. This work also reports the first measurement of the $^{\text{nat}}$Nb(p,x)$^{82\text{m}}$Rb reaction, and of the independent cross sections for $^{\text{nat}}$Cu(p,x)$^{52\text{g}}$Mn and $^{\text{nat}}$Nb(p,x)$^{85\text{g}}$Y in the 40-90 MeV region. The effects of $^{\text{nat}}$Si(p,x)$^{22,24}$Na contamination, arising from silicone adhesive in the Kapton tape used to encapsulate the aluminum monitor foils, is also discussed as a cautionary note to future stacked-target cross section measurements. \emph{A priori} predictions of the reaction modeling codes CoH, EMPIRE, and TALYS are compared with experimentally measured values and used to explore the differences between codes for the $^{\text{nat}}$Nb(p,x) and $^{\text{nat}}$Cu(p,x) reactions.
△ Less
Submitted 21 June, 2018; v1 submitted 18 April, 2018;
originally announced April 2018.
-
Associative Compression Networks for Representation Learning
Authors:
Alex Graves,
Jacob Menick,
Aaron van den Oord
Abstract:
This paper introduces Associative Compression Networks (ACNs), a new framework for variational autoencoding with neural networks. The system differs from existing variational autoencoders (VAEs) in that the prior distribution used to model each code is conditioned on a similar code from the dataset. In compression terms this equates to sequentially transmitting the dataset using an ordering determ…
▽ More
This paper introduces Associative Compression Networks (ACNs), a new framework for variational autoencoding with neural networks. The system differs from existing variational autoencoders (VAEs) in that the prior distribution used to model each code is conditioned on a similar code from the dataset. In compression terms this equates to sequentially transmitting the dataset using an ordering determined by proximity in latent space. Since the prior need only account for local, rather than global variations in the latent space, the coding cost is greatly reduced, leading to rich, informative codes. Crucially, the codes remain informative when powerful, autoregressive decoders are used, which we argue is fundamentally difficult with normal VAEs. Experimental results on MNIST, CIFAR-10, ImageNet and CelebA show that ACNs discover high-level latent features such as object class, writing style, pose and facial expression, which can be used to cluster and classify the data, as well as to generate diverse and convincing samples. We conclude that ACNs are a promising new direction for representation learning: one that steps away from IID modelling, and towards learning a structured description of the dataset as a whole.
△ Less
Submitted 26 April, 2018; v1 submitted 6 April, 2018;
originally announced April 2018.
-
The Kanerva Machine: A Generative Distributed Memory
Authors:
Yan Wu,
Greg Wayne,
Alex Graves,
Timothy Lillicrap
Abstract:
We present an end-to-end trained memory system that quickly adapts to new data and generates samples like them. Inspired by Kanerva's sparse distributed memory, it has a robust distributed reading and writing mechanism. The memory is analytically tractable, which enables optimal on-line compression via a Bayesian update-rule. We formulate it as a hierarchical conditional generative model, where me…
▽ More
We present an end-to-end trained memory system that quickly adapts to new data and generates samples like them. Inspired by Kanerva's sparse distributed memory, it has a robust distributed reading and writing mechanism. The memory is analytically tractable, which enables optimal on-line compression via a Bayesian update-rule. We formulate it as a hierarchical conditional generative model, where memory provides a rich data-dependent prior distribution. Consequently, the top-down memory and bottom-up perception are combined to produce the code representing an observation. Empirically, we demonstrate that the adaptive memory significantly improves generative models trained on both the Omniglot and CIFAR datasets. Compared with the Differentiable Neural Computer (DNC) and its variants, our memory model has greater capacity and is significantly easier to train.
△ Less
Submitted 18 June, 2018; v1 submitted 5 April, 2018;
originally announced April 2018.
-
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
Authors:
Aaron van den Oord,
Yazhe Li,
Igor Babuschkin,
Karen Simonyan,
Oriol Vinyals,
Koray Kavukcuoglu,
George van den Driessche,
Edward Lockhart,
Luis C. Cobo,
Florian Stimberg,
Norman Casagrande,
Dominik Grewe,
Seb Noury,
Sander Dieleman,
Erich Elsen,
Nal Kalchbrenner,
Heiga Zen,
Alex Graves,
Helen King,
Tom Walters,
Dan Belov,
Demis Hassabis
Abstract:
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time p…
▽ More
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.
△ Less
Submitted 28 November, 2017;
originally announced November 2017.
-
Noisy Networks for Exploration
Authors:
Meire Fortunato,
Mohammad Gheshlaghi Azar,
Bilal Piot,
Jacob Menick,
Ian Osband,
Alex Graves,
Vlad Mnih,
Remi Munos,
Demis Hassabis,
Olivier Pietquin,
Charles Blundell,
Shane Legg
Abstract:
We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find…
▽ More
We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find that replacing the conventional exploration heuristics for A3C, DQN and dueling agents (entropy reward and $ε$-greedy respectively) with NoisyNet yields substantially higher scores for a wide range of Atari games, in some cases advancing the agent from sub to super-human performance.
△ Less
Submitted 9 July, 2019; v1 submitted 30 June, 2017;
originally announced June 2017.
-
Swimming against the tide: Gender bias in the physics classroom
Authors:
Amy L. Graves,
Estuko Hoshino-Browne,
Kristine P. H. Lui
Abstract:
This study examines physics students' evaluations of identical, video-recorded lectures performed by female and male actors playing the role of professors. The results indicate that evaluations by male students show statistically significant overall biases with male professors rated more positively than female professors. Female students tended to be egalitarian, except in two areas. Female studen…
▽ More
This study examines physics students' evaluations of identical, video-recorded lectures performed by female and male actors playing the role of professors. The results indicate that evaluations by male students show statistically significant overall biases with male professors rated more positively than female professors. Female students tended to be egalitarian, except in two areas. Female students evaluated female professors' interpersonal/communicative skills more positively than male professors'. They evaluated female professors' scientific knowledge and skills less positively than that of male professors just as male students did. These findings are relevant to two areas of research on bias in evaluation: rater-ratee similarity bias and stereotype confirmation bias. Results from this study have important implications for efforts focused on educating students and mentoring faculty members in order to increase the representation of women in the physical sciences.
△ Less
Submitted 1 June, 2017; v1 submitted 26 May, 2017;
originally announced May 2017.
-
Automated Curriculum Learning for Neural Networks
Authors:
Alex Graves,
Marc G. Bellemare,
Jacob Menick,
Remi Munos,
Koray Kavukcuoglu
Abstract:
We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multi-armed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived fro…
▽ More
We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multi-armed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived from two distinct indicators of learning progress: rate of increase in prediction accuracy, and rate of increase in network complexity. Experimental results for LSTM networks on three curricula demonstrate that our approach can significantly accelerate learning, in some cases halving the time required to attain a satisfactory performance level.
△ Less
Submitted 10 April, 2017;
originally announced April 2017.
-
Neural Machine Translation in Linear Time
Authors:
Nal Kalchbrenner,
Lasse Espeholt,
Karen Simonyan,
Aaron van den Oord,
Alex Graves,
Koray Kavukcuoglu
Abstract:
We present a novel neural network for processing sequences. The ByteNet is a one-dimensional convolutional neural network that is composed of two parts, one to encode the source sequence and the other to decode the target sequence. The two network parts are connected by stacking the decoder on top of the encoder and preserving the temporal resolution of the sequences. To address the differing leng…
▽ More
We present a novel neural network for processing sequences. The ByteNet is a one-dimensional convolutional neural network that is composed of two parts, one to encode the source sequence and the other to decode the target sequence. The two network parts are connected by stacking the decoder on top of the encoder and preserving the temporal resolution of the sequences. To address the differing lengths of the source and the target, we introduce an efficient mechanism by which the decoder is dynamically unfolded over the representation of the encoder. The ByteNet uses dilation in the convolutional layers to increase its receptive field. The resulting network has two core properties: it runs in time that is linear in the length of the sequences and it sidesteps the need for excessive memorization. The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks. The ByteNet also achieves state-of-the-art performance on character-to-character machine translation on the English-to-German WMT translation task, surpassing comparable neural translation models that are based on recurrent networks with attentional pooling and run in quadratic time. We find that the latent alignment structure contained in the representations reflects the expected alignment between the tokens.
△ Less
Submitted 15 March, 2017; v1 submitted 31 October, 2016;
originally announced October 2016.
-
Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes
Authors:
Jack W Rae,
Jonathan J Hunt,
Tim Harley,
Ivo Danihelka,
Andrew Senior,
Greg Wayne,
Alex Graves,
Timothy P Lillicrap
Abstract:
Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory…
▽ More
Neural networks augmented with external memory have the ability to learn algorithmic solutions to complex tasks. These models appear promising for applications such as language modeling and machine translation. However, they scale poorly in both space and time as the amount of memory grows --- limiting their applicability to real-world domains. Here, we present an end-to-end differentiable memory access scheme, which we call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories. We show that SAM achieves asymptotic lower bounds in space and time complexity, and find that an implementation runs $1,\!000\times$ faster and with $3,\!000\times$ less physical memory than non-sparse models. SAM learns with comparable data efficiency to existing models on a range of synthetic tasks and one-shot Omniglot character recognition, and can scale to tasks requiring $100,\!000$s of time steps and memories. As well, we show how our approach can be adapted for models that maintain temporal associations between memories, as with the recently introduced Differentiable Neural Computer.
△ Less
Submitted 27 October, 2016;
originally announced October 2016.
-
Video Pixel Networks
Authors:
Nal Kalchbrenner,
Aaron van den Oord,
Karen Simonyan,
Ivo Danihelka,
Oriol Vinyals,
Alex Graves,
Koray Kavukcuoglu
Abstract:
We propose a probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video. The model and the neural architecture reflect the time, space and color structure of video tensors and encode it as a four-dimensional dependency chain. The VPN approaches the best possible performance on the Moving MNIST benchmark, a leap over t…
▽ More
We propose a probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video. The model and the neural architecture reflect the time, space and color structure of video tensors and encode it as a four-dimensional dependency chain. The VPN approaches the best possible performance on the Moving MNIST benchmark, a leap over the previous state of the art, and the generated videos show only minor deviations from the ground truth. The VPN also produces detailed samples on the action-conditional Robotic Pushing benchmark and generalizes to the motion of novel objects.
△ Less
Submitted 3 October, 2016;
originally announced October 2016.
-
WaveNet: A Generative Model for Raw Audio
Authors:
Aaron van den Oord,
Sander Dieleman,
Heiga Zen,
Karen Simonyan,
Oriol Vinyals,
Alex Graves,
Nal Kalchbrenner,
Andrew Senior,
Koray Kavukcuoglu
Abstract:
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-…
▽ More
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
△ Less
Submitted 19 September, 2016; v1 submitted 12 September, 2016;
originally announced September 2016.
-
Decoupled Neural Interfaces using Synthetic Gradients
Authors:
Max Jaderberg,
Wojciech Marian Czarnecki,
Simon Osindero,
Oriol Vinyals,
Alex Graves,
David Silver,
Koray Kavukcuoglu
Abstract:
Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In…
▽ More
Training directed neural networks typically requires forward-propagating data through a computation graph, followed by backpropagating error signal, to produce weight updates. All layers, or more generally, modules, of the network are therefore locked, in the sense that they must wait for the remainder of the network to execute forwards and propagate error backwards before they can be updated. In this work we break this constraint by decoupling modules by introducing a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. In particular we focus on modelling error gradients: by using the modelled synthetic gradient in place of true backpropagated error gradients we decouple subgraphs, and can update them independently and asynchronously i.e. we realise decoupled neural interfaces. We show results for feed-forward models, where every layer is trained asynchronously, recurrent neural networks (RNNs) where predicting one's future gradient extends the time over which the RNN can effectively model, and also a hierarchical RNN system with ticking at different timescales. Finally, we demonstrate that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass -- amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.
△ Less
Submitted 3 July, 2017; v1 submitted 18 August, 2016;
originally announced August 2016.
-
Stochastic Backpropagation through Mixture Density Distributions
Authors:
Alex Graves
Abstract:
The ability to backpropagate stochastic gradients through continuous latent distributions has been crucial to the emergence of variational autoencoders and stochastic gradient variational Bayes. The key ingredient is an unbiased and low-variance way of estimating gradients with respect to distribution parameters from gradients evaluated at distribution samples. The "reparameterization trick" provi…
▽ More
The ability to backpropagate stochastic gradients through continuous latent distributions has been crucial to the emergence of variational autoencoders and stochastic gradient variational Bayes. The key ingredient is an unbiased and low-variance way of estimating gradients with respect to distribution parameters from gradients evaluated at distribution samples. The "reparameterization trick" provides a class of transforms yielding such estimators for many continuous distributions, including the Gaussian and other members of the location-scale family. However the trick does not readily extend to mixture density models, due to the difficulty of reparameterizing the discrete distribution over mixture weights. This report describes an alternative transform, applicable to any continuous multivariate distribution with a differentiable density function from which samples can be drawn, and uses it to derive an unbiased estimator for mixture density weight derivatives. Combined with the reparameterization trick applied to the individual mixture components, this estimator makes it straightforward to train variational autoencoders with mixture-distributed latent variables, or to perform stochastic variational inference with a mixture density variational posterior.
△ Less
Submitted 19 July, 2016;
originally announced July 2016.
-
Conditional Image Generation with PixelCNN Decoders
Authors:
Aaron van den Oord,
Nal Kalchbrenner,
Oriol Vinyals,
Lasse Espeholt,
Alex Graves,
Koray Kavukcuoglu
Abstract:
This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects…
▽ More
This work explores conditional image generation with a new image density model based on the PixelCNN architecture. The model can be conditioned on any vector, including descriptive labels or tags, or latent embeddings created by other networks. When conditioned on class labels from the ImageNet database, the model is able to generate diverse, realistic scenes representing distinct animals, objects, landscapes and structures. When conditioned on an embedding produced by a convolutional network given a single image of an unseen face, it generates a variety of new portraits of the same person with different facial expressions, poses and lighting conditions. We also show that conditional PixelCNN can serve as a powerful decoder in an image autoencoder. Additionally, the gated convolutional layers in the proposed model improve the log-likelihood of PixelCNN to match the state-of-the-art performance of PixelRNN on ImageNet, with greatly reduced computational cost.
△ Less
Submitted 18 June, 2016; v1 submitted 16 June, 2016;
originally announced June 2016.
-
Strategic Attentive Writer for Learning Macro-Actions
Authors:
Alexander,
Vezhnevets,
Volodymyr Mnih,
John Agapiou,
Simon Osindero,
Alex Graves,
Oriol Vinyals,
Koray Kavukcuoglu
Abstract:
We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner by purely interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub- seque…
▽ More
We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner by purely interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub- sequences by learning for how long the plan can be committed to - i.e. followed without re-planing. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn high-level, temporally abstracted macro- actions of varying lengths that are solely learnt from data without any prior information. These macro-actions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e.g. Ms. Pacman and Frostbite). It is at the same time a general algorithm that can be applied on any sequence data. To that end, we also show that when trained on text prediction task, STRAW naturally predicts frequent n-grams (instead of macro-actions), demonstrating the generality of the approach.
△ Less
Submitted 15 June, 2016;
originally announced June 2016.
-
Memory-Efficient Backpropagation Through Time
Authors:
Audrūnas Gruslys,
Remi Munos,
Ivo Danihelka,
Marc Lanctot,
Alex Graves
Abstract:
We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs). Our approach uses dynamic programming to balance a trade-off between caching of intermediate results and recomputation. The algorithm is capable of tightly fitting within almost any user-set memory budget while finding an optimal execution po…
▽ More
We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs). Our approach uses dynamic programming to balance a trade-off between caching of intermediate results and recomputation. The algorithm is capable of tightly fitting within almost any user-set memory budget while finding an optimal execution policy minimizing the computational cost. Computational devices have limited memory capacity and maximizing a computational performance given a fixed memory budget is a practical use-case. We provide asymptotic computational upper bounds for various regimes. The algorithm is particularly effective for long sequences. For sequences of length 1000, our algorithm saves 95\% of memory usage while using only one third more time per iteration than the standard BPTT.
△ Less
Submitted 10 June, 2016;
originally announced June 2016.
-
Adaptive Computation Time for Recurrent Neural Networks
Authors:
Alex Graves
Abstract:
This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synt…
▽ More
This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synthetic problems: determining the parity of binary vectors, applying binary logic operations, adding integers, and sorting real numbers. Overall, performance is dramatically improved by the use of ACT, which successfully adapts the number of computational steps to the requirements of the problem. We also present character-level language modelling results on the Hutter prize Wikipedia dataset. In this case ACT does not yield large gains in performance; however it does provide intriguing insight into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences. This suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data.
△ Less
Submitted 21 February, 2017; v1 submitted 29 March, 2016;
originally announced March 2016.
-
Associative Long Short-Term Memory
Authors:
Ivo Danihelka,
Greg Wayne,
Benigno Uria,
Nal Kalchbrenner,
Alex Graves
Abstract:
We investigate a new method to augment recurrent neural networks with extra memory without increasing the number of network parameters. The system has an associative memory based on complex-valued vectors and is closely related to Holographic Reduced Representations and Long Short-Term Memory networks. Holographic Reduced Representations have limited capacity: as they store more information, each…
▽ More
We investigate a new method to augment recurrent neural networks with extra memory without increasing the number of network parameters. The system has an associative memory based on complex-valued vectors and is closely related to Holographic Reduced Representations and Long Short-Term Memory networks. Holographic Reduced Representations have limited capacity: as they store more information, each retrieval becomes noisier due to interference. Our system in contrast creates redundant copies of stored information, which enables retrieval with reduced noise. Experiments demonstrate faster learning on multiple memorization tasks.
△ Less
Submitted 19 May, 2016; v1 submitted 9 February, 2016;
originally announced February 2016.
-
Asynchronous Methods for Deep Reinforcement Learning
Authors:
Volodymyr Mnih,
Adrià Puigdomènech Badia,
Mehdi Mirza,
Alex Graves,
Timothy P. Lillicrap,
Tim Harley,
David Silver,
Koray Kavukcuoglu
Abstract:
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural n…
▽ More
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.
△ Less
Submitted 16 June, 2016; v1 submitted 4 February, 2016;
originally announced February 2016.
-
Pinning Susceptibility: The effect of dilute, quenched disorder on jamming
Authors:
Amy L. Graves,
Samer Nashed,
Elliot Padgett,
Carl P. Goodrich,
Andrea J. Liu,
James P. Sethna
Abstract:
We study the effect of dilute pinning on the jamming transition. Pinning reduces the average contact number needed to jam unpinned particles and shifts the jamming threshold to lower densities, leading to a pinning susceptibility, $χ_p$. Our main results are that this susceptibility obeys scaling form and diverges in the thermodynamic limit as $χ_p \propto |φ- φ_c^\infty|^{-γ_p}$ where…
▽ More
We study the effect of dilute pinning on the jamming transition. Pinning reduces the average contact number needed to jam unpinned particles and shifts the jamming threshold to lower densities, leading to a pinning susceptibility, $χ_p$. Our main results are that this susceptibility obeys scaling form and diverges in the thermodynamic limit as $χ_p \propto |φ- φ_c^\infty|^{-γ_p}$ where $φ_c^\infty$ is the jamming threshold in the absence of pins. Finite-size scaling arguments yield these values with associated statistical (systematic) errors $γ_p = 1.018 \pm 0.026 (0.291) $ in $d=2$ and $γ_p =1.534 \pm 0.120 (0.822)$ in $d=3$. Logarithmic corrections raise the exponent in $d=2$ to close to the $d=3$ value, although the systematic errors are very large.
△ Less
Submitted 18 May, 2016; v1 submitted 2 September, 2015;
originally announced September 2015.
-
Grid Long Short-Term Memory
Authors:
Nal Kalchbrenner,
Ivo Danihelka,
Alex Graves
Abstract:
This paper introduces Grid Long Short-Term Memory, a network of LSTM cells arranged in a multidimensional grid that can be applied to vectors, sequences or higher dimensional data such as images. The network differs from existing deep LSTM architectures in that the cells are connected between network layers as well as along the spatiotemporal dimensions of the data. The network provides a unified…
▽ More
This paper introduces Grid Long Short-Term Memory, a network of LSTM cells arranged in a multidimensional grid that can be applied to vectors, sequences or higher dimensional data such as images. The network differs from existing deep LSTM architectures in that the cells are connected between network layers as well as along the spatiotemporal dimensions of the data. The network provides a unified way of using LSTM for both deep and sequential computation. We apply the model to algorithmic tasks such as 15-digit integer addition and sequence memorization, where it is able to significantly outperform the standard LSTM. We then give results for two empirical tasks. We find that 2D Grid LSTM achieves 1.47 bits per character on the Wikipedia character prediction benchmark, which is state-of-the-art among neural approaches. In addition, we use the Grid LSTM to define a novel two-dimensional translation model, the Reencoder, and show that it outperforms a phrase-based reference system on a Chinese-to-English translation task.
△ Less
Submitted 7 January, 2016; v1 submitted 6 July, 2015;
originally announced July 2015.
-
DRAW: A Recurrent Neural Network For Image Generation
Authors:
Karol Gregor,
Ivo Danihelka,
Alex Graves,
Danilo Jimenez Rezende,
Daan Wierstra
Abstract:
This paper introduces the Deep Recurrent Attentive Writer (DRAW) neural network architecture for image generation. DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images. The system substantially improves on the state of the art for gener…
▽ More
This paper introduces the Deep Recurrent Attentive Writer (DRAW) neural network architecture for image generation. DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images. The system substantially improves on the state of the art for generative models on MNIST, and, when trained on the Street View House Numbers dataset, it generates images that cannot be distinguished from real data with the naked eye.
△ Less
Submitted 20 May, 2015; v1 submitted 16 February, 2015;
originally announced February 2015.
-
Neural Turing Machines
Authors:
Alex Graves,
Greg Wayne,
Ivo Danihelka
Abstract:
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple…
▽ More
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.
△ Less
Submitted 10 December, 2014; v1 submitted 20 October, 2014;
originally announced October 2014.
-
Recurrent Models of Visual Attention
Authors:
Volodymyr Mnih,
Nicolas Heess,
Alex Graves,
Koray Kavukcuoglu
Abstract:
Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution…
▽ More
Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size. While the model is non-differentiable, it can be trained using reinforcement learning methods to learn task-specific policies. We evaluate our model on several image classification tasks, where it significantly outperforms a convolutional neural network baseline on cluttered images, and on a dynamic visual control problem, where it learns to track a simple object without an explicit training signal for doing so.
△ Less
Submitted 24 June, 2014;
originally announced June 2014.
-
Playing Atari with Deep Reinforcement Learning
Authors:
Volodymyr Mnih,
Koray Kavukcuoglu,
David Silver,
Alex Graves,
Ioannis Antonoglou,
Daan Wierstra,
Martin Riedmiller
Abstract:
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning E…
▽ More
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
△ Less
Submitted 19 December, 2013;
originally announced December 2013.
-
Generating Sequences With Recurrent Neural Networks
Authors:
Alex Graves
Abstract:
This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwriting (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its pre…
▽ More
This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwriting (where the data are real-valued). It is then extended to handwriting synthesis by allowing the network to condition its predictions on a text sequence. The resulting system is able to generate highly realistic cursive handwriting in a wide variety of styles.
△ Less
Submitted 5 June, 2014; v1 submitted 4 August, 2013;
originally announced August 2013.
-
Speech Recognition with Deep Recurrent Neural Networks
Authors:
Alex Graves,
Abdel-rahman Mohamed,
Geoffrey Hinton
Abstract:
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art…
▽ More
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
△ Less
Submitted 22 March, 2013;
originally announced March 2013.
-
Sequence Transduction with Recurrent Neural Networks
Authors:
Alex Graves
Abstract:
Many machine learning tasks can be expressed as the transformation---or \emph{transduction}---of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequ…
▽ More
Many machine learning tasks can be expressed as the transformation---or \emph{transduction}---of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating. Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since \emph{finding} the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging. This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence. Experimental results for phoneme recognition are provided on the TIMIT speech corpus.
△ Less
Submitted 14 November, 2012;
originally announced November 2012.
-
Statistical Common Author Networks (SCAN)
Authors:
F. G. Serpa,
Adam M. Graves,
Artjay Javier
Abstract:
A new method for visualizing the relatedness of scientific areas is developed that is based on measuring the overlap of researchers between areas. It is found that closely related areas have a high propensity to share a larger number of common authors. A methodology for comparing areas of vastly different sizes and to handle name homonymy is constructed, allowing for the robust deployment of this…
▽ More
A new method for visualizing the relatedness of scientific areas is developed that is based on measuring the overlap of researchers between areas. It is found that closely related areas have a high propensity to share a larger number of common authors. A methodology for comparing areas of vastly different sizes and to handle name homonymy is constructed, allowing for the robust deployment of this method on real data sets. A statistical analysis of the probability distributions of the common author overlap that accounts for noise is carried out along with the production of network maps with weighted links proportional to the overlap strength. This is demonstrated on two case studies, complexity science and neutrino physics, where the level of relatedness of areas within each area is expected to vary greatly. It is found that the results returned by this method closely match the intuitive expectation that the broad, multidisciplinary area of complexity science possesses areas that are weakly related to each other while the much narrower area of neutrino physics shows very strongly related areas.
△ Less
Submitted 8 March, 2013; v1 submitted 15 August, 2012;
originally announced August 2012.
-
Phoneme recognition in TIMIT with BLSTM-CTC
Authors:
Santiago Fernández,
Alex Graves,
Juergen Schmidhuber
Abstract:
We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6%. This result is not significantly…
▽ More
We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database. These published results have been obtained with a combination of classifiers. However, in this paper we apply a single recurrent neural network to the same task. Our recurrent neural network attains an error rate of 24.6%. This result is not significantly different from that obtained by the other best methods, but they rely on a combination of classifiers for achieving comparable performance.
△ Less
Submitted 21 April, 2008;
originally announced April 2008.
-
Multi-Dimensional Recurrent Neural Networks
Authors:
Alex Graves,
Santiago Fernandez,
Juergen Schmidhuber
Abstract:
Recurrent neural networks (RNNs) have proved effective at one dimensional sequence learning tasks, such as speech and online handwriting recognition. Some of the properties that make RNNs suitable for such tasks, for example robustness to input war**, and the ability to access contextual information, are also desirable in multidimensional domains. However, there has so far been no direct way o…
▽ More
Recurrent neural networks (RNNs) have proved effective at one dimensional sequence learning tasks, such as speech and online handwriting recognition. Some of the properties that make RNNs suitable for such tasks, for example robustness to input war**, and the ability to access contextual information, are also desirable in multidimensional domains. However, there has so far been no direct way of applying RNNs to data with more than one spatio-temporal dimension. This paper introduces multi-dimensional recurrent neural networks (MDRNNs), thereby extending the potential applicability of RNNs to vision, video processing, medical imaging and many other areas, while avoiding the scaling problems that have plagued other multi-dimensional models. Experimental results are provided for two image segmentation tasks.
△ Less
Submitted 14 May, 2007;
originally announced May 2007.