Skip to main content

Showing 1–28 of 28 results for author: Mnih, V

.
  1. arXiv:2312.09187  [pdf, other

    cs.LG

    Vision-Language Models as a Source of Rewards

    Authors: Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rocktäschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang , et al. (1 additional authors not shown)

    Abstract: Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of… ▽ More

    Submitted 21 February, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: 10 pages, 5 figures

  2. arXiv:2210.14215  [pdf, other

    cs.LG cs.AI

    In-context Reinforcement Learning with Algorithm Distillation

    Authors: Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, Maxime Gazeau, Himanshu Sahni, Satinder Singh, Volodymyr Mnih

    Abstract: We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transf… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

  3. arXiv:2210.10913  [pdf, other

    cs.LG cs.AI

    Palm up: Playing in the Latent Manifold for Unsupervised Pretraining

    Authors: Hao Liu, Tom Zahavy, Volodymyr Mnih, Satinder Singh

    Abstract: Large and diverse datasets have been the cornerstones of many impressive advancements in artificial intelligence. Intelligent creatures, however, learn by interacting with the environment, which changes the input sensory signals and the state of the environment. In this work, we aim to bring the best of both worlds and propose an algorithm that exhibits an exploratory behavior whilst it utilizes l… ▽ More

    Submitted 21 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022)

  4. arXiv:2110.15331  [pdf, other

    cs.LG cs.AI

    Wasserstein Distance Maximizing Intrinsic Control

    Authors: Ishan Durugkar, Steven Hansen, Stephen Spencer, Volodymyr Mnih

    Abstract: This paper deals with the problem of learning a skill-conditioned policy that acts meaningfully in the absence of a reward signal. Mutual information based objectives have shown some success in learning skills that reach a diverse set of states in this setting. These objectives include a KL-divergence term, which is maximized by visiting distinct states even if those states are not far apart in th… ▽ More

    Submitted 28 October, 2021; originally announced October 2021.

  5. arXiv:2107.14226  [pdf, other

    cs.LG cs.AI stat.ML

    Learning more skills through optimistic exploration

    Authors: DJ Strouse, Kate Baumli, David Warde-Farley, Vlad Mnih, Steven Hansen

    Abstract: Unsupervised skill learning objectives (Gregor et al., 2016, Eysenbach et al., 2018) allow agents to learn rich repertoires of behavior in the absence of extrinsic rewards. They work by simultaneously training a policy to produce distinguishable latent-conditioned trajectories, and a discriminator to evaluate distinguishability by trying to infer latents from trajectories. The hope is for the agen… ▽ More

    Submitted 12 May, 2022; v1 submitted 29 July, 2021; originally announced July 2021.

    Comments: Accepted at ICLR 2022 (spotlight)

  6. arXiv:2106.00669  [pdf, other

    cs.AI cs.LG stat.ML

    Discovering Diverse Nearly Optimal Policies with Successor Features

    Authors: Tom Zahavy, Brendan O'Donoghue, Andre Barreto, Volodymyr Mnih, Sebastian Flennerhag, Satinder Singh

    Abstract: Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while ass… ▽ More

    Submitted 4 January, 2022; v1 submitted 1 June, 2021; originally announced June 2021.

  7. arXiv:2012.07827  [pdf, other

    cs.AI cs.LG

    Relative Variational Intrinsic Control

    Authors: Kate Baumli, David Warde-Farley, Steven Hansen, Volodymyr Mnih

    Abstract: In the absence of external rewards, agents can still learn useful behaviors by identifying and mastering a set of diverse skills within their environment. Existing skill learning methods use mutual information objectives to incentivize each skill to be diverse and distinguishable from the rest. However, if care is not taken to constrain the ways in which the skills are diverse, trivially diverse s… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: Accepted by AAAI2021

  8. arXiv:2001.08116  [pdf, other

    cs.LG cs.AI stat.ML

    Q-Learning in enormous action spaces via amortized approximate maximization

    Authors: Tom Van de Wiele, David Warde-Farley, Andriy Mnih, Volodymyr Mnih

    Abstract: Applying Q-learning to high-dimensional or continuous action spaces can be difficult due to the required maximization over the set of possible actions. Motivated by techniques from amortized inference, we replace the expensive maximization over all actions with a maximization over a small subset of possible actions sampled from a learned proposal distribution. The resulting approach, which we dub… ▽ More

    Submitted 22 January, 2020; originally announced January 2020.

    Comments: A previous version of this work appeared at the Deep Reinforcement Learning Workshop, NeurIPS 2018

  9. arXiv:1906.11883  [pdf, other

    cs.CV cs.LG

    Unsupervised Learning of Object Keypoints for Perception and Control

    Authors: Tejas Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, Volodymyr Mnih

    Abstract: The study of object representations in computer vision has primarily focused on develo** representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for d… ▽ More

    Submitted 19 November, 2019; v1 submitted 19 June, 2019; originally announced June 2019.

    Comments: In NeurIPS 2019. Code https://github.com/deepmind/deepmind-research/tree/master/transporter

  10. arXiv:1906.05030  [pdf, other

    cs.LG cs.AI stat.ML

    Fast Task Inference with Variational Intrinsic Successor Features

    Authors: Steven Hansen, Will Dabney, Andre Barreto, Tom Van de Wiele, David Warde-Farley, Volodymyr Mnih

    Abstract: It has been established that diverse behaviors spanning the controllable subspace of an Markov decision process can be trained by rewarding a policy for being distinguishable from other policies \citep{gregor2016variational, eysenbach2018diversity, warde2018unsupervised}. However, one limitation of this formulation is generalizing behaviors beyond the finite set being explicitly learned, as is nee… ▽ More

    Submitted 27 January, 2020; v1 submitted 12 June, 2019; originally announced June 2019.

    Comments: Accepted for publication at ICLR 2020

  11. arXiv:1811.11359  [pdf, other

    cs.LG cs.AI stat.ML

    Unsupervised Control Through Non-Parametric Discriminative Rewards

    Authors: David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, Volodymyr Mnih

    Abstract: Learning to control an environment without hand-crafted rewards or expert data remains challenging and is at the frontier of reinforcement learning research. We present an unsupervised learning algorithm to train agents to achieve perceptually-specified goals using only a stream of observations and actions. Our agent simultaneously learns a goal-conditioned policy and a goal achievement reward fun… ▽ More

    Submitted 27 November, 2018; originally announced November 2018.

    Comments: 10 pages + references & 5 page appendix

  12. arXiv:1802.10567  [pdf, other

    cs.LG cs.RO stat.ML

    Learning by Playing - Solving Sparse Reward Tasks from Scratch

    Authors: Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, Jost Tobias Springenberg

    Abstract: We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is t… ▽ More

    Submitted 28 February, 2018; originally announced February 2018.

    Comments: A video of the rich set of learned behaviours can be found at https://youtu.be/mPKyvocNe_M

  13. arXiv:1802.01561  [pdf, other

    cs.LG cs.AI

    IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

    Authors: Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu

    Abstract: In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also… ▽ More

    Submitted 28 June, 2018; v1 submitted 5 February, 2018; originally announced February 2018.

  14. arXiv:1709.05380  [pdf, other

    cs.AI cs.LG math.OC stat.ML

    The Uncertainty Bellman Equation and Exploration

    Authors: Brendan O'Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih

    Abstract: We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-s… ▽ More

    Submitted 22 October, 2018; v1 submitted 15 September, 2017; originally announced September 2017.

  15. arXiv:1706.10295  [pdf, other

    cs.LG stat.ML

    Noisy Networks for Exploration

    Authors: Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg

    Abstract: We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find… ▽ More

    Submitted 9 July, 2019; v1 submitted 30 June, 2017; originally announced June 2017.

    Comments: ICLR 2018

  16. arXiv:1611.05397  [pdf, other

    cs.LG cs.NE

    Reinforcement Learning with Unsupervised Auxiliary Tasks

    Authors: Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu

    Abstract: Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervi… ▽ More

    Submitted 16 November, 2016; originally announced November 2016.

  17. arXiv:1611.01626  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Combining policy gradient and Q-learning

    Authors: Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih

    Abstract: Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the f… ▽ More

    Submitted 7 April, 2017; v1 submitted 5 November, 2016; originally announced November 2016.

  18. arXiv:1611.01224  [pdf, other

    cs.LG

    Sample Efficient Actor-Critic with Experience Replay

    Authors: Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas

    Abstract: This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochasti… ▽ More

    Submitted 10 July, 2017; v1 submitted 3 November, 2016; originally announced November 2016.

    Comments: 20 pages. Prepared for ICLR 2017

  19. arXiv:1610.06258  [pdf, other

    stat.ML cs.LG cs.NE

    Using Fast Weights to Attend to the Recent Past

    Authors: Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu

    Abstract: Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial ne… ▽ More

    Submitted 4 December, 2016; v1 submitted 19 October, 2016; originally announced October 2016.

    Comments: Added [Schmidhuber 1993] citation to the last paragraph of the introduction. Fixed typo appendix A.1 uniform initialization to 1/\sqrt{H}

  20. arXiv:1606.04695  [pdf, other

    cs.AI cs.LG

    Strategic Attentive Writer for Learning Macro-Actions

    Authors: Alexander, Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, Koray Kavukcuoglu

    Abstract: We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner by purely interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub- seque… ▽ More

    Submitted 15 June, 2016; originally announced June 2016.

  21. arXiv:1602.07714  [pdf, other

    cs.LG cs.AI cs.NE stat.ML

    Learning values across many orders of magnitude

    Authors: Hado van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, David Silver

    Abstract: Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games… ▽ More

    Submitted 16 August, 2016; v1 submitted 24 February, 2016; originally announced February 2016.

    Comments: Paper accepted for publication at NIPS 2016. This version includes the appendix

  22. arXiv:1602.01783  [pdf, other

    cs.LG

    Asynchronous Methods for Deep Reinforcement Learning

    Authors: Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu

    Abstract: We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural n… ▽ More

    Submitted 16 June, 2016; v1 submitted 4 February, 2016; originally announced February 2016.

    Journal ref: ICML 2016

  23. arXiv:1511.06295  [pdf, other

    cs.LG

    Policy Distillation

    Authors: Andrei A. Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, Raia Hadsell

    Abstract: Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and… ▽ More

    Submitted 7 January, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

    Comments: Submitted to ICLR 2016

  24. arXiv:1507.04296  [pdf, other

    cs.LG cs.AI cs.DC cs.NE

    Massively Parallel Methods for Deep Reinforcement Learning

    Authors: Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Vedavyas Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, David Silver

    Abstract: We present the first massively distributed architecture for deep reinforcement learning. This architecture uses four main components: parallel actors that generate new behaviour; parallel learners that are trained from stored experience; a distributed neural network to represent the value function or behaviour policy; and a distributed store of experience. We used our architecture to implement the… ▽ More

    Submitted 16 July, 2015; v1 submitted 15 July, 2015; originally announced July 2015.

    Comments: Presented at the Deep Learning Workshop, International Conference on Machine Learning, Lille, France, 2015

  25. arXiv:1412.7755  [pdf, other

    cs.LG cs.CV cs.NE

    Multiple Object Recognition with Visual Attention

    Authors: Jimmy Ba, Volodymyr Mnih, Koray Kavukcuoglu

    Abstract: We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challengi… ▽ More

    Submitted 23 April, 2015; v1 submitted 24 December, 2014; originally announced December 2014.

  26. arXiv:1406.6247  [pdf, other

    cs.LG cs.CV stat.ML

    Recurrent Models of Visual Attention

    Authors: Volodymyr Mnih, Nicolas Heess, Alex Graves, Koray Kavukcuoglu

    Abstract: Applying convolutional neural networks to large images is computationally expensive because the amount of computation scales linearly with the number of image pixels. We present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution… ▽ More

    Submitted 24 June, 2014; originally announced June 2014.

  27. arXiv:1312.5602  [pdf, other

    cs.LG

    Playing Atari with Deep Reinforcement Learning

    Authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller

    Abstract: We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning E… ▽ More

    Submitted 19 December, 2013; originally announced December 2013.

    Comments: NIPS Deep Learning Workshop 2013

  28. arXiv:1202.3748  [pdf

    cs.LG stat.ML

    Conditional Restricted Boltzmann Machines for Structured Output Prediction

    Authors: Volodymyr Mnih, Hugo Larochelle, Geoffrey E. Hinton

    Abstract: Conditional Restricted Boltzmann Machines (CRBMs) are rich probabilistic models that have recently been applied to a wide range of problems, including collaborative filtering, classification, and modeling motion capture data. While much progress has been made in training non-conditional RBMs, these algorithms are not applicable to conditional models and there has been almost no work on training an… ▽ More

    Submitted 14 February, 2012; originally announced February 2012.

    Report number: UAI-P-2011-PG-514-522