Skip to main content

Showing 1–50 of 53 results for author: Van Hasselt, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01800  [pdf, other

    cs.LG cs.AI

    Normalization and effective learning rates in reinforcement learning

    Authors: Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado van Hasselt, Razvan Pascanu, Will Dabney

    Abstract: Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature, with several works highlighting diverse benefits such as improving loss landscape conditioning and combatting overestimation bias. However, normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network paramet… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2402.18762  [pdf, other

    cs.LG

    Disentangling the Causes of Plasticity Loss in Neural Networks

    Authors: Clare Lyle, Zeyu Zheng, Khimya Khetarpal, Hado van Hasselt, Razvan Pascanu, James Martens, Will Dabney

    Abstract: Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption: that the network is trained on a \textit{stationary} data distribution. In settings where this assumption is violated, e.g.\ deep reinforcement learning, learning algorithms become unstable and brittle with respect to hyperparameters and even random seeds. O… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  3. arXiv:2312.01072  [pdf, other

    cs.LG cs.AI

    A Survey of Temporal Credit Assignment in Deep Reinforcement Learning

    Authors: Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, Laura Toni

    Abstract: The Credit Assignment Problem (CAP) refers to the longstanding challenge of Reinforcement Learning (RL) agents to associate actions with their long-term consequences. Solving the CAP is a crucial step towards the successful deployment of RL in the real world since most decision problems provide feedback that is noisy, delayed, and with little or no information about the causes. These conditions ma… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

    Comments: 56 pages, 2 figures, 4 tables

  4. arXiv:2307.11046  [pdf, other

    cs.LG cs.AI

    A Definition of Continual Reinforcement Learning

    Authors: David Abel, André Barreto, Benjamin Van Roy, Doina Precup, Hado van Hasselt, Satinder Singh

    Abstract: In a standard view of the reinforcement learning problem, an agent's goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning.… ▽ More

    Submitted 1 December, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: NeurIPS 2023

  5. arXiv:2307.11044  [pdf, other

    cs.LG cs.AI

    On the Convergence of Bounded Agents

    Authors: David Abel, André Barreto, Hado van Hasselt, Benjamin Van Roy, Doina Precup, Satinder Singh

    Abstract: When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

  6. arXiv:2303.04012  [pdf, other

    cs.LG cs.AI stat.ML

    Exploration via Epistemic Value Estimation

    Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt

    Abstract: How to efficiently explore in reinforcement learning is an open problem. Many exploration algorithms employ the epistemic uncertainty of their own value predictions -- for instance to compute an exploration bonus or upper confidence bound. Unfortunately the required uncertainty is difficult to estimate in general with function approximation. We propose epistemic value estimation (EVE): a recipe… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  7. arXiv:2302.04250  [pdf, other

    cs.LG stat.ML

    Learning How to Infer Partial MDPs for In-Context Adaptation and Exploration

    Authors: Chentian Jiang, Nan Rosemary Ke, Hado van Hasselt

    Abstract: To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising appro… ▽ More

    Submitted 4 May, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

    Comments: In proceedings of the Reincarnating Reinforcement Learning (RRL) Workshop at ICLR 2023 and the Neuro-Symbolic AI for Agent and Multi-Agent Systems (NeSyMAS) Workshop at AAMAS 2023

  8. arXiv:2301.03236  [pdf, other

    cs.LG cs.AI math.OC

    Optimistic Meta-Gradients

    Authors: Sebastian Flennerhag, Tom Zahavy, Brendan O'Donoghue, Hado van Hasselt, András György, Satinder Singh

    Abstract: We study the connection between gradient-based meta-learning and convex op-timisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient fo… ▽ More

    Submitted 9 January, 2023; originally announced January 2023.

  9. arXiv:2209.07550  [pdf, other

    cs.LG

    Human-level Atari 200x faster

    Authors: Steven Kapturowski, Víctor Campos, Ray Jiang, Nemanja Rakićević, Hado van Hasselt, Charles Blundell, Adrià Puigdomènech Badia

    Abstract: The task of building general agents that perform well over a wide range of tasks has been an important goal in reinforcement learning since its inception. The problem has been subject of research of a large body of work, with performance frequently measured by observing scores over the wide range of environments contained in the Atari 57 benchmark. Agent57 was the first agent to surpass the human… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

  10. arXiv:2202.09699  [pdf, other

    cs.LG cs.AI stat.ML

    Selective Credit Assignment

    Authors: Veronica Chelu, Diana Borsa, Doina Precup, Hado van Hasselt

    Abstract: Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms… ▽ More

    Submitted 19 February, 2022; originally announced February 2022.

  11. arXiv:2201.06468  [pdf, other

    cs.LG cs.AI stat.ML

    Chaining Value Functions for Off-Policy Learning

    Authors: Simon Schmitt, John Shawe-Taylor, Hado van Hasselt

    Abstract: To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be un… ▽ More

    Submitted 2 February, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

  12. arXiv:2110.12840  [pdf, other

    cs.LG cs.AI stat.ML

    Self-Consistent Models and Values

    Authors: Gregory Farquhar, Kate Baumli, Zita Marinho, Angelos Filos, Matteo Hessel, Hado van Hasselt, David Silver

    Abstract: Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment. In particular, models enable planning, i.e. using more computation to improve value functions or policies, without requiring additional environment interactions. In this work, we investigate a way of augmenting model-based RL, by additionally encouraging a le… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

    Comments: NeurIPS 2021

  13. arXiv:2110.04041  [pdf, other

    cs.AI

    Pick Your Battles: Interaction Graphs as Population-Level Objectives for Strategic Diversity

    Authors: Marta Garnelo, Wojciech Marian Czarnecki, Siqi Liu, Dhruva Tirumala, Junhyuk Oh, Gauthier Gidel, Hado van Hasselt, David Balduzzi

    Abstract: Strategic diversity is often essential in games: in multi-player games, for example, evaluating a player against a diverse set of strategies will yield a more accurate estimate of its performance. Furthermore, in games with non-transitivities diversity allows a player to cover several winning strategies. However, despite the significance of strategic diversity, training agents that exhibit diverse… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

  14. arXiv:2109.10781  [pdf, other

    cs.LG cs.AI cs.NE stat.ML

    Introducing Symmetries to Black Box Meta Reinforcement Learning

    Authors: Louis Kirsch, Sebastian Flennerhag, Hado van Hasselt, Abram Friesen, Junhyuk Oh, Yutian Chen

    Abstract: Meta reinforcement learning (RL) attempts to discover new RL algorithms automatically from environment interaction. In so-called black-box approaches, the policy and the learning algorithm are jointly represented by a single neural network. These methods are very flexible, but they tend to underperform in terms of generalisation to new, unseen environments. In this paper, we explore the role of sy… ▽ More

    Submitted 5 June, 2022; v1 submitted 22 September, 2021; originally announced September 2021.

    Comments: AAAI 2022

  15. arXiv:2109.04504  [pdf, other

    cs.LG cs.AI stat.ML

    Bootstrapped Meta-Learning

    Authors: Sebastian Flennerhag, Yannick Schroecker, Tom Zahavy, Hado van Hasselt, David Silver, Satinder Singh

    Abstract: Meta-learning empowers artificial intelligence to increase its efficiency by learning how to learn. Unlocking this potential involves overcoming a challenging meta-optimisation problem. We propose an algorithm that tackles this problem by letting the meta-learner teach itself. The algorithm first bootstraps a target from the meta-learner, then optimises the meta-learner by minimising the distance… ▽ More

    Submitted 16 March, 2022; v1 submitted 9 September, 2021; originally announced September 2021.

    Comments: Published at ICLR 2022. 37 pages, 19 figures, 9 tables

  16. arXiv:2107.05405  [pdf, other

    cs.LG stat.ML

    Learning Expected Emphatic Traces for Deep RL

    Authors: Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, Hado van Hasselt

    Abstract: Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods. When combined with function approximation, such as neural networks, this combination is known as the deadly triad and is potentially unstable. Recently, it has been shown that stability and good performance at scale can be achieved by combining emphatic wei… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

  17. arXiv:2106.11779  [pdf, other

    cs.LG stat.ML

    Emphatic Algorithms for Deep Reinforcement Learning

    Authors: Ray Jiang, Tom Zahavy, Zhongwen Xu, Adam White, Matteo Hessel, Charles Blundell, Hado van Hasselt

    Abstract: Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD($λ$)) algorithm ensures convergence in the linear case by app… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

    Journal ref: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

  18. arXiv:2104.06272  [pdf, other

    cs.LG

    Podracer architectures for scalable Reinforcement Learning

    Authors: Matteo Hessel, Manuel Kroiss, Aidan Clark, Iurii Kemaev, John Quan, Thomas Keck, Fabio Viola, Hado van Hasselt

    Abstract: Supporting state-of-the-art AI research requires balancing rapid prototy**, ease of use, and quick iteration, with the ability to deploy experiments at a scale traditionally associated with production systems.Deep learning frameworks such as TensorFlow, PyTorch and JAX allow users to transparently make use of accelerators, such as TPUs and GPUs, to offload the more computationally intensive part… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

  19. arXiv:2104.06159  [pdf, other

    cs.LG cs.AI

    Muesli: Combining Improvements in Policy Optimization

    Authors: Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado van Hasselt

    Abstract: We propose a novel policy update that combines regularized policy optimization with model learning as an auxiliary loss. The update (henceforth Muesli) matches MuZero's state-of-the-art performance on Atari. Notably, Muesli does so without using deep search: it acts directly with a policy network and has computation speed comparable to model-free baselines. The Atari results are complemented by ex… ▽ More

    Submitted 31 March, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

  20. arXiv:2102.12425  [pdf, other

    cs.LG

    Synthetic Returns for Long-Term Credit Assignment

    Authors: David Raposo, Sam Ritter, Adam Santoro, Greg Wayne, Theophane Weber, Matt Botvinick, Hado van Hasselt, Francis Song

    Abstract: Since the earliest days of reinforcement learning, the workhorse method for assigning credit to actions over time has been temporal-difference (TD) learning, which propagates credit backward timestep-by-timestep. This approach suffers when delays between actions and rewards are long and when intervening unrelated events contribute variance to long-term returns. We propose state-associative (SA) le… ▽ More

    Submitted 24 February, 2021; originally announced February 2021.

  21. arXiv:2102.06741  [pdf, other

    cs.LG cs.AI

    Discovery of Options via Meta-Learned Subgoals

    Authors: Vivek Veeriah, Tom Zahavy, Matteo Hessel, Zhongwen Xu, Junhyuk Oh, Iurii Kemaev, Hado van Hasselt, David Silver, Satinder Singh

    Abstract: Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster. However, despite prior work on this topic, the problem of discovering options through interaction with an environment remains a challenge. In this paper, we introduce a novel meta-gradient approach for discovering useful options in multi-task RL environments. Our approach is based… ▽ More

    Submitted 12 February, 2021; originally announced February 2021.

  22. arXiv:2010.13685  [pdf, other

    cs.LG cs.AI stat.ML

    Forethought and Hindsight in Credit Assignment

    Authors: Veronica Chelu, Doina Precup, Hado van Hasselt

    Abstract: We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models o… ▽ More

    Submitted 26 October, 2020; originally announced October 2020.

  23. arXiv:2007.08794  [pdf, other

    cs.LG cs.AI

    Discovering Reinforcement Learning Algorithms

    Authors: Junhyuk Oh, Matteo Hessel, Wojciech M. Czarnecki, Zhongwen Xu, Hado van Hasselt, Satinder Singh, David Silver

    Abstract: Reinforcement learning (RL) algorithms update an agent's parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have been prior attempts at addressing this significant scientific cha… ▽ More

    Submitted 5 January, 2021; v1 submitted 17 July, 2020; originally announced July 2020.

  24. arXiv:2007.08433  [pdf, other

    cs.LG cs.AI stat.ML

    Meta-Gradient Reinforcement Learning with an Objective Discovered Online

    Authors: Zhongwen Xu, Hado van Hasselt, Matteo Hessel, Junhyuk Oh, Satinder Singh, David Silver

    Abstract: Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its o… ▽ More

    Submitted 16 July, 2020; originally announced July 2020.

  25. arXiv:2007.01839  [pdf, other

    cs.LG cs.AI stat.ML

    Expected Eligibility Traces

    Authors: Hado van Hasselt, Sephora Madjiheurem, Matteo Hessel, David Silver, André Barreto, Diana Borsa

    Abstract: The question of how to determine which states and actions are responsible for a certain outcome is known as the credit assignment problem and remains a central research question in reinforcement learning and artificial intelligence. Eligibility traces enable efficient credit assignment to the recent sequence of states and actions experienced by the agent, but not to counterfactual sequences that c… ▽ More

    Submitted 8 February, 2021; v1 submitted 3 July, 2020; originally announced July 2020.

    Comments: AAAI, distinguished paper award

  26. arXiv:2002.12928  [pdf, other

    stat.ML cs.LG

    A Self-Tuning Actor-Critic Algorithm

    Authors: Tom Zahavy, Zhongwen Xu, Vivek Veeriah, Matteo Hessel, Junhyuk Oh, Hado van Hasselt, David Silver, Satinder Singh

    Abstract: Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-… ▽ More

    Submitted 14 April, 2021; v1 submitted 28 February, 2020; originally announced February 2020.

  27. arXiv:1912.05500  [pdf, other

    cs.AI cs.LG

    What Can Learned Intrinsic Rewards Capture?

    Authors: Zeyu Zheng, Junhyuk Oh, Matteo Hessel, Zhongwen Xu, Manuel Kroiss, Hado van Hasselt, David Silver, Satinder Singh

    Abstract: The objective of a reinforcement learning agent is to behave so as to maximise the sum of a suitable scalar function of state: the reward. These rewards are typically given and immutable. In this paper, we instead consider the proposition that the reward function itself can be a good locus of learned knowledge. To investigate this, we propose a scalable meta-gradient framework for learning useful… ▽ More

    Submitted 21 August, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    Comments: ICML 2020. The first two authors contributed equally

  28. arXiv:1912.02503  [pdf, other

    cs.LG stat.ML

    Hindsight Credit Assignment

    Authors: Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Azar, Bilal Piot, Nicolas Heess, Hado van Hasselt, Greg Wayne, Satinder Singh, Doina Precup, Remi Munos

    Abstract: We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions… ▽ More

    Submitted 5 December, 2019; originally announced December 2019.

    Comments: NeurIPS 2019

  29. arXiv:1910.07479  [pdf, other

    cs.LG stat.ML

    Conditional Importance Sampling for Off-Policy Learning

    Authors: Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, Will Dabney

    Abstract: The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms th… ▽ More

    Submitted 30 July, 2020; v1 submitted 16 October, 2019; originally announced October 2019.

    Comments: AISTATS 2020 camera-ready version

  30. arXiv:1909.04607  [pdf, other

    cs.AI cs.LG

    Discovery of Useful Questions as Auxiliary Tasks

    Authors: Vivek Veeriah, Matteo Hessel, Zhongwen Xu, Richard Lewis, Janarthanan Rajendran, Junhyuk Oh, Hado van Hasselt, David Silver, Satinder Singh

    Abstract: Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

  31. arXiv:1908.03568  [pdf, other

    cs.LG cs.AI stat.ML

    Behaviour Suite for Reinforcement Learning

    Authors: Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepesvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, Hado Van Hasselt

    Abstract: This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to stud… ▽ More

    Submitted 14 February, 2020; v1 submitted 9 August, 2019; originally announced August 2019.

  32. arXiv:1907.03687  [pdf, other

    cs.LG cs.AI stat.ML

    General non-linear Bellman equations

    Authors: Hado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, Andre Barreto

    Abstract: We consider a general class of non-linear Bellman equations. These open up a design space of algorithms that have interesting properties, which has two potential advantages. First, we can perhaps better model natural phenomena. For instance, hyperbolic discounting has been proposed as a mathematical model that matches human and animal data well, and can therefore be used to explain preference orde… ▽ More

    Submitted 8 July, 2019; originally announced July 2019.

  33. arXiv:1907.02908  [pdf, other

    cs.LG cs.AI stat.ML

    On Inductive Biases in Deep Reinforcement Learning

    Authors: Matteo Hessel, Hado van Hasselt, Joseph Modayil, David Silver

    Abstract: Many deep reinforcement learning algorithms contain inductive biases that sculpt the agent's objective and its interface to the environment. These inductive biases can take many forms, including domain knowledge and pretuned hyper-parameters. In general, there is a trade-off between generality and performance when algorithms use such biases. Stronger biases can lead to faster learning, but weaker… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.

  34. arXiv:1906.05243  [pdf, other

    cs.LG cs.AI stat.ML

    When to use parametric models in reinforcement learning?

    Authors: Hado van Hasselt, Matteo Hessel, John Aslanides

    Abstract: We examine the question of when and how parametric models are most useful in reinforcement learning. In particular, we look at commonalities and differences between parametric models and experience replay. Replay-based learning algorithms share important traits with model-based approaches, including the ability to plan: to use more computation without additional data to improve predictions and beh… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

    Journal ref: NeurIPS 2019

  35. arXiv:1905.03030  [pdf, other

    cs.LG cs.AI stat.ML

    Meta-learning of Sequential Strategies

    Authors: Pedro A. Ortega, Jane X. Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, Siddhant M. Jayakumar, Tom McGrath, Kevin Miller, Mohammad Azar, Ian Osband, Neil Rabinowitz, András György, Silvia Chiappa, Simon Osindero, Yee Whye Teh, Hado van Hasselt, Nando de Freitas, Matthew Botvinick, Shane Legg

    Abstract: In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal pred… ▽ More

    Submitted 18 July, 2019; v1 submitted 8 May, 2019; originally announced May 2019.

    Comments: DeepMind Technical Report (15 pages, 6 figures). Version V1.1

  36. arXiv:1812.07626  [pdf, other

    cs.LG cs.AI stat.ML

    Universal Successor Features Approximators

    Authors: Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado van Hasselt, David Silver, Tom Schaul

    Abstract: The ability of a reinforcement learning (RL) agent to learn about many reward functions at the same time has many potential benefits, such as the decomposition of complex tasks into simpler ones, the exchange of information between tasks, and the reuse of skills. We focus on one aspect in particular, namely the ability to generalise to unseen tasks. Parametric generalisation relies on the interpol… ▽ More

    Submitted 18 December, 2018; originally announced December 2018.

  37. arXiv:1812.02648  [pdf, other

    cs.AI cs.LG

    Deep Reinforcement Learning and the Deadly Triad

    Authors: Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, Joseph Modayil

    Abstract: We know from reinforcement learning theory that temporal difference learning can fail in certain cases. Sutton and Barto (2018) identify a deadly triad of function approximation, bootstrap**, and off-policy learning. When these three properties are combined, learning can diverge with the value estimates becoming unbounded. However, several algorithms successfully combine these three properties,… ▽ More

    Submitted 6 December, 2018; originally announced December 2018.

  38. arXiv:1811.07004  [pdf, ps, other

    cs.AI cs.LG

    The Barbados 2018 List of Open Issues in Continual Learning

    Authors: Tom Schaul, Hado van Hasselt, Joseph Modayil, Martha White, Adam White, Pierre-Luc Bacon, Jean Harb, Shibl Mourad, Marc Bellemare, Doina Precup

    Abstract: We want to make progress toward artificial general intelligence, namely general-purpose agents that autonomously learn how to competently act in complex environments. The purpose of this report is to sketch a research outline, share some of the most important open issues we are facing, and stimulate further discussion in the community. The content is based on some of our discussions during a week-… ▽ More

    Submitted 16 November, 2018; originally announced November 2018.

    Comments: NIPS Continual Learning Workshop 2018

  39. arXiv:1809.04474  [pdf, other

    cs.LG stat.ML

    Multi-task Deep Reinforcement Learning with PopArt

    Authors: Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, Hado van Hasselt

    Abstract: The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this w… ▽ More

    Submitted 12 September, 2018; originally announced September 2018.

  40. arXiv:1805.11593  [pdf, other

    cs.LG cs.AI stat.ML

    Observe and Look Further: Achieving Consistent Performance on Atari

    Authors: Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado van Hasselt, John Quan, Mel Večerík, Matteo Hessel, Rémi Munos, Olivier Pietquin

    Abstract: Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and explori… ▽ More

    Submitted 29 May, 2018; originally announced May 2018.

  41. arXiv:1805.09801  [pdf, other

    cs.LG cs.AI stat.ML

    Meta-Gradient Reinforcement Learning

    Authors: Zhongwen Xu, Hado van Hasselt, David Silver

    Abstract: The goal of reinforcement learning algorithms is to estimate and/or optimise the value function. However, unlike supervised learning, no teacher or oracle is available to provide the true value function. Instead, the majority of reinforcement learning algorithms estimate and/or optimise a proxy for the value function. This proxy is typically based on a sampled and bootstrapped approximation to the… ▽ More

    Submitted 24 May, 2018; originally announced May 2018.

  42. arXiv:1803.00933  [pdf, other

    cs.LG

    Distributed Prioritized Experience Replay

    Authors: Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, David Silver

    Abstract: We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shar… ▽ More

    Submitted 2 March, 2018; originally announced March 2018.

    Comments: Accepted to International Conference on Learning Representations 2018

  43. arXiv:1802.08294  [pdf, other

    cs.LG

    Unicorn: Continual Learning with a Universal, Off-policy Agent

    Authors: Daniel J. Mankowitz, Augustin Žídek, André Barreto, Dan Horgan, Matteo Hessel, John Quan, Junhyuk Oh, Hado van Hasselt, David Silver, Tom Schaul

    Abstract: Some real-world domains are best characterized as a single task, but for others this perspective is limiting. Instead, some tasks continually grow in complexity, in tandem with the agent's competence. In continual learning, also referred to as lifelong learning, there are no explicit task boundaries or curricula. As learning agents have become more powerful, continual learning remains one of the f… ▽ More

    Submitted 3 July, 2018; v1 submitted 22 February, 2018; originally announced February 2018.

  44. arXiv:1710.02298  [pdf, other

    cs.AI cs.LG

    Rainbow: Combining Improvements in Deep Reinforcement Learning

    Authors: Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver

    Abstract: The deep reinforcement learning community has made several independent improvements to the DQN algorithm. However, it is unclear which of these extensions are complementary and can be fruitfully combined. This paper examines six extensions to the DQN algorithm and empirically studies their combination. Our experiments show that the combination provides state-of-the-art performance on the Atari 260… ▽ More

    Submitted 6 October, 2017; originally announced October 2017.

    Comments: Under review as a conference paper at AAAI 2018

  45. arXiv:1708.04782  [pdf, other

    cs.LG cs.AI

    StarCraft II: A New Challenge for Reinforcement Learning

    Authors: Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Jacob Repp, Rodney Tsing

    Abstract: This paper introduces SC2LE (StarCraft II Learning Environment), a reinforcement learning environment based on the StarCraft II game. This domain poses a new grand challenge for reinforcement learning, representing a more difficult class of problems than considered in most prior work. It is a multi-agent problem with multiple players interacting; there is imperfect information due to a partially o… ▽ More

    Submitted 16 August, 2017; originally announced August 2017.

    Comments: Collaboration between DeepMind & Blizzard. 20 pages, 9 figures, 2 tables

  46. arXiv:1612.08810  [pdf, other

    cs.LG cs.AI cs.NE

    The Predictron: End-To-End Learning and Planning

    Authors: David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, Thomas Degris

    Abstract: One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple "imagined" planning steps. Each forward pass of the predictron accumulates internal rewards and… ▽ More

    Submitted 20 July, 2017; v1 submitted 28 December, 2016; originally announced December 2016.

    Comments: Camera-ready version, ICML 2017, with supplement

  47. arXiv:1606.05312  [pdf, other

    cs.AI

    Successor Features for Transfer in Reinforcement Learning

    Authors: André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van Hasselt, David Silver

    Abstract: Transfer in reinforcement learning refers to the notion that generalization should occur not only within a task but also across tasks. We propose a transfer framework for the scenario where the reward function changes between tasks but the environment's dynamics remain the same. Our approach rests on two key ideas: "successor features", a value function representation that decouples the dynamics o… ▽ More

    Submitted 12 April, 2018; v1 submitted 16 June, 2016; originally announced June 2016.

    Comments: Published at NIPS 2017

  48. arXiv:1602.07714  [pdf, other

    cs.LG cs.AI cs.NE stat.ML

    Learning values across many orders of magnitude

    Authors: Hado van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, David Silver

    Abstract: Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games… ▽ More

    Submitted 16 August, 2016; v1 submitted 24 February, 2016; originally announced February 2016.

    Comments: Paper accepted for publication at NIPS 2016. This version includes the appendix

  49. arXiv:1512.07679  [pdf, other

    cs.AI cs.LG cs.NE stat.ML

    Deep Reinforcement Learning in Large Discrete Action Spaces

    Authors: Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, Ben Coppin

    Abstract: Being able to reason in an environment with a large number of discrete actions is essential to bringing reinforcement learning to a larger class of problems. Recommender systems, industrial plants and language models are only some of the many real-world tasks involving large numbers of discrete actions for which current methods are difficult or even often impossible to apply. An ability to general… ▽ More

    Submitted 4 April, 2016; v1 submitted 23 December, 2015; originally announced December 2015.

  50. arXiv:1511.06581  [pdf, other

    cs.LG

    Dueling Network Architectures for Deep Reinforcement Learning

    Authors: Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas

    Abstract: In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state… ▽ More

    Submitted 5 April, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

    Comments: 15 pages, 5 figures, and 5 tables