Skip to main content

Showing 1–50 of 64 results for author: Sutton, R S

.
  1. arXiv:2406.14951  [pdf, other

    cs.LG cs.AI

    An Idiosyncrasy of Time-discretization in Reinforcement Learning

    Authors: Kris De Asis, Richard S. Sutton

    Abstract: Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessit… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: RLC 2024

    ACM Class: I.2.6; I.2.9

  2. arXiv:2405.09999  [pdf, other

    cs.LG cs.AI

    Reward Centering

    Authors: Abhishek Naik, Yi Wan, Manan Tomar, Richard S. Sutton

    Abstract: We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem's rewards are shifted by a constant… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: In Proceedings of RLC 2024

  3. arXiv:2312.15091  [pdf, ps, other

    cs.LG math.OC

    A Note on Stability in Asynchronous Stochastic Approximation without Communication Delays

    Authors: Huizhen Yu, Yi Wan, Richard S. Sutton

    Abstract: In this paper, we study asynchronous stochastic approximation algorithms without communication delays. Our main contribution is a stability proof for these algorithms that extends a method of Borkar and Meyn by accommodating more general noise conditions. We also derive convergence results from this stability result and discuss their application in important average-reward reinforcement learning p… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: 21 pages

    MSC Class: 62L20 (Primary) 93E35; 90C40 (Secondary)

  4. arXiv:2310.01569  [pdf, other

    cs.AI cs.LG

    Iterative Option Discovery for Planning, by Planning

    Authors: Kenny Young, Richard S. Sutton

    Abstract: Discovering useful temporal abstractions, in the form of options, is widely thought to be key to applying reinforcement learning and planning to increasingly complex domains. Building on the empirical success of the Expert Iteration approach to policy learning used in AlphaZero, we propose Option Iteration, an analogous approach to option discovery. Rather than learning a single strong policy that… ▽ More

    Submitted 22 December, 2023; v1 submitted 2 October, 2023; originally announced October 2023.

    Comments: Fixed incorrect arrows on some figures in the appendix

  5. arXiv:2306.15625  [pdf, other

    cs.LG cs.AI

    Value-aware Importance Weighting for Off-policy Reinforcement Learning

    Authors: Kristopher De Asis, Eric Graves, Richard S. Sutton

    Abstract: Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However, importance sampling weights tend to exhibit extreme variance, often leading to stability issues in practice. In this work, we consider a broader class of importance wei… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: CoLLAs 2023

    ACM Class: I.2

  6. arXiv:2306.13812  [pdf, other

    cs.LG

    Maintaining Plasticity in Deep Continual Learning

    Authors: Shibhansh Dohare, J. Fernando Hernandez-Garcia, Parash Rahman, A. Rupam Mahmood, Richard S. Sutton

    Abstract: Modern deep-learning systems are specialized to problem settings in which training occurs once and then never again, as opposed to continual-learning settings in which training occurs continually. If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples. More fundamental, but less well known, is that they may also l… ▽ More

    Submitted 9 April, 2024; v1 submitted 23 June, 2023; originally announced June 2023.

  7. arXiv:2209.15141  [pdf, other

    cs.LG

    On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs

    Authors: Yi Wan, Richard S. Sutton

    Abstract: We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs. Weakly communicating MDPs are the most general MDPs that can be solved by a learning algorithm with a single stream of experience. The original convergence proofs of the two algorithms require tha… ▽ More

    Submitted 5 November, 2022; v1 submitted 29 September, 2022; originally announced September 2022.

  8. arXiv:2208.11173  [pdf, other

    cs.AI cs.LG

    The Alberta Plan for AI Research

    Authors: Richard S. Sutton, Michael Bowling, Patrick M. Pilarski

    Abstract: Herein we describe our approach to artificial intelligence research, which we call the Alberta Plan. The Alberta Plan is pursued within our research groups in Alberta and by others who are like minded throughout the world. We welcome all who would join us in this pursuit.

    Submitted 21 March, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

  9. arXiv:2207.01613  [pdf, other

    cs.LG

    Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

    Authors: Tian Tian, Kenny Young, Richard S. Sutton

    Abstract: Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many appl… ▽ More

    Submitted 27 November, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

  10. arXiv:2205.12515  [pdf, other

    cs.LG cs.AI

    Toward Discovering Options that Achieve Faster Planning

    Authors: Yi Wan, Richard S. Sutton

    Abstract: We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. In a sequential machine, the speed of planning is proportional to the number of elementary operations used to achieve a good policy. For episodic tasks, the number of elementary operations depends on the number of options composed by the policy in an episode and the number of o… ▽ More

    Submitted 29 September, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

  11. arXiv:2202.13252  [pdf, other

    cs.AI

    The Quest for a Common Model of the Intelligent Decision Maker

    Authors: Richard S. Sutton

    Abstract: The premise of the Multi-disciplinary Conference on Reinforcement Learning and Decision Making is that multiple disciplines share an interest in goal-directed decision making over time. The idea of this paper is to sharpen and deepen this premise by proposing a perspective on the decision maker that is substantive and widely held across psychology, artificial intelligence, economics, control theor… ▽ More

    Submitted 5 June, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

    Comments: Will appear as an extended abstract at the fifth Multi-disciplinary Conference on Reinforcement Learning and Decision Making, held in Providence, Rhode Island, June 8-11, 2022

  12. arXiv:2202.09701  [pdf, ps, other

    cs.LG

    A History of Meta-gradient: Gradient Methods for Meta-learning

    Authors: Richard S. Sutton

    Abstract: The history of meta-learning methods based on gradient descent is reviewed, focusing primarily on methods that adapt step-size (learning rate) meta-parameters.

    Submitted 19 February, 2022; originally announced February 2022.

    Comments: 3 pages of text, 54 references

  13. Reward-Respecting Subtasks for Model-Based Reinforcement Learning

    Authors: Richard S. Sutton, Marlos C. Machado, G. Zacharias Holland, David Szepesvari, Finbarr Timbers, Brian Tanner, Adam White

    Abstract: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is i… ▽ More

    Submitted 16 September, 2023; v1 submitted 7 February, 2022; originally announced February 2022.

    Journal ref: Artificial Intelligence, first published online September 6, 2023

  14. arXiv:2112.15236  [pdf, other

    cs.LG cs.AI

    Learning Agent State Online with Recurrent Generate-and-Test

    Authors: Amir Samani, Richard S. Sutton

    Abstract: Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of… ▽ More

    Submitted 30 December, 2021; originally announced December 2021.

  15. arXiv:2110.13855  [pdf, other

    cs.LG

    Average-Reward Learning and Planning with Options

    Authors: Yi Wan, Abhishek Naik, Richard S. Sutton

    Abstract: We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample-based planning variants of our learning algorithms. Our algorithms and convergen… ▽ More

    Submitted 26 October, 2021; originally announced October 2021.

  16. arXiv:2109.05110  [pdf, other

    cs.LG cs.AI

    An Empirical Comparison of Off-policy Prediction Learning Algorithms in the Four Rooms Environment

    Authors: Sina Ghiassian, Richard S. Sutton

    Abstract: Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. We empirically compare 11 off-policy prediction learning algorithms with linear function approximation on two small tasks: the Rooms task, and the High Variance Rooms task. The tasks are designed such that learning fast in them is challenging. In t… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: 13 pages

  17. arXiv:2108.06325  [pdf, other

    cs.LG

    Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

    Authors: Shibhansh Dohare, Richard S. Sutton, A. Rupam Mahmood

    Abstract: The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficien… ▽ More

    Submitted 5 May, 2022; v1 submitted 13 August, 2021; originally announced August 2021.

  18. arXiv:2106.00922  [pdf, other

    cs.LG cs.AI

    An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

    Authors: Sina Ghiassian, Richard S. Sutton

    Abstract: Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD($λ$), Vtrace… ▽ More

    Submitted 11 June, 2021; v1 submitted 1 June, 2021; originally announced June 2021.

  19. arXiv:2104.08543  [pdf, other

    cs.AI

    Planning with Expectation Models for Control

    Authors: Katya Kudashkina, Yi Wan, Abhishek Naik, Richard S. Sutton

    Abstract: In model-based reinforcement learning (MBRL), Wan et al. (2019) showed conditions under which the environment model could produce the expectation of the next feature vector rather than the full distribution, or a sample thereof, with no loss in planning performance. Such expectation models are of interest when the environment is stochastic and non-stationary, and the model is approximate, such as… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

  20. arXiv:2102.07686  [pdf, other

    cs.LG cs.AI stat.ML

    Does the Adam Optimizer Exacerbate Catastrophic Forgetting?

    Authors: Dylan R. Ashley, Sina Ghiassian, Richard S. Sutton

    Abstract: Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learni… ▽ More

    Submitted 9 June, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: 9 pages in main text + 3 pages of references + 16 pages of appendices, 6 figures in main text + 21 figures in appendices, 6 tables in appendices; source code available at https://github.com/dylanashley/catastrophic-forgetting/tree/arxiv

    ACM Class: I.2.6

  21. arXiv:2101.02808  [pdf, other

    cs.LG cs.AI

    Average-Reward Off-Policy Policy Evaluation with Function Approximation

    Authors: Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson

    Abstract: We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrap** is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing… ▽ More

    Submitted 18 October, 2022; v1 submitted 7 January, 2021; originally announced January 2021.

    Comments: ICML 2021

  22. arXiv:2010.15268  [pdf, other

    cs.LG cs.AI

    Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning

    Authors: Kenny Young, Richard S. Sutton

    Abstract: Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies,… ▽ More

    Submitted 28 October, 2020; originally announced October 2020.

  23. arXiv:2008.12095  [pdf, other

    cs.AI cs.HC cs.LG

    Document-editing Assistants and Model-based Reinforcement Learning as a Path to Conversational AI

    Authors: Katya Kudashkina, Patrick M. Pilarski, Richard S. Sutton

    Abstract: Intelligent assistants that follow commands or answer simple questions, such as Siri and Google search, are among the most economically important applications of AI. Future conversational AI assistants promise even greater capabilities and a better user experience through a deeper understanding of the domain, the user, or the user's purposes. But what domain and what methods are best suited to res… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

    Comments: Currently under review

  24. arXiv:2008.11329  [pdf, other

    cs.LG cs.AI

    Inverse Policy Evaluation for Value-based Sequential Decision-making

    Authors: Alan Chan, Kris de Asis, Richard S. Sutton

    Abstract: Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the valu… ▽ More

    Submitted 25 August, 2020; originally announced August 2020.

    Comments: Submitted to NeurIPS 2020

  25. arXiv:2006.16318  [pdf, other

    cs.LG cs.AI

    Learning and Planning in Average-Reward Markov Decision Processes

    Authors: Yi Wan, Abhishek Naik, Richard S. Sutton

    Abstract: We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset… ▽ More

    Submitted 28 June, 2021; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: In Proceedings of ICML 2021

  26. arXiv:1912.04002  [pdf, other

    cs.LG stat.ML

    Learning Sparse Representations Incrementally in Deep Reinforcement Learning

    Authors: J. Fernando Hernandez-Garcia, Richard S. Sutton

    Abstract: Sparse representations have been shown to be useful in deep reinforcement learning for mitigating catastrophic interference and improving the performance of agents in terms of cumulative reward. Previous results were based on a two step process were the representation was learned offline and the action-value function was learned online afterwards. In this paper, we investigate if it is possible to… ▽ More

    Submitted 9 December, 2019; originally announced December 2019.

  27. arXiv:1910.02140  [pdf, ps, other

    cs.AI

    Discounted Reinforcement Learning Is Not an Optimization Problem

    Authors: Abhishek Naik, Roshan Shariff, Niko Yasui, Hengshuai Yao, Richard S. Sutton

    Abstract: Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks. It is not an optimization problem in its usual formulation, so when using function approximation there is no optimal policy. We substantiate these claims, then go on to address some misconceptions about discounting and its connection to the average reward formulation. We enc… ▽ More

    Submitted 27 November, 2019; v1 submitted 4 October, 2019; originally announced October 2019.

    Comments: Accepted for presentation at the Optimization Foundations of Reinforcement Learning Workshop at NeurIPS 2019

  28. arXiv:1909.03906  [pdf, other

    cs.LG cs.AI

    Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

    Authors: Kristopher De Asis, Alan Chan, Silviu Pitis, Richard S. Sutton, Daniel Graves

    Abstract: We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps. To learn the value function for horizon $h$, these algorithms bootstrap from the value function for horizon $h-1$, or some shorter horizon. Because no value function bootstraps from itself… ▽ More

    Submitted 10 February, 2020; v1 submitted 9 September, 2019; originally announced September 2019.

    Comments: AAAI 2020

    ACM Class: I.2

  29. arXiv:1904.01191  [pdf, other

    cs.LG cs.AI stat.ML

    Planning with Expectation Models

    Authors: Yi Wan, Zaheer Abbas, Adam White, Martha White, Richard S. Sutton

    Abstract: Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments… ▽ More

    Submitted 29 July, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

  30. arXiv:1903.03252  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Feature Relevance Through Step Size Adaptation in Temporal-Difference Learning

    Authors: Alex Kearney, Vivek Veeriah, Jaden Travnik, Patrick M. Pilarski, Richard S. Sutton

    Abstract: There is a long history of using meta learning as representation learning, specifically for determining the relevance of inputs. In this paper, we examine an instance of meta-learning in which feature relevance is learned by adapting step size parameters of stochastic gradient descent---building on a variety of prior work in stochastic approximation, machine learning, and artificial neural network… ▽ More

    Submitted 7 March, 2019; originally announced March 2019.

  31. arXiv:1903.00194  [pdf, other

    cs.AI cs.LG

    Should All Temporal Difference Learning Use Emphasis?

    Authors: Xiang Gu, Sina Ghiassian, Richard S. Sutton

    Abstract: Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class… ▽ More

    Submitted 1 March, 2019; originally announced March 2019.

  32. arXiv:1901.07510  [pdf, other

    cs.LG stat.ML

    Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target

    Authors: J. Fernando Hernandez-Garcia, Richard S. Sutton

    Abstract: Multi-step methods such as Retrace($λ$) and $n$-step $Q$-learning have become a crucial component of modern deep reinforcement learning agents. These methods are often evaluated as a part of bigger architectures and their evaluations rarely include enough samples to draw statistically significant conclusions about their performance. This type of methodology makes it difficult to understand how par… ▽ More

    Submitted 7 February, 2019; v1 submitted 22 January, 2019; originally announced January 2019.

  33. arXiv:1811.02597  [pdf, other

    cs.LG cs.AI stat.ML

    Online Off-policy Prediction

    Authors: Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, Adam White

    Abstract: This paper investigates the problem of online prediction learning, where learning proceeds continuously as the agent interacts with an environment. The predictions made by the agent are contingent on a particular way of behaving, represented as a value function. However, the behavior used to select actions and generate the behavior data might be different from the one used to define the prediction… ▽ More

    Submitted 6 November, 2018; originally announced November 2018.

    Comments: 68 pages

  34. arXiv:1809.07435  [pdf, other

    cs.LG cs.AI eess.SP

    Predicting Periodicity with Temporal Difference Learning

    Authors: Kristopher De Asis, Brendan Bennett, Richard S. Sutton

    Abstract: Temporal difference (TD) learning is an important approach in reinforcement learning, as it combines ideas from dynamic programming and Monte Carlo methods in a way that allows for online and incremental model-free learning. A key idea of TD learning is that it is learning predictive knowledge about the environment in the form of value functions, from which it can derive its behavior to address lo… ▽ More

    Submitted 19 September, 2018; originally announced September 2018.

  35. arXiv:1807.01830  [pdf, other

    cs.LG cs.AI stat.ML

    Per-decision Multi-step Temporal Difference Learning with Control Variates

    Authors: Kristopher De Asis, Richard S. Sutton

    Abstract: Multi-step temporal difference (TD) learning is an important approach in reinforcement learning, as it unifies one-step TD learning with Monte Carlo methods in a way where intermediate algorithms can outperform either extreme. They address a bias-variance trade off between reliance on current estimates, which could be poor, and incorporating longer sampled reward sequences into the updates. Especi… ▽ More

    Submitted 4 July, 2018; originally announced July 2018.

    Journal ref: (2018). In Conference on Uncertainty in Artificial Intelligence. http://auai.org/uai2018/proceedings/papers/282.pdf

  36. arXiv:1806.00540  [pdf, other

    cs.LG cs.AI stat.ML

    Integrating Episodic Memory into a Reinforcement Learning Agent using Reservoir Sampling

    Authors: Kenny J. Young, Richard S. Sutton, Shuo Yang

    Abstract: Episodic memory is a psychology term which refers to the ability to recall specific events from the past. We suggest one advantage of this particular type of memory is the ability to easily assign credit to a specific state when remembered information is found to be useful. Inspired by this idea, and the increasing popularity of external memory mechanisms to handle long-term dependencies in deep l… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

  37. arXiv:1805.07476  [pdf, other

    cs.LG cs.AI stat.ML

    Two geometric input transformation methods for fast online reinforcement learning with neural nets

    Authors: Sina Ghiassian, Huizhen Yu, Banafsheh Rafiee, Richard S. Sutton

    Abstract: We apply neural nets with ReLU gates in online reinforcement learning. Our goal is to train these networks in an incremental manner, without the computationally expensive experience replay. By studying how individual neural nodes behave in online training, we recognize that the global nature of ReLU gates can cause undesirable learning interference in each node's learning behavior. We propose redu… ▽ More

    Submitted 6 September, 2018; v1 submitted 18 May, 2018; originally announced May 2018.

    Comments: 16 pages

  38. arXiv:1804.03334  [pdf, other

    cs.LG stat.ML

    TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent

    Authors: Alex Kearney, Vivek Veeriah, Jaden B. Travnik, Richard S. Sutton, Patrick M. Pilarski

    Abstract: In this paper, we introduce a method for adapting the step-sizes of temporal difference (TD) learning. The performance of TD methods often depends on well chosen step-sizes, yet few algorithms have been developed for setting the step-size automatically for TD learning. An important limitation of current methods is that they adapt a single step-size shared by all the weights of the learning system.… ▽ More

    Submitted 10 April, 2018; originally announced April 2018.

    Comments: Version as submitted to the 31st Conference on Neural Information Processing Systems (NIPS 2017) on May 19, 2017. 9 pages, 5 figures. Extended version in preparation for journal submission

  39. Reactive Reinforcement Learning in Asynchronous Environments

    Authors: Jaden B. Travnik, Kory W. Mathewson, Richard S. Sutton, Patrick M. Pilarski

    Abstract: The relationship between a reinforcement learning (RL) agent and an asynchronous environment is often ignored. Frequently used models of the interaction between an agent and its environment, such as Markov Decision Processes (MDP) or Semi-Markov Decision Processes (SMDP), do not capture the fact that, in an asynchronous environment, the state of the environment may change during computation perfor… ▽ More

    Submitted 16 February, 2018; originally announced February 2018.

    Comments: 11 pages, 7 figures, currently under journal peer review

  40. arXiv:1801.08287  [pdf, other

    cs.AI

    Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods

    Authors: Craig Sherstan, Brendan Bennett, Kenny Young, Dylan R. Ashley, Adam White, Martha White, Richard S. Sutton

    Abstract: This paper investigates estimating the variance of a temporal-difference learning agent's update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimate… ▽ More

    Submitted 14 February, 2018; v1 submitted 25 January, 2018; originally announced January 2018.

  41. arXiv:1712.01275  [pdf, other

    cs.LG cs.AI

    A Deeper Look at Experience Replay

    Authors: Shangtong Zhang, Richard S. Sutton

    Abstract: Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay. It introduces a new hyper-parameter, the memory buffer size, which needs carefully tuning. However unfortunately the importance of this new hyper-parameter has been underestimated in the community for a long time. In this paper we did a system… ▽ More

    Submitted 30 April, 2018; v1 submitted 4 December, 2017; originally announced December 2017.

    Comments: NIPS 2017 Deep Reinforcement Learning Symposium

  42. arXiv:1711.03676  [pdf, other

    cs.AI cs.HC cs.LG

    Communicative Capital for Prosthetic Agents

    Authors: Patrick M. Pilarski, Richard S. Sutton, Kory W. Mathewson, Craig Sherstan, Adam S. R. Parker, Ann L. Edwards

    Abstract: This work presents an overarching perspective on the role that machine intelligence can play in enhancing human abilities, especially those that have been diminished due to injury or illness. As a primary contribution, we develop the hypothesis that assistive devices, and specifically artificial arms and hands, can and should be viewed as agents in order for us to most effectively improve their co… ▽ More

    Submitted 9 November, 2017; originally announced November 2017.

    Comments: 33 pages, 10 figures; unpublished technical report undergoing peer review

  43. arXiv:1705.04185  [pdf, other

    cs.AI cs.LG

    A First Empirical Study of Emphatic Temporal Difference Learning

    Authors: Sina Ghiassian, Banafsheh Rafiee, Richard S. Sutton

    Abstract: In this paper we present the first empirical study of the emphatic temporal-difference learning algorithm (ETD), comparing it with conventional temporal-difference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem. The initial motivation for develo** ETD was that it has good convergence properties under off-policy training (Sutton, Mah… ▽ More

    Submitted 12 May, 2017; v1 submitted 11 May, 2017; originally announced May 2017.

    Comments: 5 pages, Accepted to NIPS Continual Learning and Deep Networks workshop, 2016

  44. arXiv:1705.03967  [pdf, ps, other

    cs.LG

    GQ($λ$) Quick Reference and Implementation Guide

    Authors: Adam White, Richard S. Sutton

    Abstract: This document should serve as a quick reference for and guide to the implementation of linear GQ($λ$), a gradient-based off-policy temporal-difference learning algorithm. Explanation of the intuition and theory behind the algorithm are provided elsewhere (e.g., Maei & Sutton 2010, Maei 2011). If you questions or concerns about the content in this document or the attached java code please email Ada… ▽ More

    Submitted 10 May, 2017; originally announced May 2017.

  45. Policy Iterations for Reinforcement Learning Problems in Continuous Time and Space -- Fundamental Theory and Methods

    Authors: Jaeyoung Lee, Richard S. Sutton

    Abstract: Policy iteration (PI) is a recursive process of policy evaluation and improvement for solving an optimal decision-making/control problem, or in other words, a reinforcement learning (RL) problem. PI has also served as the fundamental for develo** RL methods. In this paper, we propose two PI methods, called differential PI (DPI) and integral PI (IPI), and their variants, for a general RL framewor… ▽ More

    Submitted 31 October, 2020; v1 submitted 9 May, 2017; originally announced May 2017.

    Comments: To appear in Automatica. All the Appendices are provided

    MSC Class: 68T05; 49L20; 93C15; 34H05

    Journal ref: Automatica vol. 126, 109421 (2021)

  46. arXiv:1704.04463  [pdf, other

    cs.LG math.OC

    On Generalized Bellman Equations and Temporal-Difference Learning

    Authors: Huizhen Yu, A. Rupam Mahmood, Richard S. Sutton

    Abstract: We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the $λ$-parameters of TD, based on generalized Bellman equations. Ou… ▽ More

    Submitted 27 September, 2018; v1 submitted 14 April, 2017; originally announced April 2017.

    Comments: Minor revision; 41 pages; to appear in Journal on Machine Learning Research, 2018

    MSC Class: 90C40; 60J05; 65C05; 68W40

    Journal ref: Journal of Machine Learning Research 19(48):1-49, 2018

  47. arXiv:1703.01327  [pdf, other

    cs.AI cs.LG

    Multi-step Reinforcement Learning: A Unifying Algorithm

    Authors: Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, Richard S. Sutton

    Abstract: Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD($λ$) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter $λ$. Currently, there are a multitude of algorithms that can be used to perform TD control, i… ▽ More

    Submitted 11 June, 2018; v1 submitted 3 March, 2017; originally announced March 2017.

    Comments: Appeared at the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

    Journal ref: (2018). In AAAI Conference on Artificial Intelligence. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16294

  48. arXiv:1702.03006  [pdf, other

    cs.LG

    Multi-step Off-policy Learning Without Importance Sampling Ratios

    Authors: Ashique Rupam Mahmood, Huizhen Yu, Richard S. Sutton

    Abstract: To estimate the value functions of policies from exploratory data, most model-free off-policy algorithms rely on importance sampling, where the use of importance sampling ratios often leads to estimates with severe variance. It is thus desirable to learn off-policy without using the ratios. However, such an algorithm does not exist for multi-step learning with function approximation. In this paper… ▽ More

    Submitted 9 February, 2017; originally announced February 2017.

    Comments: 24 pages, 4 figures

  49. arXiv:1612.02879  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Representations by Stochastic Meta-Gradient Descent in Neural Networks

    Authors: Vivek Veeriah, Shangtong Zhang, Richard S. Sutton

    Abstract: Representations are fundamental to artificial intelligence. The performance of a learning system depends on the type of representation used for representing the data. Typically, these representations are hand-engineered using domain knowledge. More recently, the trend is to learn these representations through stochastic gradient descent in multi-layer neural networks, which is called backprop. Lea… ▽ More

    Submitted 27 April, 2017; v1 submitted 8 December, 2016; originally announced December 2016.

  50. arXiv:1607.05047  [pdf, other

    stat.ML cs.LG

    A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward

    Authors: S. A. Murphy, Y. Deng, E. B. Laber, H. R. Maei, R. S. Sutton, K. Witkiewitz

    Abstract: We develop an off-policy actor-critic algorithm for learning an optimal policy from a training set composed of data from multiple individuals. This algorithm is developed with a view towards its use in mobile health.

    Submitted 18 July, 2016; originally announced July 2016.