Skip to main content

Showing 1–13 of 13 results for author: Daley, B

.
  1. arXiv:2406.12284  [pdf, other

    cs.LG cs.AI

    Demystifying the Recency Heuristic in Temporal-Difference Learning

    Authors: Brett Daley, Marlos C. Machado, Martha White

    Abstract: The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency heuristic is one of the key assumptions made by TD($λ$), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as $n$-st… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: RLC 2024. 18 pages, 8 figures, 1 table

  2. arXiv:2402.03903  [pdf, other

    cs.LG

    Averaging $n$-step Returns Reduces Variance in Reinforcement Learning

    Authors: Brett Daley, Martha White, Marlos C. Machado

    Abstract: Multistep returns, such as $n$-step returns and $λ$-returns, are commonly used to improve the sample efficiency of reinforcement learning (RL) methods. The variance of the multistep returns becomes the limiting factor in their length; looking too far into the future increases variance and reverses the benefits of multistep learning. In our work, we demonstrate the ability of compound returns -- we… ▽ More

    Submitted 5 June, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

    Comments: ICML 2024. 27 pages, 7 figures, 3 tables

  3. arXiv:2301.11321  [pdf, other

    cs.LG

    Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

    Authors: Brett Daley, Martha White, Christopher Amato, Marlos C. Machado

    Abstract: Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-po… ▽ More

    Submitted 31 May, 2023; v1 submitted 26 January, 2023; originally announced January 2023.

    Comments: ICML 2023. 8 pages, 2 figures. arXiv admin note: text overlap with arXiv:2112.12281

  4. arXiv:2206.01896  [pdf, other

    cs.LG

    Adaptive Tree Backup Algorithms for Temporal-Difference Reinforcement Learning

    Authors: Brett Daley, Isaac Chan

    Abstract: Q($σ$) is a recently proposed temporal-difference learning method that interpolates between learning from expected backups and sampled backups. It has been shown that intermediate values for the interpolation parameter $σ\in [0,1]$ perform better in practice, and therefore it is commonly believed that $σ$ functions as a bias-variance trade-off parameter to achieve these improvements. In our work,… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: RLDM 2022. 4 pages, 1 figure

  5. arXiv:2112.12281  [pdf, ps, other

    cs.LG

    Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

    Authors: Brett Daley, Christopher Amato

    Abstract: Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after… ▽ More

    Submitted 22 December, 2021; originally announced December 2021.

    Comments: 12 pages, 0 figures

  6. arXiv:2112.03421  [pdf, other

    cs.LG

    Virtual Replay Cache

    Authors: Brett Daley, Christopher Amato

    Abstract: Return caching is a recent strategy that enables efficient minibatch training with multistep estimators (e.g. the λ-return) for deep reinforcement learning. By precomputing return estimates in sequential batches and then storing the results in an auxiliary data structure for later sampling, the average computation spent per estimate can be greatly reduced. Still, the efficiency of return caching c… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

    Comments: 4 pages, 1 figure, 3 tables

  7. arXiv:2111.01264  [pdf, other

    cs.LG

    Human-Level Control without Server-Grade Hardware

    Authors: Brett Daley, Christopher Amato

    Abstract: Deep Q-Network (DQN) marked a major milestone for reinforcement learning, demonstrating for the first time that human-level control policies could be learned directly from raw visual inputs via reward maximization. Even years after its introduction, DQN remains highly relevant to the research community since many of its innovations have been adopted by successor methods. Nevertheless, despite sign… ▽ More

    Submitted 1 November, 2021; originally announced November 2021.

    Comments: 13 pages, 3 figures, 5 tables

  8. arXiv:2106.05449  [pdf, other

    cs.LG

    Investigating Alternatives to the Root Mean Square for Adaptive Gradient Methods

    Authors: Brett Daley, Christopher Amato

    Abstract: Adam is an adaptive gradient method that has experienced widespread adoption due to its fast and reliable training performance. Recent approaches have not offered significant improvement over Adam, often because they do not innovate upon one of its core features: normalization by the root mean square (RMS) of recent gradients. However, as noted by Kingma and Ba (2015), any number of $L^p$ normaliz… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: 12 pages, 6 figures, 3 tables

  9. arXiv:2102.11319  [pdf, other

    cs.LG cs.AI

    Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning

    Authors: Brett Daley, Cameron Hickert, Christopher Amato

    Abstract: Deep Reinforcement Learning (RL) methods rely on experience replay to approximate the minibatched supervised learning setting; however, unlike supervised learning where access to lots of training data is crucial to generalization, replay-based deep RL appears to struggle in the presence of extraneous data. Recent works have shown that the performance of Deep Q-Network (DQN) degrades when its repla… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Comments: AAMAS 2021 Extended Abstract, 3 pages, 3 figures

  10. arXiv:2102.04402  [pdf, other

    cs.LG cs.AI

    Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforcement Learning

    Authors: Xueguang Lyu, Yuchen Xiao, Brett Daley, Christopher Amato

    Abstract: Centralized Training for Decentralized Execution, where agents are trained offline using centralized information but execute in a decentralized manner online, has gained popularity in the multi-agent reinforcement learning community. In particular, actor-critic methods with a centralized critic and decentralized actors are a common instance of this idea. However, the implications of using a centra… ▽ More

    Submitted 2 December, 2021; v1 submitted 8 February, 2021; originally announced February 2021.

    Journal ref: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS). 2021

  11. arXiv:2010.09170  [pdf, other

    cs.RO cs.AI cs.LG

    Belief-Grounded Networks for Accelerated Robot Learning under Partial Observability

    Authors: Hai Nguyen, Brett Daley, Xinchao Song, Christopher Amato, Robert Platt

    Abstract: Many important robotics problems are partially observable in the sense that a single visual or force-feedback measurement is insufficient to reconstruct the state. Standard approaches involve learning a policy over beliefs or observation-action histories. However, both of these have drawbacks; it is expensive to track the belief online, and it is hard to learn policies directly over histories. We… ▽ More

    Submitted 20 October, 2021; v1 submitted 18 October, 2020; originally announced October 2020.

    Comments: Accepted at Conference on Robot Learning (CoRL) 2020

  12. arXiv:2010.01356  [pdf, other

    cs.LG stat.ML

    Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties

    Authors: Brett Daley, Christopher Amato

    Abstract: Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has shown that it also causes divergence on at least one convex optimization problem. We propose a novel method called Expectigrad, which adjusts stepsizes according… ▽ More

    Submitted 12 October, 2021; v1 submitted 3 October, 2020; originally announced October 2020.

    Comments: Preprint. 18 pages, 4 figures, 3 tables

  13. arXiv:1810.09967  [pdf, other

    cs.LG stat.ML

    Reconciling $λ$-Returns with Experience Replay

    Authors: Brett Daley, Christopher Amato

    Abstract: Modern deep reinforcement learning methods have departed from the incremental learning required for eligibility traces, rendering the implementation of the $λ$-return difficult in this context. In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of $λ$-returns. Yet replay-base… ▽ More

    Submitted 13 January, 2020; v1 submitted 23 October, 2018; originally announced October 2018.

    Comments: NeurIPS 2019 (Camera-Ready) Code available: https://github.com/brett-daley/dqn-lambda