-
Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input
Authors:
Andi Peng,
Yuying Sun,
Tianmin Shu,
David Abel
Abstract:
Humans use social context to specify preferences over behaviors, i.e. their reward functions. Yet, algorithms for inferring reward models from preference data do not take this social learning view into account. Inspired by pragmatic human communication, we study how to extract fine-grained data regarding why an example is preferred that is useful for learning more accurate reward models. We propos…
▽ More
Humans use social context to specify preferences over behaviors, i.e. their reward functions. Yet, algorithms for inferring reward models from preference data do not take this social learning view into account. Inspired by pragmatic human communication, we study how to extract fine-grained data regarding why an example is preferred that is useful for learning more accurate reward models. We propose to enrich binary preference queries to ask both (1) which features of a given example are preferable in addition to (2) comparisons between examples themselves. We derive an approach for learning from these feature-level preferences, both for cases where users specify which features are reward-relevant, and when users do not. We evaluate our approach on linear bandit settings in both vision- and language-based domains. Results support the efficiency of our approach in quickly converging to accurate rewards with fewer comparisons vs. example-only labels. Finally, we validate the real-world applicability with a behavioral experiment on a mushroom foraging task. Our findings suggest that incorporating pragmatic feature preferences is a promising approach for more efficient user-aligned reward learning.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
A Definition of Continual Reinforcement Learning
Authors:
David Abel,
André Barreto,
Benjamin Van Roy,
Doina Precup,
Hado van Hasselt,
Satinder Singh
Abstract:
In a standard view of the reinforcement learning problem, an agent's goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning.…
▽ More
In a standard view of the reinforcement learning problem, an agent's goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning. Despite the importance of continual reinforcement learning, the community lacks a simple definition of the problem that highlights its commitments and makes its primary concepts precise and clear. To this end, this paper is dedicated to carefully defining the continual reinforcement learning problem. We formalize the notion of agents that "never stop learning" through a new mathematical language for analyzing and cataloging agents. Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents. We provide two motivating examples, illustrating that traditional views of multi-task reinforcement learning and continual supervised learning are special cases of our definition. Collectively, these definitions and perspectives formalize many intuitive concepts at the heart of learning, and open new research pathways surrounding continual learning agents.
△ Less
Submitted 1 December, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
On the Convergence of Bounded Agents
Authors:
David Abel,
André Barreto,
Hado van Hasselt,
Benjamin Van Roy,
Doina Precup,
Satinder Singh
Abstract:
When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly…
▽ More
When has an agent converged? Standard models of the reinforcement learning problem give rise to a straightforward definition of convergence: An agent converges when its behavior or performance in each environment state stops changing. However, as we shift the focus of our learning problem from the environment's state to the agent's state, the concept of an agent's convergence becomes significantly less clear. In this paper, we propose two complementary accounts of agent convergence in a framing of the reinforcement learning problem that centers around bounded agents. The first view says that a bounded agent has converged when the minimal number of states needed to describe the agent's future behavior cannot decrease. The second view says that a bounded agent has converged just when the agent's performance only changes if the agent's internal state changes. We establish basic properties of these two definitions, show that they accommodate typical views of convergence in standard settings, and prove several facts about their nature and relationship. We take these perspectives, definitions, and analysis to bring clarity to a central idea of the field.
△ Less
Submitted 20 July, 2023;
originally announced July 2023.
-
Settling the Reward Hypothesis
Authors:
Michael Bowling,
John D. Martin,
David Abel,
Will Dabney
Abstract:
The reward hypothesis posits that, "all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)." We aim to fully settle this hypothesis. This will not conclude with a simple affirmation or refutation, but rather specify completely the implicit requirements on goals and purposes under which the hy…
▽ More
The reward hypothesis posits that, "all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward)." We aim to fully settle this hypothesis. This will not conclude with a simple affirmation or refutation, but rather specify completely the implicit requirements on goals and purposes under which the hypothesis holds.
△ Less
Submitted 16 September, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
onlineFGO: Online Continuous-Time Factor Graph Optimization with Time-Centric Multi-Sensor Fusion for Robust Localization in Large-Scale Environments
Authors:
Haoming Zhang,
Felix Widmayer,
Lars Lünnemann,
Dirk Abel
Abstract:
Accurate and consistent vehicle localization in urban areas is challenging due to the large-scale and complicated environments. In this paper, we propose onlineFGO, a novel time-centric graph-optimization-based localization method that fuses multiple sensor measurements with the continuous-time trajectory representation for vehicle localization tasks. We generalize the graph construction independe…
▽ More
Accurate and consistent vehicle localization in urban areas is challenging due to the large-scale and complicated environments. In this paper, we propose onlineFGO, a novel time-centric graph-optimization-based localization method that fuses multiple sensor measurements with the continuous-time trajectory representation for vehicle localization tasks. We generalize the graph construction independent of any spatial sensor measurements by creating the states deterministically on time. As the trajectory representation in continuous-time enables querying states at arbitrary times, incoming sensor measurements can be factorized on the graph without requiring state alignment. We integrate different GNSS observations: pseudorange, deltarange, and time-differenced carrier phase (TDCP) to ensure global reference and fuse the relative motion from a LiDAR-odometry to improve the localization consistency while GNSS observations are not available. Experiments on general performance, effects of different factors, and hyper-parameter settings are conducted in a real-world measurement campaign in Aachen city that contains different urban scenarios. Our results show an average 2D error of 0.99m and consistent state estimation in urban scenarios.
△ Less
Submitted 1 September, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference
Authors:
Joey Wang,
Yingcan Wei,
Minseok Lee,
Matthias Langer,
Fan Yu,
Jie Liu,
Alex Liu,
Daniel Abel,
Gems Guo,
Jianbing Dong,
Jerry Shi,
Kunlun Li
Abstract:
In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-accelerated integration framework for click-through rate estimation. It optimizes both training and inference, whilst enabling model training at scale with model-parallel embeddings and data-parallel neural networks. In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical sto…
▽ More
In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-accelerated integration framework for click-through rate estimation. It optimizes both training and inference, whilst enabling model training at scale with model-parallel embeddings and data-parallel neural networks. In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. In the MLPerf v1.0 DLRM model training benchmark, Merlin HugeCTR achieves a speedup of up to 24.6x on a single DGX A100 (8x A100) over PyTorch on 4x4-socket CPU nodes (4x4x28 cores). Merlin HugeCTR can also take advantage of multi-node environments to accelerate training even further. Since late 2021, Merlin HugeCTR additionally features a hierarchical parameter server (HPS) and supports deployment via the NVIDIA Triton server framework, to leverage the computational capabilities of GPUs for high-speed recommendation model inference. Using this HPS, Merlin HugeCTR users can achieve a 5~62x speedup (batch size dependent) for popular recommendation models over CPU baseline implementations, and dramatically reduce their end-to-end inference latency.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Meta-Gradients in Non-Stationary Environments
Authors:
Jelena Luketina,
Sebastian Flennerhag,
Yannick Schroecker,
David Abel,
Tom Zahavy,
Satinder Singh
Abstract:
Meta-gradient methods (Xu et al., 2018; Zahavy et al., 2020) offer a promising solution to the problem of hyperparameter selection and adaptation in non-stationary reinforcement learning problems. However, the properties of meta-gradients in such environments have not been systematically studied. In this work, we bring new clarity to meta-gradients in non-stationary environments. Concretely, we as…
▽ More
Meta-gradient methods (Xu et al., 2018; Zahavy et al., 2020) offer a promising solution to the problem of hyperparameter selection and adaptation in non-stationary reinforcement learning problems. However, the properties of meta-gradients in such environments have not been systematically studied. In this work, we bring new clarity to meta-gradients in non-stationary environments. Concretely, we ask: (i) how much information should be given to the learned optimizers, so as to enable faster adaptation and generalization over a lifetime, (ii) what meta-optimizer functions are learned in this process, and (iii) whether meta-gradient methods provide a bigger advantage in highly non-stationary environments. To study the effect of information provided to the meta-optimizer, as in recent works (Flennerhag et al., 2021; Almeida et al., 2021), we replace the tuned meta-parameters of fixed update rules with learned meta-parameter functions of selected context features. The context features carry information about agent performance and changes in the environment and hence can inform learned meta-parameter schedules. We find that adding more contextual information is generally beneficial, leading to faster adaptation of meta-parameter values and increased performance over a lifetime. We support these results with a qualitative analysis of resulting meta-parameter schedules and learned functions of context features. Lastly, we find that without context, meta-gradients do not provide a consistent advantage over the baseline in highly non-stationary environments. Our findings suggest that contextualizing meta-gradients can play a pivotal role in extracting high performance from meta-gradients in non-stationary settings.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
A Theory of Abstraction in Reinforcement Learning
Authors:
David Abel
Abstract:
Reinforcement learning defines the problem facing agents that learn to make good decisions through action and observation alone. To be effective problem solvers, such agents must efficiently explore vast worlds, assign credit from delayed feedback, and generalize to new experiences, all while making use of limited data, computational resources, and perceptual bandwidth. Abstraction is essential to…
▽ More
Reinforcement learning defines the problem facing agents that learn to make good decisions through action and observation alone. To be effective problem solvers, such agents must efficiently explore vast worlds, assign credit from delayed feedback, and generalize to new experiences, all while making use of limited data, computational resources, and perceptual bandwidth. Abstraction is essential to all of these endeavors. Through abstraction, agents can form concise models of their environment that support the many practices required of a rational, adaptive decision maker. In this dissertation, I present a theory of abstraction in reinforcement learning. I first offer three desiderata for functions that carry out the process of abstraction: they should 1) preserve representation of near-optimal behavior, 2) be learned and constructed efficiently, and 3) lower planning or learning time. I then present a suite of new algorithms and analysis that clarify how agents can learn to abstract according to these desiderata. Collectively, these results provide a partial path toward the discovery and use of abstraction that minimizes the complexity of effective reinforcement learning.
△ Less
Submitted 1 March, 2022;
originally announced March 2022.
-
On the Expressivity of Markov Reward
Authors:
David Abel,
Will Dabney,
Anna Harutyunyan,
Mark K. Ho,
Michael L. Littman,
Doina Precup,
Satinder Singh
Abstract:
Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajector…
▽ More
Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task" that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.
△ Less
Submitted 18 January, 2022; v1 submitted 1 November, 2021;
originally announced November 2021.
-
Bad-Policy Density: A Measure of Reinforcement Learning Hardness
Authors:
David Abel,
Cameron Allen,
Dilip Arumugam,
D. Ellis Hershkowitz,
Michael L. Littman,
Lawson L. S. Wong
Abstract:
Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement-learning hardness called the bad-policy density. This quantity measures the fraction of the deterministic stationary policy space that is below a desired thresh…
▽ More
Reinforcement learning is hard in general. Yet, in many specific environments, learning is easy. What makes learning easy in one environment, but difficult in another? We address this question by proposing a simple measure of reinforcement-learning hardness called the bad-policy density. This quantity measures the fraction of the deterministic stationary policy space that is below a desired threshold in value. We prove that this simple quantity has many properties one would expect of a measure of learning hardness. Further, we prove it is NP-hard to compute the measure in general, but there are paths to polynomial-time approximation. We conclude by summarizing potential directions and uses for this measure.
△ Less
Submitted 7 October, 2021;
originally announced October 2021.
-
People construct simplified mental representations to plan
Authors:
Mark K. Ho,
David Abel,
Carlos G. Correa,
Michael L. Littman,
Jonathan D. Cohen,
Thomas L. Griffiths
Abstract:
One of the most striking features of human cognition is the capacity to plan. Two aspects of human planning stand out: its efficiency and flexibility. Efficiency is especially impressive because plans must often be made in complex environments, and yet people successfully plan solutions to myriad everyday problems despite having limited cognitive resources. Standard accounts in psychology, economi…
▽ More
One of the most striking features of human cognition is the capacity to plan. Two aspects of human planning stand out: its efficiency and flexibility. Efficiency is especially impressive because plans must often be made in complex environments, and yet people successfully plan solutions to myriad everyday problems despite having limited cognitive resources. Standard accounts in psychology, economics, and artificial intelligence have suggested human planning succeeds because people have a complete representation of a task and then use heuristics to plan future actions in that representation. However, this approach generally assumes that task representations are fixed. Here, we propose that task representations can be controlled and that such control provides opportunities to quickly simplify problems and more easily reason about them. We propose a computational account of this simplification process and, in a series of pre-registered behavioral experiments, show that it is subject to online cognitive control and that people optimally balance the complexity of a task representation and its utility for planning and acting. These results demonstrate how strategically perceiving and conceiving problems facilitates the effective use of limited cognitive resources.
△ Less
Submitted 26 November, 2022; v1 submitted 14 May, 2021;
originally announced May 2021.
-
Revisiting Peng's Q($λ$) for Modern Reinforcement Learning
Authors:
Tadashi Kozuno,
Yunhao Tang,
Mark Rowland,
Rémi Munos,
Steven Kapturowski,
Will Dabney,
Michal Valko,
David Abel
Abstract:
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonethel…
▽ More
Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q($λ$), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng's Q($λ$) in complex continuous control tasks, confirming that Peng's Q($λ$) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng's Q($λ$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.
△ Less
Submitted 26 February, 2021;
originally announced March 2021.
-
Deep Learning for Climate Model Output Statistics
Authors:
Michael Steininger,
Daniel Abel,
Katrin Ziegler,
Anna Krause,
Heiko Paeth,
Andreas Hotho
Abstract:
Climate models are an important tool for the assessment of prospective climate change effects but they suffer from systematic and representation errors, especially for precipitation. Model output statistics (MOS) reduce these errors by fitting the model output to observational data with machine learning. In this work, we explore the feasibility and potential of deep learning with convolutional neu…
▽ More
Climate models are an important tool for the assessment of prospective climate change effects but they suffer from systematic and representation errors, especially for precipitation. Model output statistics (MOS) reduce these errors by fitting the model output to observational data with machine learning. In this work, we explore the feasibility and potential of deep learning with convolutional neural networks (CNNs) for MOS. We propose the CNN architecture ConvMOS specifically designed for reducing errors in climate model outputs and apply it to the climate model REMO. Our results show a considerable reduction of errors and mostly improved performance compared to three commonly used MOS approaches.
△ Less
Submitted 9 December, 2020;
originally announced December 2020.
-
What can I do here? A Theory of Affordances in Reinforcement Learning
Authors:
Khimya Khetarpal,
Zafarali Ahmed,
Gheorghe Comanici,
David Abel,
Doina Precup
Abstract:
Reinforcement learning algorithms usually assume that all actions are always available to an agent. However, both people and animals understand the general link between the features of their environment and the actions that are feasible. Gibson (1977) coined the term "affordances" to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In…
▽ More
Reinforcement learning algorithms usually assume that all actions are always available to an agent. However, both people and animals understand the general link between the features of their environment and the actions that are feasible. Gibson (1977) coined the term "affordances" to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes. Affordances play a dual role in this case. On one hand, they allow faster planning, by reducing the number of actions available in any given situation. On the other hand, they facilitate more efficient and precise learning of transition models from data, especially when such models require function approximation. We establish these properties through theoretical results as well as illustrative examples. We also propose an approach to learn affordances and use it to estimate transition models that are simpler and generalize better.
△ Less
Submitted 26 June, 2020;
originally announced June 2020.
-
The Efficiency of Human Cognition Reflects Planned Information Processing
Authors:
Mark K. Ho,
David Abel,
Jonathan D. Cohen,
Michael L. Littman,
Thomas L. Griffiths
Abstract:
Planning is useful. It lets people take actions that have desirable long-term consequences. But, planning is hard. It requires thinking about consequences, which consumes limited computational and cognitive resources. Thus, people should plan their actions, but they should also be smart about how they deploy resources used for planning their actions. Put another way, people should also "plan their…
▽ More
Planning is useful. It lets people take actions that have desirable long-term consequences. But, planning is hard. It requires thinking about consequences, which consumes limited computational and cognitive resources. Thus, people should plan their actions, but they should also be smart about how they deploy resources used for planning their actions. Put another way, people should also "plan their plans". Here, we formulate this aspect of planning as a meta-reasoning problem and formalize it in terms of a recursive Bellman objective that incorporates both task rewards and information-theoretic planning costs. Our account makes quantitative predictions about how people should plan and meta-plan as a function of the overall structure of a task, which we test in two experiments with human participants. We find that people's reaction times reflect a planned use of information processing, consistent with our account. This formulation of planning to plan provides new insight into the function of hierarchical planning, state abstraction, and cognitive control in both humans and machines.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Learning State Abstractions for Transfer in Continuous Control
Authors:
Kavosh Asadi,
David Abel,
Michael L. Littman
Abstract:
Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our main contribution is a learning algorithm that ab…
▽ More
Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our main contribution is a learning algorithm that abstracts a continuous state-space into a discrete one. We transfer this learned representation to unseen problems to enable effective learning. We provide theory showing that learned abstractions maintain a bounded value loss, and we report experiments showing that the abstractions empower tabular Q-Learning to learn efficiently in unseen tasks.
△ Less
Submitted 8 February, 2020;
originally announced February 2020.
-
Lipschitz Lifelong Reinforcement Learning
Authors:
Erwan Lecarpentier,
David Abel,
Kavosh Asadi,
Yuu **nai,
Emmanuel Rachelson,
Michael L. Littman
Abstract:
We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes (MDPs) and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfe…
▽ More
We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes (MDPs) and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Further, we show the method to experience no negative transfer with high probability. We illustrate the benefits of the method in Lifelong RL experiments.
△ Less
Submitted 22 March, 2021; v1 submitted 15 January, 2020;
originally announced January 2020.
-
Depth Camera Based Particle Filter for Robotic Osteotomy Navigation
Authors:
Tim Übelhör,
Jonas Gesenhues,
Nassim Ayoub,
Ali Modabber,
Dirk Abel
Abstract:
Active surgical robots lack acceptance in clinical practice, because they do not offer the flexibility and usability required for a versatile usage: the systems require a large installation space or a complicated registration step, where the preoperative plan is aligned to the patient and transformed to the base frame of the robot. In this paper, a navigation system for robotic osteotomies is desi…
▽ More
Active surgical robots lack acceptance in clinical practice, because they do not offer the flexibility and usability required for a versatile usage: the systems require a large installation space or a complicated registration step, where the preoperative plan is aligned to the patient and transformed to the base frame of the robot. In this paper, a navigation system for robotic osteotomies is designed, which uses the raw depth images from a camera mounted on the flange of a lightweight robot arm. Consequently, the system does not require any rigid attachment of the robot or fiducials to the bone and the time-consuming registration step is eliminated. Instead, only a coarse initialization is required which improves the usability in surgery. The full six dimensional pose of the iliac crest bone is estimated with a particle filter at a maximum rate of 90 Hz. The presented method is robust against changing lighting conditions, blood or tissue on the bone surface and partial occlusions caused by the surgeons. Proof of the usability in a clinical environment is successfully provided in a corpse study, where surgeons used an augmented reality osteotomy template, which was aligned to bone via the particle filters pose estimates for the resection of transplants from the iliac crest.
△ Less
Submitted 24 October, 2019;
originally announced October 2019.
-
Discovering Options for Exploration by Minimizing Cover Time
Authors:
Yuu **nai,
Jee Won Park,
David Abel,
George Konidaris
Abstract:
One of the main challenges in reinforcement learning is solving tasks with sparse reward. We show that the difficulty of discovering a distant rewarding state in an MDP is bounded by the expected cover time of a random walk over the graph induced by the MDP's transition dynamics. We therefore propose to accelerate exploration by constructing options that minimize cover time. The proposed algorithm…
▽ More
One of the main challenges in reinforcement learning is solving tasks with sparse reward. We show that the difficulty of discovering a distant rewarding state in an MDP is bounded by the expected cover time of a random walk over the graph induced by the MDP's transition dynamics. We therefore propose to accelerate exploration by constructing options that minimize cover time. The proposed algorithm finds an option which provably diminishes the expected number of steps to visit every state in the state space by a uniform random walk. We show empirically that the proposed algorithm improves the learning time in several domains with sparse rewards.
△ Less
Submitted 16 March, 2019; v1 submitted 1 March, 2019;
originally announced March 2019.
-
Mitigating Planner Overfitting in Model-Based Reinforcement Learning
Authors:
Dilip Arumugam,
David Abel,
Kavosh Asadi,
Nakul Gopalan,
Christopher Grimm,
Jun Ki Lee,
Lucas Lehnert,
Michael L. Littman
Abstract:
An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slo…
▽ More
An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slow to learn from experience, while the former can lead to "planner overfitting" - aspects of the agent's behavior are optimized to exploit errors in its model. This paper explores an intermediate position in which the planner seeks to avoid overfitting through a kind of regularization of the plans it considers. We present three different approaches that demonstrably mitigate planner overfitting in reinforcement-learning environments.
△ Less
Submitted 19 March, 2020; v1 submitted 3 December, 2018;
originally announced December 2018.
-
Finding Options that Minimize Planning Time
Authors:
Yuu **nai,
David Abel,
D Ellis Hershkowitz,
Michael Littman,
George Konidaris
Abstract:
We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. We first show that the problem is NP-hard, even if the task is constrained to be deterministic---the first such complexity result for option discovery. We then present the first polynomial-t…
▽ More
We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. We first show that the problem is NP-hard, even if the task is constrained to be deterministic---the first such complexity result for option discovery. We then present the first polynomial-time boundedly suboptimal approximation algorithm for this setting, and empirically evaluate it against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains including the classic four-rooms problem.
△ Less
Submitted 16 March, 2019; v1 submitted 16 October, 2018;
originally announced October 2018.
-
Modeling Latent Attention Within Neural Networks
Authors:
Christopher Grimm,
Dilip Arumugam,
Siddharth Karamcheti,
David Abel,
Lawson L. S. Wong,
Michael L. Littman
Abstract:
Deep neural networks are able to solve tasks across a variety of domains and modalities of data. Despite many empirical successes, we lack the ability to clearly understand and interpret the learned internal mechanisms that contribute to such effective behaviors or, more critically, failure modes. In this work, we present a general method for visualizing an arbitrary neural network's inner mechani…
▽ More
Deep neural networks are able to solve tasks across a variety of domains and modalities of data. Despite many empirical successes, we lack the ability to clearly understand and interpret the learned internal mechanisms that contribute to such effective behaviors or, more critically, failure modes. In this work, we present a general method for visualizing an arbitrary neural network's inner mechanisms and their power and limitations. Our dataset-centric method produces visualizations of how a trained network attends to components of its inputs. The computed "attention masks" support improved interpretability by highlighting which input attributes are critical in determining output. We demonstrate the effectiveness of our framework on a variety of deep neural network architectures in domains from computer vision, natural language processing, and reinforcement learning. The primary contribution of our approach is an interpretable visualization of attention that provides unique insights into the network's underlying decision-making process irrespective of the data modality.
△ Less
Submitted 30 December, 2017; v1 submitted 1 June, 2017;
originally announced June 2017.
-
Near Optimal Behavior via Approximate State Abstraction
Authors:
David Abel,
D. Ellis Hershkowitz,
Michael L. Littman
Abstract:
The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms can be moderated using state abstraction. Prohibitively large task representations can be condensed such that essential information is preserved, and consequently, solutions are tractably computable. However, exact abstractions, which treat only fully-identical situations as equivalent, fail to present opp…
▽ More
The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms can be moderated using state abstraction. Prohibitively large task representations can be condensed such that essential information is preserved, and consequently, solutions are tractably computable. However, exact abstractions, which treat only fully-identical situations as equivalent, fail to present opportunities for abstraction in environments where no two situations are exactly alike. In this work, we investigate approximate state abstractions, which treat nearly-identical situations as equivalent. We present theoretical guarantees of the quality of behaviors derived from four types of approximate abstractions. Additionally, we empirically demonstrate that approximate abstractions lead to reduction in task complexity and bounded loss of optimality of behavior in a variety of environments.
△ Less
Submitted 15 January, 2017;
originally announced January 2017.
-
Agent-Agnostic Human-in-the-Loop Reinforcement Learning
Authors:
David Abel,
John Salvatier,
Andreas Stuhlmüller,
Owain Evans
Abstract:
Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learning. Prior work has developed teaching protocols that enable agents to learn efficiently in complex environments; many of these methods tailor the teacher's guidance to agents with a particular representation or underlying learning scheme, offering effective but specialized teaching procedur…
▽ More
Providing Reinforcement Learning agents with expert advice can dramatically improve various aspects of learning. Prior work has developed teaching protocols that enable agents to learn efficiently in complex environments; many of these methods tailor the teacher's guidance to agents with a particular representation or underlying learning scheme, offering effective but specialized teaching procedures. In this work, we explore protocol programs, an agent-agnostic schema for Human-in-the-Loop Reinforcement Learning. Our goal is to incorporate the beneficial properties of a human teacher into Reinforcement Learning without making strong assumptions about the inner workings of the agent. We show how to represent existing approaches such as action pruning, reward sha**, and training in simulation as special cases of our schema and conduct preliminary experiments on simple domains.
△ Less
Submitted 15 January, 2017;
originally announced January 2017.
-
Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains
Authors:
David Abel,
Alekh Agarwal,
Fernando Diaz,
Akshay Krishnamurthy,
Robert E. Schapire
Abstract:
High-dimensional observations and complex real-world dynamics present major challenges in reinforcement learning for both function approximation and exploration. We address both of these challenges with two complementary techniques: First, we develop a gradient-boosting style, non-parametric function approximator for learning on $Q$-function residuals. And second, we propose an exploration strateg…
▽ More
High-dimensional observations and complex real-world dynamics present major challenges in reinforcement learning for both function approximation and exploration. We address both of these challenges with two complementary techniques: First, we develop a gradient-boosting style, non-parametric function approximator for learning on $Q$-function residuals. And second, we propose an exploration strategy inspired by the principles of state abstraction and information acquisition under uncertainty. We demonstrate the empirical effectiveness of these techniques, first, as a preliminary check, on two standard tasks (Blackjack and $n$-Chain), and then on two much larger and more realistic tasks with high-dimensional observation spaces. Specifically, we introduce two benchmarks built within the game Minecraft where the observations are pixel arrays of the agent's visual field. A combination of our two algorithmic techniques performs competitively on the standard reinforcement-learning tasks while consistently and substantially outperforming baselines on the two tasks with high-dimensional observation spaces. The new function approximator, exploration strategy, and evaluation benchmarks are each of independent interest in the pursuit of reinforcement-learning methods that scale to real-world domains.
△ Less
Submitted 13 March, 2016;
originally announced March 2016.
-
Development of an Android Application for an Electronic Medical Record System in an Outpatient Environment for Healthcare in Fiji
Authors:
Daryl Abel,
Bulou Gavidi,
Nicholas Rollings,
Rohitash Chandra
Abstract:
The outpatients department in a develo** country is typically understaffed and inadequately equipped to handle a large numbers of patients filing through on an average day. The use of electronic medical record (EMR) systems can resolve some of the longstanding medical inefficiencies common in develo** countries. This paper presents the design and implementation of a proposed outpatient managem…
▽ More
The outpatients department in a develo** country is typically understaffed and inadequately equipped to handle a large numbers of patients filing through on an average day. The use of electronic medical record (EMR) systems can resolve some of the longstanding medical inefficiencies common in develo** countries. This paper presents the design and implementation of a proposed outpatient management system that enables efficient management of a patient's medical details. We present a system to create appointments with medical practitioners by integrating a proposed Android-based mobile application with a selected open source EMR system. The application allows both the patient and the medical practitioners to manage appointments and make use of the electronic messaging facility to send reminders when the appointed time is approaching in real-time. A mobile application prototype is developed and the road map for implementation is also discussed.
△ Less
Submitted 2 March, 2015;
originally announced March 2015.