-
Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning
Authors:
Max Weltevrede,
Felix Kaubek,
Matthijs T. J. Spaan,
Wendelin Böhmer
Abstract:
One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase t…
▽ More
One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase the generalisation performance of the agent. This makes sense when the states encountered during testing can actually be explored during training. In this paper, we provide intuition why exploration can also benefit generalisation to states that cannot be explicitly encountered during training. Additionally, we propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains. Explore-Go effectively increases the starting state distribution of the agent and as a result can be used in conjunction with most existing on-policy or off-policy reinforcement learning algorithms. We show empirically that our method can increase generalisation performance in an illustrative environment and on the Procgen benchmark.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Value Improved Actor Critic Algorithms
Authors:
Yaniv Oren,
Moritz A. Zanger,
Pascal R. van der Vaart,
Matthijs T. J. Spaan,
Wendelin Bohmer
Abstract:
Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this wo…
▽ More
Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
A Penalty-Based Guardrail Algorithm for Non-Decreasing Optimization with Inequality Constraints
Authors:
Ksenija Stepanovic,
Wendelin Böhmer,
Mathijs de Weerdt
Abstract:
Traditional mathematical programming solvers require long computational times to solve constrained minimization problems of complex and large-scale physical systems. Therefore, these problems are often transformed into unconstrained ones, and solved with computationally efficient optimization approaches based on first-order information, such as the gradient descent method. However, for unconstrain…
▽ More
Traditional mathematical programming solvers require long computational times to solve constrained minimization problems of complex and large-scale physical systems. Therefore, these problems are often transformed into unconstrained ones, and solved with computationally efficient optimization approaches based on first-order information, such as the gradient descent method. However, for unconstrained problems, balancing the minimization of the objective function with the reduction of constraint violations is challenging. We consider the class of time-dependent minimization problems with increasing (possibly) nonlinear and non-convex objective function and non-decreasing (possibly) nonlinear and non-convex inequality constraints. To efficiently solve them, we propose a penalty-based guardrail algorithm (PGA). This algorithm adapts a standard penalty-based method by dynamically updating the right-hand side of the constraints with a guardrail variable which adds a margin to prevent violations. We evaluate PGA on two novel application domains: a simplified model of a district heating system and an optimization model derived from learned deep neural networks. Our method significantly outperforms mathematical programming solvers and the standard penalty-based method, and achieves better performance and faster convergence than a state-of-the-art algorithm (IPDD) within a specified time limit.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
To the Max: Reinventing Reward in Reinforcement Learning
Authors:
Grigorii Veviurko,
Wendelin Böhmer,
Mathijs de Weerdt
Abstract:
In reinforcement learning (RL), different rewards can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach to using re…
▽ More
In reinforcement learning (RL), different rewards can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach to using rewards for learning. We introduce max-reward RL, where an agent optimizes the maximum rather than the cumulative reward. Unlike earlier works, our approach works for deterministic and stochastic environments and can be easily combined with state-of-the-art RL algorithms. In the experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics and demonstrate its benefits over standard RL. The code is publicly available.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Lights out: training RL agents robust to temporary blindness
Authors:
N. Ordonez,
M. Tromp,
P. M. Julbe,
W. Böhmer
Abstract:
Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this…
▽ More
Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this we want our agent to continue taking actions until it receives a (recognized) observation again. To achieve this we introduce a combination of a neural network architecture that uses hidden representations of the observations and a novel n-step loss function. Our implementation is able to withstand location based blindness stretches longer than the ones it was trained on, and therefore shows robustness to temporary blindness. For access to our implementation, please email Nathan, Marije, or Pau.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Multi-Robot Local Motion Planning Using Dynamic Optimization Fabrics
Authors:
Saray Bakker,
Luzia Knoedler,
Max Spahn,
Wendelin Böhmer,
Javier Alonso-Mora
Abstract:
In this paper, we address the problem of real-time motion planning for multiple robotic manipulators that operate in close proximity. We build upon the concept of dynamic fabrics and extend them to multi-robot systems, referred to as Multi-Robot Dynamic Fabrics (MRDF). This geometric method enables a very high planning frequency for high-dimensional systems at the expense of being reactive and pro…
▽ More
In this paper, we address the problem of real-time motion planning for multiple robotic manipulators that operate in close proximity. We build upon the concept of dynamic fabrics and extend them to multi-robot systems, referred to as Multi-Robot Dynamic Fabrics (MRDF). This geometric method enables a very high planning frequency for high-dimensional systems at the expense of being reactive and prone to deadlocks. To detect and resolve deadlocks, we propose Rollout Fabrics where MRDF are forward simulated in a decentralized manner. We validate the methods in simulated close-proximity pick-and-place scenarios with multiple manipulators, showing high success rates and real-time performance.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
You Shall Pass: Dealing with the Zero-Gradient Problem in Predict and Optimize for Convex Optimization
Authors:
Grigorii Veviurko,
Wendelin Böhmer,
Mathijs de Weerdt
Abstract:
Predict and optimize is an increasingly popular decision-making paradigm that employs machine learning to predict unknown parameters of optimization problems. Instead of minimizing the prediction error of the parameters, it trains predictive models using task performance as a loss function. The key challenge to train such models is the computation of the Jacobian of the solution of the optimizatio…
▽ More
Predict and optimize is an increasingly popular decision-making paradigm that employs machine learning to predict unknown parameters of optimization problems. Instead of minimizing the prediction error of the parameters, it trains predictive models using task performance as a loss function. The key challenge to train such models is the computation of the Jacobian of the solution of the optimization problem with respect to its parameters. For linear problems, this Jacobian is known to be zero or undefined; hence, approximations are usually employed. For non-linear convex problems, however, it is common to use the exact Jacobian. This paper demonstrates that the zero-gradient problem appears in the non-linear case as well -- the Jacobian can have a sizeable null space, thereby causing the training process to get stuck in suboptimal points. Through formal proofs, this paper shows that smoothing the feasible set resolves this problem. Combining this insight with known techniques from the literature, such as quadratic programming approximation and projection distance regularization, a novel method to approximate the Jacobian is derived. In simulation experiments, the proposed method increases the performance in the non-linear case and at least matches the existing state-of-the-art methods for linear problems.
△ Less
Submitted 2 February, 2024; v1 submitted 30 July, 2023;
originally announced July 2023.
-
Diverse Projection Ensembles for Distributional Reinforcement Learning
Authors:
Moritz A. Zanger,
Wendelin Böhmer,
Matthijs T. J. Spaan
Abstract:
In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds approximations within a set of representable, parametric distributions. Typically, this involves a projec…
▽ More
In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds approximations within a set of representable, parametric distributions. Typically, this involves a projection of the unconstrained distribution onto the set of simplified distributions. We argue that this projection step entails a strong inductive bias when coupled with neural networks and gradient descent, thereby profoundly impacting the generalization behavior of learned models. In order to facilitate reliable uncertainty estimation through diversity, this work studies the combination of several different projections and representations in a distributional ensemble. We establish theoretical properties of such projection ensembles and derive an algorithm that uses ensemble disagreement, measured by the average $1$-Wasserstein distance, as a bonus for deep exploration. We evaluate our algorithm on the behavior suite benchmark and find that diverse projection ensembles lead to significant performance improvements over existing methods on a wide variety of tasks with the most pronounced gains in directed exploration problems.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
The Role of Diverse Replay for Generalisation in Reinforcement Learning
Authors:
Max Weltevrede,
Matthijs T. J. Spaan,
Wendelin Böhmer
Abstract:
In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the impact of these components in the context of generalisation in multi-task RL. We investigate the hypothesis that collect…
▽ More
In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the impact of these components in the context of generalisation in multi-task RL. We investigate the hypothesis that collecting and training on more diverse data from the training environments will improve zero-shot generalisation to new tasks. We motivate mathematically and show empirically that generalisation to tasks that are "reachable'' during training is improved by increasing the diversity of transitions in the replay buffer. Furthermore, we show empirically that this same strategy also shows improvement for generalisation to similar but "unreachable'' tasks which could be due to improved generalisation of the learned latent representations.
△ Less
Submitted 31 August, 2023; v1 submitted 9 June, 2023;
originally announced June 2023.
-
Active Classification of Moving Targets with Learned Control Policies
Authors:
Álvaro Serra-Gómez,
Eduardo Montijano,
Wendelin Böhmer,
Javier Alonso-Mora
Abstract:
In this paper, we consider the problem where a drone has to collect semantic information to classify multiple moving targets. In particular, we address the challenge of computing control inputs that move the drone to informative viewpoints, position and orientation, when the information is extracted using a "black-box" classifier, e.g., a deep learning neural network. These algorithms typically la…
▽ More
In this paper, we consider the problem where a drone has to collect semantic information to classify multiple moving targets. In particular, we address the challenge of computing control inputs that move the drone to informative viewpoints, position and orientation, when the information is extracted using a "black-box" classifier, e.g., a deep learning neural network. These algorithms typically lack of analytical relationships between the viewpoints and their associated outputs, preventing their use in information-gathering schemes. To fill this gap, we propose a novel attention-based architecture, trained via Reinforcement Learning (RL), that outputs the next viewpoint for the drone favoring the acquisition of evidence from as many unclassified targets as possible while reasoning about their movement, orientation, and occlusions. Then, we use a low-level MPC controller to move the drone to the desired viewpoint taking into account its actual dynamics. We show that our approach not only outperforms a variety of baselines but also generalizes to scenarios unseen during training. Additionally, we show that the network scales to large numbers of targets and generalizes well to different movement dynamics of the targets.
△ Less
Submitted 27 September, 2023; v1 submitted 6 December, 2022;
originally announced December 2022.
-
E-MCTS: Deep Exploration in Model-Based Reinforcement Learning by Planning with Epistemic Uncertainty
Authors:
Yaniv Oren,
Matthijs T. J. Spaan,
Wendelin Böhmer
Abstract:
One of the most well-studied and highly performing planning approaches used in Model-Based Reinforcement Learning (MBRL) is Monte-Carlo Tree Search (MCTS). Key challenges of MCTS-based MBRL methods remain dedicated deep exploration and reliability in the face of the unknown, and both challenges can be alleviated through principled epistemic uncertainty estimation in the predictions of MCTS. We pre…
▽ More
One of the most well-studied and highly performing planning approaches used in Model-Based Reinforcement Learning (MBRL) is Monte-Carlo Tree Search (MCTS). Key challenges of MCTS-based MBRL methods remain dedicated deep exploration and reliability in the face of the unknown, and both challenges can be alleviated through principled epistemic uncertainty estimation in the predictions of MCTS. We present two main contributions: First, we develop methodology to propagate epistemic uncertainty in MCTS, enabling agents to estimate the epistemic uncertainty in their predictions. Second, we utilize the propagated uncertainty for a novel deep exploration algorithm by explicitly planning to explore. We incorporate our approach into variations of MCTS-based MBRL approaches with learned and provided dynamics models, and empirically show deep exploration through successful epistemic uncertainty estimation achieved by our approach. We compare to a non-planning-based deep-exploration baseline, and demonstrate that planning with epistemic MCTS significantly outperforms non-planning based exploration in the investigated deep exploration benchmark.
△ Less
Submitted 30 August, 2023; v1 submitted 21 October, 2022;
originally announced October 2022.
-
UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning
Authors:
Tarun Gupta,
Anuj Mahajan,
Bei Peng,
Wendelin Böhmer,
Shimon Whiteson
Abstract:
VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can prevent them from solving tasks that require significant coordination between agents at a given timestep. We show that this…
▽ More
VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can prevent them from solving tasks that require significant coordination between agents at a given timestep. We show that this problem can be overcome by improving the joint exploration of all agents during training. Specifically, we propose a novel MARL approach called Universal Value Exploration (UneVEn) that learns a set of related tasks simultaneously with a linear decomposition of universal successor features. With the policies of already solved related tasks, the joint exploration process of all agents can be improved to help them achieve better coordination. Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail.
△ Less
Submitted 10 June, 2021; v1 submitted 6 October, 2020;
originally announced October 2020.
-
My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control
Authors:
Vitaly Kurin,
Maximilian Igl,
Tim Rocktäschel,
Wendelin Boehmer,
Shimon Whiteson
Abstract:
Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. The…
▽ More
Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. They also allow practitioners to inject biases encoded in the structure of the input graph. Existing work in graph-based continuous control uses the physical morphology of the agent to construct the input graph, i.e., encoding limb features as node labels and using edges to connect the nodes if their corresponded limbs are physically connected. In this work, we present a series of ablations on existing methods that show that morphological information encoded in the graph does not improve their performance. Motivated by the hypothesis that any benefits GNNs extract from the graph structure are outweighed by difficulties they create for message passing, we also propose Amorpheus, a transformer-based approach. Further results show that, while Amorpheus ignores the morphological information that GNNs encode, it nonetheless substantially outperforms GNN-based methods that use the morphological information to define the message-passing scheme.
△ Less
Submitted 14 April, 2021; v1 submitted 5 October, 2020;
originally announced October 2020.
-
Transient Non-Stationarity and Generalisation in Deep Reinforcement Learning
Authors:
Maximilian Igl,
Gregory Farquhar,
Jelena Luketina,
Wendelin Boehmer,
Shimon Whiteson
Abstract:
Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exh…
▽ More
Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exhibit a memory effect where these transient non-stationarities can permanently impact the latent representation and adversely affect generalisation performance. Consequently, to improve generalisation of deep RL agents, we propose Iterated Relearning (ITER). ITER augments standard RL training by repeated knowledge transfer of the current policy into a freshly initialised network, which thereby experiences less non-stationarity during training. Experimentally, we show that ITER improves performance on the challenging generalisation benchmarks ProcGen and Multiroom.
△ Less
Submitted 22 September, 2021; v1 submitted 10 June, 2020;
originally announced June 2020.
-
Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning
Authors:
Shariq Iqbal,
Christian A. Schroeder de Witt,
Bei Peng,
Wendelin Böhmer,
Shimon Whiteson,
Fei Sha
Abstract:
Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities; however, common patterns of behavior often emerge among these agents/entities. Our method aims to leverage these commonalities by asking the question: ``What is the expected utility of each agent when only considering a randomly selected sub-group of its observed entities?…
▽ More
Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities; however, common patterns of behavior often emerge among these agents/entities. Our method aims to leverage these commonalities by asking the question: ``What is the expected utility of each agent when only considering a randomly selected sub-group of its observed entities?'' By posing this counterfactual question, we can recognize state-action trajectories within sub-groups of entities that we may have encountered in another task and use what we learned in that task to inform our prediction in the current one. We then reconstruct a prediction of the full returns as a combination of factors considering these disjoint groups of entities and train this ``randomly factorized" value function as an auxiliary objective for value-based multi-agent reinforcement learning. By doing so, our model can recognize and leverage similarities across tasks to improve learning efficiency in a multi-task setting. Our approach, Randomized Entity-wise Factorization for Imagined Learning (REFIL), outperforms all strong baselines by a significant margin in challenging multi-task StarCraft micromanagement settings.
△ Less
Submitted 11 June, 2021; v1 submitted 7 June, 2020;
originally announced June 2020.
-
Privileged Information Dropout in Reinforcement Learning
Authors:
Pierre-Alexandre Kamienny,
Kai Arulkumaran,
Feryal Behbahani,
Wendelin Boehmer,
Shimon Whiteson
Abstract:
Using privileged information during training can improve the sample efficiency and performance of machine learning systems. This paradigm has been applied to reinforcement learning (RL), primarily in the form of distillation or auxiliary tasks, and less commonly in the form of augmenting the inputs of agents. In this work, we investigate Privileged Information Dropout (\pid) for achieving the latt…
▽ More
Using privileged information during training can improve the sample efficiency and performance of machine learning systems. This paradigm has been applied to reinforcement learning (RL), primarily in the form of distillation or auxiliary tasks, and less commonly in the form of augmenting the inputs of agents. In this work, we investigate Privileged Information Dropout (\pid) for achieving the latter which can be applied equally to value-based and policy-based RL algorithms. Within a simple partially-observed environment, we demonstrate that \pid outperforms alternatives for leveraging privileged information, including distillation and auxiliary tasks, and can successfully utilise different types of privileged information. Finally, we analyse its effect on the learned representations.
△ Less
Submitted 19 May, 2020;
originally announced May 2020.
-
FACMAC: Factored Multi-Agent Centralised Policy Gradients
Authors:
Bei Peng,
Tabish Rashid,
Christian A. Schroeder de Witt,
Pierre-Alexandre Kamienny,
Philip H. S. Torr,
Wendelin Böhmer,
Shimon Whiteson
Abstract:
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic policy gradients to learn policies. However, FACMAC learns a centralised but factored critic, which combines per-agent utilit…
▽ More
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic policy gradients to learn policies. However, FACMAC learns a centralised but factored critic, which combines per-agent utilities into the joint action-value function via a non-linear monotonic function, as in QMIX, a popular multi-agent Q-learning algorithm. However, unlike QMIX, there are no inherent constraints on factoring the critic. We thus also employ a nonmonotonic factorisation and empirically demonstrate that its increased representational capacity allows it to solve some tasks that cannot be solved with monolithic, or monotonically factored critics. In addition, FACMAC uses a centralised policy gradient estimator that optimises over the entire joint action space, rather than optimising over each agent's action space separately as in MADDPG. This allows for more coordinated policy changes and fully reaps the benefits of a centralised critic. We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks. Empirical results demonstrate FACMAC's superior performance over MADDPG and other baselines on all three domains.
△ Less
Submitted 7 May, 2021; v1 submitted 14 March, 2020;
originally announced March 2020.
-
Optimistic Exploration even with a Pessimistic Initialisation
Authors:
Tabish Rashid,
Bei Peng,
Wendelin Böhmer,
Shimon Whiteson
Abstract:
Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-va…
▽ More
Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrap**. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs.
△ Less
Submitted 26 February, 2020;
originally announced February 2020.
-
Multi-agent Hierarchical Reinforcement Learning with Dynamic Termination
Authors:
Dongge Han,
Wendelin Boehmer,
Michael Wooldridge,
Alex Rogers
Abstract:
In a multi-agent system, an agent's optimal policy will typically depend on the policies chosen by others. Therefore, a key issue in multi-agent systems research is that of predicting the behaviours of others, and responding promptly to changes in such behaviours. One obvious possibility is for each agent to broadcast their current intention, for example, the currently executed option in a hierarc…
▽ More
In a multi-agent system, an agent's optimal policy will typically depend on the policies chosen by others. Therefore, a key issue in multi-agent systems research is that of predicting the behaviours of others, and responding promptly to changes in such behaviours. One obvious possibility is for each agent to broadcast their current intention, for example, the currently executed option in a hierarchical reinforcement learning framework. However, this approach results in inflexibility of agents if options have an extended duration and are dynamic. While adjusting the executed option at each step improves flexibility from a single-agent perspective, frequent changes in options can induce inconsistency between an agent's actual behaviour and its broadcast intention. In order to balance flexibility and predictability, we propose a dynamic termination Bellman equation that allows the agents to flexibly terminate their options. We evaluate our model empirically on a set of multi-agent pursuit and taxi tasks, and show that our agents learn to adapt flexibly across scenarios that require different termination behaviours.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Deep Coordination Graphs
Authors:
Wendelin Böhmer,
Vitaly Kurin,
Shimon Whiteson
Abstract:
This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factoring the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allow…
▽ More
This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factoring the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allows training of the value function end-to-end with Q-learning. Payoff functions are approximated with deep neural networks that employ parameter sharing and low-rank approximations to significantly improve sample efficiency. We show that DCG can solve predator-prey tasks that highlight the relative overgeneralization pathology, as well as challenging StarCraft II micromanagement tasks.
△ Less
Submitted 23 June, 2020; v1 submitted 27 September, 2019;
originally announced October 2019.
-
Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning
Authors:
Wendelin Böhmer,
Tabish Rashid,
Shimon Whiteson
Abstract:
This paper investigates the use of intrinsic reward to guide exploration in multi-agent reinforcement learning. We discuss the challenges in applying intrinsic reward to multiple collaborative agents and demonstrate how unreliable reward can prevent decentralized agents from learning the optimal policy. We address this problem with a novel framework, Independent Centrally-assisted Q-learning (ICQL…
▽ More
This paper investigates the use of intrinsic reward to guide exploration in multi-agent reinforcement learning. We discuss the challenges in applying intrinsic reward to multiple collaborative agents and demonstrate how unreliable reward can prevent decentralized agents from learning the optimal policy. We address this problem with a novel framework, Independent Centrally-assisted Q-learning (ICQL), in which decentralized agents share control and an experience replay buffer with a centralized agent. Only the centralized agent is intrinsically rewarded, but the decentralized agents still benefit from improved exploration, without the distraction of unreliable incentives.
△ Less
Submitted 5 June, 2019;
originally announced June 2019.
-
Deep Residual Reinforcement Learning
Authors:
Shangtong Zhang,
Wendelin Boehmer,
Shimon Whiteson
Abstract:
We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch…
▽ More
We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD($k$) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.
△ Less
Submitted 23 January, 2020; v1 submitted 3 May, 2019;
originally announced May 2019.
-
Multitask Soft Option Learning
Authors:
Maximilian Igl,
Andrew Gambardella,
**ke He,
Nantas Nardelli,
N. Siddharth,
Wendelin Böhmer,
Shimon Whiteson
Abstract:
We present Multitask Soft Option Learning(MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This ''soft'' version of options avoids several instabilities during training in a multitask setting, and provides a natural way to learn both intra-option policie…
▽ More
We present Multitask Soft Option Learning(MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This ''soft'' version of options avoids several instabilities during training in a multitask setting, and provides a natural way to learn both intra-option policies and their terminations. Furthermore, it allows fine-tuning of options for new tasks without forgetting their learned policies, leading to faster training without reducing the expressiveness of the hierarchical policy. We demonstrate empirically that MSOL significantly outperforms both hierarchical and flat transfer-learning baselines.
△ Less
Submitted 21 June, 2020; v1 submitted 1 April, 2019;
originally announced April 2019.
-
Generalized Off-Policy Actor-Critic
Authors:
Shangtong Zhang,
Wendelin Boehmer,
Shimon Whiteson
Abstract:
We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Po…
▽ More
We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks.
△ Less
Submitted 28 October, 2019; v1 submitted 27 March, 2019;
originally announced March 2019.
-
Multi-Agent Common Knowledge Reinforcement Learning
Authors:
Christian A. Schroeder de Witt,
Jakob N. Foerster,
Gregory Farquhar,
Philip H. S. Torr,
Wendelin Boehmer,
Shimon Whiteson
Abstract:
Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledge arises naturally in a large number of decentralised cooperative multi-agent tasks, for example, when agents can recons…
▽ More
Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledge arises naturally in a large number of decentralised cooperative multi-agent tasks, for example, when agents can reconstruct parts of each others' observations. Since agents an independently agree on their common knowledge, they can execute complex coordinated policies that condition on this knowledge in a fully decentralised fashion. We propose multi-agent common knowledge reinforcement learning (MACKRL), a novel stochastic actor-critic algorithm that learns a hierarchical policy tree. Higher levels in the hierarchy coordinate groups of agents by conditioning on their common knowledge, or delegate to lower levels with smaller subgroups but potentially richer common knowledge. The entire policy tree can be executed in a fully decentralised fashion. As the lowest policy tree level consists of independent policies for each agent, MACKRL reduces to independently learnt decentralised policies as a special case. We demonstrate that our method can exploit common knowledge for superior performance on complex decentralised coordination tasks, including a stochastic matrix game and challenging problems in StarCraft II unit micromanagement.
△ Less
Submitted 11 January, 2020; v1 submitted 27 October, 2018;
originally announced October 2018.
-
Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning
Authors:
Wendelin Böhmer,
Rong Guo,
Klaus Obermayer
Abstract:
This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The p…
▽ More
This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The presented approach is simple and should also be easily transferable to more sophisticated algorithms like deep reinforcement learning.
△ Less
Submitted 22 December, 2016;
originally announced December 2016.
-
Theoretical neutron-capture cross sections for r-process nucleosynthesis in the $^{48}$Ca region
Authors:
T. Rauscher,
W. Böhmer,
K. -L. Kratz,
W. Balogh,
H. Oberhummer
Abstract:
We calculate neutron capture cross sections for r-process nucleosynthesis in the $^{48}$Ca region, namely for the isotopes $^{40-44}$S, $^{46-50}$Ar, $^{56-66}$Ti, $^{62-68}$Cr, and $^{72-76}$Fe. While previously only cross sections resulting from the compound nucleus reaction mechanism (Hauser-Feshbach) have been considered, we recalculate not only that contribution to the cross section but also…
▽ More
We calculate neutron capture cross sections for r-process nucleosynthesis in the $^{48}$Ca region, namely for the isotopes $^{40-44}$S, $^{46-50}$Ar, $^{56-66}$Ti, $^{62-68}$Cr, and $^{72-76}$Fe. While previously only cross sections resulting from the compound nucleus reaction mechanism (Hauser-Feshbach) have been considered, we recalculate not only that contribution to the cross section but also include direct capture on even-even nuclei. The level schemes, which are of utmost importance in the direct capture calculations, are taken from quasi-particle states obtained with a folded-Yukawa potential and Lipkin-Nogami pairing. Most recent deformation values derived from experimental data on $β$-decay half lives are used where available. Due to the consideration of direct capture, the capture rates are enhanced and the "turning points" in the r-process path are shifted to slightly higher mass numbers. We also discuss the sensitivity of the direct capture cross sections on the assumed deformation.
△ Less
Submitted 17 April, 2015;
originally announced April 2015.
-
Regression with Linear Factored Functions
Authors:
Wendelin Böhmer,
Klaus Obermayer
Abstract:
Many applications that use empirically estimated functions face a curse of dimensionality, because the integrals over most function classes must be approximated by sampling. This paper introduces a novel regression-algorithm that learns linear factored functions (LFF). This class of functions has structural properties that allow to analytically solve certain integrals and to calculate point-wise p…
▽ More
Many applications that use empirically estimated functions face a curse of dimensionality, because the integrals over most function classes must be approximated by sampling. This paper introduces a novel regression-algorithm that learns linear factored functions (LFF). This class of functions has structural properties that allow to analytically solve certain integrals and to calculate point-wise products. Applications like belief propagation and reinforcement learning can exploit these properties to break the curse and speed up computation. We derive a regularized greedy optimization scheme, that learns factored basis functions during training. The novel regression algorithm performs competitively to Gaussian processes on benchmark tasks, and the learned LFF functions are with 4-9 factored basis functions on average very compact.
△ Less
Submitted 30 March, 2015; v1 submitted 19 December, 2014;
originally announced December 2014.
-
Robot Navigation using Reinforcement Learning and Slow Feature Analysis
Authors:
Wendelin Böhmer
Abstract:
The application of reinforcement learning algorithms onto real life problems always bears the challenge of filtering the environmental state out of raw sensor readings. While most approaches use heuristics, biology suggests that there must exist an unsupervised method to construct such filters automatically. Besides the extraction of environmental states, the filters have to represent them in a fa…
▽ More
The application of reinforcement learning algorithms onto real life problems always bears the challenge of filtering the environmental state out of raw sensor readings. While most approaches use heuristics, biology suggests that there must exist an unsupervised method to construct such filters automatically. Besides the extraction of environmental states, the filters have to represent them in a fashion that support modern reinforcement algorithms. Many popular algorithms use a linear architecture, so one should aim at filters that have good approximation properties in combination with linear functions. This thesis wants to propose the unsupervised method slow feature analysis (SFA) for this task. Presented with a random sequence of sensor readings, SFA learns a set of filters. With growing model complexity and training examples, the filters converge against trigonometric polynomial functions. These are known to possess excellent approximation capabilities and should therfore support the reinforcement algorithms well. We evaluate this claim on a robot. The task is to learn a navigational control in a simple environment using the least square policy iteration (LSPI) algorithm. The only accessible sensor is a head mounted video camera, but without meaningful filtering, video images are not suited as LSPI input. We will show that filters learned by SFA, based on a random walk video of the robot, allow the learned control to navigate successfully in ca. 80% of the test trials.
△ Less
Submitted 4 May, 2012;
originally announced May 2012.
-
On the origin of the Ca-Ti-Cr isotopic anomalies in the inclusion EK-1-4-1 of the Allende Meteorite
Authors:
K. -L. Kratz,
W. Boehmer,
C. Freiburghaus,
P. Moeller,
B. Pfeiffer,
T. Rauscher,
F. -K. Thielemann
Abstract:
In the framework of our investigation to explain the nucleosynthesis origin of the correlated Ca-Ti-Cr isotopic anomalies in the Ca-Al-rich ''FUN'' inclusion EK-1-4-1 of the Allende meteorite, the nuclear-physics basis in the neutron-rich N=28 region has been updated by including recent experimental data on beta-decay properties and microscopic predictions of neutron-capture cross sections. Char…
▽ More
In the framework of our investigation to explain the nucleosynthesis origin of the correlated Ca-Ti-Cr isotopic anomalies in the Ca-Al-rich ''FUN'' inclusion EK-1-4-1 of the Allende meteorite, the nuclear-physics basis in the neutron-rich N=28 region has been updated by including recent experimental data on beta-decay properties and microscopic predictions of neutron-capture cross sections. Charged-particle and subsequent r-process calculations within an entropy-based approach were performed using a complete reaction network. It is shown that there exist two astrophysical scenarios within which the observed isotopic anomalies can be reproduced simultaneously; one at low entropies (about 10) which confirms the earlier suggestrd Sn Ia mechanism, and another at high entropies (about 150) which could be compatible with the neutrino-wind scenario of a SN II.
△ Less
Submitted 11 December, 2000;
originally announced December 2000.
-
Decay of neutron-rich Mn nuclides and deformation of heavy Fe isotopes
Authors:
M. Hannawald,
T. Kautzsch,
A. Woehr,
W. B. Walters,
K. -L. Kratz,
V. N. Fedoseyev,
V. L. Mishin,
W. Boehmer,
B. Pfeiffer,
V. Sebastian,
Y. Jading,
U. Koester,
J. Lettry,
H. L. Ravn,
the ISOLDE Collaboration
Abstract:
The use of chemically selective laser ionization combined with beta-delayed neutron counting at CERN/ISOLDE has permitted identification and half-life measurements for 623-ms Mn-61 up through 14-ms Mn-69. The measured half-lives are found to be significantly longer near N=40 than the values calculated with a QRPA shell model using ground-state deformations from the FRDM and ETFSI models. Gamma-r…
▽ More
The use of chemically selective laser ionization combined with beta-delayed neutron counting at CERN/ISOLDE has permitted identification and half-life measurements for 623-ms Mn-61 up through 14-ms Mn-69. The measured half-lives are found to be significantly longer near N=40 than the values calculated with a QRPA shell model using ground-state deformations from the FRDM and ETFSI models. Gamma-ray singles and coincidence spectroscopy has been performed for Mn-64 and Mn-66 decays to levels of Fe-64 and Fe-66, revealing a significant drop in the energy of the first 2+ state in these nuclides that suggests an unanticipated increase in collectivity near N=40.
△ Less
Submitted 21 December, 1998;
originally announced December 1998.