Search | arXiv e-print repository

Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning

Authors: Max Weltevrede, Felix Kaubek, Matthijs T. J. Spaan, Wendelin Böhmer

Abstract: One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase t… ▽ More One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase the generalisation performance of the agent. This makes sense when the states encountered during testing can actually be explored during training. In this paper, we provide intuition why exploration can also benefit generalisation to states that cannot be explicitly encountered during training. Additionally, we propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains. Explore-Go effectively increases the starting state distribution of the agent and as a result can be used in conjunction with most existing on-policy or off-policy reinforcement learning algorithms. We show empirically that our method can increase generalisation performance in an illustrative environment and on the Procgen benchmark. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.01423 [pdf, other]

Value Improved Actor Critic Algorithms

Authors: Yaniv Oren, Moritz A. Zanger, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Bohmer

Abstract: Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this wo… ▽ More Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2405.01984 [pdf, other]

A Penalty-Based Guardrail Algorithm for Non-Decreasing Optimization with Inequality Constraints

Authors: Ksenija Stepanovic, Wendelin Böhmer, Mathijs de Weerdt

Abstract: Traditional mathematical programming solvers require long computational times to solve constrained minimization problems of complex and large-scale physical systems. Therefore, these problems are often transformed into unconstrained ones, and solved with computationally efficient optimization approaches based on first-order information, such as the gradient descent method. However, for unconstrain… ▽ More Traditional mathematical programming solvers require long computational times to solve constrained minimization problems of complex and large-scale physical systems. Therefore, these problems are often transformed into unconstrained ones, and solved with computationally efficient optimization approaches based on first-order information, such as the gradient descent method. However, for unconstrained problems, balancing the minimization of the objective function with the reduction of constraint violations is challenging. We consider the class of time-dependent minimization problems with increasing (possibly) nonlinear and non-convex objective function and non-decreasing (possibly) nonlinear and non-convex inequality constraints. To efficiently solve them, we propose a penalty-based guardrail algorithm (PGA). This algorithm adapts a standard penalty-based method by dynamically updating the right-hand side of the constraints with a guardrail variable which adds a margin to prevent violations. We evaluate PGA on two novel application domains: a simplified model of a district heating system and an optimization model derived from learned deep neural networks. Our method significantly outperforms mathematical programming solvers and the standard penalty-based method, and achieves better performance and faster convergence than a state-of-the-art algorithm (IPDD) within a specified time limit. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2402.01361 [pdf, other]

To the Max: Reinventing Reward in Reinforcement Learning

Authors: Grigorii Veviurko, Wendelin Böhmer, Mathijs de Weerdt

Abstract: In reinforcement learning (RL), different rewards can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach to using re… ▽ More In reinforcement learning (RL), different rewards can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach to using rewards for learning. We introduce max-reward RL, where an agent optimizes the maximum rather than the cumulative reward. Unlike earlier works, our approach works for deterministic and stochastic environments and can be easily combined with state-of-the-art RL algorithms. In the experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics and demonstrate its benefits over standard RL. The code is publicly available. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2312.02665 [pdf, other]

Lights out: training RL agents robust to temporary blindness

Authors: N. Ordonez, M. Tromp, P. M. Julbe, W. Böhmer

Abstract: Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this… ▽ More Agents trained with DQN rely on an observation at each timestep to decide what action to take next. However, in real world applications observations can change or be missing entirely. Examples of this could be a light bulb breaking down, or the wallpaper in a certain room changing. While these situations change the actual observation, the underlying optimal policy does not change. Because of this we want our agent to continue taking actions until it receives a (recognized) observation again. To achieve this we introduce a combination of a neural network architecture that uses hidden representations of the observations and a novel n-step loss function. Our implementation is able to withstand location based blindness stretches longer than the ones it was trained on, and therefore shows robustness to temporary blindness. For access to our implementation, please email Nathan, Marije, or Pau. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2310.12816 [pdf, other]

Multi-Robot Local Motion Planning Using Dynamic Optimization Fabrics

Authors: Saray Bakker, Luzia Knoedler, Max Spahn, Wendelin Böhmer, Javier Alonso-Mora

Abstract: In this paper, we address the problem of real-time motion planning for multiple robotic manipulators that operate in close proximity. We build upon the concept of dynamic fabrics and extend them to multi-robot systems, referred to as Multi-Robot Dynamic Fabrics (MRDF). This geometric method enables a very high planning frequency for high-dimensional systems at the expense of being reactive and pro… ▽ More In this paper, we address the problem of real-time motion planning for multiple robotic manipulators that operate in close proximity. We build upon the concept of dynamic fabrics and extend them to multi-robot systems, referred to as Multi-Robot Dynamic Fabrics (MRDF). This geometric method enables a very high planning frequency for high-dimensional systems at the expense of being reactive and prone to deadlocks. To detect and resolve deadlocks, we propose Rollout Fabrics where MRDF are forward simulated in a decentralized manner. We validate the methods in simulated close-proximity pick-and-place scenarios with multiple manipulators, showing high success rates and real-time performance. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 6 pages + 1 page references, 2 tables, 4 figures, preprint version to accepted paper to IEEE International Symposium on Multi-Robot & Multi-Agent Systems, Boston, 2023

arXiv:2307.16304 [pdf, other]

You Shall Pass: Dealing with the Zero-Gradient Problem in Predict and Optimize for Convex Optimization

Authors: Grigorii Veviurko, Wendelin Böhmer, Mathijs de Weerdt

Abstract: Predict and optimize is an increasingly popular decision-making paradigm that employs machine learning to predict unknown parameters of optimization problems. Instead of minimizing the prediction error of the parameters, it trains predictive models using task performance as a loss function. The key challenge to train such models is the computation of the Jacobian of the solution of the optimizatio… ▽ More Predict and optimize is an increasingly popular decision-making paradigm that employs machine learning to predict unknown parameters of optimization problems. Instead of minimizing the prediction error of the parameters, it trains predictive models using task performance as a loss function. The key challenge to train such models is the computation of the Jacobian of the solution of the optimization problem with respect to its parameters. For linear problems, this Jacobian is known to be zero or undefined; hence, approximations are usually employed. For non-linear convex problems, however, it is common to use the exact Jacobian. This paper demonstrates that the zero-gradient problem appears in the non-linear case as well -- the Jacobian can have a sizeable null space, thereby causing the training process to get stuck in suboptimal points. Through formal proofs, this paper shows that smoothing the feasible set resolves this problem. Combining this insight with known techniques from the literature, such as quadratic programming approximation and projection distance regularization, a novel method to approximate the Jacobian is derived. In simulation experiments, the proposed method increases the performance in the non-linear case and at least matches the existing state-of-the-art methods for linear problems. △ Less

Submitted 2 February, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

arXiv:2306.07124 [pdf, other]

Diverse Projection Ensembles for Distributional Reinforcement Learning

Authors: Moritz A. Zanger, Wendelin Böhmer, Matthijs T. J. Spaan

Abstract: In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds approximations within a set of representable, parametric distributions. Typically, this involves a projec… ▽ More In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds approximations within a set of representable, parametric distributions. Typically, this involves a projection of the unconstrained distribution onto the set of simplified distributions. We argue that this projection step entails a strong inductive bias when coupled with neural networks and gradient descent, thereby profoundly impacting the generalization behavior of learned models. In order to facilitate reliable uncertainty estimation through diversity, this work studies the combination of several different projections and representations in a distributional ensemble. We establish theoretical properties of such projection ensembles and derive an algorithm that uses ensemble disagreement, measured by the average $1$-Wasserstein distance, as a bonus for deep exploration. We evaluate our algorithm on the behavior suite benchmark and find that diverse projection ensembles lead to significant performance improvements over existing methods on a wide variety of tasks with the most pronounced gains in directed exploration problems. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: 21 pages, 7 figures, submitted to NeurIPS 2023

arXiv:2306.05727 [pdf, other]

The Role of Diverse Replay for Generalisation in Reinforcement Learning

Authors: Max Weltevrede, Matthijs T. J. Spaan, Wendelin Böhmer

Abstract: In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the impact of these components in the context of generalisation in multi-task RL. We investigate the hypothesis that collect… ▽ More In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the impact of these components in the context of generalisation in multi-task RL. We investigate the hypothesis that collecting and training on more diverse data from the training environments will improve zero-shot generalisation to new tasks. We motivate mathematically and show empirically that generalisation to tasks that are "reachable'' during training is improved by increasing the diversity of transitions in the replay buffer. Furthermore, we show empirically that this same strategy also shows improvement for generalisation to similar but "unreachable'' tasks which could be due to improved generalisation of the learned latent representations. △ Less

Submitted 31 August, 2023; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: 15 pages, 8 figures

arXiv:2212.03068 [pdf, other]

Active Classification of Moving Targets with Learned Control Policies

Authors: Álvaro Serra-Gómez, Eduardo Montijano, Wendelin Böhmer, Javier Alonso-Mora

Abstract: In this paper, we consider the problem where a drone has to collect semantic information to classify multiple moving targets. In particular, we address the challenge of computing control inputs that move the drone to informative viewpoints, position and orientation, when the information is extracted using a "black-box" classifier, e.g., a deep learning neural network. These algorithms typically la… ▽ More In this paper, we consider the problem where a drone has to collect semantic information to classify multiple moving targets. In particular, we address the challenge of computing control inputs that move the drone to informative viewpoints, position and orientation, when the information is extracted using a "black-box" classifier, e.g., a deep learning neural network. These algorithms typically lack of analytical relationships between the viewpoints and their associated outputs, preventing their use in information-gathering schemes. To fill this gap, we propose a novel attention-based architecture, trained via Reinforcement Learning (RL), that outputs the next viewpoint for the drone favoring the acquisition of evidence from as many unclassified targets as possible while reasoning about their movement, orientation, and occlusions. Then, we use a low-level MPC controller to move the drone to the desired viewpoint taking into account its actual dynamics. We show that our approach not only outperforms a variety of baselines but also generalizes to scenarios unseen during training. Additionally, we show that the network scales to large numbers of targets and generalizes well to different movement dynamics of the targets. △ Less

Submitted 27 September, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: 8 pages, 6 figures, Accepted in IEEE RA-L

MSC Class: 68T40

arXiv:2210.13455 [pdf, other]

E-MCTS: Deep Exploration in Model-Based Reinforcement Learning by Planning with Epistemic Uncertainty

Authors: Yaniv Oren, Matthijs T. J. Spaan, Wendelin Böhmer

Abstract: One of the most well-studied and highly performing planning approaches used in Model-Based Reinforcement Learning (MBRL) is Monte-Carlo Tree Search (MCTS). Key challenges of MCTS-based MBRL methods remain dedicated deep exploration and reliability in the face of the unknown, and both challenges can be alleviated through principled epistemic uncertainty estimation in the predictions of MCTS. We pre… ▽ More One of the most well-studied and highly performing planning approaches used in Model-Based Reinforcement Learning (MBRL) is Monte-Carlo Tree Search (MCTS). Key challenges of MCTS-based MBRL methods remain dedicated deep exploration and reliability in the face of the unknown, and both challenges can be alleviated through principled epistemic uncertainty estimation in the predictions of MCTS. We present two main contributions: First, we develop methodology to propagate epistemic uncertainty in MCTS, enabling agents to estimate the epistemic uncertainty in their predictions. Second, we utilize the propagated uncertainty for a novel deep exploration algorithm by explicitly planning to explore. We incorporate our approach into variations of MCTS-based MBRL approaches with learned and provided dynamics models, and empirically show deep exploration through successful epistemic uncertainty estimation achieved by our approach. We compare to a non-planning-based deep-exploration baseline, and demonstrate that planning with epistemic MCTS significantly outperforms non-planning based exploration in the investigated deep exploration benchmark. △ Less

Submitted 30 August, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

Comments: Submitted to NeurIPS 2023, accepted to EWRL 2023

arXiv:2010.02974 [pdf, other]

UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning

Authors: Tarun Gupta, Anuj Mahajan, Bei Peng, Wendelin Böhmer, Shimon Whiteson

Abstract: VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can prevent them from solving tasks that require significant coordination between agents at a given timestep. We show that this… ▽ More VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can prevent them from solving tasks that require significant coordination between agents at a given timestep. We show that this problem can be overcome by improving the joint exploration of all agents during training. Specifically, we propose a novel MARL approach called Universal Value Exploration (UneVEn) that learns a set of related tasks simultaneously with a linear decomposition of universal successor features. With the policies of already solved related tasks, the joint exploration process of all agents can be improved to help them achieve better coordination. Empirical results on a set of exploration games, challenging cooperative predator-prey tasks requiring significant coordination among agents, and StarCraft II micromanagement benchmarks show that UneVEn can solve tasks where other state-of-the-art MARL methods fail. △ Less

Submitted 10 June, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: Published at ICML 2021

arXiv:2010.01856 [pdf, other]

My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control

Authors: Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, Shimon Whiteson

Abstract: Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. The… ▽ More Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural Networks (GNN) are one way to address incompatible environments, because they can process graphs of arbitrary size. They also allow practitioners to inject biases encoded in the structure of the input graph. Existing work in graph-based continuous control uses the physical morphology of the agent to construct the input graph, i.e., encoding limb features as node labels and using edges to connect the nodes if their corresponded limbs are physically connected. In this work, we present a series of ablations on existing methods that show that morphological information encoded in the graph does not improve their performance. Motivated by the hypothesis that any benefits GNNs extract from the graph structure are outweighed by difficulties they create for message passing, we also propose Amorpheus, a transformer-based approach. Further results show that, while Amorpheus ignores the morphological information that GNNs encode, it nonetheless substantially outperforms GNN-based methods that use the morphological information to define the message-passing scheme. △ Less

Submitted 14 April, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: ICLR 2021 Camera-Ready Version

arXiv:2006.05826 [pdf, other]

Transient Non-Stationarity and Generalisation in Deep Reinforcement Learning

Authors: Maximilian Igl, Gregory Farquhar, Jelena Luketina, Wendelin Boehmer, Shimon Whiteson

Abstract: Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exh… ▽ More Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly addressed in deep RL and a single neural network is continually updated. However, we find evidence that neural networks exhibit a memory effect where these transient non-stationarities can permanently impact the latent representation and adversely affect generalisation performance. Consequently, to improve generalisation of deep RL agents, we propose Iterated Relearning (ITER). ITER augments standard RL training by repeated knowledge transfer of the current policy into a freshly initialised network, which thereby experiences less non-stationarity during training. Experimentally, we show that ITER improves performance on the challenging generalisation benchmarks ProcGen and Multiroom. △ Less

Submitted 22 September, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

arXiv:2006.04222 [pdf, other]

Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning

Authors: Shariq Iqbal, Christian A. Schroeder de Witt, Bei Peng, Wendelin Böhmer, Shimon Whiteson, Fei Sha

Abstract: Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities; however, common patterns of behavior often emerge among these agents/entities. Our method aims to leverage these commonalities by asking the question: ``What is the expected utility of each agent when only considering a randomly selected sub-group of its observed entities?… ▽ More Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities; however, common patterns of behavior often emerge among these agents/entities. Our method aims to leverage these commonalities by asking the question: ``What is the expected utility of each agent when only considering a randomly selected sub-group of its observed entities?'' By posing this counterfactual question, we can recognize state-action trajectories within sub-groups of entities that we may have encountered in another task and use what we learned in that task to inform our prediction in the current one. We then reconstruct a prediction of the full returns as a combination of factors considering these disjoint groups of entities and train this ``randomly factorized" value function as an auxiliary objective for value-based multi-agent reinforcement learning. By doing so, our model can recognize and leverage similarities across tasks to improve learning efficiency in a multi-task setting. Our approach, Randomized Entity-wise Factorization for Imagined Learning (REFIL), outperforms all strong baselines by a significant margin in challenging multi-task StarCraft micromanagement settings. △ Less

Submitted 11 June, 2021; v1 submitted 7 June, 2020; originally announced June 2020.

Comments: ICML 2021 Camera Ready

arXiv:2005.09220 [pdf, other]

Privileged Information Dropout in Reinforcement Learning

Authors: Pierre-Alexandre Kamienny, Kai Arulkumaran, Feryal Behbahani, Wendelin Boehmer, Shimon Whiteson

Abstract: Using privileged information during training can improve the sample efficiency and performance of machine learning systems. This paradigm has been applied to reinforcement learning (RL), primarily in the form of distillation or auxiliary tasks, and less commonly in the form of augmenting the inputs of agents. In this work, we investigate Privileged Information Dropout (\pid) for achieving the latt… ▽ More Using privileged information during training can improve the sample efficiency and performance of machine learning systems. This paradigm has been applied to reinforcement learning (RL), primarily in the form of distillation or auxiliary tasks, and less commonly in the form of augmenting the inputs of agents. In this work, we investigate Privileged Information Dropout (\pid) for achieving the latter which can be applied equally to value-based and policy-based RL algorithms. Within a simple partially-observed environment, we demonstrate that \pid outperforms alternatives for leveraging privileged information, including distillation and auxiliary tasks, and can successfully utilise different types of privileged information. Finally, we analyse its effect on the learned representations. △ Less

Submitted 19 May, 2020; originally announced May 2020.

arXiv:2003.06709 [pdf, other]

FACMAC: Factored Multi-Agent Centralised Policy Gradients

Authors: Bei Peng, Tabish Rashid, Christian A. Schroeder de Witt, Pierre-Alexandre Kamienny, Philip H. S. Torr, Wendelin Böhmer, Shimon Whiteson

Abstract: We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic policy gradients to learn policies. However, FACMAC learns a centralised but factored critic, which combines per-agent utilit… ▽ More We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic policy gradients to learn policies. However, FACMAC learns a centralised but factored critic, which combines per-agent utilities into the joint action-value function via a non-linear monotonic function, as in QMIX, a popular multi-agent Q-learning algorithm. However, unlike QMIX, there are no inherent constraints on factoring the critic. We thus also employ a nonmonotonic factorisation and empirically demonstrate that its increased representational capacity allows it to solve some tasks that cannot be solved with monolithic, or monotonically factored critics. In addition, FACMAC uses a centralised policy gradient estimator that optimises over the entire joint action space, rather than optimising over each agent's action space separately as in MADDPG. This allows for more coordinated policy changes and fully reaps the benefits of a centralised critic. We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks. Empirical results demonstrate FACMAC's superior performance over MADDPG and other baselines on all three domains. △ Less

Submitted 7 May, 2021; v1 submitted 14 March, 2020; originally announced March 2020.

arXiv:2002.12174 [pdf, other]

Optimistic Exploration even with a Pessimistic Initialisation

Authors: Tabish Rashid, Bei Peng, Wendelin Böhmer, Shimon Whiteson

Abstract: Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-va… ▽ More Optimistic initialisation is an effective strategy for efficient exploration in reinforcement learning (RL). In the tabular case, all provably efficient model-free algorithms rely on it. However, model-free deep RL algorithms do not use optimistic initialisation despite taking inspiration from these provably efficient tabular algorithms. In particular, in scenarios with only positive rewards, Q-values are initialised at their lowest possible values due to commonly used network initialisation schemes, a pessimistic initialisation. Merely initialising the network to output optimistic Q-values is not enough, since we cannot ensure that they remain optimistic for novel state-action pairs, which is crucial for exploration. We propose a simple count-based augmentation to pessimistically initialised Q-values that separates the source of optimism from the neural network. We show that this scheme is provably efficient in the tabular setting and extend it to the deep RL setting. Our algorithm, Optimistic Pessimistically Initialised Q-Learning (OPIQ), augments the Q-value estimates of a DQN-based agent with count-derived bonuses to ensure optimism during both action selection and bootstrap**. We show that OPIQ outperforms non-optimistic DQN variants that utilise a pseudocount-based intrinsic motivation in hard exploration tasks, and that it predicts optimistic estimates for novel state-action pairs. △ Less

Submitted 26 February, 2020; originally announced February 2020.

Comments: Published as a conference paper at ICLR 2020

arXiv:1910.09508 [pdf, other]

doi 10.1007/978-3-030-29911-8_7

Multi-agent Hierarchical Reinforcement Learning with Dynamic Termination

Authors: Dongge Han, Wendelin Boehmer, Michael Wooldridge, Alex Rogers

Abstract: In a multi-agent system, an agent's optimal policy will typically depend on the policies chosen by others. Therefore, a key issue in multi-agent systems research is that of predicting the behaviours of others, and responding promptly to changes in such behaviours. One obvious possibility is for each agent to broadcast their current intention, for example, the currently executed option in a hierarc… ▽ More In a multi-agent system, an agent's optimal policy will typically depend on the policies chosen by others. Therefore, a key issue in multi-agent systems research is that of predicting the behaviours of others, and responding promptly to changes in such behaviours. One obvious possibility is for each agent to broadcast their current intention, for example, the currently executed option in a hierarchical reinforcement learning framework. However, this approach results in inflexibility of agents if options have an extended duration and are dynamic. While adjusting the executed option at each step improves flexibility from a single-agent perspective, frequent changes in options can induce inconsistency between an agent's actual behaviour and its broadcast intention. In order to balance flexibility and predictability, we propose a dynamic termination Bellman equation that allows the agents to flexibly terminate their options. We evaluate our model empirically on a set of multi-agent pursuit and taxi tasks, and show that our agents learn to adapt flexibly across scenarios that require different termination behaviours. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: PRICAI 2019

arXiv:1910.00091 [pdf, other]

Deep Coordination Graphs

Authors: Wendelin Böhmer, Vitaly Kurin, Shimon Whiteson

Abstract: This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factoring the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allow… ▽ More This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible trade-off between representational capacity and generalization by factoring the joint value function of all agents according to a coordination graph into payoffs between pairs of agents. The value can be maximized by local message passing along the graph, which allows training of the value function end-to-end with Q-learning. Payoff functions are approximated with deep neural networks that employ parameter sharing and low-rank approximations to significantly improve sample efficiency. We show that DCG can solve predator-prey tasks that highlight the relative overgeneralization pathology, as well as challenging StarCraft II micromanagement tasks. △ Less

Submitted 23 June, 2020; v1 submitted 27 September, 2019; originally announced October 2019.

Comments: Accepted at ICML 2020

arXiv:1906.02138 [pdf, other]

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Authors: Wendelin Böhmer, Tabish Rashid, Shimon Whiteson

Abstract: This paper investigates the use of intrinsic reward to guide exploration in multi-agent reinforcement learning. We discuss the challenges in applying intrinsic reward to multiple collaborative agents and demonstrate how unreliable reward can prevent decentralized agents from learning the optimal policy. We address this problem with a novel framework, Independent Centrally-assisted Q-learning (ICQL… ▽ More This paper investigates the use of intrinsic reward to guide exploration in multi-agent reinforcement learning. We discuss the challenges in applying intrinsic reward to multiple collaborative agents and demonstrate how unreliable reward can prevent decentralized agents from learning the optimal policy. We address this problem with a novel framework, Independent Centrally-assisted Q-learning (ICQL), in which decentralized agents share control and an experience replay buffer with a centralized agent. Only the centralized agent is intrinsically rewarded, but the decentralized agents still benefit from improved exploration, without the distraction of unreliable incentives. △ Less

Submitted 5 June, 2019; originally announced June 2019.

Comments: Accepted to the 2nd Exploration in Reinforcement Learning Workshop at the International Conference on Machine Learning 2019

arXiv:1905.01072 [pdf, other]

Deep Residual Reinforcement Learning

Authors: Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson

Abstract: We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch… ▽ More We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMind Control Suite benchmark. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD($k$) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost. △ Less

Submitted 23 January, 2020; v1 submitted 3 May, 2019; originally announced May 2019.

Comments: AAMAS 2020

arXiv:1904.01033 [pdf, other]

Multitask Soft Option Learning

Authors: Maximilian Igl, Andrew Gambardella, **ke He, Nantas Nardelli, N. Siddharth, Wendelin Böhmer, Shimon Whiteson

Abstract: We present Multitask Soft Option Learning(MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This ''soft'' version of options avoids several instabilities during training in a multitask setting, and provides a natural way to learn both intra-option policie… ▽ More We present Multitask Soft Option Learning(MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This ''soft'' version of options avoids several instabilities during training in a multitask setting, and provides a natural way to learn both intra-option policies and their terminations. Furthermore, it allows fine-tuning of options for new tasks without forgetting their learned policies, leading to faster training without reducing the expressiveness of the hierarchical policy. We demonstrate empirically that MSOL significantly outperforms both hierarchical and flat transfer-learning baselines. △ Less

Submitted 21 June, 2020; v1 submitted 1 April, 2019; originally announced April 2019.

Comments: Published at UAI 2020

arXiv:1903.11329 [pdf, other]

Generalized Off-Policy Actor-Critic

Authors: Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson

Abstract: We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Po… ▽ More We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target policy when deployed, our new objective better predicts such performance. We prove the Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient of the counterfactual objective and use an emphatic approach to get an unbiased sample from this policy gradient, yielding the Generalized Off-Policy Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over existing algorithms in Mujoco robot simulation tasks, the first empirical success of emphatic algorithms in prevailing deep RL benchmarks. △ Less

Submitted 28 October, 2019; v1 submitted 27 March, 2019; originally announced March 2019.

Comments: NeurIPS 2019

arXiv:1810.11702 [pdf, other]

Multi-Agent Common Knowledge Reinforcement Learning

Authors: Christian A. Schroeder de Witt, Jakob N. Foerster, Gregory Farquhar, Philip H. S. Torr, Wendelin Boehmer, Shimon Whiteson

Abstract: Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledge arises naturally in a large number of decentralised cooperative multi-agent tasks, for example, when agents can recons… ▽ More Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledge arises naturally in a large number of decentralised cooperative multi-agent tasks, for example, when agents can reconstruct parts of each others' observations. Since agents an independently agree on their common knowledge, they can execute complex coordinated policies that condition on this knowledge in a fully decentralised fashion. We propose multi-agent common knowledge reinforcement learning (MACKRL), a novel stochastic actor-critic algorithm that learns a hierarchical policy tree. Higher levels in the hierarchy coordinate groups of agents by conditioning on their common knowledge, or delegate to lower levels with smaller subgroups but potentially richer common knowledge. The entire policy tree can be executed in a fully decentralised fashion. As the lowest policy tree level consists of independent policies for each agent, MACKRL reduces to independently learnt decentralised policies as a special case. We demonstrate that our method can exploit common knowledge for superior performance on complex decentralised coordination tasks, including a stochastic matrix game and challenging problems in StarCraft II unit micromanagement. △ Less

Submitted 11 January, 2020; v1 submitted 27 October, 2018; originally announced October 2018.

Comments: Advances in Neural Information Processing Systems, 9924-9935

arXiv:1612.07548 [pdf, ps, other]

Non-Deterministic Policy Improvement Stabilizes Approximated Reinforcement Learning

Authors: Wendelin Böhmer, Rong Guo, Klaus Obermayer

Abstract: This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The p… ▽ More This paper investigates a type of instability that is linked to the greedy policy improvement in approximated reinforcement learning. We show empirically that non-deterministic policy improvement can stabilize methods like LSPI by controlling the improvements' stochasticity. Additionally we show that a suitable representation of the value function also stabilizes the solution to some degree. The presented approach is simple and should also be easily transferable to more sophisticated algorithms like deep reinforcement learning. △ Less

Submitted 22 December, 2016; originally announced December 2016.

Comments: This paper has been presented at the 13th European Workshop on Reinforcement Learning (EWRL 2016) on the 3rd and 4th of December 2016 in Barcelona, Spain

arXiv:1504.04456 [pdf, ps, other]

Theoretical neutron-capture cross sections for r-process nucleosynthesis in the $^{48}$Ca region

Authors: T. Rauscher, W. Böhmer, K. -L. Kratz, W. Balogh, H. Oberhummer

Abstract: We calculate neutron capture cross sections for r-process nucleosynthesis in the $^{48}$Ca region, namely for the isotopes $^{40-44}$S, $^{46-50}$Ar, $^{56-66}$Ti, $^{62-68}$Cr, and $^{72-76}$Fe. While previously only cross sections resulting from the compound nucleus reaction mechanism (Hauser-Feshbach) have been considered, we recalculate not only that contribution to the cross section but also… ▽ More We calculate neutron capture cross sections for r-process nucleosynthesis in the $^{48}$Ca region, namely for the isotopes $^{40-44}$S, $^{46-50}$Ar, $^{56-66}$Ti, $^{62-68}$Cr, and $^{72-76}$Fe. While previously only cross sections resulting from the compound nucleus reaction mechanism (Hauser-Feshbach) have been considered, we recalculate not only that contribution to the cross section but also include direct capture on even-even nuclei. The level schemes, which are of utmost importance in the direct capture calculations, are taken from quasi-particle states obtained with a folded-Yukawa potential and Lipkin-Nogami pairing. Most recent deformation values derived from experimental data on $β$-decay half lives are used where available. Due to the consideration of direct capture, the capture rates are enhanced and the "turning points" in the r-process path are shifted to slightly higher mass numbers. We also discuss the sensitivity of the direct capture cross sections on the assumed deformation. △ Less

Submitted 17 April, 2015; originally announced April 2015.

Comments: 6 pages; talk at ENAM95, appeared in the proceedings, uploaded here to allow easy access. in Proc. Int. Conf. on Exotic Nuclei and Atomic Masses "ENAM 95", eds. M. de Saint Simon and O. Sorlin (Editions Frontières, Gif-sur-Yvette 1995), p. 683

arXiv:1412.6286 [pdf, ps, other]

Regression with Linear Factored Functions

Authors: Wendelin Böhmer, Klaus Obermayer

Abstract: Many applications that use empirically estimated functions face a curse of dimensionality, because the integrals over most function classes must be approximated by sampling. This paper introduces a novel regression-algorithm that learns linear factored functions (LFF). This class of functions has structural properties that allow to analytically solve certain integrals and to calculate point-wise p… ▽ More Many applications that use empirically estimated functions face a curse of dimensionality, because the integrals over most function classes must be approximated by sampling. This paper introduces a novel regression-algorithm that learns linear factored functions (LFF). This class of functions has structural properties that allow to analytically solve certain integrals and to calculate point-wise products. Applications like belief propagation and reinforcement learning can exploit these properties to break the curse and speed up computation. We derive a regularized greedy optimization scheme, that learns factored basis functions during training. The novel regression algorithm performs competitively to Gaussian processes on benchmark tasks, and the learned LFF functions are with 4-9 factored basis functions on average very compact. △ Less

Submitted 30 March, 2015; v1 submitted 19 December, 2014; originally announced December 2014.

Comments: Under review as conference paper at ECML/PKDD 2015

arXiv:1205.0986 [pdf, ps, other]

Robot Navigation using Reinforcement Learning and Slow Feature Analysis

Authors: Wendelin Böhmer

Abstract: The application of reinforcement learning algorithms onto real life problems always bears the challenge of filtering the environmental state out of raw sensor readings. While most approaches use heuristics, biology suggests that there must exist an unsupervised method to construct such filters automatically. Besides the extraction of environmental states, the filters have to represent them in a fa… ▽ More The application of reinforcement learning algorithms onto real life problems always bears the challenge of filtering the environmental state out of raw sensor readings. While most approaches use heuristics, biology suggests that there must exist an unsupervised method to construct such filters automatically. Besides the extraction of environmental states, the filters have to represent them in a fashion that support modern reinforcement algorithms. Many popular algorithms use a linear architecture, so one should aim at filters that have good approximation properties in combination with linear functions. This thesis wants to propose the unsupervised method slow feature analysis (SFA) for this task. Presented with a random sequence of sensor readings, SFA learns a set of filters. With growing model complexity and training examples, the filters converge against trigonometric polynomial functions. These are known to possess excellent approximation capabilities and should therfore support the reinforcement algorithms well. We evaluate this claim on a robot. The task is to learn a navigational control in a simple environment using the least square policy iteration (LSPI) algorithm. The only accessible sensor is a head mounted video camera, but without meaningful filtering, video images are not suited as LSPI input. We will show that filters learned by SFA, based on a random walk video of the robot, allow the learned control to navigate successfully in ca. 80% of the test trials. △ Less

Submitted 4 May, 2012; originally announced May 2012.

Comments: Diploma Thesis

arXiv:astro-ph/0012217 [pdf, ps, other]

On the origin of the Ca-Ti-Cr isotopic anomalies in the inclusion EK-1-4-1 of the Allende Meteorite

Authors: K. -L. Kratz, W. Boehmer, C. Freiburghaus, P. Moeller, B. Pfeiffer, T. Rauscher, F. -K. Thielemann

Abstract: In the framework of our investigation to explain the nucleosynthesis origin of the correlated Ca-Ti-Cr isotopic anomalies in the Ca-Al-rich ''FUN'' inclusion EK-1-4-1 of the Allende meteorite, the nuclear-physics basis in the neutron-rich N=28 region has been updated by including recent experimental data on beta-decay properties and microscopic predictions of neutron-capture cross sections. Char… ▽ More In the framework of our investigation to explain the nucleosynthesis origin of the correlated Ca-Ti-Cr isotopic anomalies in the Ca-Al-rich ''FUN'' inclusion EK-1-4-1 of the Allende meteorite, the nuclear-physics basis in the neutron-rich N=28 region has been updated by including recent experimental data on beta-decay properties and microscopic predictions of neutron-capture cross sections. Charged-particle and subsequent r-process calculations within an entropy-based approach were performed using a complete reaction network. It is shown that there exist two astrophysical scenarios within which the observed isotopic anomalies can be reproduced simultaneously; one at low entropies (about 10) which confirms the earlier suggestrd Sn Ia mechanism, and another at high entropies (about 150) which could be compatible with the neutrino-wind scenario of a SN II. △ Less

Submitted 11 December, 2000; originally announced December 2000.

Comments: 15 pages, 3 figures. Proc. Torino-Melbourne-Pasadena "Wasserburg" Workshop, U. Torino, 21-22 June 2000. For publication in Mem.S.A.It

arXiv:nucl-ex/9812008 [pdf, ps, other]

doi 10.1103/PhysRevLett.82.1391

Decay of neutron-rich Mn nuclides and deformation of heavy Fe isotopes

Authors: M. Hannawald, T. Kautzsch, A. Woehr, W. B. Walters, K. -L. Kratz, V. N. Fedoseyev, V. L. Mishin, W. Boehmer, B. Pfeiffer, V. Sebastian, Y. Jading, U. Koester, J. Lettry, H. L. Ravn, the ISOLDE Collaboration

Abstract: The use of chemically selective laser ionization combined with beta-delayed neutron counting at CERN/ISOLDE has permitted identification and half-life measurements for 623-ms Mn-61 up through 14-ms Mn-69. The measured half-lives are found to be significantly longer near N=40 than the values calculated with a QRPA shell model using ground-state deformations from the FRDM and ETFSI models. Gamma-r… ▽ More The use of chemically selective laser ionization combined with beta-delayed neutron counting at CERN/ISOLDE has permitted identification and half-life measurements for 623-ms Mn-61 up through 14-ms Mn-69. The measured half-lives are found to be significantly longer near N=40 than the values calculated with a QRPA shell model using ground-state deformations from the FRDM and ETFSI models. Gamma-ray singles and coincidence spectroscopy has been performed for Mn-64 and Mn-66 decays to levels of Fe-64 and Fe-66, revealing a significant drop in the energy of the first 2+ state in these nuclides that suggests an unanticipated increase in collectivity near N=40. △ Less

Submitted 21 December, 1998; originally announced December 1998.

Comments: Latex-file with 4 figures, 5 pages, Phys. Rev. Lett., in print

Journal ref: Phys.Rev.Lett. 82 (1999) 1391-1394

Showing 1–31 of 31 results for author: Boehmer, W