Search | arXiv e-print repository

Laser Learning Environment: A new environment for coordination-critical multi-agent tasks

Authors: Yannick Molinghen, Raphaël Avalos, Mark Van Achter, Ann Nowé, Tom Lenaerts

Abstract: We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dyna… ▽ More We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of esca** state space bottlenecks caused by interdependence steps since esca** those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritized experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: Pre-print, 21 pages

arXiv:2403.08829 [pdf, other]

Mitigating Biases in Collective Decision-Making: Enhancing Performance in the Face of Fake News

Authors: Axel Abels, Elias Fernandez Domingos, Ann Nowé, Tom Lenaerts

Abstract: Individual and social biases undermine the effectiveness of human advisers by inducing judgment errors which can disadvantage protected groups. In this paper, we study the influence these biases can have in the pervasive problem of fake news by evaluating human participants' capacity to identify false headlines. By focusing on headlines involving sensitive characteristics, we gather a comprehensiv… ▽ More Individual and social biases undermine the effectiveness of human advisers by inducing judgment errors which can disadvantage protected groups. In this paper, we study the influence these biases can have in the pervasive problem of fake news by evaluating human participants' capacity to identify false headlines. By focusing on headlines involving sensitive characteristics, we gather a comprehensive dataset to explore how human responses are shaped by their biases. Our analysis reveals recurring individual biases and their permeation into collective decisions. We show that demographic factors, headline categories, and the manner in which information is presented significantly influence errors in human judgment. We then use our collected data as a benchmark problem on which we evaluate the efficacy of adaptive aggregation algorithms. In addition to their improved accuracy, our results highlight the interactions between the emergence of collective intelligence and the mitigation of participant biases. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2402.13785 [pdf, other]

Synthesis of Hierarchical Controllers Based on Deep Reinforcement Learning Policies

Authors: Florent Delgrange, Guy Avni, Anna Lukina, Christian Schilling, Ann Nowé, Guillermo A. Pérez

Abstract: We propose a novel approach to the problem of controller design for environments modeled as Markov decision processes (MDPs). Specifically, we consider a hierarchical MDP a graph with each vertex populated by an MDP called a "room". We first apply deep reinforcement learning (DRL) to obtain low-level policies for each room, scaling to large rooms of unknown structure. We then apply reactive synthe… ▽ More We propose a novel approach to the problem of controller design for environments modeled as Markov decision processes (MDPs). Specifically, we consider a hierarchical MDP a graph with each vertex populated by an MDP called a "room". We first apply deep reinforcement learning (DRL) to obtain low-level policies for each room, scaling to large rooms of unknown structure. We then apply reactive synthesis to obtain a high-level planner that chooses which low-level policy to execute in each room. The central challenge in synthesizing the planner is the need for modeling rooms. We address this challenge by develo** a DRL procedure to train concise "latent" policies together with PAC guarantees on their performance. Unlike previous approaches, ours circumvents a model distillation step. Our approach combats sparse rewards in DRL and enables reusability of low-level policies. We demonstrate feasibility in a case study involving agent navigation amid moving obstacles. △ Less

Submitted 21 February, 2024; originally announced February 2024.

Comments: 19 pages main text, 17 pages Appendix (excluding references)

arXiv:2402.07182 [pdf, other]

Divide and Conquer: Provably Unveiling the Pareto Front with Multi-Objective Reinforcement Learning

Authors: Willem Röpke, Mathieu Reymond, Patrick Mannion, Diederik M. Roijers, Ann Nowé, Roxana Rădulescu

Abstract: A significant challenge in multi-objective reinforcement learning is obtaining a Pareto front of policies that attain optimal performance under different preferences. We introduce Iterated Pareto Referent Optimisation (IPRO), a principled algorithm that decomposes the task of finding the Pareto front into a sequence of single-objective problems for which various solution methods exist. This enable… ▽ More A significant challenge in multi-objective reinforcement learning is obtaining a Pareto front of policies that attain optimal performance under different preferences. We introduce Iterated Pareto Referent Optimisation (IPRO), a principled algorithm that decomposes the task of finding the Pareto front into a sequence of single-objective problems for which various solution methods exist. This enables us to establish convergence guarantees while providing an upper bound on the distance to undiscovered Pareto optimal solutions at each step. Empirical evaluations demonstrate that IPRO matches or outperforms methods that require additional domain knowledge. By leveraging problem-specific single-objective solvers, our approach also holds promise for applications beyond multi-objective reinforcement learning, such as in pathfinding and optimisation. △ Less

Submitted 11 February, 2024; originally announced February 2024.

arXiv:2306.10134 [pdf, other]

Dynamic Size Message Scheduling for Multi-Agent Communication under Limited Bandwidth

Authors: Qingshuang Sun, Denis Steckelmacher, Yuan Yao, Ann Nowé, Raphaël Avalos

Abstract: Communication plays a vital role in multi-agent systems, fostering collaboration and coordination. However, in real-world scenarios where communication is bandwidth-limited, existing multi-agent reinforcement learning (MARL) algorithms often provide agents with a binary choice: either transmitting a fixed number of bytes or no information at all. This limitation hinders the ability to effectively… ▽ More Communication plays a vital role in multi-agent systems, fostering collaboration and coordination. However, in real-world scenarios where communication is bandwidth-limited, existing multi-agent reinforcement learning (MARL) algorithms often provide agents with a binary choice: either transmitting a fixed number of bytes or no information at all. This limitation hinders the ability to effectively utilize the available bandwidth. To overcome this challenge, we present the Dynamic Size Message Scheduling (DSMS) method, which introduces a finer-grained approach to scheduling by considering the actual size of the information to be exchanged. Our contribution lies in adaptively adjusting message sizes using Fourier transform-based compression techniques, enabling agents to tailor their messages to match the allocated bandwidth while striking a balance between information loss and transmission efficiency. Receiving agents can reliably decompress the messages using the inverse Fourier transform. Experimental results demonstrate that DSMS significantly improves performance in multi-agent cooperative tasks by optimizing the utilization of bandwidth and effectively balancing information value. △ Less

Submitted 16 June, 2023; originally announced June 2023.

arXiv:2305.05560 [pdf, other]

Distributional Multi-Objective Decision Making

Authors: Willem Röpke, Conor F. Hayes, Patrick Mannion, Enda Howley, Ann Nowé, Diederik M. Roijers

Abstract: For effective decision support in scenarios with conflicting objectives, sets of potentially optimal solutions can be presented to the decision maker. We explore both what policies these sets should contain and how such sets can be computed efficiently. With this in mind, we take a distributional approach and introduce a novel dominance criterion relating return distributions of policies directly.… ▽ More For effective decision support in scenarios with conflicting objectives, sets of potentially optimal solutions can be presented to the decision maker. We explore both what policies these sets should contain and how such sets can be computed efficiently. With this in mind, we take a distributional approach and introduce a novel dominance criterion relating return distributions of policies directly. Based on this criterion, we present the distributional undominated set and show that it contains optimal policies otherwise ignored by the Pareto front. In addition, we propose the convex distributional undominated set and prove that it comprises all policies that maximise expected utility for multivariate risk-averse decision makers. We propose a novel algorithm to learn the distributional undominated set and further contribute pruning operators to reduce the set to the convex distributional undominated set. Through experiments, we demonstrate the feasibility and effectiveness of these methods, making this a valuable new approach for decision support in real-world problems. △ Less

Submitted 18 July, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted at IJCAI 2023

arXiv:2305.01063 [pdf, other]

doi 10.5555/3618408.3618413

Expertise Trees Resolve Knowledge Limitations in Collective Decision-Making

Authors: Axel Abels, Tom Lenaerts, Vito Trianni, Ann Nowé

Abstract: Experts advising decision-makers are likely to display expertise which varies as a function of the problem instance. In practice, this may lead to sub-optimal or discriminatory decisions against minority cases. In this work we model such changes in depth and breadth of knowledge as a partitioning of the problem space into regions of differing expertise. We provide here new algorithms that explicit… ▽ More Experts advising decision-makers are likely to display expertise which varies as a function of the problem instance. In practice, this may lead to sub-optimal or discriminatory decisions against minority cases. In this work we model such changes in depth and breadth of knowledge as a partitioning of the problem space into regions of differing expertise. We provide here new algorithms that explicitly consider and adapt to the relationship between problem instances and experts' knowledge. We first propose and highlight the drawbacks of a naive approach based on nearest neighbor queries. To address these drawbacks we then introduce a novel algorithm - expertise trees - that constructs decision trees enabling the learner to select appropriate models. We provide theoretical insights and empirically validate the improved performance of our novel approach on a range of problems for which existing methods proved to be inadequate. △ Less

Submitted 4 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Comments: Proceedings of the 40th International Conference on Machine Learning (2023)

arXiv:2304.08897 [pdf, other]

doi 10.1016/j.segan.2023.101202

An adaptive safety layer with hard constraints for safe reinforcement learning in multi-energy management systems

Authors: Glenn Ceusters, Muhammad Andy Putratama, Rüdiger Franke, Ann Nowé, Maarten Messagie

Abstract: Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a priori and not a complete model. The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can s… ▽ More Safe reinforcement learning (RL) with hard constraint guarantees is a promising optimal control direction for multi-energy management systems. It only requires the environment-specific constraint functions itself a priori and not a complete model. The project-specific upfront and ongoing engineering efforts are therefore still reduced, better representations of the underlying system dynamics can still be learnt, and modelling bias is kept to a minimum. However, even the constraint functions alone are not always trivial to accurately provide in advance, leading to potentially unsafe behaviour. In this paper, we present two novel advancements: (I) combining the OptLayer and SafeFallback method, named OptLayerPolicy, to increase the initial utility while kee** a high sample efficiency and the possibility to formulate equality constraints. (II) introducing self-improving hard constraints, to increase the accuracy of the constraint functions as more and new data becomes available so that better policies can be learnt. Both advancements keep the constraint formulation decoupled from the RL formulation, so new (presumably better) RL algorithms can act as drop-in replacements. We have shown that, in a simulated multi-energy system case study, the initial utility is increased to 92.4% (OptLayerPolicy) compared to 86.1% (OptLayer) and that the policy after training is increased to 104.9% (GreyOptLayerPolicy) compared to 103.4% (OptLayer) - all relative to a vanilla RL benchmark. Although introducing surrogate functions into the optimisation problem requires special attention, we conclude that the newly presented GreyOptLayerPolicy method is the most advantageous. △ Less

Submitted 6 November, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

Comments: post-print

arXiv:2303.12558 [pdf, other]

Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees

Authors: Florent Delgrange, Ann Nowé, Guillermo A. Pérez

Abstract: Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any R… ▽ More Although deep reinforcement learning (DRL) has many success stories, the large-scale deployment of policies learned through these advanced techniques in safety-critical scenarios is hindered by their lack of formal guarantees. Variational Markov Decision Processes (VAE-MDPs) are discrete latent space models that provide a reliable framework for distilling formally verifiable controllers from any RL policy. While the related guarantees address relevant practical aspects such as the satisfaction of performance and safety properties, the VAE approach suffers from several learning flaws (posterior collapse, slow learning speed, poor dynamics estimates), primarily due to the absence of abstraction and representation guarantees to support latent optimization. We introduce the Wasserstein auto-encoded MDP (WAE-MDP), a latent space model that fixes those issues by minimizing a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy, for which the formal guarantees apply. Our approach yields bisimulation guarantees while learning the distilled policy, allowing concrete optimization of the abstraction and representation model quality. Our experiments show that, besides distilling policies up to 10 times faster, the latent model quality is indeed better in general. Moreover, we present experiments from a simple time-to-failure verification algorithm on the latent space. The fact that our approach enables such simple verification techniques highlights its applicability. △ Less

Submitted 21 April, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

Comments: ICLR 2023, 10 pages main text, 14 pages appendix (excluding references)

arXiv:2303.03284 [pdf, other]

The Wasserstein Believer: Learning Belief Updates for Partially Observable Environments through Reliable Latent Space Models

Authors: Raphael Avalos, Florent Delgrange, Ann Nowé, Guillermo A. Pérez, Diederik M. Roijers

Abstract: Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that mode… ▽ More Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is often intractable. While SOTA algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function. △ Less

Submitted 26 October, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

arXiv:2301.12822 [pdf, other]

Evaluating COVID-19 vaccine allocation policies using Bayesian $m$-top exploration

Authors: Alexandra Cimpean, Timothy Verstraeten, Lander Willem, Niel Hens, Ann Nowé, Pieter Libin

Abstract: Individual-based epidemiological models support the study of fine-grained preventive measures, such as tailored vaccine allocation policies, in silico. As individual-based models are computationally intensive, it is pivotal to identify optimal strategies within a reasonable computational budget. Moreover, due to the high societal impact associated with the implementation of preventive strategies,… ▽ More Individual-based epidemiological models support the study of fine-grained preventive measures, such as tailored vaccine allocation policies, in silico. As individual-based models are computationally intensive, it is pivotal to identify optimal strategies within a reasonable computational budget. Moreover, due to the high societal impact associated with the implementation of preventive strategies, uncertainty regarding decisions should be communicated to policy makers, which is naturally embedded in a Bayesian approach. We present a novel technique for evaluating vaccine allocation strategies using a multi-armed bandit framework in combination with a Bayesian anytime $m$-top exploration algorithm. $m$-top exploration allows the algorithm to learn $m$ policies for which it expects the highest utility, enabling experts to inspect this small set of alternative strategies, along with their quantified uncertainty. The anytime component provides policy advisors with flexibility regarding the computation time and the desired confidence, which is important as it is difficult to make this trade-off beforehand. We consider the Belgian COVID-19 epidemic using the individual-based model STRIDE, where we learn a set of vaccination policies that minimize the number of infections and hospitalisations. Through experiments we show that our method can efficiently identify the $m$-top policies, which is validated in a scenario where the ground truth is available. Finally, we explore how vaccination policies can best be organised under different contact reduction schemes. Through these experiments, we show that the top policies follow a clear trend regarding the prioritised age groups and assigned vaccine type, which provides insights for future vaccination campaigns. △ Less

Submitted 30 January, 2023; originally announced January 2023.

arXiv:2301.12820 [pdf, other]

Transferring Multiple Policies to Hotstart Reinforcement Learning in an Air Compressor Management Problem

Authors: Hélène Plisnier, Denis Steckelmacher, Jeroen Willems, Bruno Depraetere, Ann Nowé

Abstract: Many instances of similar or almost-identical industrial machines or tools are often deployed at once, or in quick succession. For instance, a particular model of air compressor may be installed at hundreds of customers. Because these tools perform distinct but highly similar tasks, it is interesting to be able to quickly produce a high-quality controller for machine $N+1$ given the controllers al… ▽ More Many instances of similar or almost-identical industrial machines or tools are often deployed at once, or in quick succession. For instance, a particular model of air compressor may be installed at hundreds of customers. Because these tools perform distinct but highly similar tasks, it is interesting to be able to quickly produce a high-quality controller for machine $N+1$ given the controllers already produced for machines $1..N$. This is even more important when the controllers are learned through Reinforcement Learning, as training takes time, energy and other resources. In this paper, we apply Policy Intersection, a Policy Sha** method, to help a Reinforcement Learning agent learn to solve a new variant of a compressors control problem faster, by transferring knowledge from several previously learned controllers. We show that our approach outperforms loading an old controller, and significantly improves performance in the long run. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: Preliminary version, experimental details still to be made more precise

arXiv:2301.07784 [pdf, other]

doi 10.5555/3545946.3598872

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization

Authors: Lucas N. Alegre, Ana L. C. Bazzan, Diederik M. Roijers, Ann Nowé, Bruno C. da Silva

Abstract: Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Po… ▽ More Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $ε$-optimal solution (for a bounded $ε$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces. △ Less

Submitted 23 March, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

Comments: Accepted to AAMAS 2023

arXiv:2301.05755 [pdf, other]

Bridging the Gap Between Single and Multi Objective Games

Authors: Willem Röpke, Carla Groenland, Roxana Rădulescu, Ann Nowé, Diederik M. Roijers

Abstract: A classic model to study strategic decision making in multi-agent systems is the normal-form game. This model can be generalised to allow for an infinite number of pure strategies leading to continuous games. Multi-objective normal-form games are another generalisation that model settings where players receive separate payoffs in more than one objective. We bridge the gap between the two models by… ▽ More A classic model to study strategic decision making in multi-agent systems is the normal-form game. This model can be generalised to allow for an infinite number of pure strategies leading to continuous games. Multi-objective normal-form games are another generalisation that model settings where players receive separate payoffs in more than one objective. We bridge the gap between the two models by providing a theoretical guarantee that a game from one setting can always be transformed to a game in the other. We extend the theoretical results to include guaranteed equivalence of Nash equilibria. The map** makes it possible to apply algorithms from one field to the other. We demonstrate this by introducing a fictitious play algorithm for multi-objective games and subsequently applying it to two well-known continuous games. We believe the equivalence relation will lend itself to new insights by translating the theoretical guarantees from one formalism to another. Moreover, it may lead to new computational approaches for continuous games when a problem is more naturally solved in the succinct format of multi-objective games. △ Less

Submitted 1 March, 2023; v1 submitted 13 January, 2023; originally announced January 2023.

Comments: Accepted to AAMAS 2023

arXiv:2207.03830 [pdf, other]

Safe reinforcement learning for multi-energy management systems with known constraint functions

Authors: Glenn Ceusters, Luis Ramirez Camargo, Rüdiger Franke, Ann Nowé, Maarten Messagie

Abstract: Reinforcement learning (RL) is a promising optimal control technique for multi-energy management systems. It does not require a model a priori - reducing the upfront and ongoing project-specific engineering effort and is capable of learning better representations of the underlying system dynamics. However, vanilla RL does not provide constraint satisfaction guarantees - resulting in various potent… ▽ More Reinforcement learning (RL) is a promising optimal control technique for multi-energy management systems. It does not require a model a priori - reducing the upfront and ongoing project-specific engineering effort and is capable of learning better representations of the underlying system dynamics. However, vanilla RL does not provide constraint satisfaction guarantees - resulting in various potentially unsafe interactions within its safety-critical environment. In this paper, we present two novel safe RL methods, namely SafeFallback and GiveSafe, where the safety constraint formulation is decoupled from the RL formulation. These provide hard-constraint, rather than soft- and chance-constraint, satisfaction guarantees both during training a (near) optimal policy (which involves exploratory and exploitative, i.e. greedy, steps) as well as during deployment of any policy (e.g. random agents or offline trained RL agents). This without the need of solving a mathematical program, resulting in less computational power requirements and a more flexible constraint function formulation (no derivative information is required). In a simulated multi-energy systems case study we have shown that both methods start with a significantly higher utility (i.e. useful policy) compared to a vanilla RL benchmark and Optlayer benchmark (94,6% and 82,8% compared to 35,5% and 77,8%) and that the proposed SafeFallback method even can outperform the vanilla RL benchmark (102,9% to 100%). We conclude that both methods are viably safety constraint handling techniques applicable beyond RL, as demonstrated with random policies while still providing hard-constraint guarantees. △ Less

Submitted 1 September, 2022; v1 submitted 8 July, 2022; originally announced July 2022.

Comments: 26 pages, 14 figures

arXiv:2204.05036 [pdf, other]

Pareto Conditioned Networks

Authors: Mathieu Reymond, Eugenio Bargiacchi, Ann Nowé

Abstract: In multi-objective optimization, learning all the policies that reach Pareto-efficient solutions is an expensive process. The set of optimal policies can grow exponentially with the number of objectives, and recovering all solutions requires an exhaustive exploration of the entire state space. We propose Pareto Conditioned Networks (PCN), a method that uses a single neural network to encompass all… ▽ More In multi-objective optimization, learning all the policies that reach Pareto-efficient solutions is an expensive process. The set of optimal policies can grow exponentially with the number of objectives, and recovering all solutions requires an exhaustive exploration of the entire state space. We propose Pareto Conditioned Networks (PCN), a method that uses a single neural network to encompass all non-dominated policies. PCN associates every past transition with its episode's return. It trains the network such that, when conditioned on this same return, it should reenact said transition. In doing so we transform the optimization problem into a classification problem. We recover a concrete policy by conditioning the network on the desired Pareto-efficient solution. Our method is stable as it learns in a supervised fashion, thus avoiding moving target issues. Moreover, by using a single network, PCN scales efficiently with the number of objectives. Finally, it makes minimal assumptions on the shape of the Pareto front, which makes it suitable to a wider range of problems than previous state-of-the-art multi-objective reinforcement learning algorithms. △ Less

Submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted at the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 2022

arXiv:2204.05027 [pdf, ps, other]

Exploring the Pareto front of multi-objective COVID-19 mitigation policies using reinforcement learning

Authors: Mathieu Reymond, Conor F. Hayes, Lander Willem, Roxana Rădulescu, Steven Abrams, Diederik M. Roijers, Enda Howley, Patrick Mannion, Niel Hens, Ann Nowé, Pieter Libin

Abstract: Infectious disease outbreaks can have a disruptive impact on public health and societal processes. As decision making in the context of epidemic mitigation is hard, reinforcement learning provides a methodology to automatically learn prevention strategies in combination with complex epidemic models. Current research focuses on optimizing policies w.r.t. a single objective, such as the pathogen's a… ▽ More Infectious disease outbreaks can have a disruptive impact on public health and societal processes. As decision making in the context of epidemic mitigation is hard, reinforcement learning provides a methodology to automatically learn prevention strategies in combination with complex epidemic models. Current research focuses on optimizing policies w.r.t. a single objective, such as the pathogen's attack rate. However, as the mitigation of epidemics involves distinct, and possibly conflicting criteria (i.a., prevalence, mortality, morbidity, cost), a multi-objective approach is warranted to learn balanced policies. To lift this decision-making process to real-world epidemic models, we apply deep multi-objective reinforcement learning and build upon a state-of-the-art algorithm, Pareto Conditioned Networks (PCN), to learn a set of solutions that approximates the Pareto front of the decision problem. We consider the first wave of the Belgian COVID-19 epidemic, which was mitigated by a lockdown, and study different deconfinement strategies, aiming to minimize both COVID-19 cases (i.e., infections and hospitalizations) and the societal burden that is induced by the applied mitigation measures. We contribute a multi-objective Markov decision process that encapsulates the stochastic compartment model that was used to inform policy makers during the COVID-19 epidemic. As these social mitigation measures are implemented in a continuous action space that modulates the contact matrix of the age-structured epidemic model, we extend PCN to this setting. We evaluate the solution returned by PCN, and observe that it correctly learns to reduce the social burden whenever the hospitalization rates are sufficiently low. In this work, we thus show that multi-objective reinforcement learning is attainable in complex epidemiological models and provides essential insights to balance complex mitigation policies. △ Less

Submitted 11 April, 2022; originally announced April 2022.

arXiv:2112.12458 [pdf, other]

Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning

Authors: Raphaël Avalos, Mathieu Reymond, Ann Nowé, Diederik M. Roijers

Abstract: Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentr… ▽ More Many recent successful off-policy multi-agent reinforcement learning (MARL) algorithms for cooperative partially observable environments focus on finding factorized value functions, leading to convoluted network structures. Building on the structure of independent Q-learners, our LAN algorithm takes a radically different approach, leveraging a dueling architecture to learn for each agent a decentralized best-response policies via individual advantage functions. The learning is stabilized by a centralized critic whose primary objective is to reduce the moving target problem of the individual advantages. The critic, whose network's size is independent of the number of agents, is cast aside after learning. Evaluation on the StarCraft II multi-agent challenge benchmark shows that LAN reaches state-of-the-art performance and is highly scalable with respect to the number of agents, opening up a promising alternative direction for MARL research. △ Less

Submitted 26 October, 2023; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: https://openreview.net/forum?id=adpKzWQunW

Journal ref: Transactions on Machine Learning Research - October 2023

arXiv:2112.09655 [pdf, other]

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes (Technical Report)

Authors: Florent Delgrange, Ann Nowé, Guillermo A. Pérez

Abstract: We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniqu… ▽ More We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework introduced by Gelada et al. to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Additionally, we obtain a distilled version of the policy for the latent model. △ Less

Submitted 14 June, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: AAAI 2022, technical report including supplementary material (10 pages main text, 14 pages appendix)

arXiv:2112.06500 [pdf, other]

On Nash Equilibria in Normal-Form Games With Vectorial Payoffs

Authors: Willem Röpke, Diederik M. Roijers, Ann Nowé, Roxana Rădulescu

Abstract: We provide an in-depth study of Nash equilibria in multi-objective normal form games (MONFGs), i.e., normal form games with vectorial payoffs. Taking a utility-based approach, we assume that each player's utility can be modelled with a utility function that maps a vector to a scalar utility. In the case of a mixed strategy, it is meaningful to apply such a scalarisation both before calculating the… ▽ More We provide an in-depth study of Nash equilibria in multi-objective normal form games (MONFGs), i.e., normal form games with vectorial payoffs. Taking a utility-based approach, we assume that each player's utility can be modelled with a utility function that maps a vector to a scalar utility. In the case of a mixed strategy, it is meaningful to apply such a scalarisation both before calculating the expectation of the payoff vector as well as after. This distinction leads to two optimisation criteria. With the first criterion, players aim to optimise the expected value of their utility function applied to the payoff vectors obtained in the game. With the second criterion, players aim to optimise the utility of expected payoff vectors given a joint strategy. Under this latter criterion, it was shown that Nash equilibria need not exist. Our first contribution is to provide a sufficient condition under which Nash equilibria are guaranteed to exist. Secondly, we show that when Nash equilibria do exist under both criteria, no equilibrium needs to be shared between the two criteria, and even the number of equilibria can differ. Thirdly, we contribute a study of pure strategy Nash equilibria under both criteria. We show that when assuming quasiconvex utility functions for players, the sets of pure strategy Nash equilibria under both optimisation criteria are equivalent. This result is further extended to games in which players adhere to different optimisation criteria. Finally, given these theoretical results, we construct an algorithm to compute all pure strategy Nash equilibria in MONFGs where players have a quasiconvex utility function. △ Less

Submitted 16 July, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

arXiv:2111.09191 [pdf, other]

Preference Communication in Multi-Objective Normal-Form Games

Authors: Willem Röpke, Diederik M. Roijers, Ann Nowé, Roxana Rădulescu

Abstract: We consider preference communication in two-player multi-objective normal-form games. In such games, the payoffs resulting from joint actions are vector-valued. Taking a utility-based approach, we assume there exists a utility function for each player which maps vectors to scalar utilities and consider agents that aim to maximise the utility of expected payoff vectors. As agents typically do not k… ▽ More We consider preference communication in two-player multi-objective normal-form games. In such games, the payoffs resulting from joint actions are vector-valued. Taking a utility-based approach, we assume there exists a utility function for each player which maps vectors to scalar utilities and consider agents that aim to maximise the utility of expected payoff vectors. As agents typically do not know their opponent's utility function or strategy, they must learn policies to interact with each other. Inspired by Stackelberg games, we introduce four novel preference communication protocols to aid agents in arriving at adequate solutions. Each protocol describes a specific approach for one agent to communicate preferences over their actions and how another agent responds. Additionally, to study when communication emerges, we introduce a communication protocol where agents must learn when to communicate. These protocols are subsequently evaluated on a set of five benchmark games against baseline agents that do not communicate. We find that preference communication can alter the learning process and lead to the emergence of cyclic policies which had not been previously observed in this setting. We further observe that the resulting policies can heavily depend on the characteristics of the game that is played. Lastly, we find that communication naturally emerges in both cooperative and self-interested settings. △ Less

Submitted 10 June, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

arXiv:2106.13539 [pdf, other]

Dealing with Expert Bias in Collective Decision-Making

Authors: Axel Abels, Tom Lenaerts, Vito Trianni, Ann Nowé

Abstract: Quite some real-world problems can be formulated as decision-making problems wherein one must repeatedly make an appropriate choice from a set of alternatives. Multiple expert judgements, whether human or artificial, can help in taking correct decisions, especially when exploration of alternative solutions is costly. As expert opinions might deviate, the problem of finding the right alternative ca… ▽ More Quite some real-world problems can be formulated as decision-making problems wherein one must repeatedly make an appropriate choice from a set of alternatives. Multiple expert judgements, whether human or artificial, can help in taking correct decisions, especially when exploration of alternative solutions is costly. As expert opinions might deviate, the problem of finding the right alternative can be approached as a collective decision making problem (CDM) via aggregation of independent judgements. Current state-of-the-art approaches focus on efficiently finding the optimal expert, and thus perform poorly if all experts are not qualified or if they are overly biased, thereby potentially derailing the decision-making process. In this paper, we propose a new algorithmic approach based on contextual multi-armed bandit problems (CMAB) to identify and counteract such biased expertise. We explore homogeneous, heterogeneous and polarised expert groups and show that this approach is able to effectively exploit the collective expertise, outperforming state-of-the-art methods, especially when the quality of the provided expertise degrades. Our novel CMAB-inspired approach achieves a higher final performance and does so while converging more rapidly than previous adaptive algorithms. △ Less

Submitted 29 August, 2022; v1 submitted 25 June, 2021; originally announced June 2021.

arXiv:2106.06009 [pdf, other]

doi 10.1007/978-3-030-73959-1_15

Synthesising Reinforcement Learning Policies through Set-Valued Inductive Rule Learning

Authors: Youri Coppens, Denis Steckelmacher, Catholijn M. Jonker, Ann Nowé

Abstract: Today's advanced Reinforcement Learning algorithms produce black-box policies, that are often difficult to interpret and trust for a person. We introduce a policy distilling algorithm, building on the CN2 rule mining algorithm, that distills the policy into a rule-based decision system. At the core of our approach is the fact that an RL process does not just learn a policy, a map** from states t… ▽ More Today's advanced Reinforcement Learning algorithms produce black-box policies, that are often difficult to interpret and trust for a person. We introduce a policy distilling algorithm, building on the CN2 rule mining algorithm, that distills the policy into a rule-based decision system. At the core of our approach is the fact that an RL process does not just learn a policy, a map** from states to actions, but also produces extra meta-information, such as action values indicating the quality of alternative actions. This meta-information can indicate whether more than one action is near-optimal for a certain state. We extend CN2 to make it able to leverage knowledge about equally-good actions to distill the policy into fewer rules, increasing its interpretability by a person. Then, to ensure that the rules explain a valid, non-degenerate policy, we introduce a refinement algorithm that fine-tunes the rules to obtain good performance when executed in the environment. We demonstrate the applicability of our algorithm on the Mario AI benchmark, a complex task that requires modern reinforcement learning algorithms including neural networks. The explanations we produce capture the learned policy in only a few rules, that allow a person to understand what the black-box agent learned. Source code: https://gitlab.ai.vub.ac.be/yocoppen/svcn2 △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: 17 pages, 4 figures. The final authenticated publication is available online at https://doi.org/10.1007/978-3-030-73959-1_15

Journal ref: Trustworthy AI - Integrating Learning, Optimization and Reasoning (2021), Lecture Notes in Computer Science, vol. 12641, pp. 163-179

arXiv:2104.09785 [pdf, other]

Model-predictive control and reinforcement learning in multi-energy system case studies

Authors: Glenn Ceusters, Román Cantú Rodríguez, Alberte Bouso García, Rüdiger Franke, Geert Deconinck, Lieve Helsen, Ann Nowé, Maarten Messagie, Luis Ramirez Camargo

Abstract: Model-predictive-control (MPC) offers an optimal control technique to establish and ensure that the total operation cost of multi-energy systems remains at a minimum while fulfilling all system constraints. However, this method presumes an adequate model of the underlying system dynamics, which is prone to modelling errors and is not necessarily adaptive. This has an associated initial and ongoing… ▽ More Model-predictive-control (MPC) offers an optimal control technique to establish and ensure that the total operation cost of multi-energy systems remains at a minimum while fulfilling all system constraints. However, this method presumes an adequate model of the underlying system dynamics, which is prone to modelling errors and is not necessarily adaptive. This has an associated initial and ongoing project-specific engineering cost. In this paper, we present an on- and off-policy multi-objective reinforcement learning (RL) approach, that does not assume a model a priori, benchmarking this against a linear MPC (LMPC - to reflect current practice, though non-linear MPC performs better) - both derived from the general optimal control problem, highlighting their differences and similarities. In a simple multi-energy system (MES) configuration case study, we show that a twin delayed deep deterministic policy gradient (TD3) RL agent offers potential to match and outperform the perfect foresight LMPC benchmark (101.5%). This while the realistic LMPC, i.e. imperfect predictions, only achieves 98%. While in a more complex MES system configuration, the RL agent's performance is generally lower (94.6%), yet still better than the realistic LMPC (88.9%). In both case studies, the RL agents outperformed the realistic LMPC after a training period of 2 years using quarterly interactions with the environment. We conclude that reinforcement learning is a viable optimal control technique for multi-energy systems given adequate constraint handling and pre-training, to avoid unsafe interactions and long training periods, as is proposed in fundamental future work. △ Less

Submitted 9 September, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

Comments: 43 pages, 29 figures

arXiv:2103.09568 [pdf, other]

doi 10.1007/s10458-022-09552-y

A Practical Guide to Multi-Objective Reinforcement Learning and Planning

Authors: Conor F. Hayes, Roxana Rădulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M. Zintgraf, Richard Dazeley, Fredrik Heintz, Enda Howley, Athirai A. Irissappane, Patrick Mannion, Ann Nowé, Gabriel Ramos, Marcello Restelli, Peter Vamplew, Diederik M. Roijers

Abstract: Real-world decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying pr… ▽ More Real-world decision-making tasks are generally complex, requiring trade-offs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decision-theoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multi-objective methods to difficult problems, and is aimed at researchers who are already familiar with single-objective reinforcement learning and planning methods who wish to adopt a multi-objective perspective on their research, as well as practitioners who encounter multi-objective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multi-objective decision-making systems for complex problems. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Journal ref: Auton Agent Multi-Agent Syst 36, 26 (2022)

arXiv:2011.07290 [pdf, other]

Opponent Learning Awareness and Modelling in Multi-Objective Normal Form Games

Authors: Roxana Rădulescu, Timothy Verstraeten, Yijie Zhang, Patrick Mannion, Diederik M. Roijers, Ann Nowé

Abstract: Many real-world multi-agent interactions consider multiple distinct criteria, i.e. the payoffs are multi-objective in nature. However, the same multi-objective payoff vector may lead to different utilities for each participant. Therefore, it is essential for an agent to learn about the behaviour of other agents in the system. In this work, we present the first study of the effects of such opponent… ▽ More Many real-world multi-agent interactions consider multiple distinct criteria, i.e. the payoffs are multi-objective in nature. However, the same multi-objective payoff vector may lead to different utilities for each participant. Therefore, it is essential for an agent to learn about the behaviour of other agents in the system. In this work, we present the first study of the effects of such opponent modelling on multi-objective multi-agent interactions with non-linear utilities. Specifically, we consider two-player multi-objective normal form games with non-linear utility functions under the scalarised expected returns optimisation criterion. We contribute novel actor-critic and policy gradient formulations to allow reinforcement learning of mixed strategies in this setting, along with extensions that incorporate opponent policy reconstruction and learning with opponent learning awareness (i.e., learning while considering the impact of one's policy when anticipating the opponent's learning step). Empirical results in five different MONFGs demonstrate that opponent learning awareness and modelling can drastically alter the learning dynamics in this setting. When equilibria are present, opponent modelling can confer significant benefits on agents that implement it. When there are no Nash equilibria, opponent learning awareness and modelling allows agents to still converge to meaningful solutions that approximate equilibria. △ Less

Submitted 14 November, 2020; originally announced November 2020.

Comments: Under review since 14 November 2020

arXiv:2003.13676 [pdf, other]

Deep reinforcement learning for large-scale epidemic control

Authors: Pieter Libin, Arno Moonens, Timothy Verstraeten, Fabian Perez-San**es, Niel Hens, Philippe Lemey, Ann Nowé

Abstract: Epidemics of infectious diseases are an important threat to public health and global economies. Yet, the development of prevention strategies remains a challenging process, as epidemics are non-linear and complex processes. For this reason, we investigate a deep reinforcement learning approach to automatically learn prevention strategies in the context of pandemic influenza. Firstly, we construct… ▽ More Epidemics of infectious diseases are an important threat to public health and global economies. Yet, the development of prevention strategies remains a challenging process, as epidemics are non-linear and complex processes. For this reason, we investigate a deep reinforcement learning approach to automatically learn prevention strategies in the context of pandemic influenza. Firstly, we construct a new epidemiological meta-population model, with 379 patches (one for each administrative district in Great Britain), that adequately captures the infection process of pandemic influenza. Our model balances complexity and computational efficiency such that the use of reinforcement learning techniques becomes attainable. Secondly, we set up a ground truth such that we can evaluate the performance of the 'Proximal Policy Optimization' algorithm to learn in a single district of this epidemiological model. Finally, we consider a large-scale problem, by conducting an experiment where we aim to learn a joint policy to control the districts in a community of 11 tightly coupled districts, for which no ground truth can be established. This experiment shows that deep reinforcement learning can be used to learn mitigation policies in complex epidemiological models with a large state space. Moreover, through this experiment, we demonstrate that there can be an advantage to consider collaboration between districts when designing prevention strategies. △ Less

Submitted 30 March, 2020; originally announced March 2020.

arXiv:2001.09502 [pdf, other]

An interpretable semi-supervised classifier using two different strategies for amended self-labeling

Authors: Isel Grau, Dipankar Sengupta, Maria M. Garcia Lorenzo, Ann Nowe

Abstract: In the context of some machine learning applications, obtaining data instances is a relatively easy process but labeling them could become quite expensive or tedious. Such scenarios lead to datasets with few labeled instances and a larger number of unlabeled ones. Semi-supervised classification techniques combine labeled and unlabeled data during the learning phase in order to increase the classif… ▽ More In the context of some machine learning applications, obtaining data instances is a relatively easy process but labeling them could become quite expensive or tedious. Such scenarios lead to datasets with few labeled instances and a larger number of unlabeled ones. Semi-supervised classification techniques combine labeled and unlabeled data during the learning phase in order to increase the classifier's generalization capability. Regrettably, most successful semi-supervised classifiers do not allow explaining their outcome, thus behaving like black boxes. However, there is an increasing number of problem domains in which experts demand a clear understanding of the decision process. In this paper, we report on an extended experimental study presenting an interpretable self-labeling grey-box classifier that uses a black box to estimate the missing class labels and a white box to explain the final predictions. Two different approaches for amending the self-labeling process are explored: a first one based on the confidence of the black box and the latter one based on measures from Rough Set Theory. The results of the extended experimental study support the interpretability by means of transparency and simplicity of our classifier, while attaining superior prediction rates when compared with state-of-the-art self-labeling classifiers reported in the literature. △ Less

Submitted 20 July, 2020; v1 submitted 26 January, 2020; originally announced January 2020.

Comments: Accepted at Special Session on Advances on Explainable Artificial Intelligence, IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2020), IEEE World Congress on Computational Intelligence (WCCI 2020)

arXiv:2001.08177 [pdf, other]

doi 10.1017/S0269888920000351

A utility-based analysis of equilibria in multi-objective normal form games

Authors: Roxana Rădulescu, Patrick Mannion, Yijie Zhang, Diederik M. Roijers, Ann Nowé

Abstract: In multi-objective multi-agent systems (MOMAS), agents explicitly consider the possible tradeoffs between conflicting objective functions. We argue that compromises between competing objectives in MOMAS should be analysed on the basis of the utility that these compromises have for the users of a system, where an agent's utility function maps their payoff vectors to scalar utility values. This util… ▽ More In multi-objective multi-agent systems (MOMAS), agents explicitly consider the possible tradeoffs between conflicting objective functions. We argue that compromises between competing objectives in MOMAS should be analysed on the basis of the utility that these compromises have for the users of a system, where an agent's utility function maps their payoff vectors to scalar utility values. This utility-based approach naturally leads to two different optimisation criteria for agents in a MOMAS: expected scalarised returns (ESR) and scalarised expected returns (SER). In this article, we explore the differences between these two criteria using the framework of multi-objective normal form games (MONFGs). We demonstrate that the choice of optimisation criterion (ESR or SER) can radically alter the set of equilibria in a MONFG when non-linear utility functions are used. △ Less

Submitted 17 January, 2020; originally announced January 2020.

Comments: Under review since 16 January 2020

arXiv:2001.07527 [pdf, other]

Model-based Multi-Agent Reinforcement Learning with Cooperative Prioritized Swee**

Authors: Eugenio Bargiacchi, Timothy Verstraeten, Diederik M. Roijers, Ann Nowé

Abstract: We present a new model-based reinforcement learning algorithm, Cooperative Prioritized Swee**, for efficient learning in multi-agent Markov decision processes. The algorithm allows for sample-efficient learning on large problems by exploiting a factorization to approximate the value function. Our approach only requires knowledge about the structure of the problem in the form of a dynamic decisio… ▽ More We present a new model-based reinforcement learning algorithm, Cooperative Prioritized Swee**, for efficient learning in multi-agent Markov decision processes. The algorithm allows for sample-efficient learning on large problems by exploiting a factorization to approximate the value function. Our approach only requires knowledge about the structure of the problem in the form of a dynamic decision network. Using this information, our method learns a model of the environment and performs temporal difference updates which affect multiple joint states and actions at once. Batch updates are additionally performed which efficiently back-propagate knowledge throughout the factored Q-function. Our method outperforms the state-of-the-art algorithm sparse cooperative Q-learning algorithm, both on the well-known SysAdmin benchmark and randomized environments. △ Less

Submitted 15 January, 2020; originally announced January 2020.

arXiv:1911.10121 [pdf, other]

Fleet Control using Coregionalized Gaussian Process Policy Iteration

Authors: Timothy Verstraeten, Pieter JK Libin, Ann Nowé

Abstract: In many settings, as for example wind farms, multiple machines are instantiated to perform the same task, which is called a fleet. The recent advances with respect to the Internet of Things allow control devices and/or machines to connect through cloud-based architectures in order to share information about their status and environment. Such an infrastructure allows seamless data sharing between f… ▽ More In many settings, as for example wind farms, multiple machines are instantiated to perform the same task, which is called a fleet. The recent advances with respect to the Internet of Things allow control devices and/or machines to connect through cloud-based architectures in order to share information about their status and environment. Such an infrastructure allows seamless data sharing between fleet members, which could greatly improve the sample-efficiency of reinforcement learning techniques. However in practice, these machines, while almost identical in design, have small discrepancies due to production errors or degradation, preventing control algorithms to simply aggregate and employ all fleet data. We propose a novel reinforcement learning method that learns to transfer knowledge between similar fleet members and creates member-specific dynamics models for control. Our algorithm uses Gaussian processes to establish cross-member covariances. This is significantly different from standard transfer learning methods, as the focus is not on sharing information over tasks, but rather over system specifications. We demonstrate our approach on two benchmarks and a realistic wind farm setting. Our method significantly outperforms two baseline approaches, namely individual learning and joint learning where all samples are aggregated, in terms of the median and variance of the results. △ Less

Submitted 22 November, 2019; originally announced November 2019.

arXiv:1911.10120 [pdf, other]

doi 10.1038/s41598-020-62939-3

Multi-Agent Thompson Sampling for Bandit Applications with Sparse Neighbourhood Structures

Authors: Timothy Verstraeten, Eugenio Bargiacchi, Pieter JK Libin, Jan Helsen, Diederik M Roijers, Ann Nowé

Abstract: Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighbouring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this… ▽ More Multi-agent coordination is prevalent in many real-world applications. However, such coordination is challenging due to its combinatorial nature. An important observation in this regard is that agents in the real world often only directly affect a limited set of neighbouring agents. Leveraging such loose couplings among agents is key to making coordination in multi-agent systems feasible. In this work, we focus on learning to coordinate. Specifically, we consider the multi-agent multi-armed bandit framework, in which fully cooperative loosely-coupled agents must learn to coordinate their decisions to optimize a common objective. We propose multi-agent Thompson sampling (MATS), a new Bayesian exploration-exploitation algorithm that leverages loose couplings. We provide a regret bound that is sublinear in time and low-order polynomial in the highest number of actions of a single agent for sparse coordination graphs. Additionally, we empirically show that MATS outperforms the state-of-the-art algorithm, MAUCE, on two synthetic benchmarks, and a novel benchmark with Poisson distributions. An example of a loosely-coupled multi-agent system is a wind farm. Coordination within the wind farm is necessary to maximize power production. As upstream wind turbines only affect nearby downstream turbines, we can use MATS to efficiently learn the optimal control mechanism for the farm. To demonstrate the benefits of our method toward applications we apply MATS to a realistic wind farm control task. In this task, wind turbines must coordinate their alignments with respect to the incoming wind vector in order to optimize power production. Our results show that MATS improves significantly upon state-of-the-art coordination methods in terms of performance, demonstrating the value of using MATS in practical applications with sparse neighbourhood structures. △ Less

Submitted 7 February, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Journal ref: Sci Rep 10, 6728 (2020)

arXiv:1910.04824 [pdf, ps, other]

Towards a phylogenetic measure to quantify HIV incidence

Authors: Pieter Libin, Nassim Versbraegen, Ana B. Abecasis, Perpetua Gomes, Tom Lenaerts, Ann Nowé

Abstract: One of the cornerstones in combating the HIV pandemic is being able to assess the current state and evolution of local HIV epidemics. This remains a complex problem, as many HIV infected individuals remain unaware of their infection status, leading to parts of HIV epidemics being undiagnosed and under-reported. To that end, we firstly present a method to learn epidemiological parameters from phylo… ▽ More One of the cornerstones in combating the HIV pandemic is being able to assess the current state and evolution of local HIV epidemics. This remains a complex problem, as many HIV infected individuals remain unaware of their infection status, leading to parts of HIV epidemics being undiagnosed and under-reported. To that end, we firstly present a method to learn epidemiological parameters from phylogenetic trees, using approximate Bayesian computation (ABC). The epidemiological parameters learned as a result of applying ABC are subsequently used in epidemiological models that aim to simulate a specific epidemic. Secondly, we continue by describing the development of a tree statistic, rooted in coalescent theory, which we use to relate epidemiological parameters to a phylogenetic tree, by using the simulated epidemics. We show that the presented tree statistic enables differentiation of epidemiological parameters, while only relying on phylogenetic trees, thus enabling the construction of new methods to ascertain the epidemiological state of an HIV epidemic. By using genetic data to infer epidemic sizes, we expect to enhance understanding of the portions of the infected population in which diagnosis rates are low. △ Less

Submitted 23 October, 2019; v1 submitted 10 October, 2019; originally announced October 2019.

Comments: Accepted at BNAIC 2019 (Benelux AI conference)

arXiv:1909.13726 [pdf, other]

doi 10.1109/ICTAI.2018.00054

IPC-Net: 3D point-cloud segmentation using deep inter-point convolutional layers

Authors: Felipe Gomez Marulanda, Pieter Libin, Timothy Verstraeten, Ann Nowé

Abstract: Over the last decade, the demand for better segmentation and classification algorithms in 3D spaces has significantly grown due to the popularity of new 3D sensor technologies and advancements in the field of robotics. Point-clouds are one of the most popular representations to store a digital description of 3D shapes. However, point-clouds are stored in irregular and unordered structures, which l… ▽ More Over the last decade, the demand for better segmentation and classification algorithms in 3D spaces has significantly grown due to the popularity of new 3D sensor technologies and advancements in the field of robotics. Point-clouds are one of the most popular representations to store a digital description of 3D shapes. However, point-clouds are stored in irregular and unordered structures, which limits the direct use of segmentation algorithms such as Convolutional Neural Networks. The objective of our work is twofold: First, we aim to provide a full analysis of the PointNet architecture to illustrate which features are being extracted from the point-clouds. Second, to propose a new network architecture called IPC-Net to improve the state-of-the-art point cloud architectures. We show that IPC-Net extracts a larger set of unique features allowing the model to produce more accurate segmentations compared to the PointNet architecture. In general, our approach outperforms PointNet on every family of 3D geometries on which the models were tested. A high generalisation improvement was observed on every 3D shape, especially on the rockets dataset. Our experiments demonstrate that our main contribution, inter-point activation on the network's layers, is essential to accurately segment 3D point-clouds. △ Less

Submitted 30 September, 2019; originally announced September 2019.

Journal ref: 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI),

arXiv:1909.02964 [pdf, other]

doi 10.1007/s10458-019-09433-x

Multi-Objective Multi-Agent Decision Making: A Utility-based Analysis and Survey

Authors: Roxana Rădulescu, Patrick Mannion, Diederik M. Roijers, Ann Nowé

Abstract: The majority of multi-agent system (MAS) implementations aim to optimise agents' policies with respect to a single objective, despite the fact that many real-world problem domains are inherently multi-objective in nature. Multi-objective multi-agent systems (MOMAS) explicitly consider the possible trade-offs between conflicting objective functions. We argue that, in MOMAS, such compromises should… ▽ More The majority of multi-agent system (MAS) implementations aim to optimise agents' policies with respect to a single objective, despite the fact that many real-world problem domains are inherently multi-objective in nature. Multi-objective multi-agent systems (MOMAS) explicitly consider the possible trade-offs between conflicting objective functions. We argue that, in MOMAS, such compromises should be analysed on the basis of the utility that these compromises have for the users of a system. As is standard in multi-objective optimisation, we model the user utility using utility functions that map value or return vectors to scalar values. This approach naturally leads to two different optimisation criteria: expected scalarised returns (ESR) and scalarised expected returns (SER). We develop a new taxonomy which classifies multi-objective multi-agent decision making settings, on the basis of the reward structures, and which and how utility functions are applied. This allows us to offer a structured view of the field, to clearly delineate the current state-of-the-art in multi-objective multi-agent decision making approaches and to identify promising directions for future research. Starting from the execution phase, in which the selected policies are applied and the utility for the users is attained, we analyse which solution concepts apply to the different settings in our taxonomy. Furthermore, we define and discuss these solution concepts under both ESR and SER optimisation criteria. We conclude with a summary of our main findings and a discussion of many promising future research directions in multi-objective multi-agent systems. △ Less

Submitted 6 September, 2019; originally announced September 2019.

Comments: Under review since 15 May 2019

arXiv:1907.07958 [pdf, other]

Transfer Learning Across Simulated Robots With Different Sensors

Authors: Hélène Plisnier, Denis Steckelmacher, Diederik Roijers, Ann Nowé

Abstract: For a robot to learn a good policy, it often requires expensive equipment (such as sophisticated sensors) and a prepared training environment conducive to learning. However, it is seldom possible to perfectly equip robots for economic reasons, nor to guarantee ideal learning conditions, when deployed in real-life environments. A solution would be to prepare the robot in the lab environment, when a… ▽ More For a robot to learn a good policy, it often requires expensive equipment (such as sophisticated sensors) and a prepared training environment conducive to learning. However, it is seldom possible to perfectly equip robots for economic reasons, nor to guarantee ideal learning conditions, when deployed in real-life environments. A solution would be to prepare the robot in the lab environment, when all necessary material is available to learn a good policy. After training in the lab, the robot should be able to get by without the expensive equipment that used to be available to it, and yet still be guaranteed to perform well on the field. The transition between the lab (source) and the real-world environment (target) is related to transfer learning, where the state-space between the source and target tasks differ. We tackle a simulated task with continuous states and discrete actions presenting this challenge, using Bootstrapped Dual Policy Iteration, a model-free actor-critic reinforcement learning algorithm, and Policy Sha**. Specifically, we train a BDPI agent, embodied by a virtual robot performing a task in the V-Rep simulator, sensing its environment through several proximity sensors. The resulting policy is then used by a second agent learning the same task in the same environment, but with camera images as input. The goal is to obtain a policy able to perform the task relying on merely camera images. △ Less

Submitted 18 July, 2019; originally announced July 2019.

arXiv:1903.11518 [pdf, other]

doi 10.1016/j.rser.2019.03.019

Fleetwide data-enabled reliability improvement of wind turbines

Authors: Timothy Verstraeten, Ann Nowe, Jonathan Keller, Yi Guo, Shuangwen Sheng, Jan Helsen

Abstract: Wind farms are an indispensable driver toward renewable and nonpolluting energy resources. However, as ideal sites are limited, placement in remote and challenging locations results in higher logistics costs and lower average wind speeds. Therefore, it is critical to increase the reliability of the turbines to reduce maintenance costs. Robust implementation requires a thorough understanding of the… ▽ More Wind farms are an indispensable driver toward renewable and nonpolluting energy resources. However, as ideal sites are limited, placement in remote and challenging locations results in higher logistics costs and lower average wind speeds. Therefore, it is critical to increase the reliability of the turbines to reduce maintenance costs. Robust implementation requires a thorough understanding of the loads subject to the turbine's control. Yet, such dynamically changing multidimensional loads are uncommon with other machinery, and generally underresearched. Therefore, a multitiered approach is proposed to investigate the load spectrum occurring in wind farms. Our approach relies on both fundamental research using controllable test rigs, as well as analyses of real-world loading conditions in high-frequency supervisory control and data acquisition data. A method is introduced to detect operational zones in wind farm data and link them with load distributions. Additionally, while focused research further investigates the load spectrum, a method is proposed that continuously optimizes the farm's control protocols without the need to fully understand the loads that occur. A case of gearbox failure is investigated based on a vast body of past experiments and suspect loads are identified. Starting from this evidence on the cause and effects of dynamic loads, the potential of our methods is shown by analyzing real-world farm loading conditions on a steady-state case of wake and develo** a preventive row-based control protocol for a case of cascading emergency brakes induced by a storm. △ Less

Submitted 3 April, 2019; v1 submitted 27 March, 2019; originally announced March 2019.

Comments: 24 pages, 8 figures

Journal ref: Renew Sustain Energy Rev 109 (2019) 428-437

arXiv:1903.04193 [pdf, other]

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Authors: Denis Steckelmacher, Hélène Plisnier, Diederik M. Roijers, Ann Nowé

Abstract: Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete act… ▽ More Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi. △ Less

Submitted 12 June, 2019; v1 submitted 11 March, 2019; originally announced March 2019.

Comments: Accepted at the European Conference on Machine Learning 2019 (ECML)

arXiv:1902.02556 [pdf, other]

The Actor-Advisor: Policy Gradient With Off-Policy Advice

Authors: Hélène Plisnier, Denis Steckelmacher, Diederik M. Roijers, Ann Nowé

Abstract: Actor-critic algorithms learn an explicit policy (actor), and an accompanying value function (critic). The actor performs actions in the environment, while the critic evaluates the actor's current policy. However, despite their stability and promising convergence properties, current actor-critic algorithms do not outperform critic-only ones in practice. We believe that the fact that the critic lea… ▽ More Actor-critic algorithms learn an explicit policy (actor), and an accompanying value function (critic). The actor performs actions in the environment, while the critic evaluates the actor's current policy. However, despite their stability and promising convergence properties, current actor-critic algorithms do not outperform critic-only ones in practice. We believe that the fact that the critic learns Q^pi, instead of the optimal Q-function Q*, prevents state-of-the-art robust and sample-efficient off-policy learning algorithms from being used. In this paper, we propose an elegant solution, the Actor-Advisor architecture, in which a Policy Gradient actor learns from unbiased Monte-Carlo returns, while being shaped (or advised) by the Softmax policy arising from an off-policy critic. The critic can be learned independently from the actor, using any state-of-the-art algorithm. Being advised by a high-quality critic, the actor quickly and robustly learns the task, while its use of the Monte-Carlo return helps overcome any bias the critic may have. In addition to a new Actor-Critic formulation, the Actor-Advisor, a method that allows an external advisory policy to shape a Policy Gradient actor, can be applied to many other domains. By varying the source of advice, we demonstrate the wide applicability of the Actor-Advisor to three other important subfields of RL: safe RL with backup policies, efficient leverage of domain knowledge, and transfer learning in RL. Our experimental results demonstrate the benefits of the Actor-Advisor compared to state-of-the-art actor-critic methods, illustrate its applicability to the three other application scenarios listed above, and show that many important challenges of RL can now be solved using a single elegant solution. △ Less

Submitted 7 February, 2019; originally announced February 2019.

arXiv:1811.11042 [pdf, other]

Bayesian inference of set-point viral load transmission models

Authors: Pieter Libin, Laurens Hernalsteen, Kristof Theys, Perpetua Gomes, Ana Abecasis, Ann Nowe

Abstract: When modelling HIV epidemics, it is important to incorporate set-point viral load and its heritability. As set-point viral load distributions can differ significantly amongst epidemics, it is imperative to account for the observed local variation. This can be done by using a heritability model and fitting it to a local set-point viral load distribution. However, as the fitting procedure needs to t… ▽ More When modelling HIV epidemics, it is important to incorporate set-point viral load and its heritability. As set-point viral load distributions can differ significantly amongst epidemics, it is imperative to account for the observed local variation. This can be done by using a heritability model and fitting it to a local set-point viral load distribution. However, as the fitting procedure needs to take into account the actual transmission dynamics (i.e., social network, sexual behaviour), a complex model is required. Furthermore, in order to use the estimates in subsequent modelling analyses to inform prevention policies, it is important to assess parameter robustness. In order to fit set-point viral load models without the need to capture explicitly the transmission dynamics, we present a new protocol. Firstly, we approximate the transmission network from a phylogeny that was inferred from sequences collected in the local epidemic. Secondly, as this transmission network only comprises a single instance of the transmission network space, and our aim is to assess parameter robustness, we infer the transmission network distribution. Thirdly, we fit the parameters of the selected set-point viral load model on multiple samples from the transmission network distribution using approximate Bayesian inference. Our new protocol enables researchers to fit set-point viral load models in their local context, and diagnose the model parameter's uncertainty. Such parameter estimates are essential to enable subsequent modelling analyses, and thus crucial to improve prevention policies. △ Less

Submitted 8 November, 2018; originally announced November 2018.

Comments: Accepted at BNAIC 2018 (Benelux AI conference)

arXiv:1809.07803 [pdf, other]

Dynamic Weights in Multi-Objective Deep Reinforcement Learning

Authors: Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Nowé, Denis Steckelmacher

Abstract: Many real-world decision problems are characterized by multiple conflicting objectives which must be balanced based on their relative importance. In the dynamic weights setting the relative importance changes over time and specialized algorithms that deal with such change, such as a tabular Reinforcement Learning (RL) algorithm by Natarajan and Tadepalli (2005), are required. However, this earlier… ▽ More Many real-world decision problems are characterized by multiple conflicting objectives which must be balanced based on their relative importance. In the dynamic weights setting the relative importance changes over time and specialized algorithms that deal with such change, such as a tabular Reinforcement Learning (RL) algorithm by Natarajan and Tadepalli (2005), are required. However, this earlier work is not feasible for RL settings that necessitate the use of function approximators. We generalize across weight changes and high-dimensional inputs by proposing a multi-objective Q-network whose outputs are conditioned on the relative importance of objectives and we introduce Diverse Experience Replay (DER) to counter the inherent non-stationarity of the Dynamic Weights setting. We perform an extensive experimental evaluation and compare our methods to adapted algorithms from Deep Multi-Task/Multi-Objective Reinforcement Learning and show that our proposed network in combination with DER dominates these adapted algorithms across weight change scenarios and problem domains. △ Less

Submitted 13 May, 2019; v1 submitted 20 September, 2018; originally announced September 2018.

ACM Class: I.2.6

arXiv:1808.04096 [pdf, other]

Directed Policy Gradient for Safe Reinforcement Learning with Human Advice

Authors: Hélène Plisnier, Denis Steckelmacher, Tim Brys, Diederik M. Roijers, Ann Nowé

Abstract: Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people's preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot im… ▽ More Many currently deployed Reinforcement Learning agents work in an environment shared with humans, be them co-workers, users or clients. It is desirable that these agents adjust to people's preferences, learn faster thanks to their help, and act safely around them. We argue that most current approaches that learn from human feedback are unsafe: rewarding or punishing the agent a-posteriori cannot immediately prevent it from wrong-doing. In this paper, we extend Policy Gradient to make it robust to external directives, that would otherwise break the fundamentally on-policy nature of Policy Gradient. Our technique, Directed Policy Gradient (DPG), allows a teacher or backup policy to override the agent before it acts undesirably, while allowing the agent to leverage human advice or directives to learn faster. Our experiments demonstrate that DPG makes the agent learn much faster than reward-based approaches, while requiring an order of magnitude less advice. △ Less

Submitted 13 August, 2018; originally announced August 2018.

Comments: Accepted at the European Workshop on Reinforcement Learning 2018 (EWRL14)

arXiv:1802.07606 [pdf, other]

Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making

Authors: Luisa M Zintgraf, Diederik M Roijers, Sjoerd Linders, Catholijn M Jonker, Ann Nowé

Abstract: In multi-objective decision planning and learning, much attention is paid to producing optimal solution sets that contain an optimal policy for every possible user preference profile. We argue that the step that follows, i.e, determining which policy to execute by maximising the user's intrinsic utility function over this (possibly infinite) set, is under-studied. This paper aims to fill this gap.… ▽ More In multi-objective decision planning and learning, much attention is paid to producing optimal solution sets that contain an optimal policy for every possible user preference profile. We argue that the step that follows, i.e, determining which policy to execute by maximising the user's intrinsic utility function over this (possibly infinite) set, is under-studied. This paper aims to fill this gap. We build on previous work on Gaussian processes and pairwise comparisons for preference modelling, extend it to the multi-objective decision support scenario, and propose new ordered preference elicitation strategies based on ranking and clustering. Our main contribution is an in-depth evaluation of these strategies using computer and human-based experiments. We show that our proposed elicitation strategies outperform the currently used pairwise methods, and found that users prefer ranking most. Our experiments further show that utilising monotonicity information in GPs by using a linear prior mean at the start and virtual comparisons to the nadir and ideal points, increases performance. We demonstrate our decision support framework in a real-world study on traffic regulation, conducted with the city of Amsterdam. △ Less

Submitted 21 February, 2018; originally announced February 2018.

Comments: AAMAS 2018, Source code at https://github.com/lmzintgraf/gp_pref_elicit

arXiv:1711.06299 [pdf, ps, other]

Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies

Authors: Pieter Libin, Timothy Verstraeten, Diederik M. Roijers, Jelena Grujic, Kristof Theys, Philippe Lemey, Ann Nowé

Abstract: Pandemic influenza has the epidemic potential to kill millions of people. While various preventive measures exist (i.a., vaccination and school closures), deciding on strategies that lead to their most effective and efficient use remains challenging. To this end, individual-based epidemiological models are essential to assist decision makers in determining the best strategy to curb epidemic spread… ▽ More Pandemic influenza has the epidemic potential to kill millions of people. While various preventive measures exist (i.a., vaccination and school closures), deciding on strategies that lead to their most effective and efficient use remains challenging. To this end, individual-based epidemiological models are essential to assist decision makers in determining the best strategy to curb epidemic spread. However, individual-based models are computationally intensive and it is therefore pivotal to identify the optimal strategy using a minimal amount of model evaluations. Additionally, as epidemiological modeling experiments need to be planned, a computational budget needs to be specified a priori. Consequently, we present a new sampling technique to optimize the evaluation of preventive strategies using fixed budget best-arm identification algorithms. We use epidemiological modeling theory to derive knowledge about the reward distribution which we exploit using Bayesian best-arm identification algorithms (i.e., Top-two Thompson sampling and BayesGap). We evaluate these algorithms in a realistic experimental setting and demonstrate that it is possible to identify the optimal strategy using only a limited number of model evaluations, i.e., 2-to-3 times faster compared to the uniform sampling method, the predominant technique used for epidemiological decision making in the literature. Finally, we contribute and evaluate a statistic for Top-two Thompson sampling to inform the decision makers about the confidence of an arm recommendation. △ Less

Submitted 15 June, 2018; v1 submitted 16 November, 2017; originally announced November 2017.

arXiv:1711.03817 [pdf, other]

Learning with Options that Terminate Off-Policy

Authors: Anna Harutyunyan, Peter Vrancx, Pierre-Luc Bacon, Doina Precup, Ann Nowe

Abstract: A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal… ▽ More A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims. △ Less

Submitted 2 December, 2017; v1 submitted 10 November, 2017; originally announced November 2017.

Comments: AAAI 2018

arXiv:1708.06551 [pdf, other]

Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets

Authors: Denis Steckelmacher, Diederik M. Roijers, Anna Harutyunyan, Peter Vrancx, Hélène Plisnier, Ann Nowé

Abstract: Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately (for instance by combining recurrent neural networks and options), we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the… ▽ More Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately (for instance by combining recurrent neural networks and options), we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the initiation set of options conditional on the previously-executed option, and show that options with such Option-Observation Initiation Sets (OOIs) are at least as expressive as Finite State Controllers (FSCs), a state-of-the-art approach for learning in POMDPs. OOIs are easy to design based on an intuitive description of the task, lead to explainable policies and keep the top-level and option policies memoryless. Our experiments show that OOIs allow agents to learn optimal policies in challenging POMDPs, while being much more sample-efficient than a recurrent neural network over options. △ Less

Submitted 12 September, 2017; v1 submitted 22 August, 2017; originally announced August 2017.

arXiv:1702.08736 [pdf, other]

Analysing Congestion Problems in Multi-agent Reinforcement Learning

Authors: Roxana Rădulescu, Peter Vrancx, Ann Nowé

Abstract: Congestion problems are omnipresent in today's complex networks and represent a challenge in many research domains. In the context of Multi-agent Reinforcement Learning (MARL), approaches like difference rewards and resource abstraction have shown promising results in tackling such problems. Resource abstraction was shown to be an ideal candidate for solving large-scale resource allocation problem… ▽ More Congestion problems are omnipresent in today's complex networks and represent a challenge in many research domains. In the context of Multi-agent Reinforcement Learning (MARL), approaches like difference rewards and resource abstraction have shown promising results in tackling such problems. Resource abstraction was shown to be an ideal candidate for solving large-scale resource allocation problems in a fully decentralized manner. However, its performance and applicability strongly depends on some, until now, undocumented assumptions. Two of the main congestion benchmark problems considered in the literature are: the Beach Problem Domain and the Traffic Lane Domain. In both settings the highest system utility is achieved when overcrowding one resource and kee** the rest at optimum capacity. We analyse how abstract grou** can promote this behaviour and how feasible it is to apply this approach in a real-world domain (i.e., what assumptions need to be satisfied and what knowledge is necessary). We introduce a new test problem, the Road Network Domain (RND), where the resources are no longer independent, but rather part of a network (e.g., road network), thus choosing one path will also impact the load on other paths having common road segments. We demonstrate the application of state-of-the-art MARL methods for this new congestion model and analyse their performance. RND allows us to highlight an important limitation of resource abstraction and show that the difference rewards approach manages to better capture and inform the agents about the dynamics of the environment. △ Less

Submitted 30 March, 2017; v1 submitted 28 February, 2017; originally announced February 2017.

Comments: Adaptive Learning Agents (ALA) Workshop at AAMAS 2017

MSC Class: 68T05 ACM Class: I.2.11

arXiv:1512.05247 [pdf, ps, other]

doi 10.1017/S147106841600003X

Solving stable matching problems using answer set programming

Authors: Sofie De Clercq, Steven Schockaert, Martine De Cock, Ann Nowé

Abstract: Since the introduction of the stable marriage problem (SMP) by Gale and Shapley (1962), several variants and extensions have been investigated. While this variety is useful to widen the application potential, each variant requires a new algorithm for finding the stable matchings. To address this issue, we propose an encoding of the SMP using answer set programming (ASP), which can straightforwardl… ▽ More Since the introduction of the stable marriage problem (SMP) by Gale and Shapley (1962), several variants and extensions have been investigated. While this variety is useful to widen the application potential, each variant requires a new algorithm for finding the stable matchings. To address this issue, we propose an encoding of the SMP using answer set programming (ASP), which can straightforwardly be adapted and extended to suit the needs of specific applications. The use of ASP also means that we can take advantage of highly efficient off-the-shelf solvers. To illustrate the flexibility of our approach, we show how our ASP encoding naturally allows us to select optimal stable matchings, i.e. matchings that are optimal according to some user-specified criterion. To the best of our knowledge, our encoding offers the first exact implementation to find sex-equal, minimum regret, egalitarian or maximum cardinality stable matchings for SMP instances in which individuals may designate unacceptable partners and ties between preferences are allowed. This paper is under consideration in Theory and Practice of Logic Programming (TPLP). △ Less

Submitted 16 December, 2015; originally announced December 2015.

Comments: Under consideration in Theory and Practice of Logic Programming (TPLP). arXiv admin note: substantial text overlap with arXiv:1302.7251

Journal ref: Theory and Practice of Logic Programming 16 (2016) 247-268

arXiv:1502.03248 [pdf, other]

Off-Policy Reward Sha** with Ensembles

Authors: Anna Harutyunyan, Tim Brys, Peter Vrancx, Ann Nowe

Abstract: Potential-based reward sha** (PBRS) is an effective and popular technique to speed up reinforcement learning by leveraging domain knowledge. While PBRS is proven to always preserve optimal policies, its effect on learning speed is determined by the quality of its potential function, which, in turn, depends on both the underlying heuristic and the scale. Knowing which heuristic will prove effecti… ▽ More Potential-based reward sha** (PBRS) is an effective and popular technique to speed up reinforcement learning by leveraging domain knowledge. While PBRS is proven to always preserve optimal policies, its effect on learning speed is determined by the quality of its potential function, which, in turn, depends on both the underlying heuristic and the scale. Knowing which heuristic will prove effective requires testing the options beforehand, and determining the appropriate scale requires tuning, both of which introduce additional sample complexity. We formulate a PBRS framework that reduces learning speed, but does not incur extra sample complexity. For this, we propose to simultaneously learn an ensemble of policies, shaped w.r.t. many heuristics and on a range of scales. The target policy is then obtained by voting. The ensemble needs to be able to efficiently and reliably learn off-policy: requirements fulfilled by the recent Horde architecture, which we take as our basis. We demonstrate empirically that (1) our ensemble policy outperforms both the base policy, and its single-heuristic components, and (2) an ensemble over a general range of scales performs at least as well as one with optimally tuned components. △ Less

Submitted 23 March, 2015; v1 submitted 11 February, 2015; originally announced February 2015.

Comments: To be presented at ALA-15. Short version to appear at AAMAS-15

arXiv:1405.5358 [pdf, other]

Off-Policy Sha** Ensembles in Reinforcement Learning

Authors: Anna Harutyunyan, Tim Brys, Peter Vrancx, Ann Nowe

Abstract: Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel with- out sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based sha** rewards. The ensemble… ▽ More Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel with- out sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based sha** rewards. The ensemble induces a combination policy by using a voting mechanism on its components. Learning happens in real time, and we empirically show the combination policy to outperform the individual policies of the ensemble. △ Less

Submitted 21 May, 2014; originally announced May 2014.

Comments: Full version of the paper to appear in Proc. ECAI 2014

Showing 1–50 of 51 results for author: Nowé, A