-
A Regret Minimization Approach to Multi-Agent Control
Authors:
Udaya Ghai,
Udari Madhushani,
Naomi Leonard,
Elad Hazan
Abstract:
We study the problem of multi-agent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distri…
▽ More
We study the problem of multi-agent control of a dynamical system with known dynamics and adversarial disturbances. Our study focuses on optimal control without centralized precomputed policies, but rather with adaptive control policies for the different agents that are only equipped with a stabilizing controller. We give a reduction from any (standard) regret minimizing control method to a distributed algorithm. The reduction guarantees that the resulting distributed algorithm has low regret relative to the optimal precomputed joint policy. Our methodology involves generalizing online convex optimization to a multi-agent setting and applying recent tools from nonstochastic control derived for a single agent. We empirically evaluate our method on a model of an overactuated aircraft. We show that the distributed method is robust to failure and to adversarial perturbations in the dynamics.
△ Less
Submitted 25 February, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Provably Efficient Multi-Agent Reinforcement Learning with Fully Decentralized Communication
Authors:
Justin Lidard,
Udari Madhushani,
Naomi Ehrich Leonard
Abstract:
A challenge in reinforcement learning (RL) is minimizing the cost of sampling associated with exploration. Distributed exploration reduces sampling complexity in multi-agent RL (MARL). We investigate the benefits to performance in MARL when exploration is fully decentralized. Specifically, we consider a class of online, episodic, tabular $Q$-learning problems under time-varying reward and transiti…
▽ More
A challenge in reinforcement learning (RL) is minimizing the cost of sampling associated with exploration. Distributed exploration reduces sampling complexity in multi-agent RL (MARL). We investigate the benefits to performance in MARL when exploration is fully decentralized. Specifically, we consider a class of online, episodic, tabular $Q$-learning problems under time-varying reward and transition dynamics, in which agents can communicate in a decentralized manner.We show that group performance, as measured by the bound on regret, can be significantly improved through communication when each agent uses a decentralized message-passing protocol, even when limited to sending information up to its $γ$-hop neighbors. We prove regret and sample complexity bounds that depend on the number of agents, communication network structure and $γ.$ We show that incorporating more agents and more information sharing into the group learning scheme speeds up convergence to the optimal policy. Numerical simulations illustrate our results and validate our theoretical claims.
△ Less
Submitted 2 May, 2022; v1 submitted 14 October, 2021;
originally announced October 2021.
-
Distributed Bandits: Probabilistic Communication on $d$-regular Graphs
Authors:
Udari Madhushani,
Naomi Ehrich Leonard
Abstract:
We study the decentralized multi-agent multi-armed bandit problem for agents that communicate with probability over a network defined by a $d$-regular graph. Every edge in the graph has probabilistic weight $p$ to account for the ($1\!-\!p$) probability of a communication link failure. At each time step, each agent chooses an arm and receives a numerical reward associated with the chosen arm. Afte…
▽ More
We study the decentralized multi-agent multi-armed bandit problem for agents that communicate with probability over a network defined by a $d$-regular graph. Every edge in the graph has probabilistic weight $p$ to account for the ($1\!-\!p$) probability of a communication link failure. At each time step, each agent chooses an arm and receives a numerical reward associated with the chosen arm. After each choice, each agent observes the last obtained reward of each of its neighbors with probability $p$. We propose a new Upper Confidence Bound (UCB) based algorithm and analyze how agent-based strategies contribute to minimizing group regret in this probabilistic communication setting. We provide theoretical guarantees that our algorithm outperforms state-of-the-art algorithms. We illustrate our results and validate the theoretical claims using numerical simulations.
△ Less
Submitted 8 October, 2021; v1 submitted 15 November, 2020;
originally announced November 2020.
-
On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning Problems in High-dimension
Authors:
Udari Madhushani,
Biswadip Dey,
Naomi Ehrich Leonard,
Amit Chakraborty
Abstract:
Value function based reinforcement learning (RL) algorithms, for example, $Q$-learning, learn optimal policies from datasets of actions, rewards, and state transitions. However, when the underlying state transition dynamics are stochastic and evolve on a high-dimensional space, generating independent and identically distributed (IID) data samples for creating these datasets poses a significant cha…
▽ More
Value function based reinforcement learning (RL) algorithms, for example, $Q$-learning, learn optimal policies from datasets of actions, rewards, and state transitions. However, when the underlying state transition dynamics are stochastic and evolve on a high-dimensional space, generating independent and identically distributed (IID) data samples for creating these datasets poses a significant challenge due to the intractability of the associated normalizing integral. In these scenarios, Hamiltonian Monte Carlo (HMC) sampling offers a computationally tractable way to generate data for training RL algorithms. In this paper, we introduce a framework, called \textit{Hamiltonian $Q$-Learning}, that demonstrates, both theoretically and empirically, that $Q$ values can be learned from a dataset generated by HMC samples of actions, rewards, and state transitions. Furthermore, to exploit the underlying low-rank structure of the $Q$ function, Hamiltonian $Q$-Learning uses a matrix completion algorithm for reconstructing the updated $Q$ function from $Q$ value updates over a much smaller subset of state-action pairs. Thus, by providing an efficient way to apply $Q$-learning in stochastic, high-dimensional settings, the proposed approach broadens the scope of RL algorithms for real-world applications.
△ Less
Submitted 28 March, 2022; v1 submitted 11 November, 2020;
originally announced November 2020.
-
Heterogeneous Explore-Exploit Strategies on Multi-Star Networks
Authors:
Udari Madhushani,
Naomi Leonard
Abstract:
We investigate the benefits of heterogeneity in multi-agent explore-exploit decision making where the goal of the agents is to maximize cumulative group reward. To do so we study a class of distributed stochastic bandit problems in which agents communicate over a multi-star network and make sequential choices among options in the same uncertain environment. Typically, in multi-agent bandit problem…
▽ More
We investigate the benefits of heterogeneity in multi-agent explore-exploit decision making where the goal of the agents is to maximize cumulative group reward. To do so we study a class of distributed stochastic bandit problems in which agents communicate over a multi-star network and make sequential choices among options in the same uncertain environment. Typically, in multi-agent bandit problems, agents use homogeneous decision-making strategies. However, group performance can be improved by incorporating heterogeneity into the choices agents make, especially when the network graph is irregular, i.e. when agents have different numbers of neighbors. We design and analyze new heterogeneous explore-exploit strategies, using the multi-star as the model irregular network graph. The key idea is to enable center agents to do more exploring than they would do using the homogeneous strategy, as a means of providing more useful data to the peripheral agents. In the case all agents broadcast their reward values and choices to their neighbors with the same probability, we provide theoretical guarantees that group performance improves under the proposed heterogeneous strategies as compared to under homogeneous strategies. We use numerical simulations to illustrate our results and to validate our theoretical bounds.
△ Less
Submitted 1 December, 2020; v1 submitted 2 September, 2020;
originally announced September 2020.
-
Distributed Learning: Sequential Decision Making in Resource-Constrained Environments
Authors:
Udari Madhushani,
Naomi Ehrich Leonard
Abstract:
We study cost-effective communication strategies that can be used to improve the performance of distributed learning systems in resource-constrained environments. For distributed learning in sequential decision making, we propose a new cost-effective partial communication protocol. We illustrate that with this protocol the group obtains the same order of performance that it obtains with full commu…
▽ More
We study cost-effective communication strategies that can be used to improve the performance of distributed learning systems in resource-constrained environments. For distributed learning in sequential decision making, we propose a new cost-effective partial communication protocol. We illustrate that with this protocol the group obtains the same order of performance that it obtains with full communication. Moreover, we prove that under the proposed partial communication protocol the communication cost is $O(\log T)$, where $T$ is the time horizon of the decision-making process. This improves significantly on protocols with full communication, which incur a communication cost that is $O(T)$. We validate our theoretical results using numerical simulations.
△ Less
Submitted 13 April, 2020;
originally announced April 2020.
-
A Dynamic Observation Strategy for Multi-agent Multi-armed Bandit Problem
Authors:
Udari Madhushani,
Naomi Ehrich Leonard
Abstract:
We define and analyze a multi-agent multi-armed bandit problem in which decision-making agents can observe the choices and rewards of their neighbors under a linear observation cost. Neighbors are defined by a network graph that encodes the inherent observation constraints of the system. We define a cost associated with observations such that at every instance an agent makes an observation it rece…
▽ More
We define and analyze a multi-agent multi-armed bandit problem in which decision-making agents can observe the choices and rewards of their neighbors under a linear observation cost. Neighbors are defined by a network graph that encodes the inherent observation constraints of the system. We define a cost associated with observations such that at every instance an agent makes an observation it receives a constant observation regret. We design a sampling algorithm and an observation protocol for each agent to maximize its own expected cumulative reward through minimizing expected cumulative sampling regret and expected cumulative observation regret. For our proposed protocol, we prove that total cumulative regret is logarithmically bounded. We verify the accuracy of analytical bounds using numerical simulations.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Heterogeneous Stochastic Interactions for Multiple Agents in a Multi-armed Bandit Problem
Authors:
Udari Madhushani,
Naomi Ehrich Leonard
Abstract:
We define and analyze a multi-agent multi-armed bandit problem in which decision-making agents can observe the choices and rewards of their neighbors. Neighbors are defined by a network graph with heterogeneous and stochastic interconnections. These interactions are determined by the sociability of each agent, which corresponds to the probability that the agent observes its neighbors. We design an…
▽ More
We define and analyze a multi-agent multi-armed bandit problem in which decision-making agents can observe the choices and rewards of their neighbors. Neighbors are defined by a network graph with heterogeneous and stochastic interconnections. These interactions are determined by the sociability of each agent, which corresponds to the probability that the agent observes its neighbors. We design an algorithm for each agent to maximize its own expected cumulative reward and prove performance bounds that depend on the sociability of the agents and the network structure. We use the bounds to predict the rank ordering of agents according to their performance and verify the accuracy analytically and computationally.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
A Geometric PID Control Framework for Mechanical Systems
Authors:
D. H. S. Maithripala,
T. W. U. Madhushani,
J. M. Berg
Abstract:
These lectures demonstrate the development of a PID control framework for mechanical systems. Based on the observation that mechanical systems are essentially double integrator systems, we generalize the linear PID controller to mechanical systems that have a non-Euclidean configuration space. Specifically we start by presenting the development of the geometric PID controller for fully actuated me…
▽ More
These lectures demonstrate the development of a PID control framework for mechanical systems. Based on the observation that mechanical systems are essentially double integrator systems, we generalize the linear PID controller to mechanical systems that have a non-Euclidean configuration space. Specifically we start by presenting the development of the geometric PID controller for fully actuated mechanical systems and then extend it to a class of under actuated interconnected mechanical systems of practical significance by introducing the notion of feedback regularization. We show that feedback regularization is the mechanical system equivalent to partial feedback linearization. We apply these results for trajectory tracking for several systems of interest in the field of robotics. First, we demonstrate the robust almost-global stability properties of the geometric PID controller developed for fully actuated mechanical systems using simulations and experiments on a multi-rotor-aerial-vehicle. The extension to the class of under actuated interconnected systems allow one to ensure the semi-almost-global locally exponential tracking of the geometric center of a spherical robot on an inclined plane of unknown angle of inclination. The results are demonstrated using simulations for a hoop rolling on an inclined plane and then for a sphere rolling on an inclined plane. The final extension that we present here is that of geometric PID control for holonomically or non-holonomically constrained mechanical systems on Lie groups. The results are demonstrated by ensuring the robust almost global locally exponential tracking of a nontrivial spherical pendulum.
△ Less
Submitted 14 October, 2016;
originally announced October 2016.
-
Feedback Regularization and Geometric PID Control for Trajectory Tracking of Coupled Mechanical Systems: Hoop Robots on an Inclined Plane
Authors:
T. W. U. Madhushani,
D. H. S. Maithripala,
J. M. Berg
Abstract:
This paper applies geometric PID control for asymptotic tracking of a desired trajectory by a hoop robot in the presence of disturbances and uncertainties. The hoop robot, consisting of a circular body rolling without slip along a one-dimensional surface, is a planar analog of a spherical robot. A variety of coupled mechanical system may be used to actuate the hoop robot. This paper specifically c…
▽ More
This paper applies geometric PID control for asymptotic tracking of a desired trajectory by a hoop robot in the presence of disturbances and uncertainties. The hoop robot, consisting of a circular body rolling without slip along a one-dimensional surface, is a planar analog of a spherical robot. A variety of coupled mechanical system may be used to actuate the hoop robot. This paper specifically considers two different actuators, one a simple pendulum and the other an internal cart. The geometric PID controller requires the plant to be a mechanical system, and the hoop robot does not satisfy this condition. Therefore a geometric inner loop is presented that gives the hoop robot the required structure. This procedure is here referred to as feedback regularization. Feedback regularization--in contrast to feedback linearization--is coordinate independent, and hence reflects the fundamental system structure. Note also that the resulting mechanical system is nonlinear and underactuated. Subsequently, the geometric PID outer loop guarantees almost-semiglobal tracking with locally exponential convergence, and the integral action of the PID guarantees robustness to constant disturbances and parameter uncertainties, including constant inclination of the rolling surface. The complete tracking controller is the composition of the two coordinate-independent loops, and therefore is also coordinate independent.
△ Less
Submitted 26 February, 2017; v1 submitted 29 September, 2016;
originally announced September 2016.
-
Semi-globally Exponential Trajectory Tracking for a Class of Spherical Robots
Authors:
T. W. U. Madhushani,
D. H. S. Maithripala,
J. V. Wijayakulasooriya,
J. M. Berg
Abstract:
A spherical robot consists of an externally spherical rigid body rolling on a two-dimensional surface, actuated by an auxiliary mechanism. For a class of actuation mechanisms, we derive a controller for the geometric center of the sphere to asymptotically track any sufficiently smooth reference trajectory, with robustness to bounded, constant uncertainties in the inertial properties of the sphere…
▽ More
A spherical robot consists of an externally spherical rigid body rolling on a two-dimensional surface, actuated by an auxiliary mechanism. For a class of actuation mechanisms, we derive a controller for the geometric center of the sphere to asymptotically track any sufficiently smooth reference trajectory, with robustness to bounded, constant uncertainties in the inertial properties of the sphere and actuation mechanism, and to constant disturbance forces including, for example, from constant inclination of the rolling surface. The sphere and actuator are modeled as distinct systems, coupled by reaction forces. It is assumed that the actuator can provide three independent control torques, and that the actuator center of mass remains at a constant distance from the geometric center of the sphere. We show that a necessary and sufficient condition for such a controller to exist is that for any constant disturbance torque acting on the sphere there is a constant input such that the sphere and the actuator mechanism has a stable relative equilibrium. A geometric PID controller guarantees robust, semi-global, locally exponential stability for the position tracking error of the geometric center of the sphere, while ensuring that actuator velocities are bounded.
△ Less
Submitted 1 March, 2017; v1 submitted 4 August, 2016;
originally announced August 2016.