-
Linear Convergence of Independent Natural Policy Gradient in Games with Entropy Regularization
Authors:
Youbang Sun,
Tao Liu,
P. R. Kumar,
Shahin Shahrampour
Abstract:
This work focuses on the entropy-regularized independent natural policy gradient (NPG) algorithm in multi-agent reinforcement learning. In this work, agents are assumed to have access to an oracle with exact policy evaluation and seek to maximize their respective independent rewards. Each individual's reward is assumed to depend on the actions of all the agents in the multi-agent system, leading t…
▽ More
This work focuses on the entropy-regularized independent natural policy gradient (NPG) algorithm in multi-agent reinforcement learning. In this work, agents are assumed to have access to an oracle with exact policy evaluation and seek to maximize their respective independent rewards. Each individual's reward is assumed to depend on the actions of all the agents in the multi-agent system, leading to a game between agents. We assume all agents make decisions under a policy with bounded rationality, which is enforced by the introduction of entropy regularization. In practice, a smaller regularization implies the agents are more rational and behave closer to Nash policies. On the other hand, agents with larger regularization acts more randomly, which ensures more exploration. We show that, under sufficient entropy regularization, the dynamics of this system converge at a linear rate to the quantal response equilibrium (QRE). Although regularization assumptions prevent the QRE from approximating a Nash equilibrium, our findings apply to a wide range of games, including cooperative, potential, and two-player matrix games. We also provide extensive empirical results on multiple games (including Markov games) as a verification of our theoretical analysis.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Provably Fast Convergence of Independent Natural Policy Gradient for Markov Potential Games
Authors:
Youbang Sun,
Tao Liu,
Ruida Zhou,
P. R. Kumar,
Shahin Shahrampour
Abstract:
This work studies an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games. It is shown that, under mild technical assumptions and the introduction of the \textit{suboptimality gap}, the independent NPG method with an oracle providing exact policy evaluation asymptotically reaches an $ε$-Nash Equilibrium (NE) within…
▽ More
This work studies an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games. It is shown that, under mild technical assumptions and the introduction of the \textit{suboptimality gap}, the independent NPG method with an oracle providing exact policy evaluation asymptotically reaches an $ε$-Nash Equilibrium (NE) within $\mathcal{O}(1/ε)$ iterations. This improves upon the previous best result of $\mathcal{O}(1/ε^2)$ iterations and is of the same order, $\mathcal{O}(1/ε)$, that is achievable for the single-agent case. Empirical results for a synthetic potential game and a congestion game are presented to verify the theoretical bounds.
△ Less
Submitted 27 October, 2023; v1 submitted 15 October, 2023;
originally announced October 2023.
-
Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation
Authors:
Ruida Zhou,
Tao Liu,
Min Cheng,
Dileep Kalathil,
P. R. Kumar,
Chao Tian
Abstract:
We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales…
▽ More
We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.
△ Less
Submitted 10 December, 2023; v1 submitted 17 July, 2023;
originally announced July 2023.
-
Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning
Authors:
Ruida Zhou,
Tao Liu,
Dileep Kalathil,
P. R. Kumar,
Chao Tian
Abstract:
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically inc…
▽ More
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve $\tilde{O}(1/T)$ global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.
△ Less
Submitted 18 October, 2022; v1 submitted 10 June, 2022;
originally announced June 2022.
-
Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems
Authors:
Akshay Mete,
Rahul Singh,
P. R. Kumar
Abstract:
We consider the problem of controlling an unknown stochastic linear system with quadratic costs - called the adaptive LQ control problem. We re-examine an approach called ''Reward Biased Maximum Likelihood Estimate'' (RBMLE) that was proposed more than forty years ago, and which predates the ''Upper Confidence Bound'' (UCB) method as well as the definition of ''regret'' for bandit problems. It sim…
▽ More
We consider the problem of controlling an unknown stochastic linear system with quadratic costs - called the adaptive LQ control problem. We re-examine an approach called ''Reward Biased Maximum Likelihood Estimate'' (RBMLE) that was proposed more than forty years ago, and which predates the ''Upper Confidence Bound'' (UCB) method as well as the definition of ''regret'' for bandit problems. It simply added a term favoring parameters with larger rewards to the criterion for parameter estimation. We show how the RBMLE and UCB methods can be reconciled, and thereby propose an Augmented RBMLE-UCB algorithm that combines the penalty of the RBMLE method with the constraints of the UCB method, uniting the two approaches to optimism in the face of uncertainty. We establish that theoretically, this method retains $\Tilde{\mathcal{O}}(\sqrt{T})$ regret, the best-known so far. We further compare the empirical performance of the proposed Augmented RBMLE-UCB and the standard RBMLE (without the augmentation) with UCB, Thompson Sampling, Input Perturbation, Randomized Certainty Equivalence and StabL on many real-world examples including flight control of Boeing 747 and Unmanned Aerial Vehicle. We perform extensive simulation studies showing that the Augmented RBMLE consistently outperforms UCB, Thompson Sampling and StabL by a huge margin, while it is marginally better than Input Perturbation and moderately better than Randomized Certainty Equivalence.
△ Less
Submitted 24 March, 2023; v1 submitted 25 January, 2022;
originally announced January 2022.
-
Policy Optimization for Constrained MDPs with Provable Fast Global Convergence
Authors:
Tao Liu,
Ruida Zhou,
Dileep Kalathil,
P. R. Kumar,
Chao Tian
Abstract:
We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) a…
▽ More
We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster $\mathcal{O}(\log(T)/T)$ convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation. Experimental results demonstrate the faster convergence rate and the better performance of the PMD-PD algorithm compared with existing policy gradient-based algorithms.
△ Less
Submitted 3 February, 2022; v1 submitted 31 October, 2021;
originally announced November 2021.
-
Learning in Networked Control Systems
Authors:
Rahul Singh,
P. R. Kumar
Abstract:
We design adaptive controller (learning rule) for a networked control system (NCS) in which data packets containing control information are transmitted across a lossy wireless channel. We propose Upper Confidence Bounds for Networked Control Systems (UCB-NCS), a learning rule that maintains confidence intervals for the estimates of plant parameters $(A_{(\star)},B_{(\star)})$, and channel reliabil…
▽ More
We design adaptive controller (learning rule) for a networked control system (NCS) in which data packets containing control information are transmitted across a lossy wireless channel. We propose Upper Confidence Bounds for Networked Control Systems (UCB-NCS), a learning rule that maintains confidence intervals for the estimates of plant parameters $(A_{(\star)},B_{(\star)})$, and channel reliability $p_{(\star)}$, and utilizes the principle of optimism in the face of uncertainty while making control decisions. We provide non-asymptotic performance guarantees for UCB-NCS by analyzing its "regret", i.e., performance gap from the scenario when $(A_{(\star)},B_{(\star)},p_{(\star)})$ are known to the controller. We show that with a high probability the regret can be upper-bounded as $\tilde{O}\left(C\sqrt{T}\right)$\footnote{Here $\tilde{O}$ hides logarithmic factors.}, where $T$ is the operating time horizon of the system, and $C$ is a problem dependent constant.
△ Less
Submitted 21 March, 2020;
originally announced March 2020.
-
Optimal Control of Thermostatic Loads for Planning Aggregate Consumption: Characterization of Solution and Explicit Strategies
Authors:
Fernando A. C. C. Fontes,
Abhishek Halder,
Jorge Becerril,
P. R. Kumar
Abstract:
We consider the problem of planning the aggregate energy consumption for a set of thermostatically controlled loads for demand response, accounting price forecast trajectory and thermal comfort constraints. We address this as a continuous-time optimal control problem, and analytically characterize the structure of its solution in the general case. In the special case when the price forecast is mon…
▽ More
We consider the problem of planning the aggregate energy consumption for a set of thermostatically controlled loads for demand response, accounting price forecast trajectory and thermal comfort constraints. We address this as a continuous-time optimal control problem, and analytically characterize the structure of its solution in the general case. In the special case when the price forecast is monotone and the loads have equal dynamics, we show that it is possible to determine the solution in an explicit form. Taking this fact into account, we handle the non-monotone price case by considering several subproblems, each corresponding to a time subinterval where the price function is monotone, and then allocating to each subinterval a fraction of the total energy budget. This way, for each time subinterval, the problem reduces to a simple convex optimization problem with a scalar decision variable, for which a descent direction is also known. The price forecasts for the day-ahead energy market typically have no more than four monotone segments, so the resulting optimization problem can be solved efficiently with modest computational resources.
△ Less
Submitted 8 May, 2019; v1 submitted 3 March, 2019;
originally announced March 2019.
-
Optimal Decentralized Dynamic Policies for Video Streaming over Wireless Channels
Authors:
Rahul Singh,
P. R. Kumar
Abstract:
The problem addressed is that of optimally controlling, in a decentralized fashion, the download of mobile video, which is expected to comprise 75 % of total mobile data traffic by 2020. The server can dynamically choose which packets to download to clients, from among several packets which encode their videos at different resolutions, as well as the power levels of their transmissions. This allow…
▽ More
The problem addressed is that of optimally controlling, in a decentralized fashion, the download of mobile video, which is expected to comprise 75 % of total mobile data traffic by 2020. The server can dynamically choose which packets to download to clients, from among several packets which encode their videos at different resolutions, as well as the power levels of their transmissions. This allows it to control packet delivery probabilities, and thereby, for example, avert imminent video outages at clients. It must however respect the access point's constraints on bandwidth and average transmission power. The goal is to maximize video "Quality of Experience" (QoE), which depends on several factors such as (i) outage duration when the video playback buffer is empty, (ii) number of outage periods, (iii) how many frames downloaded are of lower resolution, (iv) temporal variations in resolution, etc.
It is shown that there exists an optimal decentralized solution where the AP announces the price of energy, and each client distributedly and dynamically maximizes its own QoE subject to the cost of energy. A distributed iterative algorithm to solve for optimal decentralized policy is also presented. Further, for the client-level QoE optimization, the optimal choice of video-resolution and power-level of packet transmissions has a simple monotonicity and threshold structure vis-a-vis video playback buffer level. When the number of orthogonal channels is less than the number of clients, there is an index policy for prioritizing packet transmissions. When the AP has to simply choose which clients' packets to transmit, the index policy is asymptotically optimal as the number of channels is scaled up with clients.
△ Less
Submitted 20 February, 2019;
originally announced February 2019.
-
Optimal Power Consumption for Demand Response of Thermostatically Controlled Loads
Authors:
Abhishek Halder,
Xinbo Geng,
Fernando A. C. C. Fontes,
P. R. Kumar,
Le Xie
Abstract:
We consider the problem of determining the optimal aggregate power consumption of a population of thermostatically controlled loads. This is motivated by the problem of synthesizing the demand response for a load serving entity (LSE) serving a population of such customers. We show how the LSE can opportunistically design the aggregate reference consumption to minimize its energy procurement cost,…
▽ More
We consider the problem of determining the optimal aggregate power consumption of a population of thermostatically controlled loads. This is motivated by the problem of synthesizing the demand response for a load serving entity (LSE) serving a population of such customers. We show how the LSE can opportunistically design the aggregate reference consumption to minimize its energy procurement cost, given day-ahead price, load and ambient temperature forecasts, while respecting each individual load's comfort range constraints. The resulting synthesis problem is shown to be amenable to optimal control techniques, but computationally difficult otherwise. Numerical simulations elucidate how the LSE can use the optimal aggregate power consumption trajectory thus computed, for the purpose of demand response.
△ Less
Submitted 25 June, 2018; v1 submitted 23 September, 2016;
originally announced September 2016.
-
Belief Space Planning Simplified: Trajectory-Optimized LQG (T-LQG) (Extended Report)
Authors:
Mohammadhussein Rafieisakhaei,
Suman Chakravorty,
P. R. Kumar
Abstract:
Planning under motion and observation uncertainties requires solution of a stochastic control problem in the space of feedback policies. In this paper, we reduce the general (n^2+n)-dimensional belief space planning problem to an (n)-dimensional problem by obtaining a Linear Quadratic Gaussian (LQG) design with the best nominal performance. Then, by taking the underlying trajectory of the LQG cont…
▽ More
Planning under motion and observation uncertainties requires solution of a stochastic control problem in the space of feedback policies. In this paper, we reduce the general (n^2+n)-dimensional belief space planning problem to an (n)-dimensional problem by obtaining a Linear Quadratic Gaussian (LQG) design with the best nominal performance. Then, by taking the underlying trajectory of the LQG controller as the decision variable, we pose a coupled design of trajectory, estimator, and controller design through a Non-Linear Program (NLP) that can be solved by a general NLP solver. We prove that under a first-order approximation and a careful usage of the separation principle, our approximations are valid. We give an analysis on the existing major belief space planning methods and show that our algorithm has the lowest computational burden. Finally, we extend our solution to contain general state and control constraints. Our simulation results support our design.
△ Less
Submitted 11 August, 2016; v1 submitted 9 August, 2016;
originally announced August 2016.
-
Architecture and Algorithms for Privacy Preserving Thermal Inertial Load Management by A Load Serving Entity
Authors:
Abhishek Halder,
Xinbo Geng,
P. R. Kumar,
Le Xie
Abstract:
Motivated by the growing importance of demand response in modern power system's operations, we propose an architecture and supporting algorithms for privacy preserving thermal inertial load management as a service provided by the load serving entity (LSE). We focus on an LSE managing a population of its customers' air conditioners, and propose a contractual model where the LSE guarantees quality o…
▽ More
Motivated by the growing importance of demand response in modern power system's operations, we propose an architecture and supporting algorithms for privacy preserving thermal inertial load management as a service provided by the load serving entity (LSE). We focus on an LSE managing a population of its customers' air conditioners, and propose a contractual model where the LSE guarantees quality of service to each customer in terms of kee** their indoor temperature trajectories within respective bands around the desired individual comfort temperatures. We show how the LSE can price the contracts differentiated by the flexibility embodied by the width of the specified bands. We address architectural questions of (i) how the LSE can strategize its energy procurement based on price and ambient temperature forecasts, (ii) how an LSE can close the real time control loop at the aggregate level while providing individual comfort guarantees to loads, without ever measuring the states of an air conditioner for privacy reasons. Control algorithms to enable our proposed architecture are given, and their efficacy is demonstrated on real data.
△ Less
Submitted 29 November, 2016; v1 submitted 30 June, 2016;
originally announced June 2016.
-
Dynamic Watermarking: Active Defense of Networked Cyber-Physical Systems
Authors:
Bharadwaj Satchidanandan,
P. R. Kumar
Abstract:
The coming decades may see the large scale deployment of networked cyber-physical systems to address global needs in areas such as energy, water, healthcare, and transportation. However, as recent events have shown, such systems are vulnerable to cyber attacks. Being safety critical, their disruption or misbehavior can cause economic losses or injuries and loss of life. It is therefore important t…
▽ More
The coming decades may see the large scale deployment of networked cyber-physical systems to address global needs in areas such as energy, water, healthcare, and transportation. However, as recent events have shown, such systems are vulnerable to cyber attacks. Being safety critical, their disruption or misbehavior can cause economic losses or injuries and loss of life. It is therefore important to secure such networked cyber-physical systems against attacks. In the absence of credible security guarantees, there will be resistance to the proliferation of cyber-physical systems, which are much needed to meet global needs in critical infrastructures and services.
This paper addresses the problem of secure control of networked cyber-physical systems. This problem is different from the problem of securing the communication network, since cyber-physical systems at their very essence need sensors and actuators that interface with the physical plant, and malicious agents may tamper with sensors or actuators, as recent attacks have shown.
We consider physical plants that are being controlled by multiple actuators and sensors communicating over a network, where some sensors could be "malicious," meaning that they may not report the measurements that they observe. We address a general technique by which the actuators can detect the actions of malicious sensors in the system, and disable closed-loop control based on their information. This technique, called "watermarking," employs the technique of actuators injecting private excitation into the system which will reveal malicious tampering with signals. We show how such an active defense can be used to secure networked systems of sensors and actuators.
△ Less
Submitted 27 June, 2016;
originally announced June 2016.
-
Throughput Optimal Decentralized Scheduling of Multi-Hop Networks with End-to-End Deadline Constraints: Unreliable Links
Authors:
Rahul Singh,
P. R. Kumar
Abstract:
We consider unreliable multi-hop networks serving multiple flows in which packets not delivered to their destination nodes by their deadlines are dropped. We address the design of policies for routing and scheduling packets that optimize any specified weighted average of the throughputs of the flows. We provide a new approach which directly yields an optimal distributed scheduling policy that atta…
▽ More
We consider unreliable multi-hop networks serving multiple flows in which packets not delivered to their destination nodes by their deadlines are dropped. We address the design of policies for routing and scheduling packets that optimize any specified weighted average of the throughputs of the flows. We provide a new approach which directly yields an optimal distributed scheduling policy that attains any desired maximal timely-throughput vector under average-power constraints on the nodes. It pursues a novel intrinsically stochastic decomposition of the Lagrangian of the constrained network-wide MDP rather than of the fluid model. All decisions regarding a packet's transmission scheduling, transmit power level, and routing, are completely distributed, based solely on the age of the packet, not requiring any knowledge of network state or queue lengths at any of the nodes. Global coordination is achieved through a tractably computable "price" for transmission energy. This price is different from that used to derive the backpressure policy where price corresponds to queue lengths. A quantifiably near-optimal policy is provided if nodes have peak-power constraints.
△ Less
Submitted 5 June, 2016;
originally announced June 2016.
-
Decentralized Control via Dynamic Stochastic Prices: The Independent System Operator Problem
Authors:
Rahul Singh,
P. R. Kumar,
Le Xie
Abstract:
A smart grid connects wind or solar or storage farms, fossil fuel plants, industrialor commercial loads, or load serving entities, modeled as stochastic dynamical systems. In each time period, they consume or supply electrical energy, with the constraint that total generation equals consumption. Each agent's utility is either the benefit accrued from consumption, or negative of generation cost. Th…
▽ More
A smart grid connects wind or solar or storage farms, fossil fuel plants, industrialor commercial loads, or load serving entities, modeled as stochastic dynamical systems. In each time period, they consume or supply electrical energy, with the constraint that total generation equals consumption. Each agent's utility is either the benefit accrued from consumption, or negative of generation cost. The Independent System Operator has to maximize their sum, the social welfare, without agents revealing their dynamic models or utilities. It has to announce prices after interacting with agents via bids.
If agents observe and know the laws of uncertainties affecting other agents, then there is an iterative price and bid interaction that leads to the maximum social welfare attainable if agents pooled their information.
In the important case where agents are LQG systems not even knowing of the existence of other agents, the bid and price iteration is dramatically simple, exchanging time vectors of future prices and consumptions or generations at each time step. State dependent bidding is not needed. This solution of the decentralized stochastic control problem may be of economic importance in power systems, and of broader interest in general equilibrium theory of economics for stochastic dynamic agents.
△ Less
Submitted 29 June, 2016; v1 submitted 28 May, 2016;
originally announced May 2016.