-
A constrained optimization perspective on actor critic algorithms and application to network routing
Authors:
Prashanth L. A.,
H. L. Prasad,
Shalabh Bhatnagar,
Prakash Chandra
Abstract:
We propose a novel actor-critic algorithm with guaranteed convergence to an optimal policy for a discounted reward Markov decision process. The actor incorporates a descent direction that is motivated by the solution of a certain non-linear optimization problem. We also discuss an extension to incorporate function approximation and demonstrate the practicality of our algorithms on a network routin…
▽ More
We propose a novel actor-critic algorithm with guaranteed convergence to an optimal policy for a discounted reward Markov decision process. The actor incorporates a descent direction that is motivated by the solution of a certain non-linear optimization problem. We also discuss an extension to incorporate function approximation and demonstrate the practicality of our algorithms on a network routing application.
△ Less
Submitted 28 July, 2015;
originally announced July 2015.
-
A Study of Gradient Descent Schemes for General-Sum Stochastic Games
Authors:
H. L. Prasad,
Shalabh Bhatnagar
Abstract:
Zero-sum stochastic games are easy to solve as they can be cast as simple Markov decision processes. This is however not the case with general-sum stochastic games. A fairly general optimization problem formulation is available for general-sum stochastic games by Filar and Vrieze [2004]. However, the optimization problem there has a non-linear objective and non-linear constraints with special stru…
▽ More
Zero-sum stochastic games are easy to solve as they can be cast as simple Markov decision processes. This is however not the case with general-sum stochastic games. A fairly general optimization problem formulation is available for general-sum stochastic games by Filar and Vrieze [2004]. However, the optimization problem there has a non-linear objective and non-linear constraints with special structure. Since gradients of both the objective as well as constraints of this optimization problem are well defined, gradient based schemes seem to be a natural choice. We discuss a gradient scheme tuned for two-player stochastic games. We show in simulations that this scheme indeed converges to a Nash equilibrium, for a simple terrain exploration problem modelled as a general-sum stochastic game. However, it turns out that only global minima of the optimization problem correspond to Nash equilibria of the underlying general-sum stochastic game, while gradient schemes only guarantee convergence to local minima. We then provide important necessary conditions for gradient schemes to converge to Nash equilibria in general-sum stochastic games.
△ Less
Submitted 30 June, 2015;
originally announced July 2015.
-
Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games
Authors:
H. L Prasad,
L. A. Prashanth,
Shalabh Bhatnagar
Abstract:
We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a $N$-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution poi…
▽ More
We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a $N$-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game - Sub-Problem) conditions. Using these conditions, we develop two actor-critic algorithms: OFF-SGSP (model-based) and ON-SGSP (model-free). Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game (see Hart and Mas-Colell [2005]) as well as on a synthetic two-player game setup with $810,000$ states, we establish that ON-SGSP consistently outperforms NashQ ([Hu and Wellman, 2003] and FFQ [Littman, 2001] algorithms.
△ Less
Submitted 2 July, 2015; v1 submitted 8 January, 2014;
originally announced January 2014.
-
Simultaneous Perturbation Methods for Adaptive Labor Staffing in Service Systems
Authors:
L. A. Prashanth,
H. L. Prasad,
Nirmit Desai,
Shalabh Bhatnagar,
Gargi Dasgupta
Abstract:
Service systems are labor intensive due to the large variation in the tasks required to address service requests from multiple customers. Aligning the staffing levels to the forecasted workloads adaptively in such systems is nontrivial because of a large number of parameters and operational variations leading to a huge search space. A challenging problem here is to optimize the staffing while main…
▽ More
Service systems are labor intensive due to the large variation in the tasks required to address service requests from multiple customers. Aligning the staffing levels to the forecasted workloads adaptively in such systems is nontrivial because of a large number of parameters and operational variations leading to a huge search space. A challenging problem here is to optimize the staffing while maintaining the system in steady-state and compliant to aggregate service level agreement (SLA) constraints. Further, because these parameters change on a weekly basis, the optimization should not take longer than a few hours. We formulate this problem as a constrained Markov cost process parameterized by the (discrete) staffing levels. We propose novel simultaneous perturbation stochastic approximation (SPSA) based SASOC (Staff Allocation using Stochastic Optimization with Constraints) algorithms for solving the above problem. The algorithms include both first order as well as second order methods and incorporate SPSA based gradient estimates in the primal, with dual ascent for the Lagrange multipliers. Both the algorithms that we propose are online, incremental and easy to implement. Further, they involve a certain generalized smooth projection operator, which is essential to project the continuous-valued worker parameter tuned by SASOC algorithms onto the discrete set. We validated our algorithms on five real-life service systems and compared them with a state-of-the-art optimization tool-kit OptQuest. Being 25 times faster than OptQuest, our algorithms are particularly suitable for adaptive labor staffing. Also, we observe that our algorithms guarantee convergence and find better solutions than OptQuest in many cases.
△ Less
Submitted 28 December, 2013;
originally announced December 2013.