Search | arXiv e-print repository

A Concentration Bound for TD(0) with Function Approximation

Authors: Siddharth Chandak, Vivek S. Borkar

Abstract: We derive a concentration bound of the type `for all $n \geq n_0$ for some $n_0$' for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov… ▽ More We derive a concentration bound of the type `for all $n \geq n_0$ for some $n_0$' for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm, with both martingale and Markov noises. Markov noise is handled using the Poisson equation and the lack of almost sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: Submitted to Stochastic Systems

arXiv:2311.14421 [pdf, other]

Approximation of Convex Envelope Using Reinforcement Learning

Authors: Vivek S. Borkar, Adit Akarsh

Abstract: Oberman gave a stochastic control formulation of the problem of estimating the convex envelope of a non-convex function. Based on this, we develop a reinforcement learning scheme to approximate the convex envelope, using a variant of Q-learning for controlled optimal stop**. It shows very promising results on a standard library of test problems. Oberman gave a stochastic control formulation of the problem of estimating the convex envelope of a non-convex function. Based on this, we develop a reinforcement learning scheme to approximate the convex envelope, using a variant of Q-learning for controlled optimal stop**. It shows very promising results on a standard library of test problems. △ Less

Submitted 24 November, 2023; originally announced November 2023.

arXiv:2311.12613 [pdf, other]

Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion

Authors: Keshav P. Keval, Vivek S. Borkar

Abstract: In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend… ▽ More In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend on the individual agent's actions. We combine the Q-learning algorithm for a weighted combination of the costs of each agent, obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative Weights formalisms to modulate the averaging matrix of the gossip. We use multiple timescales in our algorithm and prove that under mild conditions, it approximately achieves the desired bounds for each of the agents. We also demonstrate the empirical performance of this algorithm in the more general setting of MMDPs having jointly controlled per-stage costs. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2304.03729 [pdf, other]

Full Gradient Deep Reinforcement Learning for Average-Reward Criterion

Authors: Tejas Pagare, Vivek Borkar, Konstantin Avrachenkov

Abstract: We extend the provably convergent Full Gradient DQN algorithm for discounted reward Markov decision processes from Avrachenkov et al. (2021) to average reward problems. We experimentally compare widely used RVI Q-Learning with recently proposed Differential Q-Learning in the neural function approximation setting with Full Gradient DQN and DQN. We also extend this to learn Whittle indices for Marko… ▽ More We extend the provably convergent Full Gradient DQN algorithm for discounted reward Markov decision processes from Avrachenkov et al. (2021) to average reward problems. We experimentally compare widely used RVI Q-Learning with recently proposed Differential Q-Learning in the neural function approximation setting with Full Gradient DQN and DQN. We also extend this to learn Whittle indices for Markovian restless multi-armed bandits. We observe a better convergence rate of the proposed Full Gradient variant across different tasks. △ Less

Submitted 7 April, 2023; originally announced April 2023.

Comments: 13 pages, 4 figures; Accepted by 5th Annual Learning for Dynamics & Control Conference (L4DC) 2023

MSC Class: 93-06

arXiv:2211.01595 [pdf, other]

Reinforcement Learning in Non-Markovian Environments

Authors: Siddharth Chandak, Pratik Shah, Vivek S Borkar, Parth Dodhia

Abstract: Motivated by the novel paradigm developed by Van Roy and coauthors for reinforcement learning in arbitrary non-Markovian environments, we propose a related formulation and explicitly pin down the error caused by non-Markovianity of observations when the Q-learning algorithm is applied on this formulation. Based on this observation, we propose that the criterion for agent design should be to seek g… ▽ More Motivated by the novel paradigm developed by Van Roy and coauthors for reinforcement learning in arbitrary non-Markovian environments, we propose a related formulation and explicitly pin down the error caused by non-Markovianity of observations when the Q-learning algorithm is applied on this formulation. Based on this observation, we propose that the criterion for agent design should be to seek good approximations for certain conditional laws. Inspired by classical stochastic control, we show that our problem reduces to that of recursive computation of approximate sufficient statistics. This leads to an autoencoder-based scheme for agent design which is then numerically tested on partially observed reinforcement learning environments. △ Less

Submitted 13 February, 2024; v1 submitted 3 November, 2022; originally announced November 2022.

Comments: 19 pages, accepted for publication at Systems and Control Letters

arXiv:2111.02644 [pdf, ps, other]

A Concentration Bound for LSPE($λ$)

Authors: Siddharth Chandak, Vivek S. Borkar, Harsh Dolhare

Abstract: The popular LSPE($λ$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on. The popular LSPE($λ$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on. △ Less

Submitted 30 November, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

Comments: 17 pages, accepted for publication in Systems and Control Letters

arXiv:2107.09153 [pdf, other]

User Association in Dense mmWave Networks as Restless Bandits

Authors: S. K. Singh, V. S. Borkar, G. S. Kasbekar

Abstract: We study the problem of user association, i.e., determining which base station (BS) a user should associate with, in a dense millimeter wave (mmWave) network. In our system model, in each time slot, a user arrives with some probability in a region with a relatively small geographical area served by a dense mmWave network. Our goal is to devise an association policy under which, in each time slot i… ▽ More We study the problem of user association, i.e., determining which base station (BS) a user should associate with, in a dense millimeter wave (mmWave) network. In our system model, in each time slot, a user arrives with some probability in a region with a relatively small geographical area served by a dense mmWave network. Our goal is to devise an association policy under which, in each time slot in which a user arrives, it is assigned to exactly one BS so as to minimize the weighted average amount of time that users spend in the system. The above problem is a restless multi-armed bandit problem and is provably hard to solve. We prove that the problem is Whittle indexable, and based on this result, propose an association policy under which an arriving user is associated with the BS having the smallest Whittle index. Using simulations, we show that our proposed policy outperforms several user association policies proposed in prior work. △ Less

Submitted 28 April, 2022; v1 submitted 16 July, 2021; originally announced July 2021.

Comments: 11 pages, 7 figures

arXiv:2106.14308 [pdf, other]

Concentration of Contractive Stochastic Approximation and Reinforcement Learning

Authors: Siddharth Chandak, Vivek S. Borkar, Parth Dodhia

Abstract: Using a martingale concentration inequality, concentration bounds `from time $n_0$ on' are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0). Using a martingale concentration inequality, concentration bounds `from time $n_0$ on' are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0). △ Less

Submitted 11 June, 2022; v1 submitted 27 June, 2021; originally announced June 2021.

Comments: 20 pages, Accepted for publication in Stochastic Systems

arXiv:2104.05311 [pdf, other]

doi 10.1016/j.sysconle.2021.105009

Prospect-theoretic Q-learning

Authors: Vivek S. Borkar, Siddharth Chandak

Abstract: We consider a prospect theoretic version of the classical Q-learning algorithm for discounted reward Markov decision processes, wherein the controller perceives a distorted and noisy future reward, modeled by a nonlinearity that accentuates gains and underrepresents losses relative to a reference point. We analyze the asymptotic behavior of the scheme by analyzing its limiting differential equatio… ▽ More We consider a prospect theoretic version of the classical Q-learning algorithm for discounted reward Markov decision processes, wherein the controller perceives a distorted and noisy future reward, modeled by a nonlinearity that accentuates gains and underrepresents losses relative to a reference point. We analyze the asymptotic behavior of the scheme by analyzing its limiting differential equation and using the theory of monotone dynamical systems to infer its asymptotic behavior. Specifically, we show convergence to equilibria, and establish some qualitative facts about the equilibria themselves. △ Less

Submitted 1 September, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: Published in Systems and Control Letters. 16 pages, 8 figures

arXiv:2010.06445 [pdf, ps, other]

Revisiting SIR in the age of COVID-19: Explicit Solutions and Control Problems

Authors: Vivek S. Borkar, D. Manjunath

Abstract: The non-population conserving SIR (SIR-NC) model to describe the spread of infections in a community is proposed and studied. Unlike the standard SIR model, SIR-NC does not assume population conservation. Although similar in form to the standard SIR, SIR-NC admits a closed form solution while allowing us to model mortality, and also provides different, and arguably a more realistic, interpretation… ▽ More The non-population conserving SIR (SIR-NC) model to describe the spread of infections in a community is proposed and studied. Unlike the standard SIR model, SIR-NC does not assume population conservation. Although similar in form to the standard SIR, SIR-NC admits a closed form solution while allowing us to model mortality, and also provides different, and arguably a more realistic, interpretation of the model parameters. Numerical comparisons of this SIR-NC model with the standard, population conserving, SIR model are provided. Extensions to include imported infections, interacting communities, and models that include births and deaths are presented and analyzed. Several numerical examples are also presented to illustrate these models. Two control problems for the SIR-NC epidemic model are presented. First we consider the continuous time model predictive control in which the cost function variables correspond to the levels of lockdown, the level of testing and quarantine, and the number of infections. We also include a switching cost for moving between lockdown levels. A discrete time version that is more amenable to computation is then presented along with numerical illustrations. We then consider a multi-objective and multi-community control where we can define multiple cost functions on the different communities and obtain the minimum cost control to keep the value function corresponding to these control objectives below a prescribed threshold. △ Less

Submitted 4 November, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: SIAM Journal of Control and Optimization 2020

arXiv:1910.04402 [pdf, ps, other]

Scheduling in Wireless Networks with Spatial Reuse of Spectrum as Restless Bandits

Authors: Vivek S. Borkar, Shantanu Choudhary, Vaibhav Kumar Gupta, Gaurav S. Kasbekar

Abstract: We study the problem of scheduling packet transmissions with the aim of minimizing the energy consumption and data transmission delay of users in a wireless network in which spatial reuse of spectrum is employed. We approach this problem using the theory of Whittle index for cost minimizing restless bandits, which has been used to effectively solve problems in a variety of applications. We design… ▽ More We study the problem of scheduling packet transmissions with the aim of minimizing the energy consumption and data transmission delay of users in a wireless network in which spatial reuse of spectrum is employed. We approach this problem using the theory of Whittle index for cost minimizing restless bandits, which has been used to effectively solve problems in a variety of applications. We design two Whittle index based policies the first by treating the graph representing the network as a clique and the second based on interference constraints derived from the original graph. We evaluate the performance of these two policies via extensive simulations, in terms of average cost and packets dropped, and show that they outperform the well-known Slotted ALOHA and maximum weight scheduling algorithms. △ Less

Submitted 8 June, 2020; v1 submitted 10 October, 2019; originally announced October 2019.

Comments: Revision

arXiv:1902.01048 [pdf, ps, other]

Average cost optimal control under weak ergodicity hypotheses: Relative value iterations

Authors: Ari Arapostathis, Vivek S. Borkar

Abstract: We study Markov decision processes with Polish state and action spaces. The action space is state dependent and is not necessarily compact. We first establish the existence of an optimal ergodic occupation measure using only a near-monotone hypothesis on the running cost. Then we study the well-posedness of Bellman equation, or what is commonly known as the average cost optimality equation, under… ▽ More We study Markov decision processes with Polish state and action spaces. The action space is state dependent and is not necessarily compact. We first establish the existence of an optimal ergodic occupation measure using only a near-monotone hypothesis on the running cost. Then we study the well-posedness of Bellman equation, or what is commonly known as the average cost optimality equation, under the additional hypothesis of the existence of a small set. We deviate from the usual approach which is based on the vanishing discount method and instead map the problem to an equivalent one for a controlled split chain. We employ a stochastic representation of the Poisson equation to derive the Bellman equation. Next, under suitable assumptions, we establish convergence results for the 'relative value iteration' algorithm which computes the solution of the Bellman equation recursively. In addition, we present some results concerning the stability and asymptotic optimality of the associated rolling horizon policies. △ Less

Submitted 14 August, 2023; v1 submitted 4 February, 2019; originally announced February 2019.

Comments: 32 pages

MSC Class: 90C40 (93E20)

arXiv:1802.07759 [pdf, ps, other]

doi 10.1007/s00498-019-00249-4

Non-asymptotic Error Bounds For Constant Stepsize Stochastic Approximation For Tracking Mobile Agents

Authors: Bhumesh Kumar, Vivek Borkar, Akhil Shetty

Abstract: This work revisits the constant stepsize stochastic approximation algorithm for tracking a slowly moving target and obtains a bound for the tracking error that is valid for the entire time axis, using the Alekseev non-linear variation of constants formula. It is the first non-asymptptic bound for the entire time axis in the sense that it is not based on the vanishing stepsize limit and associated… ▽ More This work revisits the constant stepsize stochastic approximation algorithm for tracking a slowly moving target and obtains a bound for the tracking error that is valid for the entire time axis, using the Alekseev non-linear variation of constants formula. It is the first non-asymptptic bound for the entire time axis in the sense that it is not based on the vanishing stepsize limit and associated limit theorems unlike prior works, and captures clearly the dependence on problem parameters and the dimension. △ Less

Submitted 1 March, 2019; v1 submitted 21 February, 2018; originally announced February 2018.

Comments: Expanded and revised

Journal ref: Mathematics of Control, Signals, and Systems (2019)

arXiv:1710.11471 [pdf, other]

Distributed Server Allocation for Content Delivery Networks

Authors: Sarath Pattathil, Vivek S. Borkar, Gaurav S. Kasbekar

Abstract: We propose a dynamic formulation of file-sharing networks in terms of an average cost Markov decision process with constraints. By analyzing a Whittle-like relaxation thereof, we propose an index policy in the spirit of Whittle and compare it by simulations with other natural heuristics. We propose a dynamic formulation of file-sharing networks in terms of an average cost Markov decision process with constraints. By analyzing a Whittle-like relaxation thereof, we propose an index policy in the spirit of Whittle and compare it by simulations with other natural heuristics. △ Less

Submitted 9 February, 2019; v1 submitted 28 October, 2017; originally announced October 2017.

Comments: 22 pages, 10 figures

arXiv:1709.03248 [pdf, ps, other]

Vector Field Guidance for Convoy Monitoring Using Elliptical Orbits

Authors: Aseem V. Borkar, Vivek S. Borkar, Arpita Sinha

Abstract: We propose a novel vector field based guidance scheme for tracking and surveillance of a convoy, moving along a possibly nonlinear trajectory on the ground, by an aerial agent. The scheme first computes a time varying ellipse that encompasses all the targets in the convoy using a simple regression based algorithm. It then ensures convergence of the agent to a trajectory that repeatedly traverses t… ▽ More We propose a novel vector field based guidance scheme for tracking and surveillance of a convoy, moving along a possibly nonlinear trajectory on the ground, by an aerial agent. The scheme first computes a time varying ellipse that encompasses all the targets in the convoy using a simple regression based algorithm. It then ensures convergence of the agent to a trajectory that repeatedly traverses this moving ellipse. The scheme is analyzed using perturbation theory of nonlinear differential equations and supporting simulations are provided. Some related implementation issues are discussed and advantages of the scheme are highlighted. △ Less

Submitted 13 September, 2017; v1 submitted 11 September, 2017; originally announced September 2017.

arXiv:1708.08246 [pdf, other]

Distributed Stochastic Approximation with Local Projections

Authors: Suhail M. Shah, Vivek S. Borkar

Abstract: We propose a distributed version of a stochastic approximation scheme constrained to remain in the intersection of a finite family of convex sets. The projection to the intersection of these sets is also computed in a distributed manner and a `nonlinear gossip' mechanism is employed to blend the projection iterations with the stochastic approximation using multiple time scales We propose a distributed version of a stochastic approximation scheme constrained to remain in the intersection of a finite family of convex sets. The projection to the intersection of these sets is also computed in a distributed manner and a `nonlinear gossip' mechanism is employed to blend the projection iterations with the stochastic approximation using multiple time scales △ Less

Submitted 28 August, 2017; originally announced August 2017.

Comments: 28 pages, 3 figures, submitted to SiOpt

arXiv:1707.02440 [pdf, other]

Whittle Indexability in Egalitarian Processor Sharing Systems

Authors: Vivek S. Borkar, Sarath Pattathil

Abstract: The egalitarian processor sharing model is viewed as a restless bandit and its Whittle indexability is established. A numerical scheme for computing the Whittle indices is provided, along with supporting numerical experiments. The egalitarian processor sharing model is viewed as a restless bandit and its Whittle indexability is established. A numerical scheme for computing the Whittle indices is provided, along with supporting numerical experiments. △ Less

Submitted 13 July, 2017; v1 submitted 8 July, 2017; originally announced July 2017.

Comments: 27 pages, 6 figures

arXiv:1706.09778 [pdf, other]

Opportunistic Scheduling as Restless Bandits

Authors: Vivek S. Borkar, Gaurav S. Kasbekar, Sarath Pattathil, Priyesh Y. Shetty

Abstract: In this paper we consider energy efficient scheduling in a multiuser setting where each user has a finite sized queue and there is a cost associated with holding packets (jobs) in each queue (modeling the delay constraints). The packets of each user need to be sent over a common channel. The channel qualities seen by the users are time-varying and differ across users; also, the cost incurred, i.e.… ▽ More In this paper we consider energy efficient scheduling in a multiuser setting where each user has a finite sized queue and there is a cost associated with holding packets (jobs) in each queue (modeling the delay constraints). The packets of each user need to be sent over a common channel. The channel qualities seen by the users are time-varying and differ across users; also, the cost incurred, i.e., energy consumed, in packet transmission is a function of the channel quality. We pose the problem as an average cost Markov Decision Problem, and prove that this problem is Whittle Indexable. Based on this result, we propose an algorithm in which the Whittle index of each user is computed and the user who has the lowest value is selected for transmission. We evaluate the performance of this algorithm via simulations and show that it achieves a lower average cost than the Maximum Weight Scheduling and Weighted Fair Scheduling strategies. △ Less

Submitted 17 October, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

Comments: 10 pages, 7 figures

arXiv:1503.08558 [pdf, other]

Whittle Index Policy for Crawling Ephemeral Content

Authors: Konstantin Avrachenkov, Vivek Borkar

Abstract: We consider a task of scheduling a crawler to retrieve content from several sites with ephemeral content. A user typically loses interest in ephemeral content, like news or posts at social network groups, after several days or hours. Thus, development of timely crawling policy for such ephemeral information sources is very important. We first formulate this problem as an optimal control problem wi… ▽ More We consider a task of scheduling a crawler to retrieve content from several sites with ephemeral content. A user typically loses interest in ephemeral content, like news or posts at social network groups, after several days or hours. Thus, development of timely crawling policy for such ephemeral information sources is very important. We first formulate this problem as an optimal control problem with average reward. The reward can be measured in the number of clicks or relevant search requests. The problem in its initial formulation suffers from the curse of dimensionality and quickly becomes intractable even with moderate number of information sources. Fortunately, this problem admits a Whittle index, which leads to problem decomposition and to a very simple and efficient crawling policy. We derive the Whittle index and provide its theoretical justification. △ Less

Submitted 30 March, 2015; originally announced March 2015.

arXiv:1411.0728 [pdf, ps, other]

Approachability in Stackelberg Stochastic Games with Vector Costs

Authors: Dileep Kalathil, Vivek Borkar, Rahul Jain

Abstract: The notion of approachability was introduced by Blackwell [1] in the context of vector-valued repeated games. The famous Blackwell's approachability theorem prescribes a strategy for approachability, i.e., for `steering' the average cost of a given agent towards a given target set, irrespective of the strategies of the other agents. In this paper, motivated by the multi-objective optimization/deci… ▽ More The notion of approachability was introduced by Blackwell [1] in the context of vector-valued repeated games. The famous Blackwell's approachability theorem prescribes a strategy for approachability, i.e., for `steering' the average cost of a given agent towards a given target set, irrespective of the strategies of the other agents. In this paper, motivated by the multi-objective optimization/decision making problems in dynamically changing environments, we address the approachability problem in Stackelberg stochastic games with vector valued cost functions. We make two main contributions. Firstly, we give a simple and computationally tractable strategy for approachability for Stackelberg stochastic games along the lines of Blackwell's. Secondly, we give a reinforcement learning algorithm for learning the approachable strategy when the transition kernel is unknown. We also recover as a by-product Blackwell's necessary and sufficient condition for approachability for convex sets in this set up and thus a complete characterization. We also give sufficient conditions for non-convex sets. △ Less

Submitted 20 June, 2016; v1 submitted 3 November, 2014; originally announced November 2014.

Comments: 18 Pages, Submitted to Dynamic Games and Applications

arXiv:1404.6635 [pdf, other]

Greedy Block Coordinate Descent (GBCD) Method for High Dimensional Quadratic Programs

Authors: Gugan Thoppe, Vivek S. Borkar, Dinesh Garg

Abstract: High dimensional unconstrained quadratic programs (UQPs) involving massive datasets are now common in application areas such as web, social networks, etc. Unless computational resources that match up to these datasets are available, solving such problems using classical UQP methods is very difficult. This paper discusses alternatives. We first define high dimensional compliant (HDC) methods for UQ… ▽ More High dimensional unconstrained quadratic programs (UQPs) involving massive datasets are now common in application areas such as web, social networks, etc. Unless computational resources that match up to these datasets are available, solving such problems using classical UQP methods is very difficult. This paper discusses alternatives. We first define high dimensional compliant (HDC) methods for UQPs---methods that can solve high dimensional UQPs by adapting to available computational resources. We then show that the class of block Kaczmarz and block coordinate descent (BCD) are the only existing methods that can be made HDC. As a possible answer to the question of the `best' amongst BCD methods for UQP, we propose a novel greedy BCD (GBCD) method with serial, parallel and distributed variants. Convergence rates and numerical tests confirm that the GBCD is indeed an effective method to solve high dimensional UQPs. In fact, it sometimes beats even the conjugate gradient. △ Less

Submitted 12 July, 2014; v1 submitted 26 April, 2014; originally announced April 2014.

Comments: 29 pages, 3 figures, New references added

arXiv:1309.7841 [pdf, ps, other]

doi 10.1109/JSTSP.2014.2320229

Asynchronous Gossip for Averaging and Spectral Ranking

Authors: Vivek S. Borkar, Rahul Makhijani, Rajesh Sundaresan

Abstract: We consider two variants of the classical gossip algorithm. The first variant is a version of asynchronous stochastic approximation. We highlight a fundamental difficulty associated with the classical asynchronous gossip scheme, viz., that it may not converge to a desired average, and suggest an alternative scheme based on reinforcement learning that has guaranteed convergence to the desired avera… ▽ More We consider two variants of the classical gossip algorithm. The first variant is a version of asynchronous stochastic approximation. We highlight a fundamental difficulty associated with the classical asynchronous gossip scheme, viz., that it may not converge to a desired average, and suggest an alternative scheme based on reinforcement learning that has guaranteed convergence to the desired average. We then discuss a potential application to a wireless network setting with simultaneous link activation constraints. The second variant is a gossip algorithm for distributed computation of the Perron-Frobenius eigenvector of a nonnegative matrix. While the first variant draws upon a reinforcement learning algorithm for an average cost controlled Markov decision problem, the second variant draws upon a reinforcement learning algorithm for risk-sensitive control. We then discuss potential applications of the second variant to ranking schemes, reputation networks, and principal component analysis. △ Less

Submitted 6 January, 2014; v1 submitted 30 September, 2013; originally announced September 2013.

Comments: 14 pages, 7 figures. Minor revision

arXiv:1303.0618 [pdf, ps, other]

doi 10.1137/130912918

Convergence of The Relative Value Iteration for the Ergodic Control Problem of Nondegenerate Diffusions under Near-Monotone Costs

Authors: Ari Arapostathis, Vivek S. Borkar, K. Suresh Kumar

Abstract: We study the relative value iteration for the ergodic control problem under a near-monotone running cost structure for a nondegenerate diffusion controlled through its drift. This algorithm takes the form of a quasilinear parabolic Cauchy initial value problem in $\RR^{d}$. We show that this Cauchy problem stabilizes, or in other words, that the solution of the quasilinear parabolic equation conve… ▽ More We study the relative value iteration for the ergodic control problem under a near-monotone running cost structure for a nondegenerate diffusion controlled through its drift. This algorithm takes the form of a quasilinear parabolic Cauchy initial value problem in $\RR^{d}$. We show that this Cauchy problem stabilizes, or in other words, that the solution of the quasilinear parabolic equation converges for every bounded initial condition in $\Cc^{2}(\RR^{d})$ to the solution of the Hamilton--Jacobi--Bellman (HJB) equation associated with the ergodic control problem. △ Less

Submitted 2 April, 2013; v1 submitted 4 March, 2013; originally announced March 2013.

MSC Class: 93E15; 93E20

Journal ref: SIAM Journal of Control and Optimization 52 (2014), no. 1, pp. 1-31

arXiv:1210.8188 [pdf, ps, other]

doi 10.1007/978-3-319-02690-9_1

Relative Value Iteration for Stochastic Differential Games

Authors: Ari Arapostathis, Vivek S. Borkar, K. Suresh Kumar

Abstract: We study zero-sum stochastic differential games with player dynamics governed by a nondegenerate controlled diffusion process. Under the assumption of uniform stability, we establish the existence of a solution to the Isaac's equation for the ergodic game and characterize the optimal stationary strategies. The data is not assumed to be bounded, nor do we assume geometric ergodicity. Thus our resul… ▽ More We study zero-sum stochastic differential games with player dynamics governed by a nondegenerate controlled diffusion process. Under the assumption of uniform stability, we establish the existence of a solution to the Isaac's equation for the ergodic game and characterize the optimal stationary strategies. The data is not assumed to be bounded, nor do we assume geometric ergodicity. Thus our results extend previous work in the literature. We also study a relative value iteration scheme that takes the form of a parabolic Isaac's equation. Under the hypothesis of geometric ergodicity we show that the relative value iteration converges to the elliptic Isaac's equation as time goes to infinity. We use these results to establish convergence of the relative value iteration for risk-sensitive control problems under an asymptotic flatness assumption. △ Less

Submitted 2 April, 2013; v1 submitted 30 October, 2012; originally announced October 2012.

MSC Class: 93E15; 93E20 (Primary) 60J25; 60J60; 90C40 (Secondary)

Journal ref: Advances in dynamic games, 3--27, Ann. Internat. Soc. Dynam. Games, 13, Birkhäuser/Springer, Cham, 2013

arXiv:1107.4142 [pdf, ps, other]

doi 10.1214/12-SSY64

Asymptotics of the Invariant Measure in Mean Field Models with Jumps

Authors: Vivek S. Borkar, Rajesh Sundaresan

Abstract: We consider the asymptotics of the invariant measure for the process of the empirical spatial distribution of $N$ coupled Markov chains in the limit of a large number of chains. Each chain reflects the stochastic evolution of one particle. The chains are coupled through the dependence of the transition rates on this spatial distribution of particles in the various states. Our model is a caricature… ▽ More We consider the asymptotics of the invariant measure for the process of the empirical spatial distribution of $N$ coupled Markov chains in the limit of a large number of chains. Each chain reflects the stochastic evolution of one particle. The chains are coupled through the dependence of the transition rates on this spatial distribution of particles in the various states. Our model is a caricature for medium access interactions in wireless local area networks. It is also applicable to the study of spread of epidemics in a network. The limiting process satisfies a deterministic ordinary differential equation called the McKean-Vlasov equation. When this differential equation has a unique globally asymptotically stable equilibrium, the spatial distribution asymptotically concentrates on this equilibrium. More generally, its limit points are supported on a subset of the $ω$-limit sets of the McKean-Vlasov equation. Using a control-theoretic approach, we examine the question of large deviations of the invariant measure from this limit. △ Less

Submitted 23 January, 2013; v1 submitted 20 July, 2011; originally announced July 2011.

Comments: 58 pages, reorganised to get quickly to the main results on invariant measure; Stochastic Systems, volume 2, 2012

Showing 1–25 of 25 results for author: Borkar, V