Search | arXiv e-print repository

A Cubic-regularized Policy Newton Algorithm for Reinforcement Learning

Authors: Mizhaan Prajit Maniyar, Akash Mondal, Prashanth L. A., Shalabh Bhatnagar

Abstract: We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the… ▽ More We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $ε$-SOSP is $O(ε^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(ε^{-4.5})$. △ Less

Submitted 21 April, 2023; originally announced April 2023.

arXiv:2301.06535 [pdf, other]

doi 10.1016/j.mlwa.2024.100535

Case-Base Neural Networks: survival analysis with time-varying, higher-order interactions

Authors: Jesse Islam, Maxime Turgeon, Robert Sladek, Sahir Bhatnagar

Abstract: In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines th… ▽ More In the context of survival analysis, data-driven neural network-based methods have been developed to model complex covariate effects. While these methods may provide better predictive performance than regression-based approaches, not all can model time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNNs) as a new approach that combines the case-base sampling framework with flexible neural network architectures. Using a novel sampling scheme and data augmentation to naturally account for censoring, we construct a feed-forward neural network that includes time as an input. CBNNs predict the probability of an event occurring at a given moment to estimate the full hazard function. We compare the performance of CBNNs to regression and neural network-based survival methods in a simulation and three case studies using two time-dependent metrics. First, we examine performance on a simulation involving a complex baseline hazard and time-varying interactions to assess all methods, with CBNN outperforming competitors. Then, we apply all methods to three real data applications, with CBNNs outperforming the competing models in two studies and showing similar performance in the third. Our results highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible framework for data-driven modeling of single event survival outcomes that estimates time-varying effects and a complex baseline hazard by design. An R package is available at https://github.com/Jesse-Islam/cbnn. △ Less

Submitted 9 January, 2024; v1 submitted 16 January, 2023; originally announced January 2023.

arXiv:2212.10477 [pdf, ps, other]

Generalized Simultaneous Perturbation-based Gradient Search with Reduced Estimator Bias

Authors: Soumen Pachal, Shalabh Bhatnagar, L. A. Prashanth

Abstract: We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present t… ▽ More We present in this paper a family of generalized simultaneous perturbation-based gradient search (GSPGS) estimators that use noisy function measurements. The number of function measurements required by each estimator is guided by the desired level of accuracy. We first present in detail unbalanced generalized simultaneous perturbation stochastic approximation (GSPSA) estimators and later present the balanced versions (B-GSPSA) of these. We extend this idea further and present the generalized smoothed functional (GSF) and generalized random directions stochastic approximation (GRDSA) estimators, respectively, as well as their balanced variants. We show that estimators within any specified class requiring more number of function measurements result in lower estimator bias. We present a detailed analysis of both the asymptotic and non-asymptotic convergence of the resulting stochastic approximation schemes. We further present a series of experimental results with the various GSPGS estimators on the Rastrigin and quadratic function objectives. Our experiments are seen to validate our theoretical findings. △ Less

Submitted 12 November, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: The material in this paper was presented in part at the Conference on Information Sciences and Systems (CISS) in March 2023

arXiv:2206.12267 [pdf, other]

Efficient Penalized Generalized Linear Mixed Models for Variable Selection and Genetic Risk Prediction in High-Dimensional Data

Authors: Julien St-Pierre, Karim Oualkacha, Sahir Rai Bhatnagar

Abstract: Sparse regularized regression methods are now widely used in genome-wide association studies (GWAS) to address the multiple testing burden that limits discovery of potentially important predictors. Linear mixed models (LMMs) have become an attractive alternative to principal components (PC) adjustment to account for population structure and relatedness in high-dimensional penalized models. However… ▽ More Sparse regularized regression methods are now widely used in genome-wide association studies (GWAS) to address the multiple testing burden that limits discovery of potentially important predictors. Linear mixed models (LMMs) have become an attractive alternative to principal components (PC) adjustment to account for population structure and relatedness in high-dimensional penalized models. However, their use in binary trait GWAS rely on the invalid assumption that the residual variance does not depend on the estimated regression coefficients. Moreover, LMMs use a single spectral decomposition of the covariance matrix of the responses, which is no longer possible in generalized linear mixed models (GLMMs). We introduce a new method called pglmm, a penalized GLMM that allows to simultaneously select genetic markers and estimate their effects, accounting for between-individual correlations and binary nature of the trait. We develop a computationally efficient algorithm based on PQL estimation that allows to scale regularized mixed models on high-dimensional binary trait GWAS (~300,000 SNPs). We show through simulations that penalized LMM and logistic regression with PC adjustment fail to correctly select important predictors and/or that prediction accuracy decreases for a binary response when the dimensionality of the relatedness matrix is high compared to pglmm. Further, we demonstrate through the analysis of two polygenic binary traits in the UK Biobank data that our method can achieve higher predictive performance, while also selecting fewer predictors than a sparse regularized logistic lasso with PC adjustment. Our method is available as a Julia package PenalizedGLMM.jl. △ Less

Submitted 24 June, 2022; originally announced June 2022.

Comments: 26 pages, 5 figures

arXiv:2205.13609 [pdf, ps, other]

Variable Selection for Individualized Treatment Rules with Discrete Outcomes

Authors: Zeyu Bian, Erica EM Moodie, Susan M Shortreed, Sylvie D Lambert, Sahir Bhatnagar

Abstract: An individualized treatment rule (ITR) is a decision rule that aims to improve individual patients health outcomes by recommending optimal treatments according to patients specific information. In observational studies, collected data may contain many variables that are irrelevant for making treatment decisions. Including all available variables in the statistical model for the ITR could yield a l… ▽ More An individualized treatment rule (ITR) is a decision rule that aims to improve individual patients health outcomes by recommending optimal treatments according to patients specific information. In observational studies, collected data may contain many variables that are irrelevant for making treatment decisions. Including all available variables in the statistical model for the ITR could yield a loss of efficiency and an unnecessarily complicated treatment rule, which is difficult for physicians to interpret or implement. Thus, a data-driven approach to select important tailoring variables with the aim of improving the estimated decision rules is crucial. While there is a growing body of literature on selecting variables in ITRs with continuous outcomes, relatively few methods exist for discrete outcomes, which pose additional computational challenges even in the absence of variable selection. In this paper, we propose a variable selection method for ITRs with discrete outcomes. We show theoretically and empirically that our approach has the double robustness property, and that it compares favorably with other competing approaches. We illustrate the proposed method on data from a study of an adaptive web-based stress management tool to identify which variables are relevant for tailoring treatment. △ Less

Submitted 29 September, 2023; v1 submitted 26 May, 2022; originally announced May 2022.

arXiv:2101.07359 [pdf, other]

Variable Selection in Regression-based Estimation of Dynamic Treatment Regimes

Authors: Zeyu Bian, Erica EM Moodie, Susan M Shortreed, Sahir Bhatnagar

Abstract: Dynamic treatment regimes (DTRs) consist of a sequence of decision rules, one per stage of intervention, that finds effective treatments for individual patients according to patient information history. DTRs can be estimated from models which include the interaction between treatment and a small number of covariates which are often chosen a priori. However, with increasingly large and complex data… ▽ More Dynamic treatment regimes (DTRs) consist of a sequence of decision rules, one per stage of intervention, that finds effective treatments for individual patients according to patient information history. DTRs can be estimated from models which include the interaction between treatment and a small number of covariates which are often chosen a priori. However, with increasingly large and complex data being collected, it is difficult to know which prognostic factors might be relevant in the treatment rule. Therefore, a more data-driven approach of selecting these covariates might improve the estimated decision rules and simplify models to make them easier to interpret. We propose a variable selection method for DTR estimation using penalized dynamic weighted least squares. Our method has the strong heredity property, that is, an interaction term can be included in the model only if the corresponding main terms have also been selected. Through simulations, we show our method has both the double robustness property and the oracle property, and the newly proposed methods compare favorably with other variable selection approaches. △ Less

Submitted 3 December, 2021; v1 submitted 18 January, 2021; originally announced January 2021.

arXiv:2009.10629 [pdf, ps, other]

doi 10.1007/s11222-023-10371-8

Accelerated Gradient Methods for Sparse Statistical Learning with Nonconvex Penalties

Authors: Kai Yang, Masoud Asgharian, Sahir Bhatnagar

Abstract: Nesterov's accelerated gradient (AG) is a popular technique to optimize objective functions comprising two components: a convex loss and a penalty function. While AG methods perform well for convex penalties, such as the LASSO, convergence issues may arise when it is applied to nonconvex penalties, such as SCAD. A recent proposal generalizes Nesterov's AG method to the nonconvex setting. The propo… ▽ More Nesterov's accelerated gradient (AG) is a popular technique to optimize objective functions comprising two components: a convex loss and a penalty function. While AG methods perform well for convex penalties, such as the LASSO, convergence issues may arise when it is applied to nonconvex penalties, such as SCAD. A recent proposal generalizes Nesterov's AG method to the nonconvex setting. The proposed algorithm requires specification of several hyperparameters for its practical application. Aside from some general conditions, there is no explicit rule for selecting the hyperparameters, and how different selection can affect convergence of the algorithm. In this article, we propose a hyperparameter setting based on the complexity upper bound to accelerate convergence, and consider the application of this nonconvex AG algorithm to high-dimensional linear and logistic sparse learning problems. We further establish the rate of convergence and present a simple and useful bound to characterize our proposed optimal dam** sequence. Simulation studies show that convergence can be made, on average, considerably faster than that of the conventional proximal gradient algorithm. Our experiments also show that the proposed method generally outperforms the current state-of-the-art methods in terms of signal recovery. △ Less

Submitted 28 November, 2022; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: 42 pages, 13 figures

Journal ref: Stat Comput 34, 59 (2024)

arXiv:2009.10264 [pdf, other]

casebase: An Alternative Framework For Survival Analysis and Comparison of Event Rates

Authors: Sahir Rai Bhatnagar, Maxime Turgeon, Jesse Islam, James A. Hanley, Olli Saarela

Abstract: In epidemiological studies of time-to-event data, a quantity of interest to the clinician and the patient is the risk of an event given a covariate profile. However, methods relying on time matching or risk-set sampling (including Cox regression) eliminate the baseline hazard from the likelihood expression or the estimating function. The baseline hazard then needs to be estimated separately using… ▽ More In epidemiological studies of time-to-event data, a quantity of interest to the clinician and the patient is the risk of an event given a covariate profile. However, methods relying on time matching or risk-set sampling (including Cox regression) eliminate the baseline hazard from the likelihood expression or the estimating function. The baseline hazard then needs to be estimated separately using a non-parametric approach. This leads to step-wise estimates of the cumulative incidence that are difficult to interpret. Using case-base sampling, Hanley & Miettinen (2009) explained how the parametric hazard functions can be estimated using logistic regression. Their approach naturally leads to estimates of the cumulative incidence that are smooth-in-time. In this paper, we present the casebase R package, a comprehensive and flexible toolkit for parametric survival analysis. We describe how the case-base framework can also be used in more complex settings: competing risks, time-varying exposure, and variable selection. Our package also includes an extensive array of visualization tools to complement the analysis of time-to-event data. We illustrate all these features through four different case studies. *SRB and MT contributed equally to this work. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: 31 pages, 10 figures

arXiv:2008.13066 [pdf, other]

Computer Model Calibration with Time Series Data using Deep Learning and Quantile Regression

Authors: Saumya Bhatnagar, Won Chang, Seon** Kim Jiali Wang

Abstract: Computer models play a key role in many scientific and engineering problems. One major source of uncertainty in computer model experiment is input parameter uncertainty. Computer model calibration is a formal statistical procedure to infer input parameters by combining information from model runs and observational data. The existing standard calibration framework suffers from inferential issues wh… ▽ More Computer models play a key role in many scientific and engineering problems. One major source of uncertainty in computer model experiment is input parameter uncertainty. Computer model calibration is a formal statistical procedure to infer input parameters by combining information from model runs and observational data. The existing standard calibration framework suffers from inferential issues when the model output and observational data are high-dimensional dependent data such as large time series due to the difficulty in building an emulator and the non-identifiability between effects from input parameters and data-model discrepancy. To overcome these challenges we propose a new calibration framework based on a deep neural network (DNN) with long-short term memory layers that directly emulates the inverse relationship between the model output and input parameters. Adopting the 'learning with noise' idea we train our DNN model to filter out the effects from data model discrepancy on input parameter inference. We also formulate a new way to construct interval predictions for DNN using quantile regression to quantify the uncertainty in input parameter estimates. Through a simulation study and real data application with WRF-hydro model we show that our approach can yield accurate point estimates and well calibrated interval estimates for input parameters. △ Less

Submitted 8 September, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

arXiv:1911.05697 [pdf, other]

A Convergent Off-Policy Temporal Difference Algorithm

Authors: Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar

Abstract: Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear… ▽ More Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm. △ Less

Submitted 13 November, 2019; originally announced November 2019.

arXiv:1906.06659 [pdf, ps, other]

doi 10.1109/TAC.2022.3159453

A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games

Authors: Raghuram Bharadwaj Diddigi, Chandramouli Kamanchi, Shalabh Bhatnagar

Abstract: We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully app… ▽ More We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques, under an assumption on the boundedness of iterates. Through experiments, we demonstrate the effectiveness of our proposed algorithm. △ Less

Submitted 18 March, 2022; v1 submitted 16 June, 2019; originally announced June 2019.

arXiv:1905.03970 [pdf, other]

doi 10.1007/s10489-020-01758-5

Reinforcement Learning in Non-Stationary Environments

Authors: Sindhu Padakandla, Prabuchandran K. J, Shalabh Bhatnagar

Abstract: Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, one often encounters situations with non-stationary environments and in these scenarios, RL methods yield sub-optimal decisions. In this pape… ▽ More Reinforcement learning (RL) methods learn optimal decisions in the presence of a stationary environment. However, the stationary assumption on the environment is very restrictive. In many real world problems like traffic signal control, robotic applications, one often encounters situations with non-stationary environments and in these scenarios, RL methods yield sub-optimal decisions. In this paper, we thus consider the problem of develo** RL methods that obtain optimal decisions in a non-stationary environment. The goal of this problem is to maximize the long-term discounted reward achieved when the underlying model of the environment changes over time. To achieve this, we first adapt a change point algorithm to detect change in the statistics of the environment and then develop an RL algorithm that maximizes the long-run reward accrued. We illustrate that our change point method detects change in the model of the environment effectively and thus facilitates the RL algorithm in maximizing the long-run reward. We further validate the effectiveness of the proposed solution on non-stationary random Markov decision processes, a sensor energy management problem and a traffic signal control problem. △ Less

Submitted 19 May, 2020; v1 submitted 10 May, 2019; originally announced May 2019.

Journal ref: Applied Intelligence 2020

arXiv:1905.03927 [pdf, other]

doi 10.1109/TAC.2021.3112851

Generalized Second Order Value Iteration in Markov Decision Processes

Authors: Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Shalabh Bhatnagar

Abstract: Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Su… ▽ More Value iteration is a fixed point iteration technique utilized to obtain the optimal value function and policy in a discounted reward Markov Decision Process (MDP). Here, a contraction operator is constructed and applied repeatedly to arrive at the optimal solution. Value iteration is a first order method and therefore it may take a large number of iterations to converge to the optimal solution. Successive relaxation is a popular technique that can be applied to solve a fixed point equation. It has been shown in the literature that, under a special structure of the MDP, successive over-relaxation technique computes the optimal value function faster than standard value iteration. In this work, we propose a second order value iteration procedure that is obtained by applying the Newton-Raphson method to the successive relaxation value iteration scheme. We prove the global convergence of our algorithm to the optimal solution asymptotically and show the second order convergence. Through experiments, we demonstrate the effectiveness of our proposed approach. △ Less

Submitted 17 September, 2021; v1 submitted 10 May, 2019; originally announced May 2019.

Comments: Accepted for publication at IEEE Transactions on Automatic Control

arXiv:1903.03812 [pdf, other]

doi 10.1109/LCSYS.2019.2921158

Successive Over Relaxation Q-Learning

Authors: Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Shalabh Bhatnagar

Abstract: In a discounted reward Markov Decision Process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation base… ▽ More In a discounted reward Markov Decision Process (MDP), the objective is to find the optimal value function, i.e., the value function corresponding to an optimal policy. This problem reduces to solving a functional equation known as the Bellman equation and a fixed point iteration scheme known as the value iteration is utilized to obtain the solution. In literature, a successive over-relaxation based value iteration scheme is proposed to speed-up the computation of the optimal value function. The speed-up is achieved by constructing a modified Bellman equation that ensures faster convergence to the optimal value function. However, in many practical applications, the model information is not known and we resort to Reinforcement Learning (RL) algorithms to obtain optimal policy and value function. One such popular algorithm is Q-learning. In this paper, we propose Successive Over-Relaxation (SOR) Q-learning. We first derive a modified fixed point iteration for SOR Q-values and utilize stochastic approximation to derive a learning algorithm to compute the optimal value function and an optimal policy. We then prove the almost sure convergence of the SOR Q-learning to SOR Q-values. Finally, through numerical experiments, we show that SOR Q-learning is faster compared to the standard Q-learning algorithm. △ Less

Submitted 13 June, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

Journal ref: IEEE Control Systems Letters 2019

arXiv:1902.03806 [pdf, ps, other]

doi 10.1109/LCSYS.2019.2916467

An Online Sample Based Method for Mode Estimation using ODE Analysis of Stochastic Approximation Algorithms

Authors: Chandramouli Kamanchi, Raghuram Bharadwaj Diddigi, Prabuchandran K. J., Shalabh Bhatnagar

Abstract: One of the popular measures of central tendency that provides better representation and interesting insights of the data compared to the other measures like mean and median is the metric mode. If the analytical form of the density function is known, mode is an argument of the maximum value of the density function and one can apply the optimization techniques to find mode. In many of the practical… ▽ More One of the popular measures of central tendency that provides better representation and interesting insights of the data compared to the other measures like mean and median is the metric mode. If the analytical form of the density function is known, mode is an argument of the maximum value of the density function and one can apply the optimization techniques to find mode. In many of the practical applications, the analytical form of the density is not known and only the samples from the distribution are available. Most of the techniques proposed in the literature for estimating the mode from the samples assume that all the samples are available beforehand. Moreover, some of the techniques employ computationally expensive operations like sorting. In this work we provide a computationally effective, on-line iterative algorithm that estimates the mode of a unimodal smooth density given only the samples generated from the density. Asymptotic convergence of the proposed algorithm using an ordinary differential equation (ODE) based analysis is provided. We also prove the stability of estimates by utilizing the concept of regularization. Experimental results further demonstrate the effectiveness of the proposed algorithm. △ Less

Submitted 3 June, 2019; v1 submitted 11 February, 2019; originally announced February 2019.

Journal ref: IEEE Control Systems Letters 2019

arXiv:1806.06720 [pdf, other]

An Online Prediction Algorithm for Reinforcement Learning with Linear Function Approximation using Cross Entropy Method

Authors: A** George Joseph, Shalabh Bhatnagar

Abstract: In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, \emph{i.e.}, estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic appro… ▽ More In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, \emph{i.e.}, estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy (CE) optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability. △ Less

Submitted 15 June, 2018; originally announced June 2018.

Comments: arXiv admin note: substantial text overlap with arXiv:1609.09449

arXiv:1802.07935 [pdf, other]

Asynchronous stochastic approximations with asymptotically biased errors and deep multi-agent learning

Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar, Daniel E. Quevedo

Abstract: Asynchronous stochastic approximations (SAs) are an important class of model-free algorithms, tools and techniques that are popular in multi-agent and distributed control scenarios. To counter Bellman's curse of dimensionality, such algorithms are coupled with function approximations. Although the learning/ control problem becomes more tractable, function approximations affect stability and conver… ▽ More Asynchronous stochastic approximations (SAs) are an important class of model-free algorithms, tools and techniques that are popular in multi-agent and distributed control scenarios. To counter Bellman's curse of dimensionality, such algorithms are coupled with function approximations. Although the learning/ control problem becomes more tractable, function approximations affect stability and convergence. In this paper, we present verifiable sufficient conditions for stability and convergence of asynchronous SAs with biased approximation errors. The theory developed herein is used to analyze Policy Gradient methods and noisy Value Iteration schemes. Specifically, we analyze the asynchronous approximate counterparts of the policy gradient (A2PG) and value iteration (A2VI) schemes. It is shown that the stability of these algorithms is unaffected by biased approximation errors, provided they are asymptotically bounded. With respect to convergence (of A2VI and A2PG), a relationship between the limiting set and the approximation errors is established. Finally, experimental results are presented that support the theory. △ Less

Submitted 2 May, 2019; v1 submitted 22 February, 2018; originally announced February 2018.

MSC Class: 62L20; 93E35; 49L20; 68T05

arXiv:1709.04673 [pdf, ps, other]

Analyzing Approximate Value Iteration Algorithms

Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar

Abstract: In this paper, we consider the stochastic iterative counterpart of the value iteration scheme wherein only noisy and possibly biased approximations of the Bellman operator are available. We call this counterpart as the approximate value iteration (AVI) scheme. Neural networks are often used as function approximators, in order to counter Bellman's curse of dimensionality. In this paper, they are us… ▽ More In this paper, we consider the stochastic iterative counterpart of the value iteration scheme wherein only noisy and possibly biased approximations of the Bellman operator are available. We call this counterpart as the approximate value iteration (AVI) scheme. Neural networks are often used as function approximators, in order to counter Bellman's curse of dimensionality. In this paper, they are used to approximate the Bellman operator. Since neural networks are typically trained using sample data, errors and biases may be introduced. The design of AVI accounts for implementations with biased approximations of the Bellman operator and sampling errors. We present verifiable sufficient conditions under which AVI is stable (almost surely bounded) and converges to a fixed point of the approximate Bellman operator. To ensure the stability of AVI, we present three different yet related sets of sufficient conditions that are based on the existence of an appropriate Lyapunov function. These Lyapunov function based conditions are easily verifiable and new to the literature. The verifiability is enhanced by the fact that a recipe for the construction of the necessary Lyapunov function is also provided. We also show that the stability analysis of AVI can be readily extended to the general case of set-valued stochastic approximations. Finally, we show that AVI can also be used in more general circumstances, i.e., for finding fixed points of contractive set-valued maps. △ Less

Submitted 30 May, 2021; v1 submitted 14 September, 2017; originally announced September 2017.

MSC Class: 62L20; 93E35; 37B25; 34A60; 90C39; 37C25

arXiv:1604.00151 [pdf, other]

Analysis of gradient descent methods with non-diminishing, bounded errors

Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar

Abstract: The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure boundedness of the iterates) and convergence of GD with bounded, (possibly) non-diminishing gradient errors. In addition to ensuring stability, such an algorit… ▽ More The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure boundedness of the iterates) and convergence of GD with bounded, (possibly) non-diminishing gradient errors. In addition to ensuring stability, such an algorithm is shown to converge to a small neighborhood of the minimum set, which depends on the gradient errors. It is worth noting that the main result of this paper can be used to show that GD with asymptotically vanishing errors indeed converges to the minimum set. The results presented herein are not only more general when compared to previous results, but our analysis of GD with errors is new to the literature to the best of our knowledge. Our work extends the contributions of Mangasarian & Solodov, Bertsekas & Tsitsiklis and Tadic & Doucet. Using our framework, a simple yet effective implementation of GD using simultaneous perturbation stochastic approximations (SP SA), with constant sensitivity parameters, is presented. Another important improvement over many previous results is that there are no `additional' restrictions imposed on the step-sizes. In machine learning applications where step-sizes are related to learning rates, our assumptions, unlike those of other papers, do not affect these learning rates. Finally, we present experimental results to validate our theory. △ Less

Submitted 18 September, 2017; v1 submitted 1 April, 2016; originally announced April 2016.

Comments: arXiv admin note: text overlap with arXiv:1502.01953, IEEE Transactions on Automatic Control, 2017

MSC Class: 93E15; 93E35

arXiv:1504.06043 [pdf, ps, other]

Stability of Stochastic Approximations with `Controlled Markov' Noise and Temporal Difference Learning

Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar

Abstract: We are interested in understanding stability (almost sure boundedness) of stochastic approximation algorithms (SAs) driven by a `controlled Markov' process. Analyzing this class of algorithms is important, since many reinforcement learning (RL) algorithms can be cast as SAs driven by a `controlled Markov' process. In this paper, we present easily verifiable sufficient conditions for stability and… ▽ More We are interested in understanding stability (almost sure boundedness) of stochastic approximation algorithms (SAs) driven by a `controlled Markov' process. Analyzing this class of algorithms is important, since many reinforcement learning (RL) algorithms can be cast as SAs driven by a `controlled Markov' process. In this paper, we present easily verifiable sufficient conditions for stability and convergence of SAs driven by a `controlled Markov' process. Many RL applications involve continuous state spaces. While our analysis readily ensures stability for such continuous state applications, traditional analyses do not. As compared to literature, our analysis presents a two-fold generalization (a) the Markov process may evolve in a continuous state space and (b) the process need not be ergodic under any given stationary policy. Temporal difference learning (TD) is an important policy evaluation method in reinforcement learning. The theory developed herein, is used to analyze generalized $TD(0)$, an important variant of TD. Our theory is also used to analyze a TD formulation of supervised learning for forecasting problems. △ Less

Submitted 17 May, 2018; v1 submitted 23 April, 2015; originally announced April 2015.

Comments: 18 pages

MSC Class: 62L20; 93E03; 93E35; 34A60

arXiv:1503.09105 [pdf, ps, other]

Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning

Authors: Prasenjit Karmakar, Shalabh Bhatnagar

Abstract: We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in… ▽ More We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results. △ Less

Submitted 25 February, 2017; v1 submitted 31 March, 2015; originally announced March 2015.

Comments: 23 pages (relaxed some important assumptions from the previous version), accepted in Mathematics of Operations Research in Feb, 2017

arXiv:1502.01956 [pdf, ps, other]

doi 10.1080/17442508.2016.1215450

Stochastic recursive inclusion in two timescales with an application to the Lagrangian dual problem

Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar

Abstract: In this paper we present a framework to analyze the asymptotic behavior of two timescale stochastic approximation algorithms including those with set-valued mean fields. This paper builds on the works of Borkar and Perkins & Leslie. The framework presented herein is more general as compared to the synchronous two timescale framework of Perkins \& Leslie, however the assumptions involved are easily… ▽ More In this paper we present a framework to analyze the asymptotic behavior of two timescale stochastic approximation algorithms including those with set-valued mean fields. This paper builds on the works of Borkar and Perkins & Leslie. The framework presented herein is more general as compared to the synchronous two timescale framework of Perkins \& Leslie, however the assumptions involved are easily verifiable. As an application, we use this framework to analyze the two timescale stochastic approximation algorithm corresponding to the Lagrangian dual problem in optimization theory. △ Less

Submitted 9 October, 2015; v1 submitted 6 February, 2015; originally announced February 2015.

MSC Class: 62L20; 93E03; 93E35; 34A60

Journal ref: Stochastics 2016

arXiv:1502.01953 [pdf, ps, other]

A Generalization of the Borkar-Meyn Theorem for Stochastic Recursive Inclusions

Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar

Abstract: In this paper the stability theorem of Borkar and Meyn is extended to include the case when the mean field is a differential inclusion. Two different sets of sufficient conditions are presented that guarantee the stability and convergence of stochastic recursive inclusions. Our work builds on the works of Benaim, Hofbauer and Sorin as well as Borkar and Meyn. As a corollary to one of the main theo… ▽ More In this paper the stability theorem of Borkar and Meyn is extended to include the case when the mean field is a differential inclusion. Two different sets of sufficient conditions are presented that guarantee the stability and convergence of stochastic recursive inclusions. Our work builds on the works of Benaim, Hofbauer and Sorin as well as Borkar and Meyn. As a corollary to one of the main theorems, a natural generalization of the Borkar and Meyn Theorem follows. In addition, the original theorem of Borkar and Meyn is shown to hold under slightly relaxed assumptions. Finally, as an application to one of the main theorems we discuss a solution to the approximate drift problem. △ Less

Submitted 27 September, 2016; v1 submitted 6 February, 2015; originally announced February 2015.

MSC Class: 62L20; 93E03; 93E35; 34A60

arXiv:1401.2086 [pdf, ps, other]

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

Authors: H. L Prasad, L. A. Prashanth, Shalabh Bhatnagar

Abstract: We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a $N$-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution poi… ▽ More We consider the problem of finding stationary Nash equilibria (NE) in a finite discounted general-sum stochastic game. We first generalize a non-linear optimization problem from Filar and Vrieze [2004] to a $N$-player setting and break down this problem into simpler sub-problems that ensure there is no Bellman error for a given state and an agent. We then provide a characterization of solution points of these sub-problems that correspond to Nash equilibria of the underlying game and for this purpose, we derive a set of necessary and sufficient SG-SP (Stochastic Game - Sub-Problem) conditions. Using these conditions, we develop two actor-critic algorithms: OFF-SGSP (model-based) and ON-SGSP (model-free). Both algorithms use a critic that estimates the value function for a fixed policy and an actor that performs descent in the policy space using a descent direction that avoids local minima. We establish that both algorithms converge, in self-play, to the equilibria of a certain ordinary differential equation (ODE), whose stable limit points coincide with stationary NE of the underlying general-sum stochastic game. On a single state non-generic game (see Hart and Mas-Colell [2005]) as well as on a synthetic two-player game setup with $810,000$ states, we establish that ON-SGSP consistently outperforms NashQ ([Hu and Wellman, 2003] and FFQ [Littman, 2001] algorithms. △ Less

Submitted 2 July, 2015; v1 submitted 8 January, 2014; originally announced January 2014.

arXiv:1206.4832 [pdf, other]

doi 10.1145/2628434

Smoothed Functional Algorithms for Stochastic Optimization using q-Gaussian Distributions

Authors: Debarghya Ghoshdastidar, Ambedkar Dukkipati, Shalabh Bhatnagar

Abstract: Smoothed functional (SF) schemes for gradient estimation are known to be efficient in stochastic optimization algorithms, specially when the objective is to improve the performance of a stochastic system. However, the performance of these methods depends on several parameters, such as the choice of a suitable smoothing kernel. Different kernels have been studied in literature, which include Gaussi… ▽ More Smoothed functional (SF) schemes for gradient estimation are known to be efficient in stochastic optimization algorithms, specially when the objective is to improve the performance of a stochastic system. However, the performance of these methods depends on several parameters, such as the choice of a suitable smoothing kernel. Different kernels have been studied in literature, which include Gaussian, Cauchy and uniform distributions among others. This paper studies a new class of kernels based on the q-Gaussian distribution, that has gained popularity in statistical physics over the last decade. Though the importance of this family of distributions is attributed to its ability to generalize the Gaussian distribution, we observe that this class encompasses almost all existing smoothing kernels. This motivates us to study SF schemes for gradient estimation using the q-Gaussian distribution. Using the derived gradient estimates, we propose two-timescale algorithms for optimization of a stochastic objective function in a constrained setting with projected gradient search approach. We prove the convergence of our algorithms to the set of stationary points of an associated ODE. We also demonstrate their performance numerically through simulations on a queuing model. △ Less

Submitted 3 July, 2014; v1 submitted 21 June, 2012; originally announced June 2012.

ACM Class: G.1.6; I.6.8

Showing 1–25 of 25 results for author: Bhatnagar, S