A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees
Authors:
Toshinori Kitamura,
Tadashi Kozuno,
Masahiro Kato,
Yuki Ichihara,
Soichiro Nishimori,
Akiyoshi Sannai,
Sho Sonoda,
Wataru Kumagai,
Yutaka Matsuo
Abstract:
We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with…
▽ More
We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while baseline algorithms exhibit oscillatory performance and constraint violation.
△ Less
Submitted 1 July, 2024; v1 submitted 31 January, 2024;
originally announced January 2024.