Search | arXiv e-print repository

On the Convergence of Discounted Policy Gradient Methods

Abstract: Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted app… ▽ More Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective. △ Less

Submitted 9 January, 2023; v1 submitted 28 December, 2022; originally announced December 2022.

Comments: 10 pages

arXiv:2001.01577 [pdf, other]

Learning Reusable Options for Multi-Task Reinforcement Learning

Authors: Francisco M. Garcia, Chris Nota, Philip S. Thomas

Abstract: Reinforcement learning (RL) has become an increasingly active area of research in recent years. Although there are many algorithms that allow an agent to solve tasks efficiently, they often ignore the possibility that prior experience related to the task at hand might be available. For many practical applications, it might be unfeasible for an agent to learn how to solve a task from scratch, given… ▽ More Reinforcement learning (RL) has become an increasingly active area of research in recent years. Although there are many algorithms that allow an agent to solve tasks efficiently, they often ignore the possibility that prior experience related to the task at hand might be available. For many practical applications, it might be unfeasible for an agent to learn how to solve a task from scratch, given that it is generally a computationally expensive process; however, prior experience could be leveraged to make these problems tractable in practice. In this paper, we propose a framework for exploiting existing experience by learning reusable options. We show that after an agent learns policies for solving a small number of problems, we are able to use the trajectories generated from those policies to learn reusable options that allow an agent to quickly learn how to solve novel and related problems. △ Less

Submitted 6 January, 2020; originally announced January 2020.

Comments: 15 pages, 7 figures, pre-print

arXiv:1906.07073 [pdf, other]

Is the Policy Gradient a Gradient?

Authors: Chris Nota, Philip S. Thomas

Abstract: The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods drop the discount factor from the state distribution and therefore do not optimize the discounted objective. What do they optimize instead? This has been an open question for several years, and this lack of theoretical clarity has… ▽ More The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods drop the discount factor from the state distribution and therefore do not optimize the discounted objective. What do they optimize instead? This has been an open question for several years, and this lack of theoretical clarity has lead to an abundance of misstatements in the literature. We answer this question by proving that the update direction approximated by most methods is not the gradient of any function. Further, we argue that algorithms that follow this direction are not guaranteed to converge to a "reasonable" fixed point by constructing a counterexample wherein the fixed point is globally pessimal with respect to both the discounted and undiscounted objectives. We motivate this work by surveying the literature and showing that there remains a widespread misunderstanding regarding discounted policy gradient methods, with errors present even in highly-cited papers published at top conferences. △ Less

Submitted 27 February, 2020; v1 submitted 17 June, 2019; originally announced June 2019.

Comments: 8 pages, 3 figures

arXiv:1906.03063 [pdf, ps, other]

Classical Policy Gradient: Preserving Bellman's Principle of Optimality

Authors: Philip S. Thomas, Scott M. Jordan, Yash Chandak, Chris Nota, James Kostas

Abstract: We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective. We propose a new objective function for finite-horizon episodic Markov decision processes that better captures Bellman's principle of optimality, and provide an expression for the gradient of the objective. △ Less

Submitted 6 June, 2019; originally announced June 2019.

Comments: 1 page, 0 figures

arXiv:1906.01770 [pdf, other]

Lifelong Learning with a Changing Action Set

Authors: Yash Chandak, Georgios Theocharous, Chris Nota, Philip S. Thomas

Abstract: In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed. In this paper, we present an algorithm that autono… ▽ More In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed. In this paper, we present an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efficiency of this approach on large-scale real-world lifelong learning problems. △ Less

Submitted 10 May, 2020; v1 submitted 4 June, 2019; originally announced June 2019.

Comments: Thirty-fourth Conference on Artificial Intelligence (AAAI 2020) [Outstanding Student Paper Honorable Mention. ]

arXiv:1902.05650 [pdf, other]

Asynchronous Coagent Networks

Authors: James E. Kostas, Chris Nota, Philip S. Thomas

Abstract: Coagent policy gradient algorithms (CPGAs) are reinforcement learning algorithms for training a class of stochastic neural networks called coagent networks. In this work, we prove that CPGAs converge to locally optimal policies. Additionally, we extend prior theory to encompass asynchronous and recurrent coagent networks. These extensions facilitate the straightforward design and analysis of hiera… ▽ More Coagent policy gradient algorithms (CPGAs) are reinforcement learning algorithms for training a class of stochastic neural networks called coagent networks. In this work, we prove that CPGAs converge to locally optimal policies. Additionally, we extend prior theory to encompass asynchronous and recurrent coagent networks. These extensions facilitate the straightforward design and analysis of hierarchical reinforcement learning algorithms like the option-critic, and eliminate the need for complex derivations of customized learning rules for these algorithms. △ Less

Submitted 10 August, 2020; v1 submitted 14 February, 2019; originally announced February 2019.

Comments: Updated version

Showing 1–6 of 6 results for author: Nota, C