1 INTRODUCTION

Sampling-based Safe Reinforcement Learning for
Nonlinear Dynamical Systems

Wesley A. Suttle Vipul K. Sharma Krishna C. Kosaraju

U.S. Army Research Laboratory Purdue University Clemson University

S. Sivaranjani Ji Liu Vijay Gupta Brian M. Sadler

Purdue University Stony Brook University Purdue University U.S. Army Research Laboratory

Abstract

We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.

1 INTRODUCTION

Learning-based methods for safe control of physical systems have been gaining increasing attention (Brunke et al., 2022). RL is especially powerful for the control of systems where performance feedback in the form of a scalar reward is available, but the dynamics are unknown (Sutton and Barto, 2018). In such settings, RL methods can learn a controller maximizing reward through direct interaction with the environment. However, due to physical realities such as the need to guarantee safety, practical application of RL to control of physical systems requires constraints on the control policies throughout training (Garcıa and Fernández, 2015). While directly constraining the action space to a static, narrowly defined set of “safe” actions is frequently employed in practice, this can lead to learning highly suboptimal policies and more nuanced methods are therefore required. Furthermore, in most physical systems it is non-trivial to directly translate complex safety constraints on the states into allowable actions.

A variety of RL approaches to the problem of safe learning for control have been proposed in the literature (see Brunke et al. (2022) for a comprehensive survey), including RL methods for safety-focused problems formulated as constrained Markov decision processes (CMDPs) (Altman, 2021), methods for learning to achieve safety through stability (Berkenkamp et al., 2017), and projection-based – also known as “safety filter” – RL methods for maintaining hard safety constraints, typically achieved through the use of control barrier functions (CBFs) (Cheng et al., 2019). Though CMDP-based methods enjoy convergence guarantees, they encourage safety without guaranteeing it, and cannot provide guarantees for hard safety constraints commonly required in physical systems. Likewise, methods like Berkenkamp et al. (2017) do better by offering high-probability safety assurances, but stop short of guaranteeing safety. In systems where safety is critical, methods like Cheng et al. (2019) that provably guarantee hard constraint satisfaction are necessary. However, the interaction between imposition of hard constraints and optimality of the resulting control policies is a subtle issue in RL. While projection-based safety-filter approaches (Wabersich et al., 2023) provably guarantee safety, the projection procedure undermines any convergence guarantees enjoyed by the underlying RL methods.

In this work, we develop a class of model-free policy gradient methods that maintain safety or other stability properties by sampling directly from the set of state-dependent safe actions. The key to our approach is that we consider truncated versions of commonly used stochastic policies, allowing us to sample directly from the safe action set at each state. This allows us to recover convergence guarantees by extending existing results for policy gradient methods to truncated policies. Our approach is applicable to a wide class of safety constraints including control barrier functions (CBFs), that enforce forward invariance of a set characterized by nonlinearly coupled states and actions (Ames et al., 2019, 2016), and reachability-type constraints (Wabersich et al., 2023). In addition to our theoretical results, we experimentally validate the practical utility of sampling-based safety-preservation methods by considering a special case: Beta policies with state-dependent, control barrier function (CBF)-constrained action sets. This novel approach extends the Beta policies in Chou et al. (2017) to the state-dependent action constraint setting. Finally, we train the resulting CBF-constrained Beta policies using PPO to solve a safety-constrained inverted pendulum problem as well as a quadcopter navigation and obstacle avoidance problem, and compare the latter to a safety filter-based benchmark.¹¹1Our implementation is publicly available at https://github.com/sharma1256/cbf-constrained_ppo. These case studies illustrate that our method simultaneously guarantees safety throughout training and guarantees optimality, even where existing benchmarks fail.

1.1 Related Work

Safety and stability have seen a great deal of interest in recent years at the intersection of the RL and control communities (see Brunke et al. (2022); Garcıa and Fernández (2015) for overviews). We are interested in safety definitions that impose hard constraints on the states and control actions (rather than, e.g., those used in robust RL (Wiesemann et al., 2013; Aswani et al., 2013) or RL for CMDPs (Achiam et al., 2017; Paternain et al., 2019; Ma et al., 2021; Bai et al., 2022)). Model-based methods for guaranteeing stability using RL controllers in systems with known or learnable dynamics have been developed in Berkenkamp et al. (2017); Fazel et al. (2018); Zhang et al. (2021). Recently, techniques leveraging control barrier functions to maintain safety (Cheng et al., 2019) and dissipativity (Kosaraju et al., 2021) have been developed.

Our work lies in the model-free RL setting. The two dominant approaches in model-free RL are value function and policy gradient-based methods (Sutton and Barto, 2018). We focus on the latter in this paper. Since their origins early in the development of RL (Sutton et al., 2000; Borkar, 2005; Bhatnagar et al., 2009), policy gradient methods have become the model-free algorithms of choice for complex problems with continuous, high-dimensional state and action spaces (Lillicrap et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018). Recent works have improved our understanding of gradient estimation procedures, global optimality properties, and convergence rates of these algorithms (Agarwal et al., 2020; Zhang et al., 2020; Suttle et al., 2023). Popular approaches for safety in model-free RL include using bounds resulting from Gaussian process models (Schreiter et al., 2015; Rasmussen, 2003; Sui et al., 2015), reward-sha**, constrained policy optimization (Achiam et al., 2017; Wachi et al., 2018), and teacher advice (Abbeel and Ng, 2004). Our work is most closely related to those approaches that use a hard safe set specification and constraints on control inputs, e.g., control barrier functions (Cheng et al., 2019; Fisac et al., 2018; Li et al., 2018; Kosaraju et al., 2021). In particular, our key contribution is a model-free safe RL algorithm with convergence guarantees and provable safety guarantees under hard constraints like CBFs, even during training.

2 PROBLEM SETTING

Consider a discounted MDP $(\mathcal{X},\mathcal{U},\mathcal{P},r,\gamma)$ , where $\mathcal{X}\subseteq\mathbb{R}^{m}$ is the state space, $\mathcal{U}\subseteq\mathbb{R}^{n}$ is the action space, $\mathcal{P}(\cdot|x,u)$ is the transition probability function given action $u$ is taken in state $x$ , $r:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}$ is the reward function, and $\gamma\in[0,1]$ is the discount factor. The MDP, which can be used to model a wide array of discrete-time systems, proceeds as follows: at time $k$ , the system is in state $x_{k}$ ; a control input $u_{k}$ is applied to the system; a reward $r(x_{k},u_{k})$ is received; the system transitions into state $x_{k+1}$ according to the distribution $P(\cdot|x_{k},u_{k})$ . The goal in this problem formulation is to maximize the expected discounted reward, which we define in (3) below. Note that deterministic dynamics can be recovered by imposing that, for each $x\in\mathcal{X},u\in\mathcal{U}$ , there exists $x^{\prime}\in\mathcal{X}$ such that $\mathcal{P}(x^{\prime}|x,u)=1$ . This is useful for modeling discretizations of continuous-time control problems, for example. We assume throughout this paper that the dynamics are deterministic in this way, which is a common setting in safe control problems. Let $\mathcal{T}:\mathcal{X}\times\mathcal{U}\rightarrow\mathcal{X}$ represent the dynamics of the MDP, i.e., given state $x$ and control input $u$ , $\mathcal{T}(x,u)$ denotes the state the system transitions into when input $u$ is applied while in state $x$ .

Letting $\Delta(\mathcal{U})$ denote the set of all probability distributions over the set $\mathcal{U}$ , a stochastic policy $\pi:\mathcal{X}\rightarrow\Delta(\mathcal{U})$ is a function map** states to probability distributions over the action space $\mathcal{U}$ . In other words, given a state $x$ , an agent using policy $\pi$ will choose a control action $u\in\mathcal{U}$ by sampling $u\sim\pi(\cdot|s)$ . For our purposes it will be useful to consider policies $\pi_{\theta}$ parameterized by $\theta\in\Theta\subseteq\mathbb{R}^{k}$ , for some $k\ll|\mathcal{X}|\cdot|\mathcal{U}|=m\cdot n$ , where $\Theta$ is a compact set of permissible parameters.

Let $\mathcal{S}\subset\mathcal{X}$ denote some “safe” or stable set within which we wish to keep the system. Furthermore, let $\mathbb{P}(S)$ denote the powerset of a set $S$ , and consider a set-valued function $C:\mathcal{X}\rightarrow\mathbb{P}(\mathcal{U})$ given by

C(x)=\{u\in\mathcal{U}\ |\ \mathcal{T}(x,u)\in\mathcal{S}\}.

(1)

Intuitively, $C(x)$ is the set of all control inputs which when applied at state $x$ keep the system within the safe set at the next time step. We assume throughout that, for a given $x\in\mathcal{X}$ , $C(x)$ is known. Since our primary focus is resolving the open problem of simultaneously guaranteeing convergence and hard safety constraint satisfaction, we leave the issue of learning or approximating $C(x)$ while maintaining these guarantees to future work. The general formulation can be used to accommodate a variety of notions of safety, including forward invariance, stability, and dissipativity enforced by, for example, CBFs and exponential CBFs (ECBFs), and control Lyapunov functions (CLFs) (see the supplementary material for an overview and Ames et al. (2019) for a comprehensive survey). As we will demonstrate in the case studies below, the use of our method in conjunction with (E)CBFs is particularly natural to provide guarantees in problems with hard safety constraints. To ensure that we can sample from $C(x)$ and integrals over $C(x)$ are well-defined, we make the following assumption. Let $\mu$ denote the Lebesgue measure.

Assumption 1.

There exist $m,M>0$ such that $m\leq\mu(C(x))\leq M$ , for all $x\in\mathcal{X}$ . Furthermore, $\cup_{x\in\mathcal{X}}C(x)$ is compact.

Given a policy $\pi_{\theta}$ , consider the distribution $\pi_{\theta}^{C}(\cdot|x)$ obtained by truncating $\pi_{\theta}(\cdot|x)$ to the set $C(x)$ . More precisely:

\pi^{C}_{\theta}(u|x)=\begin{cases}\frac{\pi_{\theta}(u|x)}{\pi_{\theta}(C(x)|% x)}&u\in C(x)\\ 0&u\notin C(x),\end{cases}

(2)

where $\pi_{\theta}(C(x)|x)=\int_{C(x)}\pi_{\theta}(u|x)du$ . As long as we can check membership in $C(x)$ for any given $u\in\mathcal{U}$ , and assuming that the volume of $C(x)$ is strictly positive, for all $x\in\mathcal{X}$ , we can generate from this distribution by using rejection sampling, i.e. repeatedly sampling $u\sim\pi_{\theta}(\cdot|x)$ until we obtain $u\in C(x)$ . Note that, depending on the structure of parametrized policies $\pi^{C}_{\theta}$ , if $C(x)$ has a particularly nice form, such as an interval or hyperrectangle, there may be more efficient methods than rejection sampling for sampling from the truncated distribution $\pi^{C}_{\theta}(\cdot|x)$ directly. We exploit this fact when leveraging Beta policies in the experimental results of Section 4 below.

With this setup in mind, and given a fixed start state $x_{0}$ , we propose a policy gradient-based algorithm maximizing the objective function

J(\theta)=\mathbb{E}_{\pi_{\theta}^{C}}\left[\sum_{k=0}^{\infty}\gamma^{k}r(x_% {k},u_{k})\ \Big{|}\ x_{0}\right],

(3)

the expected discounted reward under policy $\pi^{C}_{\theta}$ . Before proceeding with describing and analyzing the algorithm, we first need to identify conditions that ensure that, for each policy parameter $\theta$ , taking expectations with respect to $\pi_{\theta}^{C}$ is well-defined and thus meaningful. In order for (3) to be well-defined, we need to know that, for each policy parameter $\theta$ , the occupancy measure of the Markov chain induced by $\pi_{\theta}^{C}$ on $\mathcal{S}$ is irreducible and satisfies certain ergodicity conditions. Once these are proven, we will be justified in performing gradient ascent on the objective function (3).

3 THEORETICAL RESULTS

In this section we develop the theory underlying our sampling-based method for RL with hard safety constraints. Our key contributions include proving that (3) is well-defined (§3.1), obtaining gradient expressions for it from which we can sample (§3.2), and develo** and establishing the convergence of a policy gradient algorithm for optimizing (3) (§3.3 and §3.4). All proofs are deferred to the supplementary material. It is important to note that, though we assumed the deterministic dynamics common to safe control in §2, all our theoretical results go through in the stochastic dynamics case under standard ergodicity assumptions. Our key theoretical contribution in what follows is to show that, even in the deterministic dynamics case, we can ensure that the objective is well-defined (§3.1) and obtain convergence (§3.4).

3.1 Discounted Return is Well-defined

First, we show that the objective (3) is well-defined when using truncated policies, even in continuous spaces systems with deterministic dynamics. Our key contribution in this setting is to ensure that, given reasonable conditions on the policies under consideration, important ergodicity properties of their induced Markov chains hold. This fact, established in Proposition 1 and Corollary 1, is nontrivial and its proof relies on a careful analysis of the propagation of probability mass through the transition dynamics and an interesting application of the Lebesgue-Radon-Nikodym Theorem (Folland, 1999, §3.2). As in the previous section, let $\mathcal{S}\subset\mathbb{R}^{m}$ denote the “safe” set within which the system remains so long as all control inputs are selected from $C(x)$ . Though we leave open the possibility that $\mathcal{S}$ satisfies a more specific stability conditions rather than a generic notion of “safety”, we will typically use the term “safe” for ease of presentation. Let $\mu$ denote Lebesgue measure. We make the following definition:

Definition 1.

The Markov chain $\{x_{k}\}_{k\in\mathbb{N}}$ induced by $\pi^{C}_{\theta}$ on $\mathcal{X}$ is $\mu$ -irreducible on $\mathcal{S}$ if, for any $\mu$ -measurable $\mathcal{B}\subset\mathcal{S}$ , if $\mu(\mathcal{B})>0$ , then $\sum_{k\in\mathbb{N}}P(x_{k}\in\mathcal{B}\ |\ x_{0}=x)>0$ , for all $x\in\mathcal{S}$ .

This means that, for a Markov chain to be ( $\mu$ -)irreducible on the safety set, all safe subsets with positive volume must be reachable from any initial safe state with positive probability. Notice that $\{x_{k}\}_{k\in\mathbb{N}}$ is in fact a Markov chain on the safe set $\mathcal{S}$ since, by the definition of $C(x)$ , only those control inputs kee** the system within $\mathcal{S}$ are allowed. In the sequel, we will prove that, for each $\theta$ , under suitable conditions the Markov chain induced by $\pi_{\theta}^{C}$ on $\mathcal{S}$ is irreducible and the objective (3) is thus well-defined, which is a prerequisite for develo** policy gradient methods based on it. See (Konda, 2002, §2.3) for details on irreducibility in this setting.

Given an element $x\in\mathcal{S}$ and dynamics $\mathcal{T}$ , let $R(x)\subset\mathcal{S}$ consisting of all elements reachable in one step from $x$ under $\mathcal{T}$ . Furthermore, for $\mathcal{A}\subset\mathcal{S}$ , define $R(\mathcal{A})=\cup_{x\in\mathcal{A}}R(x)$ . Also, given $\varepsilon>0$ and $x\in\mathbb{R}^{m}$ , let $B_{\varepsilon}(x)$ denote the open ball of radius $\varepsilon$ centered at $x$ . Finally, for $\mathcal{A}\subset\mathcal{S}$ , define $\mathcal{T}^{-1}_{x}(\mathcal{A}):=\{u\in\mathcal{U}\ |\ T(x,u)\in\mathcal{A}\}.$ Intuitively, $\mathcal{T}^{-1}_{x}(\mathcal{A})$ is the set of all control inputs that, when taken in state $x$ , drive the system into $\mathcal{A}$ . The following assumptions are needed in what follows.

Assumption 2.

For any $x\in\mathcal{S}$ and any $\mu$ -measurable set $\mathcal{A}\subset R(x)$ , $\mu(\mathcal{A})>0$ if and only if $\mu(\mathcal{T}^{-1}_{x}(\mathcal{A}))>0$ .

Assumption 2 ensures the system dynamics map positive volume subsets of control inputs to positive volume subsets of the state space and vice versa, which is important for our application of the Lebesgue-Radon-Nikodym Theorem in Proposition 1. It is satisfied by systems where control inputs have a measurable effect on each entry in the next state vector and thus encompasses a wide array of potentially nonlinear systems.

Assumption 3.

For any $\theta\in\Theta$ , where $\Theta$ is the set of permissible policy parameters, for any element in the safe set $x\in\mathcal{S}$ , and for any set $\mathcal{A}\subset C(x)$ satisfying $\mu(\mathcal{A})>0$ , the policy $\pi^{C}_{\theta}(\cdot|x)$ assigns positive probability to $\mathcal{A}$ , i.e. $\int_{\mathcal{A}}\pi^{C}_{\theta}(a|x)da>0$ .

Assumption 3, which is standard in the RL literature, ensures that any set of allowable control inputs that has strictly positive volume will be sampled from with strictly positive probability.

Assumption 4.

For each $x\in\mathcal{S}$ , $\mu(R(x))>0$ , and, given $\mathcal{B}\subset\mathcal{S}$ , there exists $n\in\mathbb{N}$ such that $\mathcal{B}$ is reachable in $n$ steps from $x$ .

The conditions imposed in Assumption 4 guarantee that, for any state $x\in\mathbb{X}$ : (i) the set of states reachable from $x$ in one step has strictly positive volume; (ii) any subset of the safe set $\mathcal{S}$ is reachable in at most $n$ steps from $x$ . These conditions are closely related to the familiar notion of controllability of control theory. Under these conditions, we have the following proposition and its immediate corollary.

Proposition 1.

Under Assumptions 2, 3, 4, for given $\theta$ and any subset $\mathcal{B}\subset\mathcal{S}$ satisfying $\mu(\mathcal{B})>0$ , the Markov chain induced by $\pi^{C}_{\theta}$ on $\mathcal{S}$ enters $\mathcal{B}$ with strictly positive probability.

Corollary 1.

$\{x_{n}\}$ is $\mu$ -irreducible on $\mathcal{S}$ .

Now that we are assured that the objective function (3) is well-defined, we are justified in attempting to perform gradient ascent on it. In order to accomplish this, however, we need access to gradient estimates. This is the subject of the next section.

3.2 Policy Gradients

Despite the presence of $C$ in $\pi_{\theta}^{C}$ , under mild assumptions on the underlying policy $\pi_{\theta}$ , we can apply the classic policy gradient theorem of Sutton et al. (2000) to (3) to obtain a gradient expression from which we can sample. Let $d_{\theta}^{C}(\cdot):=(1-\gamma)\sum_{k=0}^{\infty}\gamma^{t}P(x_{k}\in\cdot% \ |\ \pi_{\theta}^{C})$ denote the discounted state occupancy measure of the Markov chain induced by policy $\pi_{\theta}^{C}$ on $\mathcal{S}$ . Furthermore, let $Q^{\pi_{\theta}^{C}}(x,u)=\mathbb{E}_{\pi_{\theta}^{C}}\left[\sum_{k=0}^{% \infty}\gamma^{k}r(x_{k},u_{k})\ |\ x_{0}=x,u_{0}=u\right].$ We make the following assumption:

Assumption 5.

$\pi_{\theta}(u|x)>0$ and $\pi_{\theta}(u|x)$ is differentiable in $\theta$ , for all $x\in\mathcal{X},u\in\mathcal{U}$ .

Recall from (2) that $\pi^{C}_{\theta}(\cdot|x)$ is simply the probability density function $\pi_{\theta}(\cdot|x)$ truncated to the set $C(x)$ . Note, since the value of $C(x)$ at a given $x$ is independent of $\theta$ , we can take the derivative inside the integral sign in the latter expression to obtain $\nabla\pi_{\theta}(C(x)|x)=\int_{C(x)}\nabla\pi_{\theta}(u|x)du$ , so $\pi^{C}_{\theta}(C(x)|x)$ is differentiable. Given these facts, combined with Assumption 5, the above expression for $\pi^{C}_{\theta}$ implies that, for any $x\in\mathcal{S}$ , the policy $\pi^{C}_{\theta}(u|x)$ is differentiable with respect to $\theta$ , for any $u\in C(x)$ . In short, $\pi^{C}_{\theta}$ satisfies its own version of Assumption 5, which we formalize in the following:

Lemma 1.

$\pi^{C}_{\theta}(u|x)>0$ and $\pi^{C}_{\theta}(u|x)$ is differentiable in $\theta$ , for all $x\in\mathcal{S}$ and $u\in C(x)$ .

The policy gradient theorem (Konda, 2002) implies

\nabla J(\theta)=\frac{1}{1-\gamma}\mathbb{E}_{\pi^{C}_{\theta}}\left[Q^{\pi_{% \theta}^{C}}(x,u)\nabla\log\pi_{\theta}^{C}(u|x)\right].

(4)

In order to carry out gradient updates based on this expression, we first need to be able to estimate $\nabla_{\theta}\log\pi^{C}_{\theta}(u|x)=\nabla_{\theta}\pi^{C}_{\theta}(u|x)/% \pi^{C}_{\theta}(u|x)$ , for arbitrary $u,x$ . We will discuss how to estimate $Q^{\pi^{C}_{\theta}}(x,u)$ in an unbiased manner in the following section. Since we already have access to $\pi^{C}_{\theta}(u|x)$ , we can focus on estimating $\nabla_{\theta}\pi^{C}_{\theta}(u|x)$ . Based on (2), the gradient of $\pi^{C}_{\theta}(u|x)$ with respect to $\theta$ is

$\displaystyle\nabla$	$\displaystyle\pi^{C}_{\theta}(u\|x)=\nabla\left[\frac{\pi_{\theta}(u\|x)}{\pi_{% \theta}(C(x)\|x)}\right]$	(5)
	$\displaystyle=\frac{\nabla\pi_{\theta}(u\|x)}{\pi_{\theta}(C(x)\|x)}-\frac{\pi_{% \theta}(u\|x)}{[\pi_{\theta}(C(x)\|x)]^{2}}\nabla\pi_{\theta}(C(x)\|x)$	(6)
	$\displaystyle=\frac{1}{\pi_{\theta}(C(x)\|x)}\left[\nabla\pi_{\theta}(u\|x)-\pi_% {\theta}(u\|x)\nabla\log\pi_{\theta}(C(x)\|x)\right].$	(7)

To estimate $\pi_{\theta}(C(x)|x)$ , we need to be able to estimate $\int_{C(x)}\pi_{\theta}(u|x)du$ . Given access to $\pi_{\theta}(\cdot|x)$ and $C(x)$ , we can use numerical integration or Monte Carlo techniques to approximate this integral. The standard Monte Carlo approach is to uniformly sample $M$ elements $u_{i}\sim U(C(x))$ from $C(x)$ , then estimate

\widehat{\pi_{\theta}}(C(x)|x)=\mu(C(x))\frac{1}{M}\sum_{i=1}^{M}\pi_{\theta}(% u_{i}|x),

(8)

where $\mu(C(x))$ is the volume of $C(x)$ . This estimate is based on the fact that

$\displaystyle\pi_{\theta}$	$\displaystyle(C(x)\|x)=\int_{C(x)}\pi_{\theta}(u\|x)du$	(9)
	$\displaystyle=\mu(C(x))\int_{C(x)}\frac{\pi_{\theta}(u\|x)}{\mu(C(x))}du$	(10)
	$\displaystyle=\mu(C(x))E_{u\sim U(C(x))}[\pi_{\theta}(u\|x)]$	(11)
	$\displaystyle=\mu(C(x))\lim_{M\rightarrow\infty}\frac{1}{M}\sum_{i=1}^{M}\pi_{% \theta}(u_{i}\|x),\normalsize$	(12)

where the last equality holds by the law of large numbers. Since $C(x)$ is fixed given $x$ , gradient estimates $\widehat{\nabla\log\pi_{\theta}}(C(x)|x)$ and ultimately $\widehat{\nabla\log\pi^{C}_{\theta}}(u|x)$ can also be obtained by estimating the integral $\int_{C(x)}\nabla\pi_{\theta}(u|x)du$ . In the Monte Carlo situation, this can be obtained from (8) by differentiating each term with respect to $\theta$ .

3.3 Algorithm

In this section, we present a hard safety-constrained random-horizon policy gradient (Safe-RPG) algorithm. Our algorithm is based on the random-horizon policy gradient (RPG) scheme developed in Zhang et al. (2020), which uses a random rollout horizon and recent advances in non-convex optimization to obtain unbiased policy gradient estimates and ensure finite-time convergence to approximately locally optimal policies. As discussed in the following section, our convergence results ensure asymptotic convergence of Algorithm 2 to a stationary point of (3), but can likely be strengthened to prove finite-time convergence to approximately locally optimal policies. The main algorithm is presented in Algorithm 2, which depends on the action-value function estimation subroutine in Algorithm 1.

3.4 Convergence

In this section we show asymptotic convergence of Algorithm 2 to the set of stationary points of (3). The key challenge in this result revolves around the need to establish that the policies we consider satisfy important differentiability and continuity properties, which necessitates a careful analysis of the Lipschitz properties of the score functions of our truncated policies in the proof of Lemma 2. To proceed, we need the following assumption on the reward function $r$ and underlying, untruncated policy class $\{\pi_{\theta}\}_{\theta\in\Theta}$ .

Assumption 6.

The reward function $r$ and parameterized policy class $\{\pi_{\theta}\}_{\theta\in\Theta}$ satisfy the following:

1.

The absolute value of the reward $r$ is uniformly bounded, i.e., there exists $U_{r}$ such that $0\leq\sup_{(x,u)\in\mathcal{X}\times\mathcal{U}}|r(x,u)|\leq U_{r}$ .
2.
For all $x\in\mathcal{X},u\in\mathcal{U}$ , $\nabla\log\pi_{\theta}(u|x)$ exists, and there exist $L_{\Theta}\geq 0$ and $B_{\Theta}\geq 0$ such that, for all $x\in\mathcal{X},u\in\mathcal{U}$ ,
1. (a)
  
  $\left\|\nabla\log\pi_{\theta}(u|x)-\nabla\log\pi_{\theta^{\prime}}(u|x)\right% \|{\leq}L_{\Theta}\left\|\theta-\theta^{\prime}\right\|$ , for all $\theta,\theta^{\prime}\in\Theta$
2. (b)
  
  $\left\|\nabla\log\pi_{\theta}(u|x)\right\|\leq B_{\Theta}$ , for all $\theta\in\Theta$ .

Data:

x,u,\theta

Result: Unbiased estimate of

Q^{\pi^{C}_{\theta}}(x,u)

1 Initialization: Sample

T\sim\text{Geom}(1-\gamma^{1/2})

and initialize

\hat{Q}\leftarrow 0,x_{0}\leftarrow x,u_{0}\leftarrow u

2 for $t=0,\ldots,T-1$ do

\hat{Q}\leftarrow\hat{Q}+\gamma^{t/2}r(x_{t},u_{t})

x_{t+1}\sim\mathcal{P}(\cdot|x_{t},u_{t})

u_{t+1}\sim\pi^{C}_{\theta_{k}}(\cdot|x_{t+1})

6 end for

\hat{Q}\leftarrow\hat{Q}+\gamma^{T/2}r(x_{T},u_{T})

return $\hat{Q}$

Algorithm 1 EstQ: Unbiasedly Estimating

Q

Data:

x_{0},\theta_{0}

, Monte Carlo sample size

M

Result: Locally optimal policy

1 Initialization: Set

k\leftarrow 0

2 repeat

3 Sample

T_{k+1}\sim\text{Geom}(1-\gamma)

u_{0}\sim\pi^{C}_{\theta_{k}}(\cdot|x_{0})

4 for $t=0,\ldots,T_{k+1}-1$ do

x_{t+1}\sim\mathcal{P}(\cdot|x_{t},u_{t})

u_{t+1}\sim\pi^{C}_{\theta_{k}}(\cdot|x_{t+1})

7 end for

\hat{Q}^{\pi^{C}_{\theta_{k}}}(x_{T_{k+1}},u_{T_{k+1}})=\texttt{EstQ}(x_{T_{k+% 1}},u_{T_{k+1}},\theta_{k})

9 Uniformly sample

\{u_{l}\}_{l=1,\ldots,M}

from

C(x_{T_{k+1}})

, then use them to compute

\widehat{\nabla\log\pi^{C}_{\theta_{k}}}(u_{T_{k+1}}|x_{T_{k+1}})

\theta_{k+1}\leftarrow\theta_{k}+\frac{\alpha_{k}}{1-\gamma}\hat{Q}^{\pi^{C}_{% \theta_{k}}}(x_{T_{k+1}},u_{T_{k+1}})\widehat{\nabla\log\pi^{C}_{\theta_{k}}}(% u_{T_{k+1}}|x_{T_{k+1}})

k\leftarrow k+1

13until convergence;

Algorithm 2 Safe-RPG: Hard safety-constrained Random-horizon Policy Gradient

Assumptions 5 and 6 were used to prove asymptotic convergence of the RPG algorithm with untruncated policies to stationary points in (Zhang et al., 2020, Theorem 4.4). For an analogous result to apply to the truncated policies we consider, it must be shown that the Lipschitz and differentiability conditions in part 2 of Assumption 6 hold for the constrained policies $\{\pi^{C}_{\theta}\}_{\theta\in\Theta}$ . It turns out that, under the same conditions on the untruncated policy $\{\pi_{\theta}\}_{\theta\in\Theta}$ , these properties are automatically satisfied for $\{\pi^{C}_{\theta}\}_{\theta\in\Theta}$ .

Lemma 2.

Under Assumptions 1, 5, and 6, $\nabla\log\pi^{C}_{\theta}(u|x)$ exists, for all $x\in\mathcal{X},u\in\mathcal{U}$ . Furthermore, there exist constants $L^{C}_{\Theta}\geq 0$ and $B^{C}_{\Theta}\geq 0$ such that, for all $x\in\mathcal{X},u\in\mathcal{U}$ ,
(i) $\left\|\nabla\log\pi^{C}_{\theta}(u|x)-\nabla\log\pi^{C}_{\theta^{\prime}}(u|x% )\right\|\leq L^{C}_{\Theta}\left\|\theta-\theta^{\prime}\right\|$ , for all $\theta,\theta^{\prime}\in\Theta$ , and
(ii) $\left\|\nabla\log\pi^{C}_{\theta}(u|x)\right\|\leq B^{C}_{\Theta}$ , for all $\theta\in\Theta$ .

With Lemma 2 in hand, we have the following result.

Theorem 1.

Let Assumptions 3, 4, 5, and 6 hold. Let $\{\theta_{k}\}_{k\in\mathbb{N}}$ be the sequence generated by Algorithm 2 with stepsize sequence $\{\alpha_{k}\}_{k\in\mathbb{N}}$ satisfying $\sum_{k=0}^{\infty}\alpha_{k}=\infty$ and $\sum_{k=0}^{\infty}\alpha_{k}^{2}<\infty$ . Then $\lim_{k}\theta_{k}\in\Theta^{*}$ , where $\Theta^{*}$ is the set of stationary points of (3).

Remark 1.

By arguments analogous to those in the proof of (Zhang et al., 2019, Thm. 3), it can also be shown that, under the same assumptions as in Theorem 1 and appropriate stepsize selection, Algorithm 2 achieves $\varepsilon$ -approximate first-order stationarity with a finite-time sample complexity of $\mathcal{O}(\varepsilon^{-2})$ .

Given Lemma 2, the proof of the theorem follows directly from that of (Zhang et al., 2020, Theorem 4.4). With suitable modifications to Algorithm 2 incorporating periodically increasing stepsizes, these results can likely be strengthened to obtain finite-time convergence to an $\varepsilon$ -locally optimal policy using the machinery developed in Zhang et al. (2020). We leave this to future work.

4 EXPERIMENTAL RESULTS

We now experimentally demonstrate the effectiveness of our sampling-based safe RL approach. Specifically, we evaluate the use of CBF-constrained Beta policies combined with the popular Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm on safety-constrained inverted pendulum and quadcopter navigation environments. The use of Beta policies with variable action space constraints allows us to directly sample from a CBF-constrained action space at each timestep. In addition to providing a practical example of how truncated policies can be used to ensure safety, this method extends the work Chou et al. (2017) on the use of Beta policies for deep RL from constant to state-dependent action space constraints.

Refer to caption — Figure 1: Safety and convergence of CBF-Constrained Beta policies: Agent was trained with PPO on the quadrotor navigation problem with an obstacle. Safety was maintained and goal was eventually reached.

4.1 CBF-Constrained Beta Policies

When actions must be restricted to lie within fixed, predetermined bounds due to physical or numerical constraints, the common practice of simply clip** policies with infinite support (e.g., Gaussian policies) can cause bias and performance issues. Chou et al. (2017) propose and leverage finite-support Beta distribution-based policies to overcome these issues. We extend this approach to obtain policies that sample directly from the safe control actions prescribed by the CBF at a given state.

In order to describe these CBF-constrained Beta policies, let us first recall the probability density function (p.d.f.) of a one-dimensional Beta distribution:

f(u;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}u^{% \alpha-1}(1-u)^{\beta-1},

(13)

where $u\in[0,1]$ , $\alpha,\beta>0$ , and $\Gamma(z)=\int_{0}^{\infty}u^{z-1}e^{-u}du$ is the Gamma function defined for $z\in\mathbb{C}$ with $\text{Re}(z)>0$ . A Beta policy sampling from the fixed interval $[0,1]$ is given by $f(u;\alpha_{\theta}(x),\beta_{\theta}(x))$ , where $\alpha_{\theta},\beta_{\theta}:\mathcal{X}\rightarrow\mathbb{R}^{+}$ are parameterized functions (e.g., neural networks) map** states to the parameters $\alpha,\beta$ of the Beta distribution. When the action space $\mathcal{U}\subset\mathbb{R}^{n}$ is of dimension $n\geq 2$ and the CBF constraint set (or an inner approximation of it), $C(x)$ , can be expressed as a hyperrectangle with lower and upper bounds $a(x),b(x)\in\mathbb{R}^{n}$ , respectively, we maintain independent Beta distributions, $f^{i}(\cdot;\alpha^{i}_{\theta}(x),\beta^{i}_{\theta}(x))$ , over each dimension $i$ of the unit box $[0,1]^{n}$ , and samples from these distributions are shifted and rescaled to lie within the bounds given by $a(x),b(x)$ . Specifically, our CBF-constrained Beta policies, denoted $\pi_{\theta}$ , sample $u\sim\pi_{\theta}(\cdot|x)$ from $C(x)$ by first sampling $\hat{u}^{i}\sim f^{i}(\cdot;\alpha^{i}_{\theta}(x),\beta^{i}_{\theta}(x))$ , then performing the simple transformation $u=a(x)+\text{diag}(\hat{u}^{1},\ldots,\hat{u}^{n})(b(x)-a(x))$ , where $\text{diag}(\hat{u}^{1},\ldots,\hat{u}^{n})$ denotes the diagonal matrix with elements $\hat{u}^{1},\ldots,\hat{u}^{n}$ along the diagonal.

4.2 Implementation

We now describe the implementation details of our Beta policies. For a given state $x$ , the parameter vectors $\alpha(x),\beta(x)$ are outputted by a two-layer, fully connected neural network. Control inputs at state $x$ were obtained by first creating an independent PyTorch (Paszke et al., 2019) Beta distribution object with parameters $\alpha^{i}_{\theta}(x),\beta^{i}_{\theta}(x)$ , for each dimension $i\in\{1,\ldots,n\}$ of the action space, then sampling $u=[u^{1}\ldots u^{n}]^{T}$ from these distributions, and finally scaling and translating to lie within the current CBF set $C(x)$ . Similarly, the Gaussian policies we used for comparison used distribution parameters outputted by a two-layer, fully connected neural network. Control inputs were then selected from the corresponding distribution by sampling, then following the standard practice (Chou et al., 2017) of clip** to a fixed set of permissible controls. The PPO implementation used in the experiments was adapted with minor modifications from Stable Baselines 3 (Raffin et al., 2021).

4.3 Case study 1 : Quadcopter Navigation

Experiment Setup. For this experiment, we consider the problem of learning to safely navigate a quadcopter around an obstacle to a goal location. In this section, we present an overview of the dynamical model that we use for this quadcopter, which was previously considered in Xu and Sreenath (2018), and describe our derivation of a hyperrectangular inner approximation of the safe control set, satisfying the CBF condition, that is amenable to sampling using our Beta policies. We finally briefly describe the reward function. See the supplementary material for a detailed exposition of the environment and sampling procedure.

We denote quadcopter and obstacle position by $r=(r_{x},r_{y},r_{z})$ and $r_{obs}=(r_{o_{x}},r_{o_{y}},r_{o_{z}})$ , respectively, and the quadcopter’s relative position with respect to the obstacle as $\Delta r=r-r_{obs}$ . The quadrotor dynamics are then given by

\displaystyle\dot{x}=Ax+Bu,x=\begin{bmatrix}r\\ \dot{r}\end{bmatrix},A=\begin{bmatrix}\mathbf{0}_{3\times 3}&I_{3}\\ \mathbf{0}_{3\times 3}&\mathbf{0}_{3\times 3}\end{bmatrix},B=\begin{bmatrix}% \mathbf{0}_{3\times 3}\\ \mathbf{1}_{3\times 3}\end{bmatrix},

with input $u$ consisting the desired accelerations in each $x,y$ , and $z$ dimensions. For the obstacle avoidance problem, we characterize the safe set as $\mathcal{S}=\{r:h(r)\geq 0\}$ , where

h(r)=\left({\Delta r_{x}}/{a}\right)^{4}+\left({\Delta r_{y}}/{b}\right)^{4}+% \left({\Delta r_{z}}/{c}\right)^{4}-r_{s},

(14)

$a,b,c>0$ parameterize the obstacle’s shape, which is assumed to be elliptical, and $r_{s}$ represents the desired safety margin. Given the quadcopter dynamics and (14), the first time derivative $\dot{h}(r)$ does not explicitly contain the control input $u$ . We therefore use the standard ECBF formulation Ames et al. (2019) to develop our safety condition using $\ddot{h}(r)$ , which explicitly contains $u$ . This ECBF condition is expressed as $\ddot{h}+K\cdot[h\quad\dot{h}]^{T}\geq 0,$ where $K=[K_{1}\quad K_{2}]^{T},K_{1}=6,K_{2}=8$ , are application-specific design parameters, and can be rewritten as $A_{r}u\leq b_{r},$ where $A_{r}$ is a matrix and $b_{r}$ is a vector, both depending on $r$ . Then, the state-dependent safe control set is given by $C(r)=\{u\ :\ A_{r}u\leq b_{r}\}$ (see the supplementary for details). The dynamics (and consequently the ECBF conditions) are discretized with time step $dt=0.1$ .

We consider navigation in the $x,y$ dimension as in Xu and Sreenath (2018), resulting in a two-dimensional action space. We take the actuator constraint to be defined by the hyperrectangle $H:=\{u_{min},u_{max}\}$ , where $u_{min}:=(u_{min}^{x},u_{min}^{y})\in\mathbb{R}^{2}$ and $u_{max}:=(u_{max}^{x},u_{max}^{y})\in\mathbb{R}^{2}$ are the minimum and maximum input values. In order to sample from the safe control set $C(r)$ at a given $r$ with our Beta policies, we need a hyperrectangular inner approximation. We obtain this inner approximation by formulating and solving a convex optimization problem yielding the highest volume hyperrectangle, $H_{c}(r)$ , contained within $C(r)$ .

Finally, we designed a reward providing an $\ell_{2}$ penalty based on agent distance from the goal, as well as a sizeable bonus for reaching the goal and a significant penalty for approaching the edge of the map. See the supplementary for details.

Results. The experiments we conducted illustrate that safety-filter based approaches like those considered in Cheng et al. (2019) fail on simple cases of our safety-constrained quadcopter problem (see Figure 2), while our CBF-constrained Beta policy succeeds (Figure 2). For illustration purposes, Figures 2 and 2 present trajectories generated over the course of training. The corresponding learning curves are included in the supplementary material. As illustrated in Figure 2, the safety-filter approach is effective at ensuring safety and also learns to successfully reach the goal when the obstacle is nonexistent or distant. However, it ultimately fails to reach the goal when the obstacle lies directly between the start and goal positions. We hypothesize that this is due to the fact that the projection-based approach attempts to learn an optimal policy for the unconstrained navigation problem, while projection causes it to deviate from its learned policy to maintain safety. Furthermore, the resulting safety-filtered policy cannot recover from these projections without an additional control layer (such as a derivative or PID controller) due to the repeated perturbation from the projection procedure. Our method, on the other hand, learns to successfully solve the problem as shown in Figure 2 while maintaining safety throughout training, since we directly learn policies for the CBF-constrained problem.

4.4 Case study 2: Inverted pendulum

Experiment Setup. For the second set of experiments, we considered a safety-constrained inverted pendulum environment building on the baseline Gym implementation (Brockman et al., 2016). The goal in this environment is to swing an inverted pendulum upright while maintaining it within a fixed safe set. We tested PPO with the two different policies on this environment for two different safe sets: $\mathcal{S}_{0.5}=\{\theta\ |\ -0.5\leq\theta\leq 0.5\}$ and $\mathcal{S}_{1.0}=\{\theta\ |\ -1.0\leq\theta\leq 1.0\}$ . Due to space limitations, we include the experiments with $\mathcal{S}_{0.5}$ in Figure 3 and the experiments with $\mathcal{S}_{1.0}$ with the supplementary material. As a baseline, we compare the proposed method to PPO with unconstrained Gaussian policies. This comparison highlights the effectiveness of the proposed method in guaranteeing safety as well as accelerating learning.

Results.

Our experiments are summarized in Figure 3. There are two main points to be drawn from these results. First, the top panel shows that incorporating prior knowledge about properties such as safety can encourage learning and accelerate convergence by forcing the Beta policy agent to concentrate on higher-value subsets of the state space. The Gaussian agent, on the other hand, is unable to benefit from this prior knowledge and convergence suffers as it spends a greater portion of its time exploring lower-value regions of the state space. Second, the bottom panel illustrates that Beta policies are highly effective at maintaining safety throughout training, while Gaussian policies without safety constraints naturally fail to remain inside the safe set. This is expected, but illustrates the need to use constraint-aware policies such as Beta policies when prior knowledge is available.

5 CONCLUSION

We have developed a sampling-based approach to learning policies ensuring hard constraint satisfaction in RL. Unlike existing, projection-based methods that ensure safety but lack convergence guarantees, our scheme provably does both. In addition to our theoretical contributions, we have also presented a practical solution method that leverages CBF-constrained Beta policies to ensure safety, and experimentally demonstrated its effectiveness on safe quadcopter navigation and inverted pendulum environments. Interesting directions for future work including extensions to the case where the constraint set must be estimated and application of our CBF-constrained Beta policies to real-world robotics problems.

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful comments and Mostafa Mohamed Fa Abdelnaby of Purdue University for pointing out Remark 1. The work of W. A. Suttle was supported by a Distinguished Postdoctoral Fellowship with the U.S. Army Research Laboratory. V. K. Sharma was partially funded by a grant from the Purdue Engineering Initiative on Autonomous and Connected Systems. The work of J. Liu was supported in part by the National Science Foundation (NSF) under grant 2230101, by the Air Force Office of Scientific Research (AFOSR) under award number FA9550-23-1-0175, and by U.S. Air Force Task Order FA8650-23-F-2603. The work of K. C. Kosaraju and V. Gupta was partially supported by Army Research Office grants W911NF2310111, W911NF2310266, and W911NF-23-1-0316, AFOSR grant F.10052139.02.005, Office of Naval Research grants F.10052139.02.009 and F.10052139.02.012, and NSF grant 2300355.

References

Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine learning, 2004.
Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR, 2017.
Agarwal et al. (2020) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR, 2020.
Agrawal and Sreenath (2017) Ayush Agrawal and Koushil Sreenath. Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation. In Robotics: Science and Systems, volume 13, pages 1–10, 2017.
Altman (2021) Eitan Altman. Constrained Markov Decision Processes. Routledge, 2021.
Ames et al. (2016) Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8):3861–3876, 2016.
Ames et al. (2019) Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In 2019 18th European Control Conference, pages 3420–3431, 2019.
Aswani et al. (2013) Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.
Bai et al. (2022) Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3682–3689, 2022.
Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.
Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.
Borkar (2005) Vivek S Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.
Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
Brunke et al. (2022) Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022.
Cheng et al. (2019) Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3387–3395, 2019.
Chou et al. (2017) Po-Wei Chou, Daniel Maturana, and Sebastian Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International Conference on Machine Learning, pages 834–843. PMLR, 2017.
Diamond and Boyd (2016) Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
Fazel et al. (2018) Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
Fisac et al. (2018) Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
Folland (1999) Gerald B Folland. Real Analysis: Modern Techniques and Their Applications, volume 40. John Wiley & Sons, 1999.
Garcıa and Fernández (2015) Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
Ho et al. (2020) Cherie* Ho, Katherine* Shih, Jaskaran Grover, Changliu Liu, and Sebastian Scherer. “Provably safe” in the wild: testing control barrier functions on a vision-based quadrotor in an outdoor environment. In RSS 2020 Workshop in Robust Autonomy, 2020. URL https://openreview.net/pdf?id=CrBJIgBr2BK.
Konda (2002) V. Konda. Actor-Critic Algorithms. PhD thesis, Massachusetts Institute of Technology, 2002.
Kosaraju et al. (2021) Krishna Chaitanya Kosaraju, Seetharaman Sivaranjani, Wesley Suttle, Vijay Gupta, and Ji Liu. Reinforcement learning based distributed control of dissipative networked systems. IEEE Transactions on Control of Network Systems, 9(2):856–866, 2021.
Li et al. (2018) Zhaojian Li, Uroš Kalabić, and Tianshu Chu. Safe reinforcement learning: Learning with supervision using a constraint-admissible set. In 2018 Annual American Control Conference, pages 6390–6395, 2018.
Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Ma et al. (2021) Haitong Ma, Jianyu Chen, Shengbo Eben, Ziyu Lin, Yang Guan, Yangang Ren, and Sifa Zheng. Model-based constrained reinforcement learning using generalized control barrier function. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4552–4559, 2021.
Mellinger et al. (2012) Daniel Mellinger, Nathan Michael, and Vijay Kumar. Trajectory generation and control for precise aggressive maneuvers with quadrotors. The International Journal of Robotics Research, 31(5):664–674, 2012.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
Paternain et al. (2019) Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019.
Raffin et al. (2021) Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. The Journal of Machine Learning Research, 22(1):12348–12355, 2021.
Rasmussen (2003) Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
Schreiter et al. (2015) Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint. Safe exploration for active learning with gaussian processes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 133–149. Springer, 2015.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sui et al. (2015) Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning, pages 997–1005. PMLR, 2015.
Suttle et al. (2023) Wesley A Suttle, Amrit Bedi, Bhrij Patel, Brian M Sadler, Alec Koppel, and Dinesh Manocha. Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level Monte Carlo actor-critic. In International Conference on Machine Learning, pages 33240–33267. PMLR, 2023.
Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature Methods, 17(3):261–272, 2020.
Wabersich et al. (2023) Kim P Wabersich, Andrew J Taylor, Jason J Choi, Koushil Sreenath, Claire J Tomlin, Aaron D Ames, and Melanie N Zeilinger. Data-driven safety filters: Hamilton-Jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine, 43(5):137–177, 2023.
Wachi et al. (2018) Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of constrained mdps using gaussian processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Wiesemann et al. (2013) Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
Xu and Sreenath (2018) Bin Xu and Koushil Sreenath. Safe teleoperation of dynamic uavs through control barrier functions. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7848–7855, 2018.
Zhang et al. (2019) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Convergence and iteration complexity of policy gradient method for infinite-horizon reinforcement learning. In 2019 58th Conference on Decision and Control, pages 7415–7422, 2019.
Zhang et al. (2020) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020.
Zhang et al. (2021) Kaiqing Zhang, Bin Hu, and Tamer Başar. Policy optimization for $\mathcal{H}_{2}$ linear control with $\mathcal{H}_{\infty}$ robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081–4109, 2021.

Supplementary Material

Appendix A Proofs

Proof of Proposition 1. We construct a sequence of $\varepsilon$ -balls, each reachable from the previous element of the sequence, that leads from $x_{0}$ to $\mathcal{B}$ , then show that the head of the Markov chain lies inside this sequence with positive probability. Fix $\varepsilon>0$ and let $\{y_{0},y_{1},\ldots,y_{N}\}\subset\mathcal{S}$ be such that $B_{\varepsilon}(y_{k+1})\subset R(B_{\varepsilon}(y_{k}))$ , for $k=1,\ldots,N-1$ , and $B_{\varepsilon}(y_{N})\cap\mathcal{B}\neq\emptyset$ (see Figure 1 for an illustration). For a given $\theta$ , let $\{x_{n}\}$ be the Markov chain induced on $\mathcal{S}$ by $\pi^{C}_{\theta}$ such that $x_{0}=y_{0}$ . We show that the trajectory $(x_{0},x_{1},\ldots,x_{N})$ is contained within the set $\{y_{0}\}\times B_{\varepsilon}(y_{1})\times\ldots\times B_{\varepsilon}(y_{N})$ with strictly positive probability, which will imply that $\{x_{n}\}$ enters $\mathcal{B}$ with strictly positive probability. For each $k=1,\ldots,N$ , consider the probability measure $\nu_{k}$ defined as

\nu_{k}(S)=P(x\in S\ |\ x_{k-1})=\int_{\mathcal{T}^{-1}_{x_{k-1}}(S)}\pi^{C}_{% \theta}(a|x_{k-1})\ da,

for any $\mu$ -measurable subset $S$ of $\mathcal{S}$ . Note that $\nu_{k}$ is absolutely continuous with respect to $\mu$ , written $\nu_{k}\ll\mu$ , since $\mu(S)>0$ if and only if $\mathcal{T}^{-1}_{x_{k-1}}(S)>0$ , by Assumption 2. The Lebesgue-Radon-Nikodym Theorem implies that there exists a $\mu$ -integrable function $f_{k}:\mathcal{S}\rightarrow\mathbb{R}$ , called the Radon-Nikodym derivative of $\nu_{k}$ , such that $\nu_{k}(S)=\int_{S}f_{k}(x)dx$ (see Folland (1999) for details). To make the link between $f_{k}$ and $\nu_{k}$ perfectly clear, let us write

	$\displaystyle f_{k}(x)$	$\displaystyle=\int_{\mathcal{T}^{-1}_{x_{k-1}}(x)}\pi^{C}_{\theta}(a\|x_{k-1})% \ da,$
	$\displaystyle\nu_{k}(S)$	$\displaystyle=\int_{S}\int_{\mathcal{T}^{-1}_{x_{k-1}}(x)}\pi^{C}_{\theta}(a\|x% _{k-1})\ da\ dx.$

By Assumptions 2 and 3, we also have $\mu\ll\nu_{k}$ . Since both $\nu_{k}\ll\mu$ and $\mu\ll\nu_{k}$ , the two measures are said to be equivalent, meaning that they agree on which sets have measure zero. Since $\mu$ and $\nu_{k}$ are equivalent, a standard result from real analysis allows us to take the Radon-Nikodym derivative $f_{k}$ to be strictly positive $\mu$ -almost everywhere. As a first consequence, notice that

$\displaystyle P$	$\displaystyle\Big{(}(x_{0},x_{1},x_{2})\in\{y_{0}\}\times B_{\varepsilon}(y_{1% })\times B_{\varepsilon}(y_{2})\ \|\ x_{0}=y_{0}\Big{)}$
	$\displaystyle=P\Big{(}(x_{1},x_{2})\in B_{\varepsilon}(y_{1})\times B_{% \varepsilon}(y_{2})\ \|\ x_{0}=y_{0}\Big{)}\cdot P(x_{0}=y_{0})$
	$\displaystyle=P\Big{(}(x_{1},x_{2})\in B_{\varepsilon}(y_{1})\times B_{% \varepsilon}(y_{2})\ \|\ x_{0}=y_{0}\Big{)}$
	$\displaystyle=\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon% }(y_{2}))}\pi^{C}_{\theta}(a_{1}\|x_{1})f_{1}(x_{1})\ da_{1}\ dx_{1}$	(15)
	$\displaystyle=\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon% }(y_{2})}\pi^{C}_{\theta}(a_{1}\|x_{1})\left[\int_{T^{-1}_{x_{0}}(x_{1})}\pi^{C% }_{\theta}(a_{0}\|x_{0})\ da_{0}\right]\ da_{1}\ dx_{1}.$

Given Assumption 3, equation (15) is strictly positive, since $f_{1}$ is strictly positive almost everywhere and the integrals are taken over sets of positive volume.

Building on equation (15), we have

$\displaystyle P\Big{(}(x_{1},$	$\displaystyle\ldots,x_{N-1},x_{N})\in$	(16)
	$\displaystyle\hskip 8.53581ptB_{\varepsilon}(y_{1})\times\ldots\times B_{% \varepsilon}(y_{N-1})\times(B_{\varepsilon}(y_{N})\cap\mathcal{B})\ \|\ x_{0}=y% _{0}\Big{)}$	(17)
$\displaystyle=$	$\displaystyle\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon}% (y_{2}))}\pi^{C}_{\theta}(a_{1}\|x_{1})\cdot\int_{B_{\varepsilon}(y_{2})}\int_{% T^{-1}_{x_{2}}(B_{\varepsilon}(y_{3}))}\pi^{C}_{\theta}(a_{2}\|x_{2})\cdot\ldots$
	$\displaystyle\hskip 8.53581pt\ldots\cdot\int_{B_{\varepsilon}(y_{N-1})}\int_{T% ^{-1}_{x_{N-1}}(B_{\varepsilon}(y_{N})\cap\mathcal{B})}\pi^{C}_{\theta}(a_{N-1% }\|x_{N-1})\ f_{N-1}(x_{N-1})\ da_{N-1}\ dx_{N-1}\cdot\ldots$
	$\displaystyle\hskip 34.1433pt\ldots\cdot f_{2}(x_{2})\ da_{2}\ dx_{2}\cdot f_{% 1}(x_{1})\ da_{1}\ dx_{1}.$

Note in the innermost integral that $\mu(B_{\varepsilon}(y_{N})\cap\mathcal{B})>0$ , since both sets are open and their intersection is non-empty by hypothesis. Finally, given Assumption 3, we have that (17) is strictly positive, since all integrals are taken over sets of positive volume and $f_{i}$ is strictly positive almost everywhere, for each $i\in\{1,\ldots N-1\}$ . $\square$

Proof of Lemma 2. Recalling the definition of $\pi^{C}_{\theta}(u|x)$ in (2),

	$\displaystyle\nabla\log\pi^{C}_{\theta}(u\|x)$	$\displaystyle=\nabla\log\pi_{\theta}(u\|x)-\nabla\log\int_{C(x)}\pi_{\theta}(w\|% x)dw$
		$\displaystyle=\nabla\log\pi_{\theta}(u\|x)-\frac{\int_{C(x)}\nabla\pi_{\theta}(% w\|x)dw}{\int_{C(x)}\pi_{\theta}(w\|x)dw}.$		(18)

To see that, for all $x\in\mathcal{X},u\in\mathcal{U}$ , $\nabla\log\pi^{C}_{\theta}(u|x)$ exists, for all $\theta\in\Theta$ , we simply need to verify that $(\int_{C(x)}\pi_{\theta}(w|x)dw)^{-1}$ is always finite. But this follows immediately from Assumption 3 and the fact that $\mu(C(x))\geq m>0$ by Assumption 1.

We next prove part 1) of the Lemma. The claim holds for the first term in (18) by part 2a) of Assumption 6, so we just need to show that it holds for the second term. To do this, we prove that, for a given $x\in\mathcal{X}$ , this term is Lipschitz in $\theta$ , then argue that the largest minimal Lipschitz constant over all $x\in\mathcal{X}$ is finite. We know by part 2b) of Assumption 6 that, for all $x\in\mathcal{X},u\in\mathcal{U}$ , $\left\|\nabla\pi_{\theta}(u|x)\right\|\leq\nabla\log\pi_{\theta}(u|x)\leq B_{\Theta}$ , for all $\theta\in\Theta$ . This means that, for all $x\in\mathcal{X}$ ,

	$\displaystyle\Big{\|}\int_{C(x)}$	$\displaystyle\pi_{\theta}(w\|x)dw-\int_{C(x)}\pi_{\theta^{\prime}}(w\|x)dw\Big{\|}$
		$\displaystyle=\Big{\|}\int_{C(x)}\left(\pi_{\theta}(w\|x)-\pi_{\theta^{\prime}}(% w\|x)\right)dw\Big{\|}$
		$\displaystyle\leq\int_{C(x)}\|\pi_{\theta}(w\|x)-\pi_{\theta^{\prime}}(w\|x)\|dw$
		$\displaystyle\leq\int_{C(x)}B_{\Theta}\left\\|\theta-\theta^{\prime}\right\\|dw=% B_{\Theta}\mu(C(x))\left\\|\theta-\theta^{\prime}\right\\|$
		$\displaystyle\leq B_{\Theta}M\left\\|\theta-\theta^{\prime}\right\\|,$

for all $\theta,\theta^{\prime}\in\Theta$ . So $\int_{C(x)}\pi_{\theta}(w|x)dw$ is Lipschitz in $\theta$ , for each $x\in\mathcal{X}$ , and the largest Lipschitz constant over $\mathcal{X}$ is finite. In addition, $\int_{C(x)}\pi_{\theta}(w|x)dw$ is clearly uniformly bounded.

Notice that $\inf_{x\in\mathcal{X}}\mu(C(x))\geq m>0$ , by Assumption 1. Thus, by Assumption 3, $\inf_{x\in\mathcal{X}}\int_{C(x)}\pi_{\theta}(w|x)dw>0$ , for all $\theta\in\Theta$ . Since $\Theta$ is compact, this means $\inf_{\theta\in\Theta}\inf_{x\in\mathcal{X}}\int_{C(x)}\pi_{\theta}(w|x)dw>0$ . This implies that $(\int_{C(x)}\pi_{\theta}(w|x)dw)^{-1}$ is uniformly bounded. Since $\int_{C(x)}\pi_{\theta}(w|x)dw$ is Lipschitz, $(\int_{C(x)}\pi_{\theta}(w|x)dw)^{-1}$ is therefore Lipschitz and bounded in $\theta\in\Theta$ , for all $x\in\mathcal{X}$ . We also know that, for each $x\in\mathcal{X}$ , $\int_{C(x)}\nabla\pi_{\theta}(w|x)dw$ is Lipschitz and bounded in $\theta\in\Theta$ , by Assumption 6, part 2a). Fix $x\in\mathcal{X}$ . Since the product of Lipschitz, bounded functions is Lipschitz and bounded, the function $\int_{C(x)}\nabla\pi_{\theta}(w|x)dw/\int_{C(x)}\pi_{\theta}(w|x)dw$ is Lipschitz and bounded in $\theta$ . Since this function is uniformly bounded over $x\in\mathcal{X},\theta\in\Theta$ , there therefore exists $L>0$ such that, for all $x\in\mathcal{X}$ ,

\left\|\frac{\int_{C(x)}\nabla\pi_{\theta}(w|x)dw}{\int_{C(x)}\pi_{\theta}(w|x% )dw}-\frac{\int_{C(x)}\nabla\pi_{\theta^{\prime}}(w|x)dw}{\int_{C(x)}\pi_{% \theta^{\prime}}(w|x)dw}\right\|\leq L\left\|\theta-\theta^{\prime}\right\|,

for all $\theta,\theta^{\prime}\in\Theta$ . Combined with part 2a) of Assumption 6, this implies that, for all $x\in\mathcal{X},u\in\mathcal{U}$ , $\left\|\nabla\log\pi^{C}_{\theta}(u|x)-\nabla\log\pi^{C}_{\theta^{\prime}}(u|x% )\right\|\leq(L_{\Theta}+L)\left\|\theta-\theta^{\prime}\right\|$ , for all $\theta,\theta^{\prime}\in\Theta$ . This completes the proof of part 1).

Part 2) follows from the fact that $\int_{C(x)}\nabla\pi_{\theta}(w|x)dw/\int_{C(x)}\pi_{\theta}(w|x)dw$ is uniformly bounded and that, for all $x\in\mathcal{X},u\in\mathcal{U}$ , $\left\|\nabla\log\pi_{\theta}(u|x)\right\|\leq B_{\Theta}$ , for all $\theta\in\Theta$ , by part 2) of Assumption 6.

Appendix B Background: Barrier Functions

In this section, we provide an overview of barrier functions for convenience. The theory of barrier functions revolve around controlled set invariance for dynamical systems. Safety can be represented through a set, say $\mathcal{S}$ , defined as a level set of this barrier function, say $h$ . Then, we write the condition on this barrier function to guarantee the forward invariance of this safety set under the given dynamics.

B.1 Control Barrier Functions (CBF)

Consider the following nonlinear system

\displaystyle\dot{r}=f(r,u),

(19)

where $r\in D\subset\mathbb{R}^{n}$ and $u\in U\subset\mathbb{R}^{m}$ denote the state and control input, and $f$ is a locally Lipschitz function that models the state transition. The following definition and the theorem follows the development in Ames et al. (2019); Agrawal and Sreenath (2017).

Theorem 2.

Consider a function $h:\mathbb{R}^{n}\rightarrow\mathbb{R}$ that is continuously differentiable. Define a closed set $\mathcal{S}$ as the super-level set of this function as follows:

\displaystyle\mathcal{S}\triangleq\left\{r\in\mathbb{R}^{n}\;|\;h(r)\geq 0% \right\}.

(20)

The function $h$ is a control barrier function, for (19) and with state $s$ , if there exists an extended $\kappa_{\infty}$ function $\alpha$ such that for all $r\in\mathcal{S}$ , $t\in\mathbb{R}_{+}$ ,

\dot{h}\geq-\alpha(h).

(21)

Further, if we define the safe control set as

\mathcal{C}(r)\triangleq\left\{u\in\mathbb{R}^{m}|\dot{h}(r,u)\geq-\alpha(h(r)% )\right\}.

(22)

then any input $u\in\mathcal{C}(r)$ will render the set $\mathcal{S}$ forward invariant.

When designing safe controller with control values $u$ sampled from this safe control set, we need the time-derivative of $h$ i.e. $\dot{h}$ to explicitly contain $u$ . However, the above forward variance condition is restricted to barrier function with relative degree $d_{r}=1$ . At this point, we also note that for our quadcopter experiment, our CBF has a relative degree $d_{r}=2$ , since only the second time-derivative $\ddot{h}$ explicitly contains the control input $u$ . Therefore, for barrier functions with relative degree more than $1$ , which are often referred to as Exponential Control Barrier Functions (ECBF), we need a seperate discussion on forward invariance conditions.

B.2 Exponential Control Barrier Functions

We now discuss exponential CBFs for control affine nonlinear dynamical system. Consider the following control affine nonlinear dynamical system:

\dot{r}=f(r)+g(r)u,

(23)

with $f$ and $g$ locally lipshitz, $r\in D\subset\mathbb{R}^{n}$ and $u\in U\subset\mathbb{R}^{m}$ . We suppose that the Lipschitz constant for $f$ and $g$ are $L_{f}$ and $L_{g}$ respectively, and the vector containing the first $d_{r}-1$ time derivatives of $h(r)$ including $h(r)$ is given as: $\eta_{b}(r)=\begin{bmatrix}h(r)\\ \dot{h}(r)\\ \ddot{h}(r)\\ \vdots\\ h^{d_{r}-1}(r)\end{bmatrix}$ . Further suppose that the matrices $F,G,$ and $C$ are defined as follows: $F=\begin{bmatrix}0&1&0&\cdots&0\\ 0&0&1&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&1\\ 0&0&0&\cdots&0\end{bmatrix}$ , $G=\begin{bmatrix}0\\ 0\\ 0\\ \vdots\\ 1\end{bmatrix}$ , and, $C=\begin{bmatrix}1&0&0&\cdots&0\end{bmatrix}$ .

Theorem 3.

Consider a function $h:\mathbb{R}^{n}\rightarrow\mathbb{R}$ that is continuously differentiable. Define a closed set $\mathcal{S}$ as the super-level set of this function as follows:

\displaystyle\mathcal{S}\triangleq\left\{r\in\mathbb{R}^{n}\;|\;h(r)\geq 0% \right\}.

(24)

Then the function $h$ is an ECBF, with relative degree $d_{r}$ , for system in (23), if there exist a row vector $K_{\alpha}\in\mathbb{R}^{r}$ such that

\sup_{u\in U}[L_{f}^{d_{r}}h(r)+L_{g}L_{f}^{d_{r}-1}h(r)u]\geq-K_{\alpha}\eta_% {b}(r)

(25)

$\forall r\in Int(\mathcal{S})$ , implies $h(r(t))\geq Ce^{(F-GK_{\alpha})t}\eta_{b}(r)r(t_{0})\geq 0$ , whenever $h(r(t_{0}))\geq 0$ . Further, if we define the safe control set $C(r)$ as

\mathcal{C}(r)\triangleq\{u\in U|[L_{f}^{d_{r}}h(r)+L_{g}L_{f}^{d_{r}-1}h(r)u]% \geq-K_{\alpha}\eta_{b}(r)\},

(26)

then any input $u\in\mathcal{C}(r)$ will render the set $\mathcal{S}$ forward invariant.

Appendix C Experiments: Additional Details

In this section, we provide additional details regarding the experiments presented in §4.

C.1 Inverted Pendulum Experiments

The safety-constrained inverted pendulum environment that we considered in §4.4 was obtained by modifying the standard implementation from Brockman et al. (2016) to include CBF-based safety constraints. In this section we describe the dynamical model and CBF used to obtain these constraints, then present implementation details and an additional experiment.

C.1.1 Dynamical Model

Consider the model of a simple inverted pendulum

\displaystyle\begin{bmatrix}\theta_{k+1}\\ \dot{\theta}_{k+1}\end{bmatrix}

\displaystyle=\begin{bmatrix}\theta_{k}+\delta t\dot{\theta}_{k}+\delta t^{2}% \left(\dfrac{3g}{2l}\sin{\theta_{k}}+\dfrac{3}{ml^{2}}u_{k}\right)\\ \dot{\theta}_{k}+\delta t\left(\dfrac{3g}{2l}\sin{\theta_{k}}+\dfrac{3}{ml^{2}% }u_{k}\right)\end{bmatrix},

(31)

where $\theta_{k},~{}\dot{\theta}_{k}$ denote the states (angle and angular velocity), $u_{k}$ denote the input (torque), $m$ and $l$ denotes the mass and the length of the pendulum, respectively, $g$ denotes the acceleration due to gravity and $\delta t>0$ denotes the discretization time. Denote the safe operating region by

\displaystyle\mathcal{S}

\displaystyle=\left\{\theta\in\mathbb{R}|h(\theta):=\begin{bmatrix}\theta+1\\ 1-\theta\end{bmatrix}\geq 0\right\}.

(34)

C.1.2 Control Barrier Function

The following corollary is a direct consequence of Theorem 2, presented in §B.

Corollary 2.

Let

\displaystyle U(\theta_{k},\dot{\theta}_{k})=\left\{u_{k}\in{\mathbb{R}}|\left% (\delta t\dot{\theta}_{k}+c(\theta_{k},u_{k})\right)\begin{bmatrix}1\\ -1\end{bmatrix}+\eta\begin{bmatrix}\theta_{k}+1\\ 1-\theta_{k}\end{bmatrix}\geq 0\right\}

(39)

where $c(\theta_{k},u_{k}):=\delta t^{2}\left(\dfrac{3g}{2l}\sin{\theta_{k}}+\dfrac{3% }{ml^{2}}u_{k}\right)$ , and $0<\eta<1$ . Consider system (31) with $u_{k}\in U(\theta_{k},\dot{\theta}_{k})$ . Let $(\theta_{0},\dot{\theta}_{0})\in\mathcal{S}\times\mathbb{R}$ and assume $U(\theta_{0},\dot{\theta}_{0})$ is non-empty. The set $\mathcal{S}$ is forward invariant.

The safe set (39) is used to provide the state-dependent constraints to the Beta policies learned in our experiments.

C.1.3 Implementation Details

We next describe the implementation details of our experiments. As mentioned above, the environment was adapted from the implementation of (Brockman et al., 2016), with modifications to compute the CBF safe set (39). The reward function and other details are as in Brockman et al. (2016). The Beta and Gaussian policies used the corresponding distributions from the PyTorch library (Paszke et al., 2019). As described in §4.2, for a given state $x$ , the parameters $\alpha(x),\beta(x)$ of the Beta distribution were outputted by a two-layer, fully connected neural network. Control inputs were obtained by sampling from this distribution, then translating and rescaling to lie within the current CBF set $C(x)=[a(x),b(x)]$ . The Gaussian policy parameters were outputted by a two-layer, fully connected neural network. Control inputs were subsequently selected from the corresponding distribution by sampling, then, following standard practice (Chou et al., 2017), were clipped to a set of permissible controls, which was chosen to be $[-15.0,15.0]$ . The hyperparameters used are presented in Figure 4.

policy learning rate	0.0003
value learning rate	0.0003
entropy coefficient	0.0
clip range	0.2
weight decay	0.0
layer size	64
batch size	64
buffer size	300
number of epochs	10
rollout length	300
discount factor	0.99

(a) Gaussian hyperparameters.

policy learning rate	0.01
value learning rate	0.01
entropy coefficient	0.0
clip range	0.2
weight decay	0.0
layer size	64
batch size	64
buffer size	300
number of epochs	10
rollout length	300
discount factor	0.99

(b) Beta hyperparameters.

Figure 4: PPO hyperparameters for the inverted pendulum experiments.

C.1.4 Additional Results

Figure 5 presents an experiment providing additional support to the discussion presented in §4.4.

C.2 Quadcopter Experiments

In this section, we provide additional details regarding the environment and experiments presented in §4.3.

C.2.1 Dynamical Model

We summarize the dynamical model of the quadcopter derived in Xu and Sreenath (2018). We consider the body frame, say $\mathbb{F}_{b}$ , and world frame, say $\mathbb{F}_{w}$ , and discuss the transformation between these two frames using the rotation matrix $\mathbb{R}_{wb}$ defined as

\mathbb{R}_{wb}:=\begin{bmatrix}\cos{\psi}\cos\theta-\sin{\phi}\sin{\psi}\sin{% \theta}&-\cos{\phi}\sin{\psi}&\cos{\psi}\sin{\theta}+\cos{\theta}\sin{\phi}% \sin{\psi}\\ \cos{\theta}\sin{\psi}+\cos{\psi}\sin{\phi}\sin{\theta}&\cos{\phi}\cos{\psi}&% \sin{\psi}\sin{\theta}-\cos{\psi}\cos{\theta}\sin{\phi}\\ -\cos{\phi}\sin{\theta}&\sin{\phi}&\cos{\phi}\cos{\theta}\end{bmatrix},

(40)

where $\phi$ , $\theta$ , and $\psi$ denote the Z-X-Y Euler angles corresponding to the roll, pitch, and yaw of the quadcopter. Suppose that the 3-dimensional position coordinates of the quadcopter along the x-,y-, and z-axis with respect to its body frame $\mathbb{F}_{b}$ of and the world frame of reference $\mathbb{F}_{w}$ be given by $x_{b}:=(x_{b},y_{b},z_{b})$ and $r:=(r_{x},r_{y},r_{z})$ respectively, then $r=\mathcal{R}_{wb}x_{b}$ .

Then, the quadcopter dynamics is given by $\dot{x}=Ax+Bu$ , where the control input $u$ comprises of the desired acceleration of the quadcopter. The dynamics of this controller under small angle assumptions on the Euler angles, that is $\sin\hat{e}\approx\hat{e},\cos\hat{e}\approx 1,\hat{e}\in\{\phi,\theta,\psi\}$ ) evolves as Mellinger et al. (2012):

\displaystyle u=\begin{bmatrix}\ddot{r}_{x}^{des}\\ \ddot{r}_{y}^{des}\\ \ddot{r}_{z}^{des}\end{bmatrix}=\begin{bmatrix}g(\theta^{des}\cos{\psi^{des}}+% \phi^{des}\sin{\psi^{des}}),\\ g(\theta^{des}\sin{\psi^{des}}-\phi^{des}\cos{\psi^{des}})\\ \frac{\sum^{4}_{i=1}F_{i}^{des}}{m}-g\end{bmatrix},

(47)

where $m,g$ are respectively the mass of the quadcopter and gravitational constant, and $\ddot{r}_{i}^{des},i\in\{x,y,z\}$ is the desired acceleration component of the quadcopter in the x-,y-, and z-direction respectively, computed using the desired specifications on the Euler angles $\phi^{des},\theta^{des}$ , and $\psi^{des}$ , and $F_{i}^{des},i\in\{1,2,3,4\}$ is the desired thrust on the $i$ -th rotor of the quadcopter. Lastly, the dynamical parameters for the quadcopter are setup as given in Ho et al. (2020).

C.2.2 Exponential Control Barrier Function

Recall, that the objective of our controller to enable the quadcopter to learn how to reach a target position $r_{goal}$ , while avoiding an obstacle with position $r_{obs}$ . For this obstacle avoidance, we now discuss our choice of CBF for the quadcopter experiment, as defined in (14), and reason why this is an exponential control barrier function (ECBF). We first derive expressions for $\dot{h}$ and $\ddot{h}$ using dynamical equations as follows:

\dot{h}(r)=4(\left({\Delta r_{x}}/{a}\right)^{3}\dot{r}_{x}+\left({\Delta r_{y% }}/{b}\right)^{3}\dot{r}_{y}+\left({\Delta r_{z}}/{c}\right)^{3}\dot{r}_{z}),

(48)

and

\ddot{h}(r)=12(\left({\Delta r_{x}}/{a}\right)^{2}\dot{r}_{x}+\left({\Delta r_% {y}}/{b}\right)^{2}\dot{r}_{y}+\left({\Delta r_{z}}/{c}\right)^{2}\dot{r}_{z})% +4(\left({\Delta r_{x}}/{a}\right)^{3}\ddot{r}_{x}+\left({\Delta r_{y}}/{b}% \right)^{3}\ddot{r}_{y}+\left({\Delta r_{z}}/{c}\right)^{3}\ddot{r}_{z}).

(49)

These equations can be re-written in vector form as follows:

\dot{h}(r)=\begin{bmatrix}4({\Delta r_{x}^{3}}/{a}^{4})&4({\Delta r_{y}^{3}}/{% b}^{4})&4({\Delta r_{z}^{3}}/{c}^{4})\end{bmatrix}\dot{r}

(50)

and

\ddot{h}(r)=\dot{r}^{T}\begin{bmatrix}12({\Delta r_{x}}^{2}/{a}^{4})&0&0\\ 0&12({\Delta r_{y}^{2}}/{b}^{4})&0\\ 0&0&12({\Delta r_{z}^{2}}/{c}^{4})\end{bmatrix}\dot{r}+\begin{bmatrix}4({% \Delta r_{x}}^{3}/{a}^{4})&4({\Delta r_{y}^{3}}/{b}^{4})&4({\Delta r_{z}^{3}}/% {c}^{4})\end{bmatrix}\ddot{r}.

(51)

Since $u=\ddot{r}$ from quadcopter dynamics, we can re-write the $\ddot{h}(r)$ as follows:

\ddot{h}(r)=\dot{r}^{T}D_{r}\dot{r}-A_{r}u,

(52)

where $D_{r}:=\begin{bmatrix}12({\Delta r_{x}}^{2}/{a}^{4})&0&0\\ 0&12({\Delta r_{y}^{2}}/{b}^{4})&0\\ 0&0&12({\Delta r_{z}^{2}}/{c}^{4})\end{bmatrix}$ , and $A_{r}:=-\begin{bmatrix}4({\Delta r_{x}}^{3}/{a}^{4})&4({\Delta r_{y}^{3}}/{b}^% {4})&4({\Delta r_{z}^{3}}/{c}^{4})\end{bmatrix}$ .

Therefore, we note that $\ddot{h}(r)$ or the $2^{nd}$ time-derivative of $h(r)$ which explicitly depends on the control input $u$ and therefore our choice of CBF $h(r)$ is an exponential CBF with a relative degree Ames et al. (2019) of $2$ .

Correspondingly, we use the following forward invariance condition for the set $\mathcal{S}=\{r:h(r)\geq 0\}$ as given in Xu and Sreenath (2018):

\ddot{h}+K\cdot[h\quad\dot{h}]^{T}\geq 0,

(53)

with $K=[K_{1}\quad K_{2}]^{T}$ .

The above equation can be re-arranged as follows:

-\ddot{h}\leq K_{1}h+K_{2}\dot{h},

and, using (50) and (52), we can re-write this equation as

A_{r}u\leq b_{r},

(54)

where $b_{r}=\dot{r}^{T}D_{r}\dot{r}+K_{1}h-K_{2}A_{r}\dot{r}$ . Thus, we can write the safe control set as $\mathcal{C}(r)=\{u\in\mathbb{R}^{3}:A_{r}u\leq b_{r}\}$ . For our quadcopter experiments, we consider navigation $x-y$ dimensions, therefore set the $z$ -dimension position and velocity to be $0$ . Thus the control input only comprises the desired acceleration for $x$ and $y$ axes and therefore our action space becomes two-dimensional, and we only consider the $x$ and $y$ components in the above CBF calculations.

C.2.3 Maximal Inner Hyperrectangle Computation

We now describe the construction of the maximal inner hyperrectangle contained in the set $\mathcal{C}(r)$ under actuator constraints $H$ . These are the sets that our Beta policies will sample from. We use the following optimization problem, with decision variables $u=(u^{x},u^{y})$ , to get the maximal inner hyper-rectangle inside the safe set:

\mathcal{P}_{A}:\begin{aligned} \underset{u}{\mathrm{max}}\ \ &\mathcal{A}(u)% \\ \text{s.t.}\ \ &A_{r}u\leq b_{r},\\ &u\in H,\end{aligned}

(55)

where $\mathcal{A}(u)$ is the area of a hyperrectangle inside $C(r)\cap H$ and the decision variables $u^{x},u^{y}$ are points on the line $A_{r}u\leq b_{r}$ . One of the corner points of this hyperrectangle is formed by $u^{x},u^{y}$ and the rest of corner points lie on the boundary hyperrectangle formed by $H$ . Suppose that $(u^{x}_{*},u^{y}_{*})$ are solutions to $\mathcal{P}_{A}$ , then the definition of Area $\mathcal{A}$ depends on how the line $A_{r}u\leq b_{r}$ intersects with $H$ , and therefore, leads to the following four possibilities:

•

$\mathcal{A}=(u^{x}-u_{min}^{x})*(u^{y}-u_{min}^{x})$ and $H_{c}=\{u_{min},(u^{x}_{*},u^{y}_{*})\}$
•

$\mathcal{A}=(u^{x}-u_{min}^{x})*(u_{max}^{y}-u^{y})$ and $H_{c}=\{(u_{min}^{x},u^{y}_{*}),(u^{x}_{*},u_{max}^{y})\}$
•

$\mathcal{A}=(u_{max}^{x}-u^{x})*(u_{max}^{x}-u^{y})$ and $H_{c}=\{(u^{x}_{*},u^{y}_{*}),u_{max}\}$
•

$\mathcal{A}=(u_{max}^{x}-u^{x})*(u^{y}-u_{min}^{y})$ and $H_{c}=\{(u^{x}_{*},u_{min}^{y}),(u_{max}^{x},u^{y}_{*})\}$ .

$\mathcal{P}_{A}$ is, in general, a non-convex program. However, through change-of-variables, we can transform this problem into a tractable problem through the following transformation. We perform a change of variables, with new variables denoted by $(\bar{u}^{x},\bar{u}^{y})$ defined by

•

$\bar{u}^{x}=u^{x}-u_{min}^{x}$ and $\bar{u}^{y}=u^{y}-u_{min}^{y}$
•

$\bar{u}^{x}=u^{x}-u_{min}^{x}$ and $\bar{u}^{y}=u_{max}^{y}-u^{y}$
•

$\bar{u}^{x}=u_{max}^{x}-u^{x}$ and $\bar{u}^{y}=u_{max}^{y}-u^{y}$
•

$\bar{u}^{x}=u_{max}^{x}-u^{x}$ and $\bar{u}^{y}=u^{y}-u_{min}^{y}$ .

The corresponding objective function is given by $\mathcal{\bar{A}}=\bar{u}^{x}\bar{u}^{y}$ . So long as the entries corresponding to $A_{r}$ and $b_{r}$ from (55) in the transformed problem are nonnegative, the resulting problem is a geometric program, which can be further transformed to a convex problem by standard methods and efficiently solved. In our quadcopter experiments, when $A_{r},b_{r}\geq 0$ , we solved the transformed geometric program using CVXPY (Diamond and Boyd, 2016), and used non-linear solvers from SCIPY (Virtanen et al., 2020) otherwise. We observed in our experiments that the transformation resulted in geometric programs in all but a handful of cases.

C.2.4 Reward

We now discuss reward sha** used in our quadcopter experiments.

Suppose that the $r_{min}:=[r_{min}^{x}\quad r_{min}^{y}]^{T}$ and $r_{max}:=[r_{max}^{x}\quad r_{max}^{y}]^{T}$ are environment boundaries, with $x,y$ -axis boundary repectively defined by $[r_{min}^{x},r_{max}^{x}]$ and $[r_{min}^{y},r_{max}^{y}]$ , that we employ for guiding exploration for both the quadcopter experiments. Then the reward used in our environment is defined by

R(r)=\begin{cases}50&\text{ if }||r-r_{goal}||_{2}<\epsilon,\\ -||r-r_{goal}||_{2}&\text{ if }r_{max}>r>r_{min}\text{ and }||r-r_{goal}||_{2}% \geq\epsilon,\\ -||r-r_{goal}||_{2}-400&\text{ if }r\geq r_{max}\text{ or }r\leq r_{min},\end{cases}

where $\epsilon=0.25$ is the boundary around $r_{goal}$ for which we give a constant positive reward of $50$ to the agent, and the inequalities in the reward definition are element-wise. Moreover, when the agent is inside the boundary but outside the $\epsilon$ -neighborhood of the goal, then the reward is negative of the distance between the agent and the goal. Lastly, we penalize the agent if it goes outside the boundary defined by $r_{min}$ and $r_{max}$ to encourage exploration in the region around the goal.

C.2.5 Hyperparameters

The hyperparameters used in the experiments are presented in Figure 6.

policy learning rate	0.0004
value learning rate	0.0004
entropy coefficient	0.00000001
clip range	0.2
weight decay	0.0
layer size	256
batch size	256
buffer size	320
number of epochs	10
rollout length	320
discount factor	0.90

(a) Gaussian hyperparameters.

policy learning rate	0.0006
value learning rate	0.0006
entropy coefficient	0.0
clip range	0.2
weight decay	0.0
layer size	256
batch size	256
buffer size	180
number of epochs	10
rollout length	180
discount factor	0.90

(b) Beta hyperparameters.

Figure 6: PPO hyperparameters for the quadcopter experiments.

C.2.6 Learning Curves

Learning curves for the experiments illustrated in Figures 2 and 1(c) are presented in Figures 6(a) and 6(b).

Appendix D Computing Resources

We ran our experiments on both a personal laptop and an HPC cluster. The laptop was configured with a 6-core i7-8750H, 2.20GHz CPU, an NVIDIA GeForce RTX 2070 GPU, and 32GB RAM . The HPC server node was configured with a 32-core Intel Xeon CPU , an 80GB Nvidia Tesla GPU, and 512 GB RAM .

$\displaystyle P$	$\displaystyle\Big{(}(x_{0},x_{1},x_{2})\in\{y_{0}\}\times B_{\varepsilon}(y_{1% })\times B_{\varepsilon}(y_{2})\ \|\ x_{0}=y_{0}\Big{)}$
	$\displaystyle=P\Big{(}(x_{1},x_{2})\in B_{\varepsilon}(y_{1})\times B_{% \varepsilon}(y_{2})\ \|\ x_{0}=y_{0}\Big{)}\cdot P(x_{0}=y_{0})$
	$\displaystyle=P\Big{(}(x_{1},x_{2})\in B_{\varepsilon}(y_{1})\times B_{% \varepsilon}(y_{2})\ \|\ x_{0}=y_{0}\Big{)}$
	$\displaystyle=\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon% }(y_{2}))}\pi^{C}_{\theta}(a_{1}\|x_{1})f_{1}(x_{1})\ da_{1}\ dx_{1}$	(15)
	$\displaystyle=\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon% }(y_{2})}\pi^{C}_{\theta}(a_{1}\|x_{1})\left[\int_{T^{-1}_{x_{0}}(x_{1})}\pi^{C% }_{\theta}(a_{0}\|x_{0})\ da_{0}\right]\ da_{1}\ dx_{1}.$

	$\displaystyle\nabla\log\pi^{C}_{\theta}(u\|x)$	$\displaystyle=\nabla\log\pi_{\theta}(u\|x)-\nabla\log\int_{C(x)}\pi_{\theta}(w\|% x)dw$
		$\displaystyle=\nabla\log\pi_{\theta}(u\|x)-\frac{\int_{C(x)}\nabla\pi_{\theta}(% w\|x)dw}{\int_{C(x)}\pi_{\theta}(w\|x)dw}.$		(18)

	$\displaystyle\Big{\|}\int_{C(x)}$	$\displaystyle\pi_{\theta}(w\|x)dw-\int_{C(x)}\pi_{\theta^{\prime}}(w\|x)dw\Big{\|}$
		$\displaystyle=\Big{\|}\int_{C(x)}\left(\pi_{\theta}(w\|x)-\pi_{\theta^{\prime}}(% w\|x)\right)dw\Big{\|}$
		$\displaystyle\leq\int_{C(x)}\|\pi_{\theta}(w\|x)-\pi_{\theta^{\prime}}(w\|x)\|dw$
		$\displaystyle\leq\int_{C(x)}B_{\Theta}\left\\|\theta-\theta^{\prime}\right\\|dw=% B_{\Theta}\mu(C(x))\left\\|\theta-\theta^{\prime}\right\\|$
		$\displaystyle\leq B_{\Theta}M\left\\|\theta-\theta^{\prime}\right\\|,$