License: arXiv.org perpetual non-exclusive license
arXiv:2403.04007v1 [cs.LG] 06 Mar 2024

 

Sampling-based Safe Reinforcement Learning for
Nonlinear Dynamical Systems


 


Wesley A. Suttle                        Vipul K. Sharma                        Krishna C. Kosaraju

U.S. Army Research Laboratory                        Purdue University                        Clemson University

S. Sivaranjani                        Ji Liu                        Vijay Gupta                        Brian M. Sadler

Purdue University                        Stony Brook University                        Purdue University                        U.S. Army Research Laboratory

Abstract

We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.

1 INTRODUCTION

Learning-based methods for safe control of physical systems have been gaining increasing attention (Brunke et al., 2022). RL is especially powerful for the control of systems where performance feedback in the form of a scalar reward is available, but the dynamics are unknown (Sutton and Barto, 2018). In such settings, RL methods can learn a controller maximizing reward through direct interaction with the environment. However, due to physical realities such as the need to guarantee safety, practical application of RL to control of physical systems requires constraints on the control policies throughout training (Garcıa and Fernández, 2015). While directly constraining the action space to a static, narrowly defined set of “safe” actions is frequently employed in practice, this can lead to learning highly suboptimal policies and more nuanced methods are therefore required. Furthermore, in most physical systems it is non-trivial to directly translate complex safety constraints on the states into allowable actions.

A variety of RL approaches to the problem of safe learning for control have been proposed in the literature (see Brunke et al. (2022) for a comprehensive survey), including RL methods for safety-focused problems formulated as constrained Markov decision processes (CMDPs) (Altman, 2021), methods for learning to achieve safety through stability (Berkenkamp et al., 2017), and projection-based – also known as “safety filter” – RL methods for maintaining hard safety constraints, typically achieved through the use of control barrier functions (CBFs) (Cheng et al., 2019). Though CMDP-based methods enjoy convergence guarantees, they encourage safety without guaranteeing it, and cannot provide guarantees for hard safety constraints commonly required in physical systems. Likewise, methods like Berkenkamp et al. (2017) do better by offering high-probability safety assurances, but stop short of guaranteeing safety. In systems where safety is critical, methods like Cheng et al. (2019) that provably guarantee hard constraint satisfaction are necessary. However, the interaction between imposition of hard constraints and optimality of the resulting control policies is a subtle issue in RL. While projection-based safety-filter approaches (Wabersich et al., 2023) provably guarantee safety, the projection procedure undermines any convergence guarantees enjoyed by the underlying RL methods.

In this work, we develop a class of model-free policy gradient methods that maintain safety or other stability properties by sampling directly from the set of state-dependent safe actions. The key to our approach is that we consider truncated versions of commonly used stochastic policies, allowing us to sample directly from the safe action set at each state. This allows us to recover convergence guarantees by extending existing results for policy gradient methods to truncated policies. Our approach is applicable to a wide class of safety constraints including control barrier functions (CBFs), that enforce forward invariance of a set characterized by nonlinearly coupled states and actions (Ames et al., 2019, 2016), and reachability-type constraints (Wabersich et al., 2023). In addition to our theoretical results, we experimentally validate the practical utility of sampling-based safety-preservation methods by considering a special case: Beta policies with state-dependent, control barrier function (CBF)-constrained action sets. This novel approach extends the Beta policies in Chou et al. (2017) to the state-dependent action constraint setting. Finally, we train the resulting CBF-constrained Beta policies using PPO to solve a safety-constrained inverted pendulum problem as well as a quadcopter navigation and obstacle avoidance problem, and compare the latter to a safety filter-based benchmark.111Our implementation is publicly available at https://github.com/sharma1256/cbf-constrained_ppo. These case studies illustrate that our method simultaneously guarantees safety throughout training and guarantees optimality, even where existing benchmarks fail.

1.1 Related Work

Safety and stability have seen a great deal of interest in recent years at the intersection of the RL and control communities (see Brunke et al. (2022); Garcıa and Fernández (2015) for overviews). We are interested in safety definitions that impose hard constraints on the states and control actions (rather than, e.g., those used in robust RL (Wiesemann et al., 2013; Aswani et al., 2013) or RL for CMDPs (Achiam et al., 2017; Paternain et al., 2019; Ma et al., 2021; Bai et al., 2022)). Model-based methods for guaranteeing stability using RL controllers in systems with known or learnable dynamics have been developed in Berkenkamp et al. (2017); Fazel et al. (2018); Zhang et al. (2021). Recently, techniques leveraging control barrier functions to maintain safety (Cheng et al., 2019) and dissipativity (Kosaraju et al., 2021) have been developed.

Our work lies in the model-free RL setting. The two dominant approaches in model-free RL are value function and policy gradient-based methods (Sutton and Barto, 2018). We focus on the latter in this paper. Since their origins early in the development of RL (Sutton et al., 2000; Borkar, 2005; Bhatnagar et al., 2009), policy gradient methods have become the model-free algorithms of choice for complex problems with continuous, high-dimensional state and action spaces (Lillicrap et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018). Recent works have improved our understanding of gradient estimation procedures, global optimality properties, and convergence rates of these algorithms (Agarwal et al., 2020; Zhang et al., 2020; Suttle et al., 2023). Popular approaches for safety in model-free RL include using bounds resulting from Gaussian process models (Schreiter et al., 2015; Rasmussen, 2003; Sui et al., 2015), reward-sha**, constrained policy optimization (Achiam et al., 2017; Wachi et al., 2018), and teacher advice (Abbeel and Ng, 2004). Our work is most closely related to those approaches that use a hard safe set specification and constraints on control inputs, e.g., control barrier functions (Cheng et al., 2019; Fisac et al., 2018; Li et al., 2018; Kosaraju et al., 2021). In particular, our key contribution is a model-free safe RL algorithm with convergence guarantees and provable safety guarantees under hard constraints like CBFs, even during training.

2 PROBLEM SETTING

Consider a discounted MDP (𝒳,𝒰,𝒫,r,γ)𝒳𝒰𝒫𝑟𝛾(\mathcal{X},\mathcal{U},\mathcal{P},r,\gamma)( caligraphic_X , caligraphic_U , caligraphic_P , italic_r , italic_γ ), where 𝒳m𝒳superscript𝑚\mathcal{X}\subseteq\mathbb{R}^{m}caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the state space, 𝒰n𝒰superscript𝑛\mathcal{U}\subseteq\mathbb{R}^{n}caligraphic_U ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the action space, 𝒫(|x,u)\mathcal{P}(\cdot|x,u)caligraphic_P ( ⋅ | italic_x , italic_u ) is the transition probability function given action u𝑢uitalic_u is taken in state x𝑥xitalic_x, r:𝒳×𝒰:𝑟𝒳𝒰r:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}italic_r : caligraphic_X × caligraphic_U → blackboard_R is the reward function, and γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor. The MDP, which can be used to model a wide array of discrete-time systems, proceeds as follows: at time k𝑘kitalic_k, the system is in state xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; a control input uksubscript𝑢𝑘u_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is applied to the system; a reward r(xk,uk)𝑟subscript𝑥𝑘subscript𝑢𝑘r(x_{k},u_{k})italic_r ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is received; the system transitions into state xk+1subscript𝑥𝑘1x_{k+1}italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT according to the distribution P(|xk,uk)P(\cdot|x_{k},u_{k})italic_P ( ⋅ | italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). The goal in this problem formulation is to maximize the expected discounted reward, which we define in (3) below. Note that deterministic dynamics can be recovered by imposing that, for each x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U, there exists x𝒳superscript𝑥𝒳x^{\prime}\in\mathcal{X}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X such that 𝒫(x|x,u)=1𝒫conditionalsuperscript𝑥𝑥𝑢1\mathcal{P}(x^{\prime}|x,u)=1caligraphic_P ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_u ) = 1. This is useful for modeling discretizations of continuous-time control problems, for example. We assume throughout this paper that the dynamics are deterministic in this way, which is a common setting in safe control problems. Let 𝒯:𝒳×𝒰𝒳:𝒯𝒳𝒰𝒳\mathcal{T}:\mathcal{X}\times\mathcal{U}\rightarrow\mathcal{X}caligraphic_T : caligraphic_X × caligraphic_U → caligraphic_X represent the dynamics of the MDP, i.e., given state x𝑥xitalic_x and control input u𝑢uitalic_u, 𝒯(x,u)𝒯𝑥𝑢\mathcal{T}(x,u)caligraphic_T ( italic_x , italic_u ) denotes the state the system transitions into when input u𝑢uitalic_u is applied while in state x𝑥xitalic_x.

Letting Δ(𝒰)Δ𝒰\Delta(\mathcal{U})roman_Δ ( caligraphic_U ) denote the set of all probability distributions over the set 𝒰𝒰\mathcal{U}caligraphic_U, a stochastic policy π:𝒳Δ(𝒰):𝜋𝒳Δ𝒰\pi:\mathcal{X}\rightarrow\Delta(\mathcal{U})italic_π : caligraphic_X → roman_Δ ( caligraphic_U ) is a function map** states to probability distributions over the action space 𝒰𝒰\mathcal{U}caligraphic_U. In other words, given a state x𝑥xitalic_x, an agent using policy π𝜋\piitalic_π will choose a control action u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U by sampling uπ(|s)u\sim\pi(\cdot|s)italic_u ∼ italic_π ( ⋅ | italic_s ). For our purposes it will be useful to consider policies πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT parameterized by θΘk𝜃Θsuperscript𝑘\theta\in\Theta\subseteq\mathbb{R}^{k}italic_θ ∈ roman_Θ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, for some k|𝒳||𝒰|=mnmuch-less-than𝑘𝒳𝒰𝑚𝑛k\ll|\mathcal{X}|\cdot|\mathcal{U}|=m\cdot nitalic_k ≪ | caligraphic_X | ⋅ | caligraphic_U | = italic_m ⋅ italic_n, where ΘΘ\Thetaroman_Θ is a compact set of permissible parameters.

Let 𝒮𝒳𝒮𝒳\mathcal{S}\subset\mathcal{X}caligraphic_S ⊂ caligraphic_X denote some “safe” or stable set within which we wish to keep the system. Furthermore, let (S)𝑆\mathbb{P}(S)blackboard_P ( italic_S ) denote the powerset of a set S𝑆Sitalic_S, and consider a set-valued function C:𝒳(𝒰):𝐶𝒳𝒰C:\mathcal{X}\rightarrow\mathbb{P}(\mathcal{U})italic_C : caligraphic_X → blackboard_P ( caligraphic_U ) given by

C(x)={u𝒰|𝒯(x,u)𝒮}.𝐶𝑥conditional-set𝑢𝒰𝒯𝑥𝑢𝒮C(x)=\{u\in\mathcal{U}\ |\ \mathcal{T}(x,u)\in\mathcal{S}\}.italic_C ( italic_x ) = { italic_u ∈ caligraphic_U | caligraphic_T ( italic_x , italic_u ) ∈ caligraphic_S } . (1)

Intuitively, C(x)𝐶𝑥C(x)italic_C ( italic_x ) is the set of all control inputs which when applied at state x𝑥xitalic_x keep the system within the safe set at the next time step. We assume throughout that, for a given x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, C(x)𝐶𝑥C(x)italic_C ( italic_x ) is known. Since our primary focus is resolving the open problem of simultaneously guaranteeing convergence and hard safety constraint satisfaction, we leave the issue of learning or approximating C(x)𝐶𝑥C(x)italic_C ( italic_x ) while maintaining these guarantees to future work. The general formulation can be used to accommodate a variety of notions of safety, including forward invariance, stability, and dissipativity enforced by, for example, CBFs and exponential CBFs (ECBFs), and control Lyapunov functions (CLFs) (see the supplementary material for an overview and Ames et al. (2019) for a comprehensive survey). As we will demonstrate in the case studies below, the use of our method in conjunction with (E)CBFs is particularly natural to provide guarantees in problems with hard safety constraints. To ensure that we can sample from C(x)𝐶𝑥C(x)italic_C ( italic_x ) and integrals over C(x)𝐶𝑥C(x)italic_C ( italic_x ) are well-defined, we make the following assumption. Let μ𝜇\muitalic_μ denote the Lebesgue measure.

Assumption 1.

There exist m,M>0𝑚𝑀0m,M>0italic_m , italic_M > 0 such that mμ(C(x))M𝑚𝜇𝐶𝑥𝑀m\leq\mu(C(x))\leq Mitalic_m ≤ italic_μ ( italic_C ( italic_x ) ) ≤ italic_M, for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Furthermore, x𝒳C(x)subscript𝑥𝒳𝐶𝑥\cup_{x\in\mathcal{X}}C(x)∪ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_C ( italic_x ) is compact.

Given a policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, consider the distribution πθC(|x)\pi_{\theta}^{C}(\cdot|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( ⋅ | italic_x ) obtained by truncating πθ(|x)\pi_{\theta}(\cdot|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) to the set C(x)𝐶𝑥C(x)italic_C ( italic_x ). More precisely:

πθC(u|x)={πθ(u|x)πθ(C(x)|x)uC(x)0uC(x),subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥casessubscript𝜋𝜃conditional𝑢𝑥subscript𝜋𝜃conditional𝐶𝑥𝑥𝑢𝐶𝑥0𝑢𝐶𝑥\pi^{C}_{\theta}(u|x)=\begin{cases}\frac{\pi_{\theta}(u|x)}{\pi_{\theta}(C(x)|% x)}&u\in C(x)\\ 0&u\notin C(x),\end{cases}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) = { start_ROW start_CELL divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) end_ARG end_CELL start_CELL italic_u ∈ italic_C ( italic_x ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_u ∉ italic_C ( italic_x ) , end_CELL end_ROW (2)

where πθ(C(x)|x)=C(x)πθ(u|x)𝑑usubscript𝜋𝜃conditional𝐶𝑥𝑥subscript𝐶𝑥subscript𝜋𝜃conditional𝑢𝑥differential-d𝑢\pi_{\theta}(C(x)|x)=\int_{C(x)}\pi_{\theta}(u|x)duitalic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) = ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) italic_d italic_u. As long as we can check membership in C(x)𝐶𝑥C(x)italic_C ( italic_x ) for any given u𝒰𝑢𝒰u\in\mathcal{U}italic_u ∈ caligraphic_U, and assuming that the volume of C(x)𝐶𝑥C(x)italic_C ( italic_x ) is strictly positive, for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, we can generate from this distribution by using rejection sampling, i.e. repeatedly sampling uπθ(|x)u\sim\pi_{\theta}(\cdot|x)italic_u ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) until we obtain uC(x)𝑢𝐶𝑥u\in C(x)italic_u ∈ italic_C ( italic_x ). Note that, depending on the structure of parametrized policies πθCsubscriptsuperscript𝜋𝐶𝜃\pi^{C}_{\theta}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, if C(x)𝐶𝑥C(x)italic_C ( italic_x ) has a particularly nice form, such as an interval or hyperrectangle, there may be more efficient methods than rejection sampling for sampling from the truncated distribution πθC(|x)\pi^{C}_{\theta}(\cdot|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) directly. We exploit this fact when leveraging Beta policies in the experimental results of Section 4 below.

With this setup in mind, and given a fixed start state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we propose a policy gradient-based algorithm maximizing the objective function

J(θ)=𝔼πθC[k=0γkr(xk,uk)|x0],𝐽𝜃subscript𝔼superscriptsubscript𝜋𝜃𝐶delimited-[]conditionalsuperscriptsubscript𝑘0superscript𝛾𝑘𝑟subscript𝑥𝑘subscript𝑢𝑘subscript𝑥0J(\theta)=\mathbb{E}_{\pi_{\theta}^{C}}\left[\sum_{k=0}^{\infty}\gamma^{k}r(x_% {k},u_{k})\ \Big{|}\ x_{0}\right],italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] , (3)

the expected discounted reward under policy πθCsubscriptsuperscript𝜋𝐶𝜃\pi^{C}_{\theta}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Before proceeding with describing and analyzing the algorithm, we first need to identify conditions that ensure that, for each policy parameter θ𝜃\thetaitalic_θ, taking expectations with respect to πθCsuperscriptsubscript𝜋𝜃𝐶\pi_{\theta}^{C}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT is well-defined and thus meaningful. In order for (3) to be well-defined, we need to know that, for each policy parameter θ𝜃\thetaitalic_θ, the occupancy measure of the Markov chain induced by πθCsuperscriptsubscript𝜋𝜃𝐶\pi_{\theta}^{C}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT on 𝒮𝒮\mathcal{S}caligraphic_S is irreducible and satisfies certain ergodicity conditions. Once these are proven, we will be justified in performing gradient ascent on the objective function (3).

3 THEORETICAL RESULTS

In this section we develop the theory underlying our sampling-based method for RL with hard safety constraints. Our key contributions include proving that (3) is well-defined (§3.1), obtaining gradient expressions for it from which we can sample (§3.2), and develo** and establishing the convergence of a policy gradient algorithm for optimizing (3) (§3.3 and §3.4). All proofs are deferred to the supplementary material. It is important to note that, though we assumed the deterministic dynamics common to safe control in §2, all our theoretical results go through in the stochastic dynamics case under standard ergodicity assumptions. Our key theoretical contribution in what follows is to show that, even in the deterministic dynamics case, we can ensure that the objective is well-defined (§3.1) and obtain convergence (§3.4).

3.1 Discounted Return is Well-defined

First, we show that the objective (3) is well-defined when using truncated policies, even in continuous spaces systems with deterministic dynamics. Our key contribution in this setting is to ensure that, given reasonable conditions on the policies under consideration, important ergodicity properties of their induced Markov chains hold. This fact, established in Proposition 1 and Corollary 1, is nontrivial and its proof relies on a careful analysis of the propagation of probability mass through the transition dynamics and an interesting application of the Lebesgue-Radon-Nikodym Theorem (Folland, 1999, §3.2). As in the previous section, let 𝒮m𝒮superscript𝑚\mathcal{S}\subset\mathbb{R}^{m}caligraphic_S ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denote the “safe” set within which the system remains so long as all control inputs are selected from C(x)𝐶𝑥C(x)italic_C ( italic_x ). Though we leave open the possibility that 𝒮𝒮\mathcal{S}caligraphic_S satisfies a more specific stability conditions rather than a generic notion of “safety”, we will typically use the term “safe” for ease of presentation. Let μ𝜇\muitalic_μ denote Lebesgue measure. We make the following definition:

Definition 1.

The Markov chain {xk}ksubscriptsubscript𝑥𝑘𝑘\{x_{k}\}_{k\in\mathbb{N}}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT induced by πθCsubscriptsuperscript𝜋𝐶𝜃\pi^{C}_{\theta}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on 𝒳𝒳\mathcal{X}caligraphic_X is μ𝜇\muitalic_μ-irreducible on 𝒮𝒮\mathcal{S}caligraphic_S if, for any μ𝜇\muitalic_μ-measurable 𝒮𝒮\mathcal{B}\subset\mathcal{S}caligraphic_B ⊂ caligraphic_S, if μ()>0𝜇0\mu(\mathcal{B})>0italic_μ ( caligraphic_B ) > 0, then kP(xk|x0=x)>0subscript𝑘𝑃subscript𝑥𝑘conditionalsubscript𝑥0𝑥0\sum_{k\in\mathbb{N}}P(x_{k}\in\mathcal{B}\ |\ x_{0}=x)>0∑ start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_B | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x ) > 0, for all x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S.

This means that, for a Markov chain to be (μ𝜇\muitalic_μ-)irreducible on the safety set, all safe subsets with positive volume must be reachable from any initial safe state with positive probability. Notice that {xk}ksubscriptsubscript𝑥𝑘𝑘\{x_{k}\}_{k\in\mathbb{N}}{ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT is in fact a Markov chain on the safe set 𝒮𝒮\mathcal{S}caligraphic_S since, by the definition of C(x)𝐶𝑥C(x)italic_C ( italic_x ), only those control inputs kee** the system within 𝒮𝒮\mathcal{S}caligraphic_S are allowed. In the sequel, we will prove that, for each θ𝜃\thetaitalic_θ, under suitable conditions the Markov chain induced by πθCsuperscriptsubscript𝜋𝜃𝐶\pi_{\theta}^{C}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT on 𝒮𝒮\mathcal{S}caligraphic_S is irreducible and the objective (3) is thus well-defined, which is a prerequisite for develo** policy gradient methods based on it. See (Konda, 2002, §2.3) for details on irreducibility in this setting.

Given an element x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S and dynamics 𝒯𝒯\mathcal{T}caligraphic_T, let R(x)𝒮𝑅𝑥𝒮R(x)\subset\mathcal{S}italic_R ( italic_x ) ⊂ caligraphic_S consisting of all elements reachable in one step from x𝑥xitalic_x under 𝒯𝒯\mathcal{T}caligraphic_T. Furthermore, for 𝒜𝒮𝒜𝒮\mathcal{A}\subset\mathcal{S}caligraphic_A ⊂ caligraphic_S, define R(𝒜)=x𝒜R(x)𝑅𝒜subscript𝑥𝒜𝑅𝑥R(\mathcal{A})=\cup_{x\in\mathcal{A}}R(x)italic_R ( caligraphic_A ) = ∪ start_POSTSUBSCRIPT italic_x ∈ caligraphic_A end_POSTSUBSCRIPT italic_R ( italic_x ). Also, given ε>0𝜀0\varepsilon>0italic_ε > 0 and xm𝑥superscript𝑚x\in\mathbb{R}^{m}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, let Bε(x)subscript𝐵𝜀𝑥B_{\varepsilon}(x)italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_x ) denote the open ball of radius ε𝜀\varepsilonitalic_ε centered at x𝑥xitalic_x. Finally, for 𝒜𝒮𝒜𝒮\mathcal{A}\subset\mathcal{S}caligraphic_A ⊂ caligraphic_S, define 𝒯x1(𝒜):={u𝒰|T(x,u)𝒜}.assignsubscriptsuperscript𝒯1𝑥𝒜conditional-set𝑢𝒰𝑇𝑥𝑢𝒜\mathcal{T}^{-1}_{x}(\mathcal{A}):=\{u\in\mathcal{U}\ |\ T(x,u)\in\mathcal{A}\}.caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( caligraphic_A ) := { italic_u ∈ caligraphic_U | italic_T ( italic_x , italic_u ) ∈ caligraphic_A } . Intuitively, 𝒯x1(𝒜)subscriptsuperscript𝒯1𝑥𝒜\mathcal{T}^{-1}_{x}(\mathcal{A})caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( caligraphic_A ) is the set of all control inputs that, when taken in state x𝑥xitalic_x, drive the system into 𝒜𝒜\mathcal{A}caligraphic_A. The following assumptions are needed in what follows.

Assumption 2.

For any x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S and any μ𝜇\muitalic_μ-measurable set 𝒜R(x)𝒜𝑅𝑥\mathcal{A}\subset R(x)caligraphic_A ⊂ italic_R ( italic_x ), μ(𝒜)>0𝜇𝒜0\mu(\mathcal{A})>0italic_μ ( caligraphic_A ) > 0 if and only if μ(𝒯x1(𝒜))>0𝜇subscriptsuperscript𝒯1𝑥𝒜0\mu(\mathcal{T}^{-1}_{x}(\mathcal{A}))>0italic_μ ( caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( caligraphic_A ) ) > 0.

Assumption 2 ensures the system dynamics map positive volume subsets of control inputs to positive volume subsets of the state space and vice versa, which is important for our application of the Lebesgue-Radon-Nikodym Theorem in Proposition 1. It is satisfied by systems where control inputs have a measurable effect on each entry in the next state vector and thus encompasses a wide array of potentially nonlinear systems.

Assumption 3.

For any θΘ𝜃normal-Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, where Θnormal-Θ\Thetaroman_Θ is the set of permissible policy parameters, for any element in the safe set x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S, and for any set 𝒜C(x)𝒜𝐶𝑥\mathcal{A}\subset C(x)caligraphic_A ⊂ italic_C ( italic_x ) satisfying μ(𝒜)>0𝜇𝒜0\mu(\mathcal{A})>0italic_μ ( caligraphic_A ) > 0, the policy πθC(|x)\pi^{C}_{\theta}(\cdot|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) assigns positive probability to 𝒜𝒜\mathcal{A}caligraphic_A, i.e. 𝒜πθC(a|x)𝑑a>0subscript𝒜subscriptsuperscript𝜋𝐶𝜃conditional𝑎𝑥differential-d𝑎0\int_{\mathcal{A}}\pi^{C}_{\theta}(a|x)da>0∫ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_x ) italic_d italic_a > 0.

Assumption 3, which is standard in the RL literature, ensures that any set of allowable control inputs that has strictly positive volume will be sampled from with strictly positive probability.

Assumption 4.

For each x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S, μ(R(x))>0𝜇𝑅𝑥0\mu(R(x))>0italic_μ ( italic_R ( italic_x ) ) > 0, and, given 𝒮𝒮\mathcal{B}\subset\mathcal{S}caligraphic_B ⊂ caligraphic_S, there exists n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N such that \mathcal{B}caligraphic_B is reachable in n𝑛nitalic_n steps from x𝑥xitalic_x.

The conditions imposed in Assumption 4 guarantee that, for any state x𝕏𝑥𝕏x\in\mathbb{X}italic_x ∈ blackboard_X: (i) the set of states reachable from x𝑥xitalic_x in one step has strictly positive volume; (ii) any subset of the safe set 𝒮𝒮\mathcal{S}caligraphic_S is reachable in at most n𝑛nitalic_n steps from x𝑥xitalic_x. These conditions are closely related to the familiar notion of controllability of control theory. Under these conditions, we have the following proposition and its immediate corollary.

Proposition 1.

Under Assumptions 2, 3, 4, for given θ𝜃\thetaitalic_θ and any subset 𝒮𝒮\mathcal{B}\subset\mathcal{S}caligraphic_B ⊂ caligraphic_S satisfying μ()>0𝜇0\mu(\mathcal{B})>0italic_μ ( caligraphic_B ) > 0, the Markov chain induced by πθCsubscriptsuperscript𝜋𝐶𝜃\pi^{C}_{\theta}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT on 𝒮𝒮\mathcal{S}caligraphic_S enters \mathcal{B}caligraphic_B with strictly positive probability.

Corollary 1.

{xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is μ𝜇\muitalic_μ-irreducible on 𝒮𝒮\mathcal{S}caligraphic_S.

Now that we are assured that the objective function (3) is well-defined, we are justified in attempting to perform gradient ascent on it. In order to accomplish this, however, we need access to gradient estimates. This is the subject of the next section.

3.2 Policy Gradients

Despite the presence of C𝐶Citalic_C in πθCsuperscriptsubscript𝜋𝜃𝐶\pi_{\theta}^{C}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, under mild assumptions on the underlying policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we can apply the classic policy gradient theorem of Sutton et al. (2000) to (3) to obtain a gradient expression from which we can sample. Let dθC():=(1γ)k=0γtP(xk|πθC)d_{\theta}^{C}(\cdot):=(1-\gamma)\sum_{k=0}^{\infty}\gamma^{t}P(x_{k}\in\cdot% \ |\ \pi_{\theta}^{C})italic_d start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( ⋅ ) := ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ⋅ | italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ) denote the discounted state occupancy measure of the Markov chain induced by policy πθCsuperscriptsubscript𝜋𝜃𝐶\pi_{\theta}^{C}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT on 𝒮𝒮\mathcal{S}caligraphic_S. Furthermore, let QπθC(x,u)=𝔼πθC[k=0γkr(xk,uk)|x0=x,u0=u].superscript𝑄superscriptsubscript𝜋𝜃𝐶𝑥𝑢subscript𝔼superscriptsubscript𝜋𝜃𝐶delimited-[]formulae-sequenceconditionalsuperscriptsubscript𝑘0superscript𝛾𝑘𝑟subscript𝑥𝑘subscript𝑢𝑘subscript𝑥0𝑥subscript𝑢0𝑢Q^{\pi_{\theta}^{C}}(x,u)=\mathbb{E}_{\pi_{\theta}^{C}}\left[\sum_{k=0}^{% \infty}\gamma^{k}r(x_{k},u_{k})\ |\ x_{0}=x,u_{0}=u\right].italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_u ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u ] . We make the following assumption:

Assumption 5.

πθ(u|x)>0subscript𝜋𝜃conditional𝑢𝑥0\pi_{\theta}(u|x)>0italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) > 0 and πθ(u|x)subscript𝜋𝜃conditional𝑢𝑥\pi_{\theta}(u|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) is differentiable in θ𝜃\thetaitalic_θ, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U.

Recall from (2) that πθC(|x)\pi^{C}_{\theta}(\cdot|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) is simply the probability density function πθ(|x)\pi_{\theta}(\cdot|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) truncated to the set C(x)𝐶𝑥C(x)italic_C ( italic_x ). Note, since the value of C(x)𝐶𝑥C(x)italic_C ( italic_x ) at a given x𝑥xitalic_x is independent of θ𝜃\thetaitalic_θ, we can take the derivative inside the integral sign in the latter expression to obtain πθ(C(x)|x)=C(x)πθ(u|x)𝑑usubscript𝜋𝜃conditional𝐶𝑥𝑥subscript𝐶𝑥subscript𝜋𝜃conditional𝑢𝑥differential-d𝑢\nabla\pi_{\theta}(C(x)|x)=\int_{C(x)}\nabla\pi_{\theta}(u|x)du∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) = ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) italic_d italic_u, so πθC(C(x)|x)subscriptsuperscript𝜋𝐶𝜃conditional𝐶𝑥𝑥\pi^{C}_{\theta}(C(x)|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) is differentiable. Given these facts, combined with Assumption 5, the above expression for πθCsubscriptsuperscript𝜋𝐶𝜃\pi^{C}_{\theta}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT implies that, for any x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S, the policy πθC(u|x)subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\pi^{C}_{\theta}(u|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) is differentiable with respect to θ𝜃\thetaitalic_θ, for any uC(x)𝑢𝐶𝑥u\in C(x)italic_u ∈ italic_C ( italic_x ). In short, πθCsubscriptsuperscript𝜋𝐶𝜃\pi^{C}_{\theta}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT satisfies its own version of Assumption 5, which we formalize in the following:

Lemma 1.

πθC(u|x)>0subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥0\pi^{C}_{\theta}(u|x)>0italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) > 0 and πθC(u|x)subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\pi^{C}_{\theta}(u|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) is differentiable in θ𝜃\thetaitalic_θ, for all x𝒮𝑥𝒮x\in\mathcal{S}italic_x ∈ caligraphic_S and uC(x)𝑢𝐶𝑥u\in C(x)italic_u ∈ italic_C ( italic_x ).

The policy gradient theorem (Konda, 2002) implies

J(θ)=11γ𝔼πθC[QπθC(x,u)logπθC(u|x)].𝐽𝜃11𝛾subscript𝔼subscriptsuperscript𝜋𝐶𝜃delimited-[]superscript𝑄superscriptsubscript𝜋𝜃𝐶𝑥𝑢superscriptsubscript𝜋𝜃𝐶conditional𝑢𝑥\nabla J(\theta)=\frac{1}{1-\gamma}\mathbb{E}_{\pi^{C}_{\theta}}\left[Q^{\pi_{% \theta}^{C}}(x,u)\nabla\log\pi_{\theta}^{C}(u|x)\right].∇ italic_J ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_u ) ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( italic_u | italic_x ) ] . (4)

In order to carry out gradient updates based on this expression, we first need to be able to estimate θlogπθC(u|x)=θπθC(u|x)/πθC(u|x)subscript𝜃subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥subscript𝜃subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\nabla_{\theta}\log\pi^{C}_{\theta}(u|x)=\nabla_{\theta}\pi^{C}_{\theta}(u|x)/% \pi^{C}_{\theta}(u|x)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) = ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) / italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ), for arbitrary u,x𝑢𝑥u,xitalic_u , italic_x. We will discuss how to estimate QπθC(x,u)superscript𝑄subscriptsuperscript𝜋𝐶𝜃𝑥𝑢Q^{\pi^{C}_{\theta}}(x,u)italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_u ) in an unbiased manner in the following section. Since we already have access to πθC(u|x)subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\pi^{C}_{\theta}(u|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ), we can focus on estimating θπθC(u|x)subscript𝜃subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\nabla_{\theta}\pi^{C}_{\theta}(u|x)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ). Based on (2), the gradient of πθC(u|x)subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\pi^{C}_{\theta}(u|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) with respect to θ𝜃\thetaitalic_θ is

\displaystyle\nabla πθC(u|x)=[πθ(u|x)πθ(C(x)|x)]subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥subscript𝜋𝜃conditional𝑢𝑥subscript𝜋𝜃conditional𝐶𝑥𝑥\displaystyle\pi^{C}_{\theta}(u|x)=\nabla\left[\frac{\pi_{\theta}(u|x)}{\pi_{% \theta}(C(x)|x)}\right]italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) = ∇ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) end_ARG ] (5)
=πθ(u|x)πθ(C(x)|x)πθ(u|x)[πθ(C(x)|x)]2πθ(C(x)|x)absentsubscript𝜋𝜃conditional𝑢𝑥subscript𝜋𝜃conditional𝐶𝑥𝑥subscript𝜋𝜃conditional𝑢𝑥superscriptdelimited-[]subscript𝜋𝜃conditional𝐶𝑥𝑥2subscript𝜋𝜃conditional𝐶𝑥𝑥\displaystyle=\frac{\nabla\pi_{\theta}(u|x)}{\pi_{\theta}(C(x)|x)}-\frac{\pi_{% \theta}(u|x)}{[\pi_{\theta}(C(x)|x)]^{2}}\nabla\pi_{\theta}(C(x)|x)= divide start_ARG ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) end_ARG - divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) end_ARG start_ARG [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) (6)
=1πθ(C(x)|x)[πθ(u|x)πθ(u|x)logπθ(C(x)|x)].absent1subscript𝜋𝜃conditional𝐶𝑥𝑥delimited-[]subscript𝜋𝜃conditional𝑢𝑥subscript𝜋𝜃conditional𝑢𝑥subscript𝜋𝜃conditional𝐶𝑥𝑥\displaystyle=\frac{1}{\pi_{\theta}(C(x)|x)}\left[\nabla\pi_{\theta}(u|x)-\pi_% {\theta}(u|x)\nabla\log\pi_{\theta}(C(x)|x)\right].= divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) end_ARG [ ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) - italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ) ] . (7)

To estimate πθ(C(x)|x)subscript𝜋𝜃conditional𝐶𝑥𝑥\pi_{\theta}(C(x)|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_C ( italic_x ) | italic_x ), we need to be able to estimate C(x)πθ(u|x)𝑑usubscript𝐶𝑥subscript𝜋𝜃conditional𝑢𝑥differential-d𝑢\int_{C(x)}\pi_{\theta}(u|x)du∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) italic_d italic_u. Given access to πθ(|x)\pi_{\theta}(\cdot|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) and C(x)𝐶𝑥C(x)italic_C ( italic_x ), we can use numerical integration or Monte Carlo techniques to approximate this integral. The standard Monte Carlo approach is to uniformly sample M𝑀Mitalic_M elements uiU(C(x))similar-tosubscript𝑢𝑖𝑈𝐶𝑥u_{i}\sim U(C(x))italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_U ( italic_C ( italic_x ) ) from C(x)𝐶𝑥C(x)italic_C ( italic_x ), then estimate

πθ^(C(x)|x)=μ(C(x))1Mi=1Mπθ(ui|x),^subscript𝜋𝜃conditional𝐶𝑥𝑥𝜇𝐶𝑥1𝑀superscriptsubscript𝑖1𝑀subscript𝜋𝜃conditionalsubscript𝑢𝑖𝑥\widehat{\pi_{\theta}}(C(x)|x)=\mu(C(x))\frac{1}{M}\sum_{i=1}^{M}\pi_{\theta}(% u_{i}|x),over^ start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( italic_C ( italic_x ) | italic_x ) = italic_μ ( italic_C ( italic_x ) ) divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) , (8)

where μ(C(x))𝜇𝐶𝑥\mu(C(x))italic_μ ( italic_C ( italic_x ) ) is the volume of C(x)𝐶𝑥C(x)italic_C ( italic_x ). This estimate is based on the fact that

πθsubscript𝜋𝜃\displaystyle\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (C(x)|x)=C(x)πθ(u|x)𝑑uconditional𝐶𝑥𝑥subscript𝐶𝑥subscript𝜋𝜃conditional𝑢𝑥differential-d𝑢\displaystyle(C(x)|x)=\int_{C(x)}\pi_{\theta}(u|x)du( italic_C ( italic_x ) | italic_x ) = ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) italic_d italic_u (9)
=μ(C(x))C(x)πθ(u|x)μ(C(x))𝑑uabsent𝜇𝐶𝑥subscript𝐶𝑥subscript𝜋𝜃conditional𝑢𝑥𝜇𝐶𝑥differential-d𝑢\displaystyle=\mu(C(x))\int_{C(x)}\frac{\pi_{\theta}(u|x)}{\mu(C(x))}du= italic_μ ( italic_C ( italic_x ) ) ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) end_ARG start_ARG italic_μ ( italic_C ( italic_x ) ) end_ARG italic_d italic_u (10)
=μ(C(x))EuU(C(x))[πθ(u|x)]absent𝜇𝐶𝑥subscript𝐸similar-to𝑢𝑈𝐶𝑥delimited-[]subscript𝜋𝜃conditional𝑢𝑥\displaystyle=\mu(C(x))E_{u\sim U(C(x))}[\pi_{\theta}(u|x)]= italic_μ ( italic_C ( italic_x ) ) italic_E start_POSTSUBSCRIPT italic_u ∼ italic_U ( italic_C ( italic_x ) ) end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) ] (11)
=μ(C(x))limM1Mi=1Mπθ(ui|x),absent𝜇𝐶𝑥subscript𝑀1𝑀superscriptsubscript𝑖1𝑀subscript𝜋𝜃conditionalsubscript𝑢𝑖𝑥\displaystyle=\mu(C(x))\lim_{M\rightarrow\infty}\frac{1}{M}\sum_{i=1}^{M}\pi_{% \theta}(u_{i}|x),\normalsize= italic_μ ( italic_C ( italic_x ) ) roman_lim start_POSTSUBSCRIPT italic_M → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) , (12)

where the last equality holds by the law of large numbers. Since C(x)𝐶𝑥C(x)italic_C ( italic_x ) is fixed given x𝑥xitalic_x, gradient estimates logπθ^(C(x)|x)^subscript𝜋𝜃conditional𝐶𝑥𝑥\widehat{\nabla\log\pi_{\theta}}(C(x)|x)over^ start_ARG ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( italic_C ( italic_x ) | italic_x ) and ultimately logπθC^(u|x)^subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\widehat{\nabla\log\pi^{C}_{\theta}}(u|x)over^ start_ARG ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG ( italic_u | italic_x ) can also be obtained by estimating the integral C(x)πθ(u|x)𝑑usubscript𝐶𝑥subscript𝜋𝜃conditional𝑢𝑥differential-d𝑢\int_{C(x)}\nabla\pi_{\theta}(u|x)du∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) italic_d italic_u. In the Monte Carlo situation, this can be obtained from (8) by differentiating each term with respect to θ𝜃\thetaitalic_θ.

3.3 Algorithm

In this section, we present a hard safety-constrained random-horizon policy gradient (Safe-RPG) algorithm. Our algorithm is based on the random-horizon policy gradient (RPG) scheme developed in Zhang et al. (2020), which uses a random rollout horizon and recent advances in non-convex optimization to obtain unbiased policy gradient estimates and ensure finite-time convergence to approximately locally optimal policies. As discussed in the following section, our convergence results ensure asymptotic convergence of Algorithm 2 to a stationary point of (3), but can likely be strengthened to prove finite-time convergence to approximately locally optimal policies. The main algorithm is presented in Algorithm 2, which depends on the action-value function estimation subroutine in Algorithm 1.

3.4 Convergence

In this section we show asymptotic convergence of Algorithm 2 to the set of stationary points of (3). The key challenge in this result revolves around the need to establish that the policies we consider satisfy important differentiability and continuity properties, which necessitates a careful analysis of the Lipschitz properties of the score functions of our truncated policies in the proof of Lemma 2. To proceed, we need the following assumption on the reward function r𝑟ritalic_r and underlying, untruncated policy class {πθ}θΘsubscriptsubscript𝜋𝜃𝜃Θ\{\pi_{\theta}\}_{\theta\in\Theta}{ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT.

Assumption 6.

The reward function r𝑟ritalic_r and parameterized policy class {πθ}θΘsubscriptsubscript𝜋𝜃𝜃normal-Θ\{\pi_{\theta}\}_{\theta\in\Theta}{ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT satisfy the following:

  1. 1.

    The absolute value of the reward r𝑟ritalic_r is uniformly bounded, i.e., there exists Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT such that 0sup(x,u)𝒳×𝒰|r(x,u)|Ur0subscriptsupremum𝑥𝑢𝒳𝒰𝑟𝑥𝑢subscript𝑈𝑟0\leq\sup_{(x,u)\in\mathcal{X}\times\mathcal{U}}|r(x,u)|\leq U_{r}0 ≤ roman_sup start_POSTSUBSCRIPT ( italic_x , italic_u ) ∈ caligraphic_X × caligraphic_U end_POSTSUBSCRIPT | italic_r ( italic_x , italic_u ) | ≤ italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

  2. 2.

    For all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U, logπθ(u|x)subscript𝜋𝜃conditional𝑢𝑥\nabla\log\pi_{\theta}(u|x)∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) exists, and there exist LΘ0subscript𝐿Θ0L_{\Theta}\geq 0italic_L start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ≥ 0 and BΘ0subscript𝐵Θ0B_{\Theta}\geq 0italic_B start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ≥ 0 such that, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U,

    1. (a)

      logπθ(u|x)logπθ(u|x)LΘθθ\left\|\nabla\log\pi_{\theta}(u|x)-\nabla\log\pi_{\theta^{\prime}}(u|x)\right% \|{\leq}L_{\Theta}\left\|\theta-\theta^{\prime}\right\|∥ ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) - ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u | italic_x ) ∥ ≤ italic_L start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥, for all θ,θΘ𝜃superscript𝜃Θ\theta,\theta^{\prime}\in\Thetaitalic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ

    2. (b)

      logπθ(u|x)BΘ\left\|\nabla\log\pi_{\theta}(u|x)\right\|\leq B_{\Theta}∥ ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) ∥ ≤ italic_B start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, for all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ.

Data: x,u,θ𝑥𝑢𝜃x,u,\thetaitalic_x , italic_u , italic_θ.
Result: Unbiased estimate of QπθC(x,u)superscript𝑄subscriptsuperscript𝜋𝐶𝜃𝑥𝑢Q^{\pi^{C}_{\theta}}(x,u)italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x , italic_u ).
1 Initialization: Sample TGeom(1γ1/2)similar-to𝑇Geom1superscript𝛾12T\sim\text{Geom}(1-\gamma^{1/2})italic_T ∼ Geom ( 1 - italic_γ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) and initialize Q^0,x0x,u0uformulae-sequence^𝑄0formulae-sequencesubscript𝑥0𝑥subscript𝑢0𝑢\hat{Q}\leftarrow 0,x_{0}\leftarrow x,u_{0}\leftarrow uover^ start_ARG italic_Q end_ARG ← 0 , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_x , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_u.
2 for t=0,,T1𝑡0normal-…𝑇1t=0,\ldots,T-1italic_t = 0 , … , italic_T - 1 do
3       Q^Q^+γt/2r(xt,ut)^𝑄^𝑄superscript𝛾𝑡2𝑟subscript𝑥𝑡subscript𝑢𝑡\hat{Q}\leftarrow\hat{Q}+\gamma^{t/2}r(x_{t},u_{t})over^ start_ARG italic_Q end_ARG ← over^ start_ARG italic_Q end_ARG + italic_γ start_POSTSUPERSCRIPT italic_t / 2 end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
4       xt+1𝒫(|xt,ut)x_{t+1}\sim\mathcal{P}(\cdot|x_{t},u_{t})italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_P ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
5       ut+1πθkC(|xt+1)u_{t+1}\sim\pi^{C}_{\theta_{k}}(\cdot|x_{t+1})italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
6 end for
7Q^Q^+γT/2r(xT,uT)^𝑄^𝑄superscript𝛾𝑇2𝑟subscript𝑥𝑇subscript𝑢𝑇\hat{Q}\leftarrow\hat{Q}+\gamma^{T/2}r(x_{T},u_{T})over^ start_ARG italic_Q end_ARG ← over^ start_ARG italic_Q end_ARG + italic_γ start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )
return Q^normal-^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG
Algorithm 1 EstQ: Unbiasedly Estimating Q𝑄Qitalic_Q
Data: x0,θ0subscript𝑥0subscript𝜃0x_{0},\theta_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Monte Carlo sample size M𝑀Mitalic_M.
Result: Locally optimal policy
1 Initialization: Set k0𝑘0k\leftarrow 0italic_k ← 0.
2 repeat
3       Sample Tk+1Geom(1γ)similar-tosubscript𝑇𝑘1Geom1𝛾T_{k+1}\sim\text{Geom}(1-\gamma)italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∼ Geom ( 1 - italic_γ ), u0πθkC(|x0)u_{0}\sim\pi^{C}_{\theta_{k}}(\cdot|x_{0})italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).
4       for t=0,,Tk+11𝑡0normal-…subscript𝑇𝑘11t=0,\ldots,T_{k+1}-1italic_t = 0 , … , italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT - 1 do
5             xt+1𝒫(|xt,ut)x_{t+1}\sim\mathcal{P}(\cdot|x_{t},u_{t})italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_P ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
6             ut+1πθkC(|xt+1)u_{t+1}\sim\pi^{C}_{\theta_{k}}(\cdot|x_{t+1})italic_u start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
7       end for
8      Q^πθkC(xTk+1,uTk+1)=𝙴𝚜𝚝𝚀(xTk+1,uTk+1,θk)superscript^𝑄subscriptsuperscript𝜋𝐶subscript𝜃𝑘subscript𝑥subscript𝑇𝑘1subscript𝑢subscript𝑇𝑘1𝙴𝚜𝚝𝚀subscript𝑥subscript𝑇𝑘1subscript𝑢subscript𝑇𝑘1subscript𝜃𝑘\hat{Q}^{\pi^{C}_{\theta_{k}}}(x_{T_{k+1}},u_{T_{k+1}})=\texttt{EstQ}(x_{T_{k+% 1}},u_{T_{k+1}},\theta_{k})over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = EstQ ( italic_x start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
9       Uniformly sample {ul}l=1,,Msubscriptsubscript𝑢𝑙𝑙1𝑀\{u_{l}\}_{l=1,\ldots,M}{ italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 , … , italic_M end_POSTSUBSCRIPT from C(xTk+1)𝐶subscript𝑥subscript𝑇𝑘1C(x_{T_{k+1}})italic_C ( italic_x start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), then use them to compute logπθkC^(uTk+1|xTk+1)^subscriptsuperscript𝜋𝐶subscript𝜃𝑘conditionalsubscript𝑢subscript𝑇𝑘1subscript𝑥subscript𝑇𝑘1\widehat{\nabla\log\pi^{C}_{\theta_{k}}}(u_{T_{k+1}}|x_{T_{k+1}})over^ start_ARG ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ( italic_u start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
10       θk+1θk+αk1γQ^πθkC(xTk+1,uTk+1)logπθkC^(uTk+1|xTk+1)subscript𝜃𝑘1subscript𝜃𝑘subscript𝛼𝑘1𝛾superscript^𝑄subscriptsuperscript𝜋𝐶subscript𝜃𝑘subscript𝑥subscript𝑇𝑘1subscript𝑢subscript𝑇𝑘1^subscriptsuperscript𝜋𝐶subscript𝜃𝑘conditionalsubscript𝑢subscript𝑇𝑘1subscript𝑥subscript𝑇𝑘1\theta_{k+1}\leftarrow\theta_{k}+\frac{\alpha_{k}}{1-\gamma}\hat{Q}^{\pi^{C}_{% \theta_{k}}}(x_{T_{k+1}},u_{T_{k+1}})\widehat{\nabla\log\pi^{C}_{\theta_{k}}}(% u_{T_{k+1}}|x_{T_{k+1}})italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) over^ start_ARG ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ( italic_u start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
11       kk+1𝑘𝑘1k\leftarrow k+1italic_k ← italic_k + 1
12      
13until convergence;
Algorithm 2 Safe-RPG: Hard safety-constrained Random-horizon Policy Gradient

Assumptions 5 and 6 were used to prove asymptotic convergence of the RPG algorithm with untruncated policies to stationary points in (Zhang et al., 2020, Theorem 4.4). For an analogous result to apply to the truncated policies we consider, it must be shown that the Lipschitz and differentiability conditions in part 2 of Assumption 6 hold for the constrained policies {πθC}θΘsubscriptsubscriptsuperscript𝜋𝐶𝜃𝜃Θ\{\pi^{C}_{\theta}\}_{\theta\in\Theta}{ italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT. It turns out that, under the same conditions on the untruncated policy {πθ}θΘsubscriptsubscript𝜋𝜃𝜃Θ\{\pi_{\theta}\}_{\theta\in\Theta}{ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT, these properties are automatically satisfied for {πθC}θΘsubscriptsubscriptsuperscript𝜋𝐶𝜃𝜃Θ\{\pi^{C}_{\theta}\}_{\theta\in\Theta}{ italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT.

Lemma 2.

Under Assumptions 1, 5, and 6, logπθC(u|x)normal-∇subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\nabla\log\pi^{C}_{\theta}(u|x)∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) exists, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U. Furthermore, there exist constants LΘC0subscriptsuperscript𝐿𝐶normal-Θ0L^{C}_{\Theta}\geq 0italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ≥ 0 and BΘC0subscriptsuperscript𝐵𝐶normal-Θ0B^{C}_{\Theta}\geq 0italic_B start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ≥ 0 such that, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U,
(i) logπθC(u|x)logπθC(u|x)LΘCθθ\left\|\nabla\log\pi^{C}_{\theta}(u|x)-\nabla\log\pi^{C}_{\theta^{\prime}}(u|x% )\right\|\leq L^{C}_{\Theta}\left\|\theta-\theta^{\prime}\right\|∥ ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) - ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u | italic_x ) ∥ ≤ italic_L start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥, for all θ,θΘ𝜃superscript𝜃normal-′normal-Θ\theta,\theta^{\prime}\in\Thetaitalic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ, and
(ii) logπθC(u|x)BΘC\left\|\nabla\log\pi^{C}_{\theta}(u|x)\right\|\leq B^{C}_{\Theta}∥ ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) ∥ ≤ italic_B start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, for all θΘ𝜃normal-Θ\theta\in\Thetaitalic_θ ∈ roman_Θ.

With Lemma 2 in hand, we have the following result.

Theorem 1.

Let Assumptions 3, 4, 5, and 6 hold. Let {θk}ksubscriptsubscript𝜃𝑘𝑘\{\theta_{k}\}_{k\in\mathbb{N}}{ italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT be the sequence generated by Algorithm 2 with stepsize sequence {αk}ksubscriptsubscript𝛼𝑘𝑘\{\alpha_{k}\}_{k\in\mathbb{N}}{ italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT satisfying k=0αk=superscriptsubscript𝑘0subscript𝛼𝑘\sum_{k=0}^{\infty}\alpha_{k}=\infty∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∞ and k=0αk2<superscriptsubscript𝑘0superscriptsubscript𝛼𝑘2\sum_{k=0}^{\infty}\alpha_{k}^{2}<\infty∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞. Then limkθkΘ*subscript𝑘subscript𝜃𝑘superscriptnormal-Θ\lim_{k}\theta_{k}\in\Theta^{*}roman_lim start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, where Θ*superscriptnormal-Θ\Theta^{*}roman_Θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is the set of stationary points of (3).

Remark 1.

By arguments analogous to those in the proof of (Zhang et al., 2019, Thm. 3), it can also be shown that, under the same assumptions as in Theorem 1 and appropriate stepsize selection, Algorithm 2 achieves ε𝜀\varepsilonitalic_ε-approximate first-order stationarity with a finite-time sample complexity of 𝒪(ε2)𝒪superscript𝜀2\mathcal{O}(\varepsilon^{-2})caligraphic_O ( italic_ε start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ).

Given Lemma 2, the proof of the theorem follows directly from that of (Zhang et al., 2020, Theorem 4.4). With suitable modifications to Algorithm 2 incorporating periodically increasing stepsizes, these results can likely be strengthened to obtain finite-time convergence to an ε𝜀\varepsilonitalic_ε-locally optimal policy using the machinery developed in Zhang et al. (2020). We leave this to future work.

4 EXPERIMENTAL RESULTS

We now experimentally demonstrate the effectiveness of our sampling-based safe RL approach. Specifically, we evaluate the use of CBF-constrained Beta policies combined with the popular Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm on safety-constrained inverted pendulum and quadcopter navigation environments. The use of Beta policies with variable action space constraints allows us to directly sample from a CBF-constrained action space at each timestep. In addition to providing a practical example of how truncated policies can be used to ensure safety, this method extends the work Chou et al. (2017) on the use of Beta policies for deep RL from constant to state-dependent action space constraints.

Figure 1: Safety and convergence of CBF-Constrained Beta policies: Agent was trained with PPO on the quadrotor navigation problem with an obstacle. Safety was maintained and goal was eventually reached.
Refer to caption
(a) Initial exploration.
Refer to caption
(b) Discovering goal direction.
Refer to caption
(c) Reaching the goal.
Refer to caption
(a) No obstacle.
Refer to caption
(b) Distant obstacle.
Refer to caption
(c) Interfering obstacle.
Figure 1: Safety and convergence of CBF-Constrained Beta policies: Agent was trained with PPO on the quadrotor navigation problem with an obstacle. Safety was maintained and goal was eventually reached.
Figure 2: Failure of benchmark safety-filtered Gaussian policies: Agents were trained with PPO using three different obstacle configurations. When the obstacle is distant or non-existent, the method succeeds. When the obstacle is in the way, the resulting policy is suboptimal. In all cases, safety is maintained.

4.1 CBF-Constrained Beta Policies

When actions must be restricted to lie within fixed, predetermined bounds due to physical or numerical constraints, the common practice of simply clip** policies with infinite support (e.g., Gaussian policies) can cause bias and performance issues. Chou et al. (2017) propose and leverage finite-support Beta distribution-based policies to overcome these issues. We extend this approach to obtain policies that sample directly from the safe control actions prescribed by the CBF at a given state.

In order to describe these CBF-constrained Beta policies, let us first recall the probability density function (p.d.f.) of a one-dimensional Beta distribution:

f(u;α,β)=Γ(α+β)Γ(α)Γ(β)uα1(1u)β1,𝑓𝑢𝛼𝛽Γ𝛼𝛽Γ𝛼Γ𝛽superscript𝑢𝛼1superscript1𝑢𝛽1f(u;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}u^{% \alpha-1}(1-u)^{\beta-1},italic_f ( italic_u ; italic_α , italic_β ) = divide start_ARG roman_Γ ( italic_α + italic_β ) end_ARG start_ARG roman_Γ ( italic_α ) roman_Γ ( italic_β ) end_ARG italic_u start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT ( 1 - italic_u ) start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT , (13)

where u[0,1]𝑢01u\in[0,1]italic_u ∈ [ 0 , 1 ], α,β>0𝛼𝛽0\alpha,\beta>0italic_α , italic_β > 0, and Γ(z)=0uz1eu𝑑uΓ𝑧superscriptsubscript0superscript𝑢𝑧1superscript𝑒𝑢differential-d𝑢\Gamma(z)=\int_{0}^{\infty}u^{z-1}e^{-u}duroman_Γ ( italic_z ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_u end_POSTSUPERSCRIPT italic_d italic_u is the Gamma function defined for z𝑧z\in\mathbb{C}italic_z ∈ blackboard_C with Re(z)>0Re𝑧0\text{Re}(z)>0Re ( italic_z ) > 0. A Beta policy sampling from the fixed interval [0,1]01[0,1][ 0 , 1 ] is given by f(u;αθ(x),βθ(x))𝑓𝑢subscript𝛼𝜃𝑥subscript𝛽𝜃𝑥f(u;\alpha_{\theta}(x),\beta_{\theta}(x))italic_f ( italic_u ; italic_α start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_β start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ), where αθ,βθ:𝒳+:subscript𝛼𝜃subscript𝛽𝜃𝒳superscript\alpha_{\theta},\beta_{\theta}:\mathcal{X}\rightarrow\mathbb{R}^{+}italic_α start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT are parameterized functions (e.g., neural networks) map** states to the parameters α,β𝛼𝛽\alpha,\betaitalic_α , italic_β of the Beta distribution. When the action space 𝒰n𝒰superscript𝑛\mathcal{U}\subset\mathbb{R}^{n}caligraphic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is of dimension n2𝑛2n\geq 2italic_n ≥ 2 and the CBF constraint set (or an inner approximation of it), C(x)𝐶𝑥C(x)italic_C ( italic_x ), can be expressed as a hyperrectangle with lower and upper bounds a(x),b(x)n𝑎𝑥𝑏𝑥superscript𝑛a(x),b(x)\in\mathbb{R}^{n}italic_a ( italic_x ) , italic_b ( italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, respectively, we maintain independent Beta distributions, fi(;αθi(x),βθi(x))superscript𝑓𝑖subscriptsuperscript𝛼𝑖𝜃𝑥subscriptsuperscript𝛽𝑖𝜃𝑥f^{i}(\cdot;\alpha^{i}_{\theta}(x),\beta^{i}_{\theta}(x))italic_f start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( ⋅ ; italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_β start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ), over each dimension i𝑖iitalic_i of the unit box [0,1]nsuperscript01𝑛[0,1]^{n}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and samples from these distributions are shifted and rescaled to lie within the bounds given by a(x),b(x)𝑎𝑥𝑏𝑥a(x),b(x)italic_a ( italic_x ) , italic_b ( italic_x ). Specifically, our CBF-constrained Beta policies, denoted πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, sample uπθ(|x)u\sim\pi_{\theta}(\cdot|x)italic_u ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_x ) from C(x)𝐶𝑥C(x)italic_C ( italic_x ) by first sampling u^ifi(;αθi(x),βθi(x))similar-tosuperscript^𝑢𝑖superscript𝑓𝑖subscriptsuperscript𝛼𝑖𝜃𝑥subscriptsuperscript𝛽𝑖𝜃𝑥\hat{u}^{i}\sim f^{i}(\cdot;\alpha^{i}_{\theta}(x),\beta^{i}_{\theta}(x))over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ italic_f start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( ⋅ ; italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_β start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ), then performing the simple transformation u=a(x)+diag(u^1,,u^n)(b(x)a(x))𝑢𝑎𝑥diagsuperscript^𝑢1superscript^𝑢𝑛𝑏𝑥𝑎𝑥u=a(x)+\text{diag}(\hat{u}^{1},\ldots,\hat{u}^{n})(b(x)-a(x))italic_u = italic_a ( italic_x ) + diag ( over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ( italic_b ( italic_x ) - italic_a ( italic_x ) ), where diag(u^1,,u^n)diagsuperscript^𝑢1superscript^𝑢𝑛\text{diag}(\hat{u}^{1},\ldots,\hat{u}^{n})diag ( over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) denotes the diagonal matrix with elements u^1,,u^nsuperscript^𝑢1superscript^𝑢𝑛\hat{u}^{1},\ldots,\hat{u}^{n}over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , over^ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT along the diagonal.

4.2 Implementation

We now describe the implementation details of our Beta policies. For a given state x𝑥xitalic_x, the parameter vectors α(x),β(x)𝛼𝑥𝛽𝑥\alpha(x),\beta(x)italic_α ( italic_x ) , italic_β ( italic_x ) are outputted by a two-layer, fully connected neural network. Control inputs at state x𝑥xitalic_x were obtained by first creating an independent PyTorch (Paszke et al., 2019) Beta distribution object with parameters αθi(x),βθi(x)subscriptsuperscript𝛼𝑖𝜃𝑥subscriptsuperscript𝛽𝑖𝜃𝑥\alpha^{i}_{\theta}(x),\beta^{i}_{\theta}(x)italic_α start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_β start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ), for each dimension i{1,,n}𝑖1𝑛i\in\{1,\ldots,n\}italic_i ∈ { 1 , … , italic_n } of the action space, then sampling u=[u1un]T𝑢superscriptdelimited-[]superscript𝑢1superscript𝑢𝑛𝑇u=[u^{1}\ldots u^{n}]^{T}italic_u = [ italic_u start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … italic_u start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT from these distributions, and finally scaling and translating to lie within the current CBF set C(x)𝐶𝑥C(x)italic_C ( italic_x ). Similarly, the Gaussian policies we used for comparison used distribution parameters outputted by a two-layer, fully connected neural network. Control inputs were then selected from the corresponding distribution by sampling, then following the standard practice (Chou et al., 2017) of clip** to a fixed set of permissible controls. The PPO implementation used in the experiments was adapted with minor modifications from Stable Baselines 3 (Raffin et al., 2021).

4.3 Case study 1 : Quadcopter Navigation

Experiment Setup. For this experiment, we consider the problem of learning to safely navigate a quadcopter around an obstacle to a goal location. In this section, we present an overview of the dynamical model that we use for this quadcopter, which was previously considered in Xu and Sreenath (2018), and describe our derivation of a hyperrectangular inner approximation of the safe control set, satisfying the CBF condition, that is amenable to sampling using our Beta policies. We finally briefly describe the reward function. See the supplementary material for a detailed exposition of the environment and sampling procedure.

We denote quadcopter and obstacle position by r=(rx,ry,rz)𝑟subscript𝑟𝑥subscript𝑟𝑦subscript𝑟𝑧r=(r_{x},r_{y},r_{z})italic_r = ( italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) and robs=(rox,roy,roz)subscript𝑟𝑜𝑏𝑠subscript𝑟subscript𝑜𝑥subscript𝑟subscript𝑜𝑦subscript𝑟subscript𝑜𝑧r_{obs}=(r_{o_{x}},r_{o_{y}},r_{o_{z}})italic_r start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT = ( italic_r start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), respectively, and the quadcopter’s relative position with respect to the obstacle as Δr=rrobsΔ𝑟𝑟subscript𝑟𝑜𝑏𝑠\Delta r=r-r_{obs}roman_Δ italic_r = italic_r - italic_r start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT. The quadrotor dynamics are then given by

x˙=Ax+Bu,x=[rr˙],A=[𝟎3×3I3𝟎3×3𝟎3×3],B=[𝟎3×3𝟏3×3],formulae-sequence˙𝑥𝐴𝑥𝐵𝑢formulae-sequence𝑥matrix𝑟˙𝑟formulae-sequence𝐴matrixsubscript033subscript𝐼3subscript033subscript033𝐵matrixsubscript033subscript133\displaystyle\dot{x}=Ax+Bu,x=\begin{bmatrix}r\\ \dot{r}\end{bmatrix},A=\begin{bmatrix}\mathbf{0}_{3\times 3}&I_{3}\\ \mathbf{0}_{3\times 3}&\mathbf{0}_{3\times 3}\end{bmatrix},B=\begin{bmatrix}% \mathbf{0}_{3\times 3}\\ \mathbf{1}_{3\times 3}\end{bmatrix},over˙ start_ARG italic_x end_ARG = italic_A italic_x + italic_B italic_u , italic_x = [ start_ARG start_ROW start_CELL italic_r end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_r end_ARG end_CELL end_ROW end_ARG ] , italic_A = [ start_ARG start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , italic_B = [ start_ARG start_ROW start_CELL bold_0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_1 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,

with input u𝑢uitalic_u consisting the desired accelerations in each x,y𝑥𝑦x,yitalic_x , italic_y, and z𝑧zitalic_z dimensions. For the obstacle avoidance problem, we characterize the safe set as 𝒮={r:h(r)0}𝒮conditional-set𝑟𝑟0\mathcal{S}=\{r:h(r)\geq 0\}caligraphic_S = { italic_r : italic_h ( italic_r ) ≥ 0 }, where

h(r)=(Δrx/a)4+(Δry/b)4+(Δrz/c)4rs,𝑟superscriptΔsubscript𝑟𝑥𝑎4superscriptΔsubscript𝑟𝑦𝑏4superscriptΔsubscript𝑟𝑧𝑐4subscript𝑟𝑠h(r)=\left({\Delta r_{x}}/{a}\right)^{4}+\left({\Delta r_{y}}/{b}\right)^{4}+% \left({\Delta r_{z}}/{c}\right)^{4}-r_{s},italic_h ( italic_r ) = ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / italic_a ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / italic_b ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / italic_c ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , (14)

a,b,c>0𝑎𝑏𝑐0a,b,c>0italic_a , italic_b , italic_c > 0 parameterize the obstacle’s shape, which is assumed to be elliptical, and rssubscript𝑟𝑠r_{s}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the desired safety margin. Given the quadcopter dynamics and (14), the first time derivative h˙(r)˙𝑟\dot{h}(r)over˙ start_ARG italic_h end_ARG ( italic_r ) does not explicitly contain the control input u𝑢uitalic_u. We therefore use the standard ECBF formulation Ames et al. (2019) to develop our safety condition using h¨(r)¨𝑟\ddot{h}(r)over¨ start_ARG italic_h end_ARG ( italic_r ), which explicitly contains u𝑢uitalic_u. This ECBF condition is expressed as h¨+K[hh˙]T0,¨𝐾superscript˙𝑇0\ddot{h}+K\cdot[h\quad\dot{h}]^{T}\geq 0,over¨ start_ARG italic_h end_ARG + italic_K ⋅ [ italic_h over˙ start_ARG italic_h end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≥ 0 , where K=[K1K2]T,K1=6,K2=8formulae-sequence𝐾superscriptsubscript𝐾1subscript𝐾2𝑇formulae-sequencesubscript𝐾16subscript𝐾28K=[K_{1}\quad K_{2}]^{T},K_{1}=6,K_{2}=8italic_K = [ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 6 , italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 8, are application-specific design parameters, and can be rewritten as Arubr,subscript𝐴𝑟𝑢subscript𝑏𝑟A_{r}u\leq b_{r},italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u ≤ italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , where Arsubscript𝐴𝑟A_{r}italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is a matrix and brsubscript𝑏𝑟b_{r}italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is a vector, both depending on r𝑟ritalic_r. Then, the state-dependent safe control set is given by C(r)={u:Arubr}𝐶𝑟conditional-set𝑢subscript𝐴𝑟𝑢subscript𝑏𝑟C(r)=\{u\ :\ A_{r}u\leq b_{r}\}italic_C ( italic_r ) = { italic_u : italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u ≤ italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } (see the supplementary for details). The dynamics (and consequently the ECBF conditions) are discretized with time step dt=0.1𝑑𝑡0.1dt=0.1italic_d italic_t = 0.1.

We consider navigation in the x,y𝑥𝑦x,yitalic_x , italic_y dimension as in Xu and Sreenath (2018), resulting in a two-dimensional action space. We take the actuator constraint to be defined by the hyperrectangle H:={umin,umax}assign𝐻subscript𝑢𝑚𝑖𝑛subscript𝑢𝑚𝑎𝑥H:=\{u_{min},u_{max}\}italic_H := { italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }, where umin:=(uminx,uminy)2assignsubscript𝑢𝑚𝑖𝑛superscriptsubscript𝑢𝑚𝑖𝑛𝑥superscriptsubscript𝑢𝑚𝑖𝑛𝑦superscript2u_{min}:=(u_{min}^{x},u_{min}^{y})\in\mathbb{R}^{2}italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT := ( italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and umax:=(umaxx,umaxy)2assignsubscript𝑢𝑚𝑎𝑥superscriptsubscript𝑢𝑚𝑎𝑥𝑥superscriptsubscript𝑢𝑚𝑎𝑥𝑦superscript2u_{max}:=(u_{max}^{x},u_{max}^{y})\in\mathbb{R}^{2}italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT := ( italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the minimum and maximum input values. In order to sample from the safe control set C(r)𝐶𝑟C(r)italic_C ( italic_r ) at a given r𝑟ritalic_r with our Beta policies, we need a hyperrectangular inner approximation. We obtain this inner approximation by formulating and solving a convex optimization problem yielding the highest volume hyperrectangle, Hc(r)subscript𝐻𝑐𝑟H_{c}(r)italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_r ), contained within C(r)𝐶𝑟C(r)italic_C ( italic_r ).

Finally, we designed a reward providing an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty based on agent distance from the goal, as well as a sizeable bonus for reaching the goal and a significant penalty for approaching the edge of the map. See the supplementary for details.

Results. The experiments we conducted illustrate that safety-filter based approaches like those considered in Cheng et al. (2019) fail on simple cases of our safety-constrained quadcopter problem (see Figure 2), while our CBF-constrained Beta policy succeeds (Figure 2). For illustration purposes, Figures 2 and 2 present trajectories generated over the course of training. The corresponding learning curves are included in the supplementary material. As illustrated in Figure 2, the safety-filter approach is effective at ensuring safety and also learns to successfully reach the goal when the obstacle is nonexistent or distant. However, it ultimately fails to reach the goal when the obstacle lies directly between the start and goal positions. We hypothesize that this is due to the fact that the projection-based approach attempts to learn an optimal policy for the unconstrained navigation problem, while projection causes it to deviate from its learned policy to maintain safety. Furthermore, the resulting safety-filtered policy cannot recover from these projections without an additional control layer (such as a derivative or PID controller) due to the repeated perturbation from the projection procedure. Our method, on the other hand, learns to successfully solve the problem as shown in Figure 2 while maintaining safety throughout training, since we directly learn policies for the CBF-constrained problem.

4.4 Case study 2: Inverted pendulum

Experiment Setup. For the second set of experiments, we considered a safety-constrained inverted pendulum environment building on the baseline Gym implementation (Brockman et al., 2016). The goal in this environment is to swing an inverted pendulum upright while maintaining it within a fixed safe set. We tested PPO with the two different policies on this environment for two different safe sets: 𝒮0.5={θ|0.5θ0.5}subscript𝒮0.5conditional-set𝜃0.5𝜃0.5\mathcal{S}_{0.5}=\{\theta\ |\ -0.5\leq\theta\leq 0.5\}caligraphic_S start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT = { italic_θ | - 0.5 ≤ italic_θ ≤ 0.5 } and 𝒮1.0={θ|1.0θ1.0}subscript𝒮1.0conditional-set𝜃1.0𝜃1.0\mathcal{S}_{1.0}=\{\theta\ |\ -1.0\leq\theta\leq 1.0\}caligraphic_S start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT = { italic_θ | - 1.0 ≤ italic_θ ≤ 1.0 }. Due to space limitations, we include the experiments with 𝒮0.5subscript𝒮0.5\mathcal{S}_{0.5}caligraphic_S start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT in Figure 3 and the experiments with 𝒮1.0subscript𝒮1.0\mathcal{S}_{1.0}caligraphic_S start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT with the supplementary material. As a baseline, we compare the proposed method to PPO with unconstrained Gaussian policies. This comparison highlights the effectiveness of the proposed method in guaranteeing safety as well as accelerating learning.

Results.

Refer to caption
Figure 3: CBF-constrained Beta vs. unconstrained Gaussian on inverted pendulum environment with safe set 𝒮0.5={θ|0.5θ0.5}subscript𝒮0.5conditional-set𝜃0.5𝜃0.5\mathcal{S}_{0.5}=\{\theta\ |\ -0.5\leq\theta\leq 0.5\}caligraphic_S start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT = { italic_θ | - 0.5 ≤ italic_θ ≤ 0.5 }. “Safety Rate” denotes percentage of time spent in safe set. Curves present mean and 95% confidence intervals over 5 replications.

Our experiments are summarized in Figure 3. There are two main points to be drawn from these results. First, the top panel shows that incorporating prior knowledge about properties such as safety can encourage learning and accelerate convergence by forcing the Beta policy agent to concentrate on higher-value subsets of the state space. The Gaussian agent, on the other hand, is unable to benefit from this prior knowledge and convergence suffers as it spends a greater portion of its time exploring lower-value regions of the state space. Second, the bottom panel illustrates that Beta policies are highly effective at maintaining safety throughout training, while Gaussian policies without safety constraints naturally fail to remain inside the safe set. This is expected, but illustrates the need to use constraint-aware policies such as Beta policies when prior knowledge is available.

5 CONCLUSION

We have developed a sampling-based approach to learning policies ensuring hard constraint satisfaction in RL. Unlike existing, projection-based methods that ensure safety but lack convergence guarantees, our scheme provably does both. In addition to our theoretical contributions, we have also presented a practical solution method that leverages CBF-constrained Beta policies to ensure safety, and experimentally demonstrated its effectiveness on safe quadcopter navigation and inverted pendulum environments. Interesting directions for future work including extensions to the case where the constraint set must be estimated and application of our CBF-constrained Beta policies to real-world robotics problems.

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful comments and Mostafa Mohamed Fa Abdelnaby of Purdue University for pointing out Remark 1. The work of W. A. Suttle was supported by a Distinguished Postdoctoral Fellowship with the U.S. Army Research Laboratory. V. K. Sharma was partially funded by a grant from the Purdue Engineering Initiative on Autonomous and Connected Systems. The work of J. Liu was supported in part by the National Science Foundation (NSF) under grant 2230101, by the Air Force Office of Scientific Research (AFOSR) under award number FA9550-23-1-0175, and by U.S. Air Force Task Order FA8650-23-F-2603. The work of K. C. Kosaraju and V. Gupta was partially supported by Army Research Office grants W911NF2310111, W911NF2310266, and W911NF-23-1-0316, AFOSR grant F.10052139.02.005, Office of Naval Research grants F.10052139.02.009 and F.10052139.02.012, and NSF grant 2300355.

References

  • Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine learning, 2004.
  • Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR, 2017.
  • Agarwal et al. (2020) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR, 2020.
  • Agrawal and Sreenath (2017) Ayush Agrawal and Koushil Sreenath. Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation. In Robotics: Science and Systems, volume 13, pages 1–10, 2017.
  • Altman (2021) Eitan Altman. Constrained Markov Decision Processes. Routledge, 2021.
  • Ames et al. (2016) Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8):3861–3876, 2016.
  • Ames et al. (2019) Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In 2019 18th European Control Conference, pages 3420–3431, 2019.
  • Aswani et al. (2013) Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.
  • Bai et al. (2022) Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3682–3689, 2022.
  • Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.
  • Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.
  • Borkar (2005) Vivek S Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  • Brunke et al. (2022) Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022.
  • Cheng et al. (2019) Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3387–3395, 2019.
  • Chou et al. (2017) Po-Wei Chou, Daniel Maturana, and Sebastian Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International Conference on Machine Learning, pages 834–843. PMLR, 2017.
  • Diamond and Boyd (2016) Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
  • Fazel et al. (2018) Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
  • Fisac et al. (2018) Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
  • Folland (1999) Gerald B Folland. Real Analysis: Modern Techniques and Their Applications, volume 40. John Wiley & Sons, 1999.
  • Garcıa and Fernández (2015) Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
  • Ho et al. (2020) Cherie* Ho, Katherine* Shih, Jaskaran Grover, Changliu Liu, and Sebastian Scherer. “Provably safe” in the wild: testing control barrier functions on a vision-based quadrotor in an outdoor environment. In RSS 2020 Workshop in Robust Autonomy, 2020. URL https://openreview.net/pdf?id=CrBJIgBr2BK.
  • Konda (2002) V. Konda. Actor-Critic Algorithms. PhD thesis, Massachusetts Institute of Technology, 2002.
  • Kosaraju et al. (2021) Krishna Chaitanya Kosaraju, Seetharaman Sivaranjani, Wesley Suttle, Vijay Gupta, and Ji Liu. Reinforcement learning based distributed control of dissipative networked systems. IEEE Transactions on Control of Network Systems, 9(2):856–866, 2021.
  • Li et al. (2018) Zhaojian Li, Uroš Kalabić, and Tianshu Chu. Safe reinforcement learning: Learning with supervision using a constraint-admissible set. In 2018 Annual American Control Conference, pages 6390–6395, 2018.
  • Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Ma et al. (2021) Haitong Ma, Jianyu Chen, Shengbo Eben, Ziyu Lin, Yang Guan, Yangang Ren, and Sifa Zheng. Model-based constrained reinforcement learning using generalized control barrier function. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4552–4559, 2021.
  • Mellinger et al. (2012) Daniel Mellinger, Nathan Michael, and Vijay Kumar. Trajectory generation and control for precise aggressive maneuvers with quadrotors. The International Journal of Robotics Research, 31(5):664–674, 2012.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
  • Paternain et al. (2019) Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019.
  • Raffin et al. (2021) Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. The Journal of Machine Learning Research, 22(1):12348–12355, 2021.
  • Rasmussen (2003) Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
  • Schreiter et al. (2015) Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint. Safe exploration for active learning with gaussian processes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 133–149. Springer, 2015.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sui et al. (2015) Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning, pages 997–1005. PMLR, 2015.
  • Suttle et al. (2023) Wesley A Suttle, Amrit Bedi, Bhrij Patel, Brian M Sadler, Alec Koppel, and Dinesh Manocha. Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level Monte Carlo actor-critic. In International Conference on Machine Learning, pages 33240–33267. PMLR, 2023.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
  • Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
  • Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature Methods, 17(3):261–272, 2020.
  • Wabersich et al. (2023) Kim P Wabersich, Andrew J Taylor, Jason J Choi, Koushil Sreenath, Claire J Tomlin, Aaron D Ames, and Melanie N Zeilinger. Data-driven safety filters: Hamilton-Jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine, 43(5):137–177, 2023.
  • Wachi et al. (2018) Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of constrained mdps using gaussian processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Wiesemann et al. (2013) Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
  • Xu and Sreenath (2018) Bin Xu and Koushil Sreenath. Safe teleoperation of dynamic uavs through control barrier functions. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7848–7855, 2018.
  • Zhang et al. (2019) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Convergence and iteration complexity of policy gradient method for infinite-horizon reinforcement learning. In 2019 58th Conference on Decision and Control, pages 7415–7422, 2019.
  • Zhang et al. (2020) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020.
  • Zhang et al. (2021) Kaiqing Zhang, Bin Hu, and Tamer Başar. Policy optimization for 2subscript2\mathcal{H}_{2}caligraphic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT linear control with subscript\mathcal{H}_{\infty}caligraphic_H start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081–4109, 2021.

 

Supplementary Material


 


Appendix A Proofs

Proof of Proposition 1. We construct a sequence of ε𝜀\varepsilonitalic_ε-balls, each reachable from the previous element of the sequence, that leads from x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to \mathcal{B}caligraphic_B, then show that the head of the Markov chain lies inside this sequence with positive probability. Fix ε>0𝜀0\varepsilon>0italic_ε > 0 and let {y0,y1,,yN}𝒮subscript𝑦0subscript𝑦1subscript𝑦𝑁𝒮\{y_{0},y_{1},\ldots,y_{N}\}\subset\mathcal{S}{ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ⊂ caligraphic_S be such that Bε(yk+1)R(Bε(yk))subscript𝐵𝜀subscript𝑦𝑘1𝑅subscript𝐵𝜀subscript𝑦𝑘B_{\varepsilon}(y_{k+1})\subset R(B_{\varepsilon}(y_{k}))italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ⊂ italic_R ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ), for k=1,,N1𝑘1𝑁1k=1,\ldots,N-1italic_k = 1 , … , italic_N - 1, and Bε(yN)subscript𝐵𝜀subscript𝑦𝑁B_{\varepsilon}(y_{N})\cap\mathcal{B}\neq\emptysetitalic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∩ caligraphic_B ≠ ∅ (see Figure 1 for an illustration). For a given θ𝜃\thetaitalic_θ, let {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } be the Markov chain induced on 𝒮𝒮\mathcal{S}caligraphic_S by πθCsubscriptsuperscript𝜋𝐶𝜃\pi^{C}_{\theta}italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT such that x0=y0subscript𝑥0subscript𝑦0x_{0}=y_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We show that the trajectory (x0,x1,,xN)subscript𝑥0subscript𝑥1subscript𝑥𝑁(x_{0},x_{1},\ldots,x_{N})( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) is contained within the set {y0}×Bε(y1)××Bε(yN)subscript𝑦0subscript𝐵𝜀subscript𝑦1subscript𝐵𝜀subscript𝑦𝑁\{y_{0}\}\times B_{\varepsilon}(y_{1})\times\ldots\times B_{\varepsilon}(y_{N}){ italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } × italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × … × italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) with strictly positive probability, which will imply that {xn}subscript𝑥𝑛\{x_{n}\}{ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } enters \mathcal{B}caligraphic_B with strictly positive probability. For each k=1,,N𝑘1𝑁k=1,\ldots,Nitalic_k = 1 , … , italic_N, consider the probability measure νksubscript𝜈𝑘\nu_{k}italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT defined as

νk(S)=P(xS|xk1)=𝒯xk11(S)πθC(a|xk1)𝑑a,subscript𝜈𝑘𝑆𝑃𝑥conditional𝑆subscript𝑥𝑘1subscriptsubscriptsuperscript𝒯1subscript𝑥𝑘1𝑆subscriptsuperscript𝜋𝐶𝜃conditional𝑎subscript𝑥𝑘1differential-d𝑎\nu_{k}(S)=P(x\in S\ |\ x_{k-1})=\int_{\mathcal{T}^{-1}_{x_{k-1}}(S)}\pi^{C}_{% \theta}(a|x_{k-1})\ da,italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S ) = italic_P ( italic_x ∈ italic_S | italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) = ∫ start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_d italic_a ,

for any μ𝜇\muitalic_μ-measurable subset S𝑆Sitalic_S of 𝒮𝒮\mathcal{S}caligraphic_S. Note that νksubscript𝜈𝑘\nu_{k}italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is absolutely continuous with respect to μ𝜇\muitalic_μ, written νkμmuch-less-thansubscript𝜈𝑘𝜇\nu_{k}\ll\muitalic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≪ italic_μ, since μ(S)>0𝜇𝑆0\mu(S)>0italic_μ ( italic_S ) > 0 if and only if 𝒯xk11(S)>0subscriptsuperscript𝒯1subscript𝑥𝑘1𝑆0\mathcal{T}^{-1}_{x_{k-1}}(S)>0caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_S ) > 0, by Assumption 2. The Lebesgue-Radon-Nikodym Theorem implies that there exists a μ𝜇\muitalic_μ-integrable function fk:𝒮:subscript𝑓𝑘𝒮f_{k}:\mathcal{S}\rightarrow\mathbb{R}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : caligraphic_S → blackboard_R, called the Radon-Nikodym derivative of νksubscript𝜈𝑘\nu_{k}italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, such that νk(S)=Sfk(x)𝑑xsubscript𝜈𝑘𝑆subscript𝑆subscript𝑓𝑘𝑥differential-d𝑥\nu_{k}(S)=\int_{S}f_{k}(x)dxitalic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S ) = ∫ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) italic_d italic_x (see Folland (1999) for details). To make the link between fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and νksubscript𝜈𝑘\nu_{k}italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT perfectly clear, let us write

fk(x)subscript𝑓𝑘𝑥\displaystyle f_{k}(x)italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) =𝒯xk11(x)πθC(a|xk1)𝑑a,absentsubscriptsubscriptsuperscript𝒯1subscript𝑥𝑘1𝑥subscriptsuperscript𝜋𝐶𝜃conditional𝑎subscript𝑥𝑘1differential-d𝑎\displaystyle=\int_{\mathcal{T}^{-1}_{x_{k-1}}(x)}\pi^{C}_{\theta}(a|x_{k-1})% \ da,= ∫ start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_d italic_a ,
νk(S)subscript𝜈𝑘𝑆\displaystyle\nu_{k}(S)italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_S ) =S𝒯xk11(x)πθC(a|xk1)𝑑a𝑑x.absentsubscript𝑆subscriptsubscriptsuperscript𝒯1subscript𝑥𝑘1𝑥subscriptsuperscript𝜋𝐶𝜃conditional𝑎subscript𝑥𝑘1differential-d𝑎differential-d𝑥\displaystyle=\int_{S}\int_{\mathcal{T}^{-1}_{x_{k-1}}(x)}\pi^{C}_{\theta}(a|x% _{k-1})\ da\ dx.= ∫ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_d italic_a italic_d italic_x .

By Assumptions 2 and 3, we also have μνkmuch-less-than𝜇subscript𝜈𝑘\mu\ll\nu_{k}italic_μ ≪ italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Since both νkμmuch-less-thansubscript𝜈𝑘𝜇\nu_{k}\ll\muitalic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≪ italic_μ and μνkmuch-less-than𝜇subscript𝜈𝑘\mu\ll\nu_{k}italic_μ ≪ italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the two measures are said to be equivalent, meaning that they agree on which sets have measure zero. Since μ𝜇\muitalic_μ and νksubscript𝜈𝑘\nu_{k}italic_ν start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are equivalent, a standard result from real analysis allows us to take the Radon-Nikodym derivative fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to be strictly positive μ𝜇\muitalic_μ-almost everywhere. As a first consequence, notice that

P𝑃\displaystyle Pitalic_P ((x0,x1,x2){y0}×Bε(y1)×Bε(y2)|x0=y0)subscript𝑥0subscript𝑥1subscript𝑥2conditionalsubscript𝑦0subscript𝐵𝜀subscript𝑦1subscript𝐵𝜀subscript𝑦2subscript𝑥0subscript𝑦0\displaystyle\Big{(}(x_{0},x_{1},x_{2})\in\{y_{0}\}\times B_{\varepsilon}(y_{1% })\times B_{\varepsilon}(y_{2})\ |\ x_{0}=y_{0}\Big{)}( ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ { italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } × italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
=P((x1,x2)Bε(y1)×Bε(y2)|x0=y0)P(x0=y0)absent𝑃subscript𝑥1subscript𝑥2conditionalsubscript𝐵𝜀subscript𝑦1subscript𝐵𝜀subscript𝑦2subscript𝑥0subscript𝑦0𝑃subscript𝑥0subscript𝑦0\displaystyle=P\Big{(}(x_{1},x_{2})\in B_{\varepsilon}(y_{1})\times B_{% \varepsilon}(y_{2})\ |\ x_{0}=y_{0}\Big{)}\cdot P(x_{0}=y_{0})= italic_P ( ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ italic_P ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
=P((x1,x2)Bε(y1)×Bε(y2)|x0=y0)absent𝑃subscript𝑥1subscript𝑥2conditionalsubscript𝐵𝜀subscript𝑦1subscript𝐵𝜀subscript𝑦2subscript𝑥0subscript𝑦0\displaystyle=P\Big{(}(x_{1},x_{2})\in B_{\varepsilon}(y_{1})\times B_{% \varepsilon}(y_{2})\ |\ x_{0}=y_{0}\Big{)}= italic_P ( ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
=Bε(y1)Tx11(Bε(y2))πθC(a1|x1)f1(x1)𝑑a1𝑑x1absentsubscriptsubscript𝐵𝜀subscript𝑦1subscriptsubscriptsuperscript𝑇1subscript𝑥1subscript𝐵𝜀subscript𝑦2subscriptsuperscript𝜋𝐶𝜃conditionalsubscript𝑎1subscript𝑥1subscript𝑓1subscript𝑥1differential-dsubscript𝑎1differential-dsubscript𝑥1\displaystyle=\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon% }(y_{2}))}\pi^{C}_{\theta}(a_{1}|x_{1})f_{1}(x_{1})\ da_{1}\ dx_{1}= ∫ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (15)
=Bε(y1)Tx11(Bε(y2)πθC(a1|x1)[Tx01(x1)πθC(a0|x0)𝑑a0]𝑑a1𝑑x1.\displaystyle=\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon% }(y_{2})}\pi^{C}_{\theta}(a_{1}|x_{1})\left[\int_{T^{-1}_{x_{0}}(x_{1})}\pi^{C% }_{\theta}(a_{0}|x_{0})\ da_{0}\right]\ da_{1}\ dx_{1}.= ∫ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) [ ∫ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] italic_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Given Assumption 3, equation (15) is strictly positive, since f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is strictly positive almost everywhere and the integrals are taken over sets of positive volume.

Building on equation (15), we have

P((x1,\displaystyle P\Big{(}(x_{1},italic_P ( ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ,xN1,xN)\displaystyle\ldots,x_{N-1},x_{N})\in… , italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∈ (16)
Bε(y1)××Bε(yN1)×(Bε(yN))|x0=y0)\displaystyle\hskip 8.53581ptB_{\varepsilon}(y_{1})\times\ldots\times B_{% \varepsilon}(y_{N-1})\times(B_{\varepsilon}(y_{N})\cap\mathcal{B})\ |\ x_{0}=y% _{0}\Big{)}italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × … × italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ) × ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∩ caligraphic_B ) | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (17)
=\displaystyle== Bε(y1)Tx11(Bε(y2))πθC(a1|x1)Bε(y2)Tx21(Bε(y3))πθC(a2|x2)subscriptsubscript𝐵𝜀subscript𝑦1subscriptsubscriptsuperscript𝑇1subscript𝑥1subscript𝐵𝜀subscript𝑦2subscriptsuperscript𝜋𝐶𝜃conditionalsubscript𝑎1subscript𝑥1subscriptsubscript𝐵𝜀subscript𝑦2subscriptsubscriptsuperscript𝑇1subscript𝑥2subscript𝐵𝜀subscript𝑦3subscriptsuperscript𝜋𝐶𝜃conditionalsubscript𝑎2subscript𝑥2\displaystyle\int_{B_{\varepsilon}(y_{1})}\int_{T^{-1}_{x_{1}}(B_{\varepsilon}% (y_{2}))}\pi^{C}_{\theta}(a_{1}|x_{1})\cdot\int_{B_{\varepsilon}(y_{2})}\int_{% T^{-1}_{x_{2}}(B_{\varepsilon}(y_{3}))}\pi^{C}_{\theta}(a_{2}|x_{2})\cdot\ldots∫ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ∫ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ …
Bε(yN1)TxN11(Bε(yN))πθC(aN1|xN1)fN1(xN1)𝑑aN1𝑑xN1subscriptsubscript𝐵𝜀subscript𝑦𝑁1subscriptsubscriptsuperscript𝑇1subscript𝑥𝑁1subscript𝐵𝜀subscript𝑦𝑁subscriptsuperscript𝜋𝐶𝜃conditionalsubscript𝑎𝑁1subscript𝑥𝑁1subscript𝑓𝑁1subscript𝑥𝑁1differential-dsubscript𝑎𝑁1differential-dsubscript𝑥𝑁1\displaystyle\hskip 8.53581pt\ldots\cdot\int_{B_{\varepsilon}(y_{N-1})}\int_{T% ^{-1}_{x_{N-1}}(B_{\varepsilon}(y_{N})\cap\mathcal{B})}\pi^{C}_{\theta}(a_{N-1% }|x_{N-1})\ f_{N-1}(x_{N-1})\ da_{N-1}\ dx_{N-1}\cdot\ldots… ⋅ ∫ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∩ caligraphic_B ) end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ) italic_f start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ⋅ …
f2(x2)da2dx2f1(x1)da1dx1.subscript𝑓2subscript𝑥2𝑑subscript𝑎2𝑑subscript𝑥2subscript𝑓1subscript𝑥1𝑑subscript𝑎1𝑑subscript𝑥1\displaystyle\hskip 34.1433pt\ldots\cdot f_{2}(x_{2})\ da_{2}\ dx_{2}\cdot f_{% 1}(x_{1})\ da_{1}\ dx_{1}.… ⋅ italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .

Note in the innermost integral that μ(Bε(yN))>0𝜇subscript𝐵𝜀subscript𝑦𝑁0\mu(B_{\varepsilon}(y_{N})\cap\mathcal{B})>0italic_μ ( italic_B start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ∩ caligraphic_B ) > 0, since both sets are open and their intersection is non-empty by hypothesis. Finally, given Assumption 3, we have that (17) is strictly positive, since all integrals are taken over sets of positive volume and fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is strictly positive almost everywhere, for each i{1,N1}𝑖1𝑁1i\in\{1,\ldots N-1\}italic_i ∈ { 1 , … italic_N - 1 }. \square

Proof of Lemma 2. Recalling the definition of πθC(u|x)subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\pi^{C}_{\theta}(u|x)italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) in (2),

logπθC(u|x)subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\displaystyle\nabla\log\pi^{C}_{\theta}(u|x)∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) =logπθ(u|x)logC(x)πθ(w|x)𝑑wabsentsubscript𝜋𝜃conditional𝑢𝑥subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\displaystyle=\nabla\log\pi_{\theta}(u|x)-\nabla\log\int_{C(x)}\pi_{\theta}(w|% x)dw= ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) - ∇ roman_log ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w
=logπθ(u|x)C(x)πθ(w|x)𝑑wC(x)πθ(w|x)𝑑w.absentsubscript𝜋𝜃conditional𝑢𝑥subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\displaystyle=\nabla\log\pi_{\theta}(u|x)-\frac{\int_{C(x)}\nabla\pi_{\theta}(% w|x)dw}{\int_{C(x)}\pi_{\theta}(w|x)dw}.= ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) - divide start_ARG ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w end_ARG start_ARG ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w end_ARG . (18)

To see that, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U, logπθC(u|x)subscriptsuperscript𝜋𝐶𝜃conditional𝑢𝑥\nabla\log\pi^{C}_{\theta}(u|x)∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) exists, for all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, we simply need to verify that (C(x)πθ(w|x)𝑑w)1superscriptsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤1(\int_{C(x)}\pi_{\theta}(w|x)dw)^{-1}( ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is always finite. But this follows immediately from Assumption 3 and the fact that μ(C(x))m>0𝜇𝐶𝑥𝑚0\mu(C(x))\geq m>0italic_μ ( italic_C ( italic_x ) ) ≥ italic_m > 0 by Assumption 1.

We next prove part 1) of the Lemma. The claim holds for the first term in (18) by part 2a) of Assumption 6, so we just need to show that it holds for the second term. To do this, we prove that, for a given x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, this term is Lipschitz in θ𝜃\thetaitalic_θ, then argue that the largest minimal Lipschitz constant over all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X is finite. We know by part 2b) of Assumption 6 that, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U, πθ(u|x)logπθ(u|x)BΘ\left\|\nabla\pi_{\theta}(u|x)\right\|\leq\nabla\log\pi_{\theta}(u|x)\leq B_{\Theta}∥ ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) ∥ ≤ ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) ≤ italic_B start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, for all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ. This means that, for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X,

|C(x)\displaystyle\Big{|}\int_{C(x)}| ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT πθ(w|x)dwC(x)πθ(w|x)dw|\displaystyle\pi_{\theta}(w|x)dw-\int_{C(x)}\pi_{\theta^{\prime}}(w|x)dw\Big{|}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w - ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w |
=|C(x)(πθ(w|x)πθ(w|x))dw|\displaystyle=\Big{|}\int_{C(x)}\left(\pi_{\theta}(w|x)-\pi_{\theta^{\prime}}(% w|x)\right)dw\Big{|}= | ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) - italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w | italic_x ) ) italic_d italic_w |
C(x)|πθ(w|x)πθ(w|x)|dw\displaystyle\leq\int_{C(x)}|\pi_{\theta}(w|x)-\pi_{\theta^{\prime}}(w|x)|dw≤ ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT | italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) - italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w | italic_x ) | italic_d italic_w
C(x)BΘθθ𝑑w=BΘμ(C(x))θθabsentsubscript𝐶𝑥subscript𝐵Θnorm𝜃superscript𝜃differential-d𝑤subscript𝐵Θ𝜇𝐶𝑥norm𝜃superscript𝜃\displaystyle\leq\int_{C(x)}B_{\Theta}\left\|\theta-\theta^{\prime}\right\|dw=% B_{\Theta}\mu(C(x))\left\|\theta-\theta^{\prime}\right\|≤ ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ italic_d italic_w = italic_B start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT italic_μ ( italic_C ( italic_x ) ) ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥
BΘMθθ,absentsubscript𝐵Θ𝑀norm𝜃superscript𝜃\displaystyle\leq B_{\Theta}M\left\|\theta-\theta^{\prime}\right\|,≤ italic_B start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT italic_M ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ,

for all θ,θΘ𝜃superscript𝜃Θ\theta,\theta^{\prime}\in\Thetaitalic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ. So C(x)πθ(w|x)𝑑wsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\int_{C(x)}\pi_{\theta}(w|x)dw∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w is Lipschitz in θ𝜃\thetaitalic_θ, for each x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, and the largest Lipschitz constant over 𝒳𝒳\mathcal{X}caligraphic_X is finite. In addition, C(x)πθ(w|x)𝑑wsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\int_{C(x)}\pi_{\theta}(w|x)dw∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w is clearly uniformly bounded.

Notice that infx𝒳μ(C(x))m>0subscriptinfimum𝑥𝒳𝜇𝐶𝑥𝑚0\inf_{x\in\mathcal{X}}\mu(C(x))\geq m>0roman_inf start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_μ ( italic_C ( italic_x ) ) ≥ italic_m > 0, by Assumption 1. Thus, by Assumption 3, infx𝒳C(x)πθ(w|x)𝑑w>0subscriptinfimum𝑥𝒳subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤0\inf_{x\in\mathcal{X}}\int_{C(x)}\pi_{\theta}(w|x)dw>0roman_inf start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w > 0, for all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ. Since ΘΘ\Thetaroman_Θ is compact, this means infθΘinfx𝒳C(x)πθ(w|x)𝑑w>0subscriptinfimum𝜃Θsubscriptinfimum𝑥𝒳subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤0\inf_{\theta\in\Theta}\inf_{x\in\mathcal{X}}\int_{C(x)}\pi_{\theta}(w|x)dw>0roman_inf start_POSTSUBSCRIPT italic_θ ∈ roman_Θ end_POSTSUBSCRIPT roman_inf start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w > 0. This implies that (C(x)πθ(w|x)𝑑w)1superscriptsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤1(\int_{C(x)}\pi_{\theta}(w|x)dw)^{-1}( ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is uniformly bounded. Since C(x)πθ(w|x)𝑑wsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\int_{C(x)}\pi_{\theta}(w|x)dw∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w is Lipschitz, (C(x)πθ(w|x)𝑑w)1superscriptsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤1(\int_{C(x)}\pi_{\theta}(w|x)dw)^{-1}( ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is therefore Lipschitz and bounded in θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. We also know that, for each x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X, C(x)πθ(w|x)𝑑wsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\int_{C(x)}\nabla\pi_{\theta}(w|x)dw∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w is Lipschitz and bounded in θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, by Assumption 6, part 2a). Fix x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X. Since the product of Lipschitz, bounded functions is Lipschitz and bounded, the function C(x)πθ(w|x)𝑑w/C(x)πθ(w|x)𝑑wsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\int_{C(x)}\nabla\pi_{\theta}(w|x)dw/\int_{C(x)}\pi_{\theta}(w|x)dw∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w / ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w is Lipschitz and bounded in θ𝜃\thetaitalic_θ. Since this function is uniformly bounded over x𝒳,θΘformulae-sequence𝑥𝒳𝜃Θx\in\mathcal{X},\theta\in\Thetaitalic_x ∈ caligraphic_X , italic_θ ∈ roman_Θ, there therefore exists L>0𝐿0L>0italic_L > 0 such that, for all x𝒳𝑥𝒳x\in\mathcal{X}italic_x ∈ caligraphic_X,

C(x)πθ(w|x)𝑑wC(x)πθ(w|x)𝑑wC(x)πθ(w|x)𝑑wC(x)πθ(w|x)𝑑wLθθ,normsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤subscript𝐶𝑥subscript𝜋superscript𝜃conditional𝑤𝑥differential-d𝑤subscript𝐶𝑥subscript𝜋superscript𝜃conditional𝑤𝑥differential-d𝑤𝐿norm𝜃superscript𝜃\left\|\frac{\int_{C(x)}\nabla\pi_{\theta}(w|x)dw}{\int_{C(x)}\pi_{\theta}(w|x% )dw}-\frac{\int_{C(x)}\nabla\pi_{\theta^{\prime}}(w|x)dw}{\int_{C(x)}\pi_{% \theta^{\prime}}(w|x)dw}\right\|\leq L\left\|\theta-\theta^{\prime}\right\|,∥ divide start_ARG ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w end_ARG start_ARG ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w end_ARG - divide start_ARG ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w end_ARG start_ARG ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w end_ARG ∥ ≤ italic_L ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ,

for all θ,θΘ𝜃superscript𝜃Θ\theta,\theta^{\prime}\in\Thetaitalic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ. Combined with part 2a) of Assumption 6, this implies that, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U, logπθC(u|x)logπθC(u|x)(LΘ+L)θθ\left\|\nabla\log\pi^{C}_{\theta}(u|x)-\nabla\log\pi^{C}_{\theta^{\prime}}(u|x% )\right\|\leq(L_{\Theta}+L)\left\|\theta-\theta^{\prime}\right\|∥ ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) - ∇ roman_log italic_π start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u | italic_x ) ∥ ≤ ( italic_L start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT + italic_L ) ∥ italic_θ - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥, for all θ,θΘ𝜃superscript𝜃Θ\theta,\theta^{\prime}\in\Thetaitalic_θ , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Θ. This completes the proof of part 1).

Part 2) follows from the fact that C(x)πθ(w|x)𝑑w/C(x)πθ(w|x)𝑑wsubscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤subscript𝐶𝑥subscript𝜋𝜃conditional𝑤𝑥differential-d𝑤\int_{C(x)}\nabla\pi_{\theta}(w|x)dw/\int_{C(x)}\pi_{\theta}(w|x)dw∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT ∇ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w / ∫ start_POSTSUBSCRIPT italic_C ( italic_x ) end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_w | italic_x ) italic_d italic_w is uniformly bounded and that, for all x𝒳,u𝒰formulae-sequence𝑥𝒳𝑢𝒰x\in\mathcal{X},u\in\mathcal{U}italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U, logπθ(u|x)BΘ\left\|\nabla\log\pi_{\theta}(u|x)\right\|\leq B_{\Theta}∥ ∇ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u | italic_x ) ∥ ≤ italic_B start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT, for all θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ, by part 2) of Assumption 6.

Appendix B Background: Barrier Functions

In this section, we provide an overview of barrier functions for convenience. The theory of barrier functions revolve around controlled set invariance for dynamical systems. Safety can be represented through a set, say 𝒮𝒮\mathcal{S}caligraphic_S, defined as a level set of this barrier function, say hhitalic_h. Then, we write the condition on this barrier function to guarantee the forward invariance of this safety set under the given dynamics.

B.1 Control Barrier Functions (CBF)

Consider the following nonlinear system

r˙=f(r,u),˙𝑟𝑓𝑟𝑢\displaystyle\dot{r}=f(r,u),over˙ start_ARG italic_r end_ARG = italic_f ( italic_r , italic_u ) , (19)

where rDn𝑟𝐷superscript𝑛r\in D\subset\mathbb{R}^{n}italic_r ∈ italic_D ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and uUm𝑢𝑈superscript𝑚u\in U\subset\mathbb{R}^{m}italic_u ∈ italic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denote the state and control input, and f𝑓fitalic_f is a locally Lipschitz function that models the state transition. The following definition and the theorem follows the development in  Ames et al. (2019); Agrawal and Sreenath (2017).

Theorem 2.

Consider a function h:nnormal-:normal-→superscript𝑛h:\mathbb{R}^{n}\rightarrow\mathbb{R}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R that is continuously differentiable. Define a closed set 𝒮𝒮\mathcal{S}caligraphic_S as the super-level set of this function as follows:

𝒮{rn|h(r)0}.𝒮conditional-set𝑟superscript𝑛𝑟0\displaystyle\mathcal{S}\triangleq\left\{r\in\mathbb{R}^{n}\;|\;h(r)\geq 0% \right\}.caligraphic_S ≜ { italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_h ( italic_r ) ≥ 0 } . (20)

The function hhitalic_h is a control barrier function, for (19) and with state s𝑠sitalic_s, if there exists an extended κsubscript𝜅\kappa_{\infty}italic_κ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT function α𝛼\alphaitalic_α such that for all r𝒮𝑟𝒮r\in\mathcal{S}italic_r ∈ caligraphic_S, t+𝑡subscriptt\in\mathbb{R}_{+}italic_t ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT,

h˙α(h).˙𝛼\dot{h}\geq-\alpha(h).over˙ start_ARG italic_h end_ARG ≥ - italic_α ( italic_h ) . (21)

Further, if we define the safe control set as

𝒞(r){um|h˙(r,u)α(h(r))}.𝒞𝑟conditional-set𝑢superscript𝑚˙𝑟𝑢𝛼𝑟\mathcal{C}(r)\triangleq\left\{u\in\mathbb{R}^{m}|\dot{h}(r,u)\geq-\alpha(h(r)% )\right\}.caligraphic_C ( italic_r ) ≜ { italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | over˙ start_ARG italic_h end_ARG ( italic_r , italic_u ) ≥ - italic_α ( italic_h ( italic_r ) ) } . (22)

then any input u𝒞(r)𝑢𝒞𝑟u\in\mathcal{C}(r)italic_u ∈ caligraphic_C ( italic_r ) will render the set 𝒮𝒮\mathcal{S}caligraphic_S forward invariant.

When designing safe controller with control values u𝑢uitalic_u sampled from this safe control set, we need the time-derivative of hhitalic_h i.e. h˙˙\dot{h}over˙ start_ARG italic_h end_ARG to explicitly contain u𝑢uitalic_u. However, the above forward variance condition is restricted to barrier function with relative degree dr=1subscript𝑑𝑟1d_{r}=1italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 1. At this point, we also note that for our quadcopter experiment, our CBF has a relative degree dr=2subscript𝑑𝑟2d_{r}=2italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 2, since only the second time-derivative h¨¨\ddot{h}over¨ start_ARG italic_h end_ARG explicitly contains the control input u𝑢uitalic_u. Therefore, for barrier functions with relative degree more than 1111, which are often referred to as Exponential Control Barrier Functions (ECBF), we need a seperate discussion on forward invariance conditions.

B.2 Exponential Control Barrier Functions

We now discuss exponential CBFs for control affine nonlinear dynamical system. Consider the following control affine nonlinear dynamical system:

r˙=f(r)+g(r)u,˙𝑟𝑓𝑟𝑔𝑟𝑢\dot{r}=f(r)+g(r)u,over˙ start_ARG italic_r end_ARG = italic_f ( italic_r ) + italic_g ( italic_r ) italic_u , (23)

with f𝑓fitalic_f and g𝑔gitalic_g locally lipshitz, rDn𝑟𝐷superscript𝑛r\in D\subset\mathbb{R}^{n}italic_r ∈ italic_D ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and uUm𝑢𝑈superscript𝑚u\in U\subset\mathbb{R}^{m}italic_u ∈ italic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. We suppose that the Lipschitz constant for f𝑓fitalic_f and g𝑔gitalic_g are Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Lgsubscript𝐿𝑔L_{g}italic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT respectively, and the vector containing the first dr1subscript𝑑𝑟1d_{r}-1italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 time derivatives of h(r)𝑟h(r)italic_h ( italic_r ) including h(r)𝑟h(r)italic_h ( italic_r ) is given as: ηb(r)=[h(r)h˙(r)h¨(r)hdr1(r)]subscript𝜂𝑏𝑟matrix𝑟˙𝑟¨𝑟superscriptsubscript𝑑𝑟1𝑟\eta_{b}(r)=\begin{bmatrix}h(r)\\ \dot{h}(r)\\ \ddot{h}(r)\\ \vdots\\ h^{d_{r}-1}(r)\end{bmatrix}italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_r ) = [ start_ARG start_ROW start_CELL italic_h ( italic_r ) end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_h end_ARG ( italic_r ) end_CELL end_ROW start_ROW start_CELL over¨ start_ARG italic_h end_ARG ( italic_r ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_r ) end_CELL end_ROW end_ARG ]. Further suppose that the matrices F,G,𝐹𝐺F,G,italic_F , italic_G , and C𝐶Citalic_C are defined as follows: F=[0100001000010000]𝐹matrix0100001000010000F=\begin{bmatrix}0&1&0&\cdots&0\\ 0&0&1&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&1\\ 0&0&0&\cdots&0\end{bmatrix}italic_F = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW end_ARG ], G=[0001]𝐺matrix0001G=\begin{bmatrix}0\\ 0\\ 0\\ \vdots\\ 1\end{bmatrix}italic_G = [ start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ], and, C=[1000]𝐶matrix1000C=\begin{bmatrix}1&0&0&\cdots&0\end{bmatrix}italic_C = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW end_ARG ].

Theorem 3.

Consider a function h:nnormal-:normal-→superscript𝑛h:\mathbb{R}^{n}\rightarrow\mathbb{R}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R that is continuously differentiable. Define a closed set 𝒮𝒮\mathcal{S}caligraphic_S as the super-level set of this function as follows:

𝒮{rn|h(r)0}.𝒮conditional-set𝑟superscript𝑛𝑟0\displaystyle\mathcal{S}\triangleq\left\{r\in\mathbb{R}^{n}\;|\;h(r)\geq 0% \right\}.caligraphic_S ≜ { italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_h ( italic_r ) ≥ 0 } . (24)

Then the function hhitalic_h is an ECBF, with relative degree drsubscript𝑑𝑟d_{r}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, for system in (23), if there exist a row vector Kαrsubscript𝐾𝛼superscript𝑟K_{\alpha}\in\mathbb{R}^{r}italic_K start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT such that

supuU[Lfdrh(r)+LgLfdr1h(r)u]Kαηb(r)subscriptsupremum𝑢𝑈delimited-[]superscriptsubscript𝐿𝑓subscript𝑑𝑟𝑟subscript𝐿𝑔superscriptsubscript𝐿𝑓subscript𝑑𝑟1𝑟𝑢subscript𝐾𝛼subscript𝜂𝑏𝑟\sup_{u\in U}[L_{f}^{d_{r}}h(r)+L_{g}L_{f}^{d_{r}-1}h(r)u]\geq-K_{\alpha}\eta_% {b}(r)roman_sup start_POSTSUBSCRIPT italic_u ∈ italic_U end_POSTSUBSCRIPT [ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h ( italic_r ) + italic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_h ( italic_r ) italic_u ] ≥ - italic_K start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_r ) (25)

rInt(𝒮)for-all𝑟𝐼𝑛𝑡𝒮\forall r\in Int(\mathcal{S})∀ italic_r ∈ italic_I italic_n italic_t ( caligraphic_S ), implies h(r(t))Ce(FGKα)tηb(r)r(t0)0𝑟𝑡𝐶superscript𝑒𝐹𝐺subscript𝐾𝛼𝑡subscript𝜂𝑏𝑟𝑟subscript𝑡00h(r(t))\geq Ce^{(F-GK_{\alpha})t}\eta_{b}(r)r(t_{0})\geq 0italic_h ( italic_r ( italic_t ) ) ≥ italic_C italic_e start_POSTSUPERSCRIPT ( italic_F - italic_G italic_K start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_r ) italic_r ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≥ 0, whenever h(r(t0))0𝑟subscript𝑡00h(r(t_{0}))\geq 0italic_h ( italic_r ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≥ 0. Further, if we define the safe control set C(r)𝐶𝑟C(r)italic_C ( italic_r ) as

𝒞(r){uU|[Lfdrh(r)+LgLfdr1h(r)u]Kαηb(r)},𝒞𝑟conditional-set𝑢𝑈delimited-[]superscriptsubscript𝐿𝑓subscript𝑑𝑟𝑟subscript𝐿𝑔superscriptsubscript𝐿𝑓subscript𝑑𝑟1𝑟𝑢subscript𝐾𝛼subscript𝜂𝑏𝑟\mathcal{C}(r)\triangleq\{u\in U|[L_{f}^{d_{r}}h(r)+L_{g}L_{f}^{d_{r}-1}h(r)u]% \geq-K_{\alpha}\eta_{b}(r)\},caligraphic_C ( italic_r ) ≜ { italic_u ∈ italic_U | [ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_h ( italic_r ) + italic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_h ( italic_r ) italic_u ] ≥ - italic_K start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_r ) } , (26)

then any input u𝒞(r)𝑢𝒞𝑟u\in\mathcal{C}(r)italic_u ∈ caligraphic_C ( italic_r ) will render the set 𝒮𝒮\mathcal{S}caligraphic_S forward invariant.

Appendix C Experiments: Additional Details

In this section, we provide additional details regarding the experiments presented in §4.

C.1 Inverted Pendulum Experiments

The safety-constrained inverted pendulum environment that we considered in §4.4 was obtained by modifying the standard implementation from Brockman et al. (2016) to include CBF-based safety constraints. In this section we describe the dynamical model and CBF used to obtain these constraints, then present implementation details and an additional experiment.

C.1.1 Dynamical Model

Consider the model of a simple inverted pendulum

[θk+1θ˙k+1]matrixsubscript𝜃𝑘1subscript˙𝜃𝑘1\displaystyle\begin{bmatrix}\theta_{k+1}\\ \dot{\theta}_{k+1}\end{bmatrix}[ start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] =[θk+δtθ˙k+δt2(3g2lsinθk+3ml2uk)θ˙k+δt(3g2lsinθk+3ml2uk)],absentmatrixsubscript𝜃𝑘𝛿𝑡subscript˙𝜃𝑘𝛿superscript𝑡23𝑔2𝑙subscript𝜃𝑘3𝑚superscript𝑙2subscript𝑢𝑘subscript˙𝜃𝑘𝛿𝑡3𝑔2𝑙subscript𝜃𝑘3𝑚superscript𝑙2subscript𝑢𝑘\displaystyle=\begin{bmatrix}\theta_{k}+\delta t\dot{\theta}_{k}+\delta t^{2}% \left(\dfrac{3g}{2l}\sin{\theta_{k}}+\dfrac{3}{ml^{2}}u_{k}\right)\\ \dot{\theta}_{k}+\delta t\left(\dfrac{3g}{2l}\sin{\theta_{k}}+\dfrac{3}{ml^{2}% }u_{k}\right)\end{bmatrix},= [ start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ italic_t over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 3 italic_g end_ARG start_ARG 2 italic_l end_ARG roman_sin italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG 3 end_ARG start_ARG italic_m italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ italic_t ( divide start_ARG 3 italic_g end_ARG start_ARG 2 italic_l end_ARG roman_sin italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG 3 end_ARG start_ARG italic_m italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] , (31)

where θk,θ˙ksubscript𝜃𝑘subscript˙𝜃𝑘\theta_{k},~{}\dot{\theta}_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote the states (angle and angular velocity), uksubscript𝑢𝑘u_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote the input (torque), m𝑚mitalic_m and l𝑙litalic_l denotes the mass and the length of the pendulum, respectively, g𝑔gitalic_g denotes the acceleration due to gravity and δt>0𝛿𝑡0\delta t>0italic_δ italic_t > 0 denotes the discretization time. Denote the safe operating region by

𝒮𝒮\displaystyle\mathcal{S}caligraphic_S ={θ|h(θ):=[θ+11θ]0}.absentconditional-set𝜃assign𝜃matrix𝜃11𝜃0\displaystyle=\left\{\theta\in\mathbb{R}|h(\theta):=\begin{bmatrix}\theta+1\\ 1-\theta\end{bmatrix}\geq 0\right\}.= { italic_θ ∈ blackboard_R | italic_h ( italic_θ ) := [ start_ARG start_ROW start_CELL italic_θ + 1 end_CELL end_ROW start_ROW start_CELL 1 - italic_θ end_CELL end_ROW end_ARG ] ≥ 0 } . (34)

C.1.2 Control Barrier Function

The following corollary is a direct consequence of Theorem 2, presented in §B.

Corollary 2.

Let

U(θk,θ˙k)={uk|(δtθ˙k+c(θk,uk))[11]+η[θk+11θk]0}𝑈subscript𝜃𝑘subscript˙𝜃𝑘conditional-setsubscript𝑢𝑘𝛿𝑡subscript˙𝜃𝑘𝑐subscript𝜃𝑘subscript𝑢𝑘matrix11𝜂matrixsubscript𝜃𝑘11subscript𝜃𝑘0\displaystyle U(\theta_{k},\dot{\theta}_{k})=\left\{u_{k}\in{\mathbb{R}}|\left% (\delta t\dot{\theta}_{k}+c(\theta_{k},u_{k})\right)\begin{bmatrix}1\\ -1\end{bmatrix}+\eta\begin{bmatrix}\theta_{k}+1\\ 1-\theta_{k}\end{bmatrix}\geq 0\right\}italic_U ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = { italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R | ( italic_δ italic_t over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_c ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) [ start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL end_ROW end_ARG ] + italic_η [ start_ARG start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_CELL end_ROW start_ROW start_CELL 1 - italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ≥ 0 } (39)

where c(θk,uk):=δt2(3g2lsinθk+3ml2uk)assign𝑐subscript𝜃𝑘subscript𝑢𝑘𝛿superscript𝑡23𝑔2𝑙subscript𝜃𝑘3𝑚superscript𝑙2subscript𝑢𝑘c(\theta_{k},u_{k}):=\delta t^{2}\left(\dfrac{3g}{2l}\sin{\theta_{k}}+\dfrac{3% }{ml^{2}}u_{k}\right)italic_c ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) := italic_δ italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 3 italic_g end_ARG start_ARG 2 italic_l end_ARG roman_sin italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + divide start_ARG 3 end_ARG start_ARG italic_m italic_l start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), and 0<η<10𝜂10<\eta<10 < italic_η < 1. Consider system (31) with ukU(θk,θ˙k)subscript𝑢𝑘𝑈subscript𝜃𝑘subscriptnormal-˙𝜃𝑘u_{k}\in U(\theta_{k},\dot{\theta}_{k})italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_U ( italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Let (θ0,θ˙0)𝒮×subscript𝜃0subscriptnormal-˙𝜃0𝒮(\theta_{0},\dot{\theta}_{0})\in\mathcal{S}\times\mathbb{R}( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ caligraphic_S × blackboard_R and assume U(θ0,θ˙0)𝑈subscript𝜃0subscriptnormal-˙𝜃0U(\theta_{0},\dot{\theta}_{0})italic_U ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over˙ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is non-empty. The set 𝒮𝒮\mathcal{S}caligraphic_S is forward invariant.

The safe set (39) is used to provide the state-dependent constraints to the Beta policies learned in our experiments.

C.1.3 Implementation Details

We next describe the implementation details of our experiments. As mentioned above, the environment was adapted from the implementation of (Brockman et al., 2016), with modifications to compute the CBF safe set (39). The reward function and other details are as in Brockman et al. (2016). The Beta and Gaussian policies used the corresponding distributions from the PyTorch library (Paszke et al., 2019). As described in §4.2, for a given state x𝑥xitalic_x, the parameters α(x),β(x)𝛼𝑥𝛽𝑥\alpha(x),\beta(x)italic_α ( italic_x ) , italic_β ( italic_x ) of the Beta distribution were outputted by a two-layer, fully connected neural network. Control inputs were obtained by sampling from this distribution, then translating and rescaling to lie within the current CBF set C(x)=[a(x),b(x)]𝐶𝑥𝑎𝑥𝑏𝑥C(x)=[a(x),b(x)]italic_C ( italic_x ) = [ italic_a ( italic_x ) , italic_b ( italic_x ) ]. The Gaussian policy parameters were outputted by a two-layer, fully connected neural network. Control inputs were subsequently selected from the corresponding distribution by sampling, then, following standard practice (Chou et al., 2017), were clipped to a set of permissible controls, which was chosen to be [15.0,15.0]15.015.0[-15.0,15.0][ - 15.0 , 15.0 ]. The hyperparameters used are presented in Figure 4.

policy learning rate 0.0003
value learning rate 0.0003
entropy coefficient 0.0
clip range 0.2
weight decay 0.0
layer size 64
batch size 64
buffer size 300
number of epochs 10
rollout length 300
discount factor 0.99
(a) Gaussian hyperparameters.
policy learning rate 0.01
value learning rate 0.01
entropy coefficient 0.0
clip range 0.2
weight decay 0.0
layer size 64
batch size 64
buffer size 300
number of epochs 10
rollout length 300
discount factor 0.99
(b) Beta hyperparameters.
Figure 4: PPO hyperparameters for the inverted pendulum experiments.

C.1.4 Additional Results

Refer to caption
Figure 5: Comparison of safety-constrained Beta policy and unconstrained Gaussian policy on the inverted pendulum environment with constraint set 𝒮1.0={θ|1.0θ1.0}subscript𝒮1.0conditional-set𝜃1.0𝜃1.0\mathcal{S}_{1.0}=\{\theta\ |\ -1.0\leq\theta\leq 1.0\}caligraphic_S start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT = { italic_θ | - 1.0 ≤ italic_θ ≤ 1.0 }. The top figure presents learning curves, while the bottom figure presents the “safety rate”, i.e., the percentage of time spent in 𝒮1.0subscript𝒮1.0\mathcal{S}_{1.0}caligraphic_S start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT over the course of the episode. The curves represent means and 95% confidence intervals over five independent replications.

Figure 5 presents an experiment providing additional support to the discussion presented in §4.4.

C.2 Quadcopter Experiments

In this section, we provide additional details regarding the environment and experiments presented in §4.3.

C.2.1 Dynamical Model

We summarize the dynamical model of the quadcopter derived in Xu and Sreenath (2018). We consider the body frame, say 𝔽bsubscript𝔽𝑏\mathbb{F}_{b}blackboard_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, and world frame, say 𝔽wsubscript𝔽𝑤\mathbb{F}_{w}blackboard_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and discuss the transformation between these two frames using the rotation matrix wbsubscript𝑤𝑏\mathbb{R}_{wb}blackboard_R start_POSTSUBSCRIPT italic_w italic_b end_POSTSUBSCRIPT defined as

wb:=[cosψcosθsinϕsinψsinθcosϕsinψcosψsinθ+cosθsinϕsinψcosθsinψ+cosψsinϕsinθcosϕcosψsinψsinθcosψcosθsinϕcosϕsinθsinϕcosϕcosθ],assignsubscript𝑤𝑏matrix𝜓𝜃italic-ϕ𝜓𝜃italic-ϕ𝜓𝜓𝜃𝜃italic-ϕ𝜓𝜃𝜓𝜓italic-ϕ𝜃italic-ϕ𝜓𝜓𝜃𝜓𝜃italic-ϕitalic-ϕ𝜃italic-ϕitalic-ϕ𝜃\mathbb{R}_{wb}:=\begin{bmatrix}\cos{\psi}\cos\theta-\sin{\phi}\sin{\psi}\sin{% \theta}&-\cos{\phi}\sin{\psi}&\cos{\psi}\sin{\theta}+\cos{\theta}\sin{\phi}% \sin{\psi}\\ \cos{\theta}\sin{\psi}+\cos{\psi}\sin{\phi}\sin{\theta}&\cos{\phi}\cos{\psi}&% \sin{\psi}\sin{\theta}-\cos{\psi}\cos{\theta}\sin{\phi}\\ -\cos{\phi}\sin{\theta}&\sin{\phi}&\cos{\phi}\cos{\theta}\end{bmatrix},blackboard_R start_POSTSUBSCRIPT italic_w italic_b end_POSTSUBSCRIPT := [ start_ARG start_ROW start_CELL roman_cos italic_ψ roman_cos italic_θ - roman_sin italic_ϕ roman_sin italic_ψ roman_sin italic_θ end_CELL start_CELL - roman_cos italic_ϕ roman_sin italic_ψ end_CELL start_CELL roman_cos italic_ψ roman_sin italic_θ + roman_cos italic_θ roman_sin italic_ϕ roman_sin italic_ψ end_CELL end_ROW start_ROW start_CELL roman_cos italic_θ roman_sin italic_ψ + roman_cos italic_ψ roman_sin italic_ϕ roman_sin italic_θ end_CELL start_CELL roman_cos italic_ϕ roman_cos italic_ψ end_CELL start_CELL roman_sin italic_ψ roman_sin italic_θ - roman_cos italic_ψ roman_cos italic_θ roman_sin italic_ϕ end_CELL end_ROW start_ROW start_CELL - roman_cos italic_ϕ roman_sin italic_θ end_CELL start_CELL roman_sin italic_ϕ end_CELL start_CELL roman_cos italic_ϕ roman_cos italic_θ end_CELL end_ROW end_ARG ] , (40)

where ϕitalic-ϕ\phiitalic_ϕ, θ𝜃\thetaitalic_θ, and ψ𝜓\psiitalic_ψ denote the Z-X-Y Euler angles corresponding to the roll, pitch, and yaw of the quadcopter. Suppose that the 3-dimensional position coordinates of the quadcopter along the x-,y-, and z-axis with respect to its body frame 𝔽bsubscript𝔽𝑏\mathbb{F}_{b}blackboard_F start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT of and the world frame of reference 𝔽wsubscript𝔽𝑤\mathbb{F}_{w}blackboard_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT be given by xb:=(xb,yb,zb)assignsubscript𝑥𝑏subscript𝑥𝑏subscript𝑦𝑏subscript𝑧𝑏x_{b}:=(x_{b},y_{b},z_{b})italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT := ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) and r:=(rx,ry,rz)assign𝑟subscript𝑟𝑥subscript𝑟𝑦subscript𝑟𝑧r:=(r_{x},r_{y},r_{z})italic_r := ( italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) respectively, then r=wbxb𝑟subscript𝑤𝑏subscript𝑥𝑏r=\mathcal{R}_{wb}x_{b}italic_r = caligraphic_R start_POSTSUBSCRIPT italic_w italic_b end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.

Then, the quadcopter dynamics is given by x˙=Ax+Bu˙𝑥𝐴𝑥𝐵𝑢\dot{x}=Ax+Buover˙ start_ARG italic_x end_ARG = italic_A italic_x + italic_B italic_u, where the control input u𝑢uitalic_u comprises of the desired acceleration of the quadcopter. The dynamics of this controller under small angle assumptions on the Euler angles, that is sine^e^,cose^1,e^{ϕ,θ,ψ}formulae-sequence^𝑒^𝑒formulae-sequence^𝑒1^𝑒italic-ϕ𝜃𝜓\sin\hat{e}\approx\hat{e},\cos\hat{e}\approx 1,\hat{e}\in\{\phi,\theta,\psi\}roman_sin over^ start_ARG italic_e end_ARG ≈ over^ start_ARG italic_e end_ARG , roman_cos over^ start_ARG italic_e end_ARG ≈ 1 , over^ start_ARG italic_e end_ARG ∈ { italic_ϕ , italic_θ , italic_ψ }) evolves as Mellinger et al. (2012):

u=[r¨xdesr¨ydesr¨zdes]=[g(θdescosψdes+ϕdessinψdes),g(θdessinψdesϕdescosψdes)i=14Fidesmg],𝑢matrixsuperscriptsubscript¨𝑟𝑥𝑑𝑒𝑠superscriptsubscript¨𝑟𝑦𝑑𝑒𝑠superscriptsubscript¨𝑟𝑧𝑑𝑒𝑠matrix𝑔superscript𝜃𝑑𝑒𝑠superscript𝜓𝑑𝑒𝑠superscriptitalic-ϕ𝑑𝑒𝑠superscript𝜓𝑑𝑒𝑠𝑔superscript𝜃𝑑𝑒𝑠superscript𝜓𝑑𝑒𝑠superscriptitalic-ϕ𝑑𝑒𝑠superscript𝜓𝑑𝑒𝑠subscriptsuperscript4𝑖1superscriptsubscript𝐹𝑖𝑑𝑒𝑠𝑚𝑔\displaystyle u=\begin{bmatrix}\ddot{r}_{x}^{des}\\ \ddot{r}_{y}^{des}\\ \ddot{r}_{z}^{des}\end{bmatrix}=\begin{bmatrix}g(\theta^{des}\cos{\psi^{des}}+% \phi^{des}\sin{\psi^{des}}),\\ g(\theta^{des}\sin{\psi^{des}}-\phi^{des}\cos{\psi^{des}})\\ \frac{\sum^{4}_{i=1}F_{i}^{des}}{m}-g\end{bmatrix},italic_u = [ start_ARG start_ROW start_CELL over¨ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL over¨ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL over¨ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_g ( italic_θ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT roman_cos italic_ψ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT + italic_ϕ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT roman_sin italic_ψ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_g ( italic_θ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT roman_sin italic_ψ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT - italic_ϕ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT roman_cos italic_ψ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL divide start_ARG ∑ start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG - italic_g end_CELL end_ROW end_ARG ] , (47)

where m,g𝑚𝑔m,gitalic_m , italic_g are respectively the mass of the quadcopter and gravitational constant, and r¨ides,i{x,y,z}superscriptsubscript¨𝑟𝑖𝑑𝑒𝑠𝑖𝑥𝑦𝑧\ddot{r}_{i}^{des},i\in\{x,y,z\}over¨ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT , italic_i ∈ { italic_x , italic_y , italic_z } is the desired acceleration component of the quadcopter in the x-,y-, and z-direction respectively, computed using the desired specifications on the Euler angles ϕdes,θdessuperscriptitalic-ϕ𝑑𝑒𝑠superscript𝜃𝑑𝑒𝑠\phi^{des},\theta^{des}italic_ϕ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT, and ψdessuperscript𝜓𝑑𝑒𝑠\psi^{des}italic_ψ start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT, and Fides,i{1,2,3,4}superscriptsubscript𝐹𝑖𝑑𝑒𝑠𝑖1234F_{i}^{des},i\in\{1,2,3,4\}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s end_POSTSUPERSCRIPT , italic_i ∈ { 1 , 2 , 3 , 4 } is the desired thrust on the i𝑖iitalic_i-th rotor of the quadcopter. Lastly, the dynamical parameters for the quadcopter are setup as given in Ho et al. (2020).

C.2.2 Exponential Control Barrier Function

Recall, that the objective of our controller to enable the quadcopter to learn how to reach a target position rgoalsubscript𝑟𝑔𝑜𝑎𝑙r_{goal}italic_r start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT, while avoiding an obstacle with position robssubscript𝑟𝑜𝑏𝑠r_{obs}italic_r start_POSTSUBSCRIPT italic_o italic_b italic_s end_POSTSUBSCRIPT. For this obstacle avoidance, we now discuss our choice of CBF for the quadcopter experiment, as defined in (14), and reason why this is an exponential control barrier function (ECBF). We first derive expressions for h˙˙\dot{h}over˙ start_ARG italic_h end_ARG and h¨¨\ddot{h}over¨ start_ARG italic_h end_ARG using dynamical equations as follows:

h˙(r)=4((Δrx/a)3r˙x+(Δry/b)3r˙y+(Δrz/c)3r˙z),˙𝑟4superscriptΔsubscript𝑟𝑥𝑎3subscript˙𝑟𝑥superscriptΔsubscript𝑟𝑦𝑏3subscript˙𝑟𝑦superscriptΔsubscript𝑟𝑧𝑐3subscript˙𝑟𝑧\dot{h}(r)=4(\left({\Delta r_{x}}/{a}\right)^{3}\dot{r}_{x}+\left({\Delta r_{y% }}/{b}\right)^{3}\dot{r}_{y}+\left({\Delta r_{z}}/{c}\right)^{3}\dot{r}_{z}),over˙ start_ARG italic_h end_ARG ( italic_r ) = 4 ( ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / italic_a ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / italic_b ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / italic_c ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , (48)

and

h¨(r)=12((Δrx/a)2r˙x+(Δry/b)2r˙y+(Δrz/c)2r˙z)+4((Δrx/a)3r¨x+(Δry/b)3r¨y+(Δrz/c)3r¨z).¨𝑟12superscriptΔsubscript𝑟𝑥𝑎2subscript˙𝑟𝑥superscriptΔsubscript𝑟𝑦𝑏2subscript˙𝑟𝑦superscriptΔsubscript𝑟𝑧𝑐2subscript˙𝑟𝑧4superscriptΔsubscript𝑟𝑥𝑎3subscript¨𝑟𝑥superscriptΔsubscript𝑟𝑦𝑏3subscript¨𝑟𝑦superscriptΔsubscript𝑟𝑧𝑐3subscript¨𝑟𝑧\ddot{h}(r)=12(\left({\Delta r_{x}}/{a}\right)^{2}\dot{r}_{x}+\left({\Delta r_% {y}}/{b}\right)^{2}\dot{r}_{y}+\left({\Delta r_{z}}/{c}\right)^{2}\dot{r}_{z})% +4(\left({\Delta r_{x}}/{a}\right)^{3}\ddot{r}_{x}+\left({\Delta r_{y}}/{b}% \right)^{3}\ddot{r}_{y}+\left({\Delta r_{z}}/{c}\right)^{3}\ddot{r}_{z}).over¨ start_ARG italic_h end_ARG ( italic_r ) = 12 ( ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over˙ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) + 4 ( ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT / italic_a ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over¨ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT / italic_b ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over¨ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT / italic_c ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT over¨ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) . (49)

These equations can be re-written in vector form as follows:

h˙(r)=[4(Δrx3/a4)4(Δry3/b4)4(Δrz3/c4)]r˙˙𝑟matrix4Δsuperscriptsubscript𝑟𝑥3superscript𝑎44Δsuperscriptsubscript𝑟𝑦3superscript𝑏44Δsuperscriptsubscript𝑟𝑧3superscript𝑐4˙𝑟\dot{h}(r)=\begin{bmatrix}4({\Delta r_{x}^{3}}/{a}^{4})&4({\Delta r_{y}^{3}}/{% b}^{4})&4({\Delta r_{z}^{3}}/{c}^{4})\end{bmatrix}\dot{r}over˙ start_ARG italic_h end_ARG ( italic_r ) = [ start_ARG start_ROW start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_b start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] over˙ start_ARG italic_r end_ARG (50)

and

h¨(r)=r˙T[12(Δrx2/a4)00012(Δry2/b4)00012(Δrz2/c4)]r˙+[4(Δrx3/a4)4(Δry3/b4)4(Δrz3/c4)]r¨.¨𝑟superscript˙𝑟𝑇matrix12Δsuperscriptsubscript𝑟𝑥2superscript𝑎400012Δsuperscriptsubscript𝑟𝑦2superscript𝑏400012Δsuperscriptsubscript𝑟𝑧2superscript𝑐4˙𝑟matrix4Δsuperscriptsubscript𝑟𝑥3superscript𝑎44Δsuperscriptsubscript𝑟𝑦3superscript𝑏44Δsuperscriptsubscript𝑟𝑧3superscript𝑐4¨𝑟\ddot{h}(r)=\dot{r}^{T}\begin{bmatrix}12({\Delta r_{x}}^{2}/{a}^{4})&0&0\\ 0&12({\Delta r_{y}^{2}}/{b}^{4})&0\\ 0&0&12({\Delta r_{z}^{2}}/{c}^{4})\end{bmatrix}\dot{r}+\begin{bmatrix}4({% \Delta r_{x}}^{3}/{a}^{4})&4({\Delta r_{y}^{3}}/{b}^{4})&4({\Delta r_{z}^{3}}/% {c}^{4})\end{bmatrix}\ddot{r}.over¨ start_ARG italic_h end_ARG ( italic_r ) = over˙ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL 12 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 12 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_b start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 12 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] over˙ start_ARG italic_r end_ARG + [ start_ARG start_ROW start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_b start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] over¨ start_ARG italic_r end_ARG . (51)

Since u=r¨𝑢¨𝑟u=\ddot{r}italic_u = over¨ start_ARG italic_r end_ARG from quadcopter dynamics, we can re-write the h¨(r)¨𝑟\ddot{h}(r)over¨ start_ARG italic_h end_ARG ( italic_r ) as follows:

h¨(r)=r˙TDrr˙Aru,¨𝑟superscript˙𝑟𝑇subscript𝐷𝑟˙𝑟subscript𝐴𝑟𝑢\ddot{h}(r)=\dot{r}^{T}D_{r}\dot{r}-A_{r}u,over¨ start_ARG italic_h end_ARG ( italic_r ) = over˙ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT over˙ start_ARG italic_r end_ARG - italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u , (52)

where Dr:=[12(Δrx2/a4)00012(Δry2/b4)00012(Δrz2/c4)]assignsubscript𝐷𝑟matrix12Δsuperscriptsubscript𝑟𝑥2superscript𝑎400012Δsuperscriptsubscript𝑟𝑦2superscript𝑏400012Δsuperscriptsubscript𝑟𝑧2superscript𝑐4D_{r}:=\begin{bmatrix}12({\Delta r_{x}}^{2}/{a}^{4})&0&0\\ 0&12({\Delta r_{y}^{2}}/{b}^{4})&0\\ 0&0&12({\Delta r_{z}^{2}}/{c}^{4})\end{bmatrix}italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT := [ start_ARG start_ROW start_CELL 12 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 12 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_b start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 12 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ], and Ar:=[4(Δrx3/a4)4(Δry3/b4)4(Δrz3/c4)]assignsubscript𝐴𝑟matrix4Δsuperscriptsubscript𝑟𝑥3superscript𝑎44Δsuperscriptsubscript𝑟𝑦3superscript𝑏44Δsuperscriptsubscript𝑟𝑧3superscript𝑐4A_{r}:=-\begin{bmatrix}4({\Delta r_{x}}^{3}/{a}^{4})&4({\Delta r_{y}^{3}}/{b}^% {4})&4({\Delta r_{z}^{3}}/{c}^{4})\end{bmatrix}italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT := - [ start_ARG start_ROW start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_b start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL start_CELL 4 ( roman_Δ italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ].

Therefore, we note that h¨(r)¨𝑟\ddot{h}(r)over¨ start_ARG italic_h end_ARG ( italic_r ) or the 2ndsuperscript2𝑛𝑑2^{nd}2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT time-derivative of h(r)𝑟h(r)italic_h ( italic_r ) which explicitly depends on the control input u𝑢uitalic_u and therefore our choice of CBF h(r)𝑟h(r)italic_h ( italic_r ) is an exponential CBF with a relative degree Ames et al. (2019) of 2222.

Correspondingly, we use the following forward invariance condition for the set 𝒮={r:h(r)0}𝒮conditional-set𝑟𝑟0\mathcal{S}=\{r:h(r)\geq 0\}caligraphic_S = { italic_r : italic_h ( italic_r ) ≥ 0 } as given in Xu and Sreenath (2018):

h¨+K[hh˙]T0,¨𝐾superscript˙𝑇0\ddot{h}+K\cdot[h\quad\dot{h}]^{T}\geq 0,over¨ start_ARG italic_h end_ARG + italic_K ⋅ [ italic_h over˙ start_ARG italic_h end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ≥ 0 , (53)

with K=[K1K2]T𝐾superscriptsubscript𝐾1subscript𝐾2𝑇K=[K_{1}\quad K_{2}]^{T}italic_K = [ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

The above equation can be re-arranged as follows:

h¨K1h+K2h˙,¨subscript𝐾1subscript𝐾2˙-\ddot{h}\leq K_{1}h+K_{2}\dot{h},- over¨ start_ARG italic_h end_ARG ≤ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h + italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT over˙ start_ARG italic_h end_ARG ,

and, using (50) and (52), we can re-write this equation as

Arubr,subscript𝐴𝑟𝑢subscript𝑏𝑟A_{r}u\leq b_{r},italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u ≤ italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , (54)

where br=r˙TDrr˙+K1hK2Arr˙subscript𝑏𝑟superscript˙𝑟𝑇subscript𝐷𝑟˙𝑟subscript𝐾1subscript𝐾2subscript𝐴𝑟˙𝑟b_{r}=\dot{r}^{T}D_{r}\dot{r}+K_{1}h-K_{2}A_{r}\dot{r}italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = over˙ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT over˙ start_ARG italic_r end_ARG + italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h - italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT over˙ start_ARG italic_r end_ARG. Thus, we can write the safe control set as 𝒞(r)={u3:Arubr}𝒞𝑟conditional-set𝑢superscript3subscript𝐴𝑟𝑢subscript𝑏𝑟\mathcal{C}(r)=\{u\in\mathbb{R}^{3}:A_{r}u\leq b_{r}\}caligraphic_C ( italic_r ) = { italic_u ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT : italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u ≤ italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT }. For our quadcopter experiments, we consider navigation xy𝑥𝑦x-yitalic_x - italic_y dimensions, therefore set the z𝑧zitalic_z-dimension position and velocity to be 00. Thus the control input only comprises the desired acceleration for x𝑥xitalic_x and y𝑦yitalic_y axes and therefore our action space becomes two-dimensional, and we only consider the x𝑥xitalic_x and y𝑦yitalic_y components in the above CBF calculations.

C.2.3 Maximal Inner Hyperrectangle Computation

We now describe the construction of the maximal inner hyperrectangle contained in the set 𝒞(r)𝒞𝑟\mathcal{C}(r)caligraphic_C ( italic_r ) under actuator constraints H𝐻Hitalic_H. These are the sets that our Beta policies will sample from. We use the following optimization problem, with decision variables u=(ux,uy)𝑢superscript𝑢𝑥superscript𝑢𝑦u=(u^{x},u^{y})italic_u = ( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ), to get the maximal inner hyper-rectangle inside the safe set:

𝒫A:max𝑢𝒜(u)s.t.Arubr,uH,:subscript𝒫𝐴𝑢max𝒜𝑢s.t.subscript𝐴𝑟𝑢subscript𝑏𝑟missing-subexpression𝑢𝐻\mathcal{P}_{A}:\begin{aligned} \underset{u}{\mathrm{max}}\ \ &\mathcal{A}(u)% \\ \text{s.t.}\ \ &A_{r}u\leq b_{r},\\ &u\in H,\end{aligned}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT : start_ROW start_CELL underitalic_u start_ARG roman_max end_ARG end_CELL start_CELL caligraphic_A ( italic_u ) end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u ≤ italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_u ∈ italic_H , end_CELL end_ROW (55)

where 𝒜(u)𝒜𝑢\mathcal{A}(u)caligraphic_A ( italic_u ) is the area of a hyperrectangle inside C(r)H𝐶𝑟𝐻C(r)\cap Hitalic_C ( italic_r ) ∩ italic_H and the decision variables ux,uysuperscript𝑢𝑥superscript𝑢𝑦u^{x},u^{y}italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT are points on the line Arubrsubscript𝐴𝑟𝑢subscript𝑏𝑟A_{r}u\leq b_{r}italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u ≤ italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. One of the corner points of this hyperrectangle is formed by ux,uysuperscript𝑢𝑥superscript𝑢𝑦u^{x},u^{y}italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT and the rest of corner points lie on the boundary hyperrectangle formed by H𝐻Hitalic_H. Suppose that (u*x,u*y)subscriptsuperscript𝑢𝑥subscriptsuperscript𝑢𝑦(u^{x}_{*},u^{y}_{*})( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ) are solutions to 𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, then the definition of Area 𝒜𝒜\mathcal{A}caligraphic_A depends on how the line Arubrsubscript𝐴𝑟𝑢subscript𝑏𝑟A_{r}u\leq b_{r}italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_u ≤ italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT intersects with H𝐻Hitalic_H, and therefore, leads to the following four possibilities:

  • 𝒜=(uxuminx)*(uyuminx)𝒜superscript𝑢𝑥superscriptsubscript𝑢𝑚𝑖𝑛𝑥superscript𝑢𝑦superscriptsubscript𝑢𝑚𝑖𝑛𝑥\mathcal{A}=(u^{x}-u_{min}^{x})*(u^{y}-u_{min}^{x})caligraphic_A = ( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) * ( italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) and Hc={umin,(u*x,u*y)}subscript𝐻𝑐subscript𝑢𝑚𝑖𝑛subscriptsuperscript𝑢𝑥subscriptsuperscript𝑢𝑦H_{c}=\{u_{min},(u^{x}_{*},u^{y}_{*})\}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , ( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ) }

  • 𝒜=(uxuminx)*(umaxyuy)𝒜superscript𝑢𝑥superscriptsubscript𝑢𝑚𝑖𝑛𝑥superscriptsubscript𝑢𝑚𝑎𝑥𝑦superscript𝑢𝑦\mathcal{A}=(u^{x}-u_{min}^{x})*(u_{max}^{y}-u^{y})caligraphic_A = ( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) * ( italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) and Hc={(uminx,u*y),(u*x,umaxy)}subscript𝐻𝑐superscriptsubscript𝑢𝑚𝑖𝑛𝑥subscriptsuperscript𝑢𝑦subscriptsuperscript𝑢𝑥superscriptsubscript𝑢𝑚𝑎𝑥𝑦H_{c}=\{(u_{min}^{x},u^{y}_{*}),(u^{x}_{*},u_{max}^{y})\}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { ( italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ) , ( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) }

  • 𝒜=(umaxxux)*(umaxxuy)𝒜superscriptsubscript𝑢𝑚𝑎𝑥𝑥superscript𝑢𝑥superscriptsubscript𝑢𝑚𝑎𝑥𝑥superscript𝑢𝑦\mathcal{A}=(u_{max}^{x}-u^{x})*(u_{max}^{x}-u^{y})caligraphic_A = ( italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) * ( italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) and Hc={(u*x,u*y),umax}subscript𝐻𝑐subscriptsuperscript𝑢𝑥subscriptsuperscript𝑢𝑦subscript𝑢𝑚𝑎𝑥H_{c}=\{(u^{x}_{*},u^{y}_{*}),u_{max}\}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { ( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }

  • 𝒜=(umaxxux)*(uyuminy)𝒜superscriptsubscript𝑢𝑚𝑎𝑥𝑥superscript𝑢𝑥superscript𝑢𝑦superscriptsubscript𝑢𝑚𝑖𝑛𝑦\mathcal{A}=(u_{max}^{x}-u^{x})*(u^{y}-u_{min}^{y})caligraphic_A = ( italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) * ( italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) and Hc={(u*x,uminy),(umaxx,u*y)}subscript𝐻𝑐subscriptsuperscript𝑢𝑥superscriptsubscript𝑢𝑚𝑖𝑛𝑦superscriptsubscript𝑢𝑚𝑎𝑥𝑥subscriptsuperscript𝑢𝑦H_{c}=\{(u^{x}_{*},u_{min}^{y}),(u_{max}^{x},u^{y}_{*})\}italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { ( italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) , ( italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ) }.

𝒫Asubscript𝒫𝐴\mathcal{P}_{A}caligraphic_P start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is, in general, a non-convex program. However, through change-of-variables, we can transform this problem into a tractable problem through the following transformation. We perform a change of variables, with new variables denoted by (u¯x,u¯y)superscript¯𝑢𝑥superscript¯𝑢𝑦(\bar{u}^{x},\bar{u}^{y})( over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) defined by

  • u¯x=uxuminxsuperscript¯𝑢𝑥superscript𝑢𝑥superscriptsubscript𝑢𝑚𝑖𝑛𝑥\bar{u}^{x}=u^{x}-u_{min}^{x}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and u¯y=uyuminysuperscript¯𝑢𝑦superscript𝑢𝑦superscriptsubscript𝑢𝑚𝑖𝑛𝑦\bar{u}^{y}=u^{y}-u_{min}^{y}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT

  • u¯x=uxuminxsuperscript¯𝑢𝑥superscript𝑢𝑥superscriptsubscript𝑢𝑚𝑖𝑛𝑥\bar{u}^{x}=u^{x}-u_{min}^{x}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and u¯y=umaxyuysuperscript¯𝑢𝑦superscriptsubscript𝑢𝑚𝑎𝑥𝑦superscript𝑢𝑦\bar{u}^{y}=u_{max}^{y}-u^{y}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT

  • u¯x=umaxxuxsuperscript¯𝑢𝑥superscriptsubscript𝑢𝑚𝑎𝑥𝑥superscript𝑢𝑥\bar{u}^{x}=u_{max}^{x}-u^{x}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and u¯y=umaxyuysuperscript¯𝑢𝑦superscriptsubscript𝑢𝑚𝑎𝑥𝑦superscript𝑢𝑦\bar{u}^{y}=u_{max}^{y}-u^{y}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT

  • u¯x=umaxxuxsuperscript¯𝑢𝑥superscriptsubscript𝑢𝑚𝑎𝑥𝑥superscript𝑢𝑥\bar{u}^{x}=u_{max}^{x}-u^{x}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT = italic_u start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and u¯y=uyuminysuperscript¯𝑢𝑦superscript𝑢𝑦superscriptsubscript𝑢𝑚𝑖𝑛𝑦\bar{u}^{y}=u^{y}-u_{min}^{y}over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = italic_u start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT - italic_u start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT.

The corresponding objective function is given by 𝒜¯=u¯xu¯y¯𝒜superscript¯𝑢𝑥superscript¯𝑢𝑦\mathcal{\bar{A}}=\bar{u}^{x}\bar{u}^{y}over¯ start_ARG caligraphic_A end_ARG = over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT. So long as the entries corresponding to Arsubscript𝐴𝑟A_{r}italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and brsubscript𝑏𝑟b_{r}italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from (55) in the transformed problem are nonnegative, the resulting problem is a geometric program, which can be further transformed to a convex problem by standard methods and efficiently solved. In our quadcopter experiments, when Ar,br0subscript𝐴𝑟subscript𝑏𝑟0A_{r},b_{r}\geq 0italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≥ 0, we solved the transformed geometric program using CVXPY (Diamond and Boyd, 2016), and used non-linear solvers from SCIPY (Virtanen et al., 2020) otherwise. We observed in our experiments that the transformation resulted in geometric programs in all but a handful of cases.

C.2.4 Reward

We now discuss reward sha** used in our quadcopter experiments.

Suppose that the rmin:=[rminxrminy]Tassignsubscript𝑟𝑚𝑖𝑛superscriptsuperscriptsubscript𝑟𝑚𝑖𝑛𝑥superscriptsubscript𝑟𝑚𝑖𝑛𝑦𝑇r_{min}:=[r_{min}^{x}\quad r_{min}^{y}]^{T}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT := [ italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and rmax:=[rmaxxrmaxy]Tassignsubscript𝑟𝑚𝑎𝑥superscriptsuperscriptsubscript𝑟𝑚𝑎𝑥𝑥superscriptsubscript𝑟𝑚𝑎𝑥𝑦𝑇r_{max}:=[r_{max}^{x}\quad r_{max}^{y}]^{T}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT := [ italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT are environment boundaries, with x,y𝑥𝑦x,yitalic_x , italic_y-axis boundary repectively defined by [rminx,rmaxx]superscriptsubscript𝑟𝑚𝑖𝑛𝑥superscriptsubscript𝑟𝑚𝑎𝑥𝑥[r_{min}^{x},r_{max}^{x}][ italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ] and [rminy,rmaxy]superscriptsubscript𝑟𝑚𝑖𝑛𝑦superscriptsubscript𝑟𝑚𝑎𝑥𝑦[r_{min}^{y},r_{max}^{y}][ italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ], that we employ for guiding exploration for both the quadcopter experiments. Then the reward used in our environment is defined by

R(r)={50 if rrgoal2<ϵ,rrgoal2 if rmax>r>rmin and rrgoal2ϵ,rrgoal2400 if rrmax or rrmin,𝑅𝑟cases50 if subscriptnorm𝑟subscript𝑟𝑔𝑜𝑎𝑙2italic-ϵsubscriptnorm𝑟subscript𝑟𝑔𝑜𝑎𝑙2 if subscript𝑟𝑚𝑎𝑥𝑟subscript𝑟𝑚𝑖𝑛 and subscriptnorm𝑟subscript𝑟𝑔𝑜𝑎𝑙2italic-ϵsubscriptnorm𝑟subscript𝑟𝑔𝑜𝑎𝑙2400 if 𝑟subscript𝑟𝑚𝑎𝑥 or 𝑟subscript𝑟𝑚𝑖𝑛R(r)=\begin{cases}50&\text{ if }||r-r_{goal}||_{2}<\epsilon,\\ -||r-r_{goal}||_{2}&\text{ if }r_{max}>r>r_{min}\text{ and }||r-r_{goal}||_{2}% \geq\epsilon,\\ -||r-r_{goal}||_{2}-400&\text{ if }r\geq r_{max}\text{ or }r\leq r_{min},\end{cases}italic_R ( italic_r ) = { start_ROW start_CELL 50 end_CELL start_CELL if | | italic_r - italic_r start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_ϵ , end_CELL end_ROW start_ROW start_CELL - | | italic_r - italic_r start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL if italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT > italic_r > italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and | | italic_r - italic_r start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ italic_ϵ , end_CELL end_ROW start_ROW start_CELL - | | italic_r - italic_r start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 400 end_CELL start_CELL if italic_r ≥ italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT or italic_r ≤ italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , end_CELL end_ROW

where ϵ=0.25italic-ϵ0.25\epsilon=0.25italic_ϵ = 0.25 is the boundary around rgoalsubscript𝑟𝑔𝑜𝑎𝑙r_{goal}italic_r start_POSTSUBSCRIPT italic_g italic_o italic_a italic_l end_POSTSUBSCRIPT for which we give a constant positive reward of 50505050 to the agent, and the inequalities in the reward definition are element-wise. Moreover, when the agent is inside the boundary but outside the ϵitalic-ϵ\epsilonitalic_ϵ-neighborhood of the goal, then the reward is negative of the distance between the agent and the goal. Lastly, we penalize the agent if it goes outside the boundary defined by rminsubscript𝑟𝑚𝑖𝑛r_{min}italic_r start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and rmaxsubscript𝑟𝑚𝑎𝑥r_{max}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT to encourage exploration in the region around the goal.

C.2.5 Hyperparameters

The hyperparameters used in the experiments are presented in Figure 6.

policy learning rate 0.0004
value learning rate 0.0004
entropy coefficient 0.00000001
clip range 0.2
weight decay 0.0
layer size 256
batch size 256
buffer size 320
number of epochs 10
rollout length 320
discount factor 0.90
(a) Gaussian hyperparameters.
policy learning rate 0.0006
value learning rate 0.0006
entropy coefficient 0.0
clip range 0.2
weight decay 0.0
layer size 256
batch size 256
buffer size 180
number of epochs 10
rollout length 180
discount factor 0.90
(b) Beta hyperparameters.
Figure 6: PPO hyperparameters for the quadcopter experiments.

C.2.6 Learning Curves

Learning curves for the experiments illustrated in Figures 2 and 1(c) are presented in Figures 6(a) and 6(b).

Refer to caption
(a) Beta policy learning curve.
Refer to caption
(b) Projected Gaussian learning curve.
Figure 7: Learning curves corresponding to experiments pictures in Figures 2 and 1(c). Curves show means and 95% confidence intervals over 6 independent replications. Our CBF-constrained Beta policies clearly learn to improve reward and eventually find the goal, while the projection-based approach fails.

Appendix D Computing Resources

We ran our experiments on both a personal laptop and an HPC cluster. The laptop was configured with a 6-core i7-8750H, 2.20GHz CPU, an NVIDIA GeForce RTX 2070 GPU, and 32GB RAM . The HPC server node was configured with a 32-core Intel Xeon CPU , an 80GB Nvidia Tesla GPU, and 512 GB RAM .