Sampling-based Safe Reinforcement Learning for
Nonlinear Dynamical Systems
Wesley A. Suttle Vipul K. Sharma Krishna C. Kosaraju
U.S. Army Research Laboratory Purdue University Clemson University
S. Sivaranjani Ji Liu Vijay Gupta Brian M. Sadler
Purdue University Stony Brook University Purdue University U.S. Army Research Laboratory
Abstract
We develop provably safe and convergent reinforcement learning (RL) algorithms for control of nonlinear dynamical systems, bridging the gap between the hard safety guarantees of control theory and the convergence guarantees of RL theory. Recent advances at the intersection of control and RL follow a two-stage, safety filter approach to enforcing hard safety constraints: model-free RL is used to learn a potentially unsafe controller, whose actions are projected onto safe sets prescribed, for example, by a control barrier function. Though safe, such approaches lose any convergence guarantees enjoyed by the underlying RL methods. In this paper, we develop a single-stage, sampling-based approach to hard constraint satisfaction that learns RL controllers enjoying classical convergence guarantees while satisfying hard safety constraints throughout training and deployment. We validate the efficacy of our approach in simulation, including safe control of a quadcopter in a challenging obstacle avoidance problem, and demonstrate that it outperforms existing benchmarks.
1 INTRODUCTION
Learning-based methods for safe control of physical systems have been gaining increasing attention (Brunke et al., 2022). RL is especially powerful for the control of systems where performance feedback in the form of a scalar reward is available, but the dynamics are unknown (Sutton and Barto, 2018). In such settings, RL methods can learn a controller maximizing reward through direct interaction with the environment. However, due to physical realities such as the need to guarantee safety, practical application of RL to control of physical systems requires constraints on the control policies throughout training (Garcıa and Fernández, 2015). While directly constraining the action space to a static, narrowly defined set of “safe” actions is frequently employed in practice, this can lead to learning highly suboptimal policies and more nuanced methods are therefore required. Furthermore, in most physical systems it is non-trivial to directly translate complex safety constraints on the states into allowable actions.
A variety of RL approaches to the problem of safe learning for control have been proposed in the literature (see Brunke et al. (2022) for a comprehensive survey), including RL methods for safety-focused problems formulated as constrained Markov decision processes (CMDPs) (Altman, 2021), methods for learning to achieve safety through stability (Berkenkamp et al., 2017), and projection-based – also known as “safety filter” – RL methods for maintaining hard safety constraints, typically achieved through the use of control barrier functions (CBFs) (Cheng et al., 2019). Though CMDP-based methods enjoy convergence guarantees, they encourage safety without guaranteeing it, and cannot provide guarantees for hard safety constraints commonly required in physical systems. Likewise, methods like Berkenkamp et al. (2017) do better by offering high-probability safety assurances, but stop short of guaranteeing safety. In systems where safety is critical, methods like Cheng et al. (2019) that provably guarantee hard constraint satisfaction are necessary. However, the interaction between imposition of hard constraints and optimality of the resulting control policies is a subtle issue in RL. While projection-based safety-filter approaches (Wabersich et al., 2023) provably guarantee safety, the projection procedure undermines any convergence guarantees enjoyed by the underlying RL methods.
In this work, we develop a class of model-free policy gradient methods that maintain safety or other stability properties by sampling directly from the set of state-dependent safe actions. The key to our approach is that we consider truncated versions of commonly used stochastic policies, allowing us to sample directly from the safe action set at each state. This allows us to recover convergence guarantees by extending existing results for policy gradient methods to truncated policies. Our approach is applicable to a wide class of safety constraints including control barrier functions (CBFs), that enforce forward invariance of a set characterized by nonlinearly coupled states and actions (Ames et al., 2019, 2016), and reachability-type constraints (Wabersich et al., 2023). In addition to our theoretical results, we experimentally validate the practical utility of sampling-based safety-preservation methods by considering a special case: Beta policies with state-dependent, control barrier function (CBF)-constrained action sets. This novel approach extends the Beta policies in Chou et al. (2017) to the state-dependent action constraint setting. Finally, we train the resulting CBF-constrained Beta policies using PPO to solve a safety-constrained inverted pendulum problem as well as a quadcopter navigation and obstacle avoidance problem, and compare the latter to a safety filter-based benchmark.111Our implementation is publicly available at https://github.com/sharma1256/cbf-constrained_ppo. These case studies illustrate that our method simultaneously guarantees safety throughout training and guarantees optimality, even where existing benchmarks fail.
1.1 Related Work
Safety and stability have seen a great deal of interest in recent years at the intersection of the RL and control communities (see Brunke et al. (2022); Garcıa and Fernández (2015) for overviews). We are interested in safety definitions that impose hard constraints on the states and control actions (rather than, e.g., those used in robust RL (Wiesemann et al., 2013; Aswani et al., 2013) or RL for CMDPs (Achiam et al., 2017; Paternain et al., 2019; Ma et al., 2021; Bai et al., 2022)). Model-based methods for guaranteeing stability using RL controllers in systems with known or learnable dynamics have been developed in Berkenkamp et al. (2017); Fazel et al. (2018); Zhang et al. (2021). Recently, techniques leveraging control barrier functions to maintain safety (Cheng et al., 2019) and dissipativity (Kosaraju et al., 2021) have been developed.
Our work lies in the model-free RL setting. The two dominant approaches in model-free RL are value function and policy gradient-based methods (Sutton and Barto, 2018). We focus on the latter in this paper. Since their origins early in the development of RL (Sutton et al., 2000; Borkar, 2005; Bhatnagar et al., 2009), policy gradient methods have become the model-free algorithms of choice for complex problems with continuous, high-dimensional state and action spaces (Lillicrap et al., 2015; Schulman et al., 2017; Haarnoja et al., 2018). Recent works have improved our understanding of gradient estimation procedures, global optimality properties, and convergence rates of these algorithms (Agarwal et al., 2020; Zhang et al., 2020; Suttle et al., 2023). Popular approaches for safety in model-free RL include using bounds resulting from Gaussian process models (Schreiter et al., 2015; Rasmussen, 2003; Sui et al., 2015), reward-sha**, constrained policy optimization (Achiam et al., 2017; Wachi et al., 2018), and teacher advice (Abbeel and Ng, 2004). Our work is most closely related to those approaches that use a hard safe set specification and constraints on control inputs, e.g., control barrier functions (Cheng et al., 2019; Fisac et al., 2018; Li et al., 2018; Kosaraju et al., 2021). In particular, our key contribution is a model-free safe RL algorithm with convergence guarantees and provable safety guarantees under hard constraints like CBFs, even during training.
2 PROBLEM SETTING
Consider a discounted MDP , where is the state space, is the action space, is the transition probability function given action is taken in state , is the reward function, and is the discount factor. The MDP, which can be used to model a wide array of discrete-time systems, proceeds as follows: at time , the system is in state ; a control input is applied to the system; a reward is received; the system transitions into state according to the distribution . The goal in this problem formulation is to maximize the expected discounted reward, which we define in (3) below. Note that deterministic dynamics can be recovered by imposing that, for each , there exists such that . This is useful for modeling discretizations of continuous-time control problems, for example. We assume throughout this paper that the dynamics are deterministic in this way, which is a common setting in safe control problems. Let represent the dynamics of the MDP, i.e., given state and control input , denotes the state the system transitions into when input is applied while in state .
Letting denote the set of all probability distributions over the set , a stochastic policy is a function map** states to probability distributions over the action space . In other words, given a state , an agent using policy will choose a control action by sampling . For our purposes it will be useful to consider policies parameterized by , for some , where is a compact set of permissible parameters.
Let denote some “safe” or stable set within which we wish to keep the system. Furthermore, let denote the powerset of a set , and consider a set-valued function given by
(1) |
Intuitively, is the set of all control inputs which when applied at state keep the system within the safe set at the next time step. We assume throughout that, for a given , is known. Since our primary focus is resolving the open problem of simultaneously guaranteeing convergence and hard safety constraint satisfaction, we leave the issue of learning or approximating while maintaining these guarantees to future work. The general formulation can be used to accommodate a variety of notions of safety, including forward invariance, stability, and dissipativity enforced by, for example, CBFs and exponential CBFs (ECBFs), and control Lyapunov functions (CLFs) (see the supplementary material for an overview and Ames et al. (2019) for a comprehensive survey). As we will demonstrate in the case studies below, the use of our method in conjunction with (E)CBFs is particularly natural to provide guarantees in problems with hard safety constraints. To ensure that we can sample from and integrals over are well-defined, we make the following assumption. Let denote the Lebesgue measure.
Assumption 1.
There exist such that , for all . Furthermore, is compact.
Given a policy , consider the distribution obtained by truncating to the set . More precisely:
(2) |
where . As long as we can check membership in for any given , and assuming that the volume of is strictly positive, for all , we can generate from this distribution by using rejection sampling, i.e. repeatedly sampling until we obtain . Note that, depending on the structure of parametrized policies , if has a particularly nice form, such as an interval or hyperrectangle, there may be more efficient methods than rejection sampling for sampling from the truncated distribution directly. We exploit this fact when leveraging Beta policies in the experimental results of Section 4 below.
With this setup in mind, and given a fixed start state , we propose a policy gradient-based algorithm maximizing the objective function
(3) |
the expected discounted reward under policy . Before proceeding with describing and analyzing the algorithm, we first need to identify conditions that ensure that, for each policy parameter , taking expectations with respect to is well-defined and thus meaningful. In order for (3) to be well-defined, we need to know that, for each policy parameter , the occupancy measure of the Markov chain induced by on is irreducible and satisfies certain ergodicity conditions. Once these are proven, we will be justified in performing gradient ascent on the objective function (3).
3 THEORETICAL RESULTS
In this section we develop the theory underlying our sampling-based method for RL with hard safety constraints. Our key contributions include proving that (3) is well-defined (§3.1), obtaining gradient expressions for it from which we can sample (§3.2), and develo** and establishing the convergence of a policy gradient algorithm for optimizing (3) (§3.3 and §3.4). All proofs are deferred to the supplementary material. It is important to note that, though we assumed the deterministic dynamics common to safe control in §2, all our theoretical results go through in the stochastic dynamics case under standard ergodicity assumptions. Our key theoretical contribution in what follows is to show that, even in the deterministic dynamics case, we can ensure that the objective is well-defined (§3.1) and obtain convergence (§3.4).
3.1 Discounted Return is Well-defined
First, we show that the objective (3) is well-defined when using truncated policies, even in continuous spaces systems with deterministic dynamics. Our key contribution in this setting is to ensure that, given reasonable conditions on the policies under consideration, important ergodicity properties of their induced Markov chains hold. This fact, established in Proposition 1 and Corollary 1, is nontrivial and its proof relies on a careful analysis of the propagation of probability mass through the transition dynamics and an interesting application of the Lebesgue-Radon-Nikodym Theorem (Folland, 1999, §3.2). As in the previous section, let denote the “safe” set within which the system remains so long as all control inputs are selected from . Though we leave open the possibility that satisfies a more specific stability conditions rather than a generic notion of “safety”, we will typically use the term “safe” for ease of presentation. Let denote Lebesgue measure. We make the following definition:
Definition 1.
The Markov chain induced by on is -irreducible on if, for any -measurable , if , then , for all .
This means that, for a Markov chain to be (-)irreducible on the safety set, all safe subsets with positive volume must be reachable from any initial safe state with positive probability. Notice that is in fact a Markov chain on the safe set since, by the definition of , only those control inputs kee** the system within are allowed. In the sequel, we will prove that, for each , under suitable conditions the Markov chain induced by on is irreducible and the objective (3) is thus well-defined, which is a prerequisite for develo** policy gradient methods based on it. See (Konda, 2002, §2.3) for details on irreducibility in this setting.
Given an element and dynamics , let consisting of all elements reachable in one step from under . Furthermore, for , define . Also, given and , let denote the open ball of radius centered at . Finally, for , define Intuitively, is the set of all control inputs that, when taken in state , drive the system into . The following assumptions are needed in what follows.
Assumption 2.
For any and any -measurable set , if and only if .
Assumption 2 ensures the system dynamics map positive volume subsets of control inputs to positive volume subsets of the state space and vice versa, which is important for our application of the Lebesgue-Radon-Nikodym Theorem in Proposition 1. It is satisfied by systems where control inputs have a measurable effect on each entry in the next state vector and thus encompasses a wide array of potentially nonlinear systems.
Assumption 3.
For any , where is the set of permissible policy parameters, for any element in the safe set , and for any set satisfying , the policy assigns positive probability to , i.e. .
Assumption 3, which is standard in the RL literature, ensures that any set of allowable control inputs that has strictly positive volume will be sampled from with strictly positive probability.
Assumption 4.
For each , , and, given , there exists such that is reachable in steps from .
The conditions imposed in Assumption 4 guarantee that, for any state : (i) the set of states reachable from in one step has strictly positive volume; (ii) any subset of the safe set is reachable in at most steps from . These conditions are closely related to the familiar notion of controllability of control theory. Under these conditions, we have the following proposition and its immediate corollary.
Proposition 1.
Corollary 1.
is -irreducible on .
Now that we are assured that the objective function (3) is well-defined, we are justified in attempting to perform gradient ascent on it. In order to accomplish this, however, we need access to gradient estimates. This is the subject of the next section.
3.2 Policy Gradients
Despite the presence of in , under mild assumptions on the underlying policy , we can apply the classic policy gradient theorem of Sutton et al. (2000) to (3) to obtain a gradient expression from which we can sample. Let denote the discounted state occupancy measure of the Markov chain induced by policy on . Furthermore, let We make the following assumption:
Assumption 5.
and is differentiable in , for all .
Recall from (2) that is simply the probability density function truncated to the set . Note, since the value of at a given is independent of , we can take the derivative inside the integral sign in the latter expression to obtain , so is differentiable. Given these facts, combined with Assumption 5, the above expression for implies that, for any , the policy is differentiable with respect to , for any . In short, satisfies its own version of Assumption 5, which we formalize in the following:
Lemma 1.
and is differentiable in , for all and .
The policy gradient theorem (Konda, 2002) implies
(4) |
In order to carry out gradient updates based on this expression, we first need to be able to estimate , for arbitrary . We will discuss how to estimate in an unbiased manner in the following section. Since we already have access to , we can focus on estimating . Based on (2), the gradient of with respect to is
(5) | ||||
(6) | ||||
(7) |
To estimate , we need to be able to estimate . Given access to and , we can use numerical integration or Monte Carlo techniques to approximate this integral. The standard Monte Carlo approach is to uniformly sample elements from , then estimate
(8) |
where is the volume of . This estimate is based on the fact that
(9) | ||||
(10) | ||||
(11) | ||||
(12) |
where the last equality holds by the law of large numbers. Since is fixed given , gradient estimates and ultimately can also be obtained by estimating the integral . In the Monte Carlo situation, this can be obtained from (8) by differentiating each term with respect to .
3.3 Algorithm
In this section, we present a hard safety-constrained random-horizon policy gradient (Safe-RPG) algorithm. Our algorithm is based on the random-horizon policy gradient (RPG) scheme developed in Zhang et al. (2020), which uses a random rollout horizon and recent advances in non-convex optimization to obtain unbiased policy gradient estimates and ensure finite-time convergence to approximately locally optimal policies. As discussed in the following section, our convergence results ensure asymptotic convergence of Algorithm 2 to a stationary point of (3), but can likely be strengthened to prove finite-time convergence to approximately locally optimal policies. The main algorithm is presented in Algorithm 2, which depends on the action-value function estimation subroutine in Algorithm 1.
3.4 Convergence
In this section we show asymptotic convergence of Algorithm 2 to the set of stationary points of (3). The key challenge in this result revolves around the need to establish that the policies we consider satisfy important differentiability and continuity properties, which necessitates a careful analysis of the Lipschitz properties of the score functions of our truncated policies in the proof of Lemma 2. To proceed, we need the following assumption on the reward function and underlying, untruncated policy class .
Assumption 6.
The reward function and parameterized policy class satisfy the following:
-
1.
The absolute value of the reward is uniformly bounded, i.e., there exists such that .
-
2.
For all , exists, and there exist and such that, for all ,
-
(a)
, for all
-
(b)
, for all .
-
(a)
Assumptions 5 and 6 were used to prove asymptotic convergence of the RPG algorithm with untruncated policies to stationary points in (Zhang et al., 2020, Theorem 4.4). For an analogous result to apply to the truncated policies we consider, it must be shown that the Lipschitz and differentiability conditions in part 2 of Assumption 6 hold for the constrained policies . It turns out that, under the same conditions on the untruncated policy , these properties are automatically satisfied for .
Lemma 2.
With Lemma 2 in hand, we have the following result.
Theorem 1.
Remark 1.
Given Lemma 2, the proof of the theorem follows directly from that of (Zhang et al., 2020, Theorem 4.4). With suitable modifications to Algorithm 2 incorporating periodically increasing stepsizes, these results can likely be strengthened to obtain finite-time convergence to an -locally optimal policy using the machinery developed in Zhang et al. (2020). We leave this to future work.
4 EXPERIMENTAL RESULTS
We now experimentally demonstrate the effectiveness of our sampling-based safe RL approach. Specifically, we evaluate the use of CBF-constrained Beta policies combined with the popular Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm on safety-constrained inverted pendulum and quadcopter navigation environments. The use of Beta policies with variable action space constraints allows us to directly sample from a CBF-constrained action space at each timestep. In addition to providing a practical example of how truncated policies can be used to ensure safety, this method extends the work Chou et al. (2017) on the use of Beta policies for deep RL from constant to state-dependent action space constraints.
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
4.1 CBF-Constrained Beta Policies
When actions must be restricted to lie within fixed, predetermined bounds due to physical or numerical constraints, the common practice of simply clip** policies with infinite support (e.g., Gaussian policies) can cause bias and performance issues. Chou et al. (2017) propose and leverage finite-support Beta distribution-based policies to overcome these issues. We extend this approach to obtain policies that sample directly from the safe control actions prescribed by the CBF at a given state.
In order to describe these CBF-constrained Beta policies, let us first recall the probability density function (p.d.f.) of a one-dimensional Beta distribution:
(13) |
where , , and is the Gamma function defined for with . A Beta policy sampling from the fixed interval is given by , where are parameterized functions (e.g., neural networks) map** states to the parameters of the Beta distribution. When the action space is of dimension and the CBF constraint set (or an inner approximation of it), , can be expressed as a hyperrectangle with lower and upper bounds , respectively, we maintain independent Beta distributions, , over each dimension of the unit box , and samples from these distributions are shifted and rescaled to lie within the bounds given by . Specifically, our CBF-constrained Beta policies, denoted , sample from by first sampling , then performing the simple transformation , where denotes the diagonal matrix with elements along the diagonal.
4.2 Implementation
We now describe the implementation details of our Beta policies. For a given state , the parameter vectors are outputted by a two-layer, fully connected neural network. Control inputs at state were obtained by first creating an independent PyTorch (Paszke et al., 2019) Beta distribution object with parameters , for each dimension of the action space, then sampling from these distributions, and finally scaling and translating to lie within the current CBF set . Similarly, the Gaussian policies we used for comparison used distribution parameters outputted by a two-layer, fully connected neural network. Control inputs were then selected from the corresponding distribution by sampling, then following the standard practice (Chou et al., 2017) of clip** to a fixed set of permissible controls. The PPO implementation used in the experiments was adapted with minor modifications from Stable Baselines 3 (Raffin et al., 2021).
4.3 Case study 1 : Quadcopter Navigation
Experiment Setup. For this experiment, we consider the problem of learning to safely navigate a quadcopter around an obstacle to a goal location. In this section, we present an overview of the dynamical model that we use for this quadcopter, which was previously considered in Xu and Sreenath (2018), and describe our derivation of a hyperrectangular inner approximation of the safe control set, satisfying the CBF condition, that is amenable to sampling using our Beta policies. We finally briefly describe the reward function. See the supplementary material for a detailed exposition of the environment and sampling procedure.
We denote quadcopter and obstacle position by and , respectively, and the quadcopter’s relative position with respect to the obstacle as . The quadrotor dynamics are then given by
with input consisting the desired accelerations in each , and dimensions. For the obstacle avoidance problem, we characterize the safe set as , where
(14) |
parameterize the obstacle’s shape, which is assumed to be elliptical, and represents the desired safety margin. Given the quadcopter dynamics and (14), the first time derivative does not explicitly contain the control input . We therefore use the standard ECBF formulation Ames et al. (2019) to develop our safety condition using , which explicitly contains . This ECBF condition is expressed as where , are application-specific design parameters, and can be rewritten as where is a matrix and is a vector, both depending on . Then, the state-dependent safe control set is given by (see the supplementary for details). The dynamics (and consequently the ECBF conditions) are discretized with time step .
We consider navigation in the dimension as in Xu and Sreenath (2018), resulting in a two-dimensional action space. We take the actuator constraint to be defined by the hyperrectangle , where and are the minimum and maximum input values. In order to sample from the safe control set at a given with our Beta policies, we need a hyperrectangular inner approximation. We obtain this inner approximation by formulating and solving a convex optimization problem yielding the highest volume hyperrectangle, , contained within .
Finally, we designed a reward providing an penalty based on agent distance from the goal, as well as a sizeable bonus for reaching the goal and a significant penalty for approaching the edge of the map. See the supplementary for details.
Results. The experiments we conducted illustrate that safety-filter based approaches like those considered in Cheng et al. (2019) fail on simple cases of our safety-constrained quadcopter problem (see Figure 2), while our CBF-constrained Beta policy succeeds (Figure 2). For illustration purposes, Figures 2 and 2 present trajectories generated over the course of training. The corresponding learning curves are included in the supplementary material. As illustrated in Figure 2, the safety-filter approach is effective at ensuring safety and also learns to successfully reach the goal when the obstacle is nonexistent or distant. However, it ultimately fails to reach the goal when the obstacle lies directly between the start and goal positions. We hypothesize that this is due to the fact that the projection-based approach attempts to learn an optimal policy for the unconstrained navigation problem, while projection causes it to deviate from its learned policy to maintain safety. Furthermore, the resulting safety-filtered policy cannot recover from these projections without an additional control layer (such as a derivative or PID controller) due to the repeated perturbation from the projection procedure. Our method, on the other hand, learns to successfully solve the problem as shown in Figure 2 while maintaining safety throughout training, since we directly learn policies for the CBF-constrained problem.
4.4 Case study 2: Inverted pendulum
Experiment Setup. For the second set of experiments, we considered a safety-constrained inverted pendulum environment building on the baseline Gym implementation (Brockman et al., 2016). The goal in this environment is to swing an inverted pendulum upright while maintaining it within a fixed safe set. We tested PPO with the two different policies on this environment for two different safe sets: and . Due to space limitations, we include the experiments with in Figure 3 and the experiments with with the supplementary material. As a baseline, we compare the proposed method to PPO with unconstrained Gaussian policies. This comparison highlights the effectiveness of the proposed method in guaranteeing safety as well as accelerating learning.
Results.
![Refer to caption](extracted/5453604/AISTATS2024/figures/comparison_-0.5_0.5.png)
Our experiments are summarized in Figure 3. There are two main points to be drawn from these results. First, the top panel shows that incorporating prior knowledge about properties such as safety can encourage learning and accelerate convergence by forcing the Beta policy agent to concentrate on higher-value subsets of the state space. The Gaussian agent, on the other hand, is unable to benefit from this prior knowledge and convergence suffers as it spends a greater portion of its time exploring lower-value regions of the state space. Second, the bottom panel illustrates that Beta policies are highly effective at maintaining safety throughout training, while Gaussian policies without safety constraints naturally fail to remain inside the safe set. This is expected, but illustrates the need to use constraint-aware policies such as Beta policies when prior knowledge is available.
5 CONCLUSION
We have developed a sampling-based approach to learning policies ensuring hard constraint satisfaction in RL. Unlike existing, projection-based methods that ensure safety but lack convergence guarantees, our scheme provably does both. In addition to our theoretical contributions, we have also presented a practical solution method that leverages CBF-constrained Beta policies to ensure safety, and experimentally demonstrated its effectiveness on safe quadcopter navigation and inverted pendulum environments. Interesting directions for future work including extensions to the case where the constraint set must be estimated and application of our CBF-constrained Beta policies to real-world robotics problems.
Acknowledgements
The authors would like to thank the anonymous reviewers for their helpful comments and Mostafa Mohamed Fa Abdelnaby of Purdue University for pointing out Remark 1. The work of W. A. Suttle was supported by a Distinguished Postdoctoral Fellowship with the U.S. Army Research Laboratory. V. K. Sharma was partially funded by a grant from the Purdue Engineering Initiative on Autonomous and Connected Systems. The work of J. Liu was supported in part by the National Science Foundation (NSF) under grant 2230101, by the Air Force Office of Scientific Research (AFOSR) under award number FA9550-23-1-0175, and by U.S. Air Force Task Order FA8650-23-F-2603. The work of K. C. Kosaraju and V. Gupta was partially supported by Army Research Office grants W911NF2310111, W911NF2310266, and W911NF-23-1-0316, AFOSR grant F.10052139.02.005, Office of Naval Research grants F.10052139.02.009 and F.10052139.02.012, and NSF grant 2300355.
References
- Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine learning, 2004.
- Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR, 2017.
- Agarwal et al. (2020) Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR, 2020.
- Agrawal and Sreenath (2017) Ayush Agrawal and Koushil Sreenath. Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation. In Robotics: Science and Systems, volume 13, pages 1–10, 2017.
- Altman (2021) Eitan Altman. Constrained Markov Decision Processes. Routledge, 2021.
- Ames et al. (2016) Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8):3861–3876, 2016.
- Ames et al. (2019) Aaron D Ames, Samuel Coogan, Magnus Egerstedt, Gennaro Notomista, Koushil Sreenath, and Paulo Tabuada. Control barrier functions: Theory and applications. In 2019 18th European Control Conference, pages 3420–3431, 2019.
- Aswani et al. (2013) Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. Provably safe and robust learning-based model predictive control. Automatica, 49(5):1216–1226, 2013.
- Bai et al. (2022) Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3682–3689, 2022.
- Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in Neural Information Processing Systems, pages 908–918, 2017.
- Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.
- Borkar (2005) Vivek S Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
- Brunke et al. (2022) Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411–444, 2022.
- Cheng et al. (2019) Richard Cheng, Gábor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3387–3395, 2019.
- Chou et al. (2017) Po-Wei Chou, Daniel Maturana, and Sebastian Scherer. Improving stochastic policy gradients in continuous control with deep reinforcement learning using the beta distribution. In International Conference on Machine Learning, pages 834–843. PMLR, 2017.
- Diamond and Boyd (2016) Steven Diamond and Stephen Boyd. CVXPY: A python-embedded modeling language for convex optimization. The Journal of Machine Learning Research, 17(1):2909–2913, 2016.
- Fazel et al. (2018) Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
- Fisac et al. (2018) Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
- Folland (1999) Gerald B Folland. Real Analysis: Modern Techniques and Their Applications, volume 40. John Wiley & Sons, 1999.
- Garcıa and Fernández (2015) Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. PMLR, 2018.
- Ho et al. (2020) Cherie* Ho, Katherine* Shih, Jaskaran Grover, Changliu Liu, and Sebastian Scherer. “Provably safe” in the wild: testing control barrier functions on a vision-based quadrotor in an outdoor environment. In RSS 2020 Workshop in Robust Autonomy, 2020. URL https://openreview.net/pdf?id=CrBJIgBr2BK.
- Konda (2002) V. Konda. Actor-Critic Algorithms. PhD thesis, Massachusetts Institute of Technology, 2002.
- Kosaraju et al. (2021) Krishna Chaitanya Kosaraju, Seetharaman Sivaranjani, Wesley Suttle, Vijay Gupta, and Ji Liu. Reinforcement learning based distributed control of dissipative networked systems. IEEE Transactions on Control of Network Systems, 9(2):856–866, 2021.
- Li et al. (2018) Zhaojian Li, Uroš Kalabić, and Tianshu Chu. Safe reinforcement learning: Learning with supervision using a constraint-admissible set. In 2018 Annual American Control Conference, pages 6390–6395, 2018.
- Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Ma et al. (2021) Haitong Ma, Jianyu Chen, Shengbo Eben, Ziyu Lin, Yang Guan, Yangang Ren, and Sifa Zheng. Model-based constrained reinforcement learning using generalized control barrier function. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4552–4559, 2021.
- Mellinger et al. (2012) Daniel Mellinger, Nathan Michael, and Vijay Kumar. Trajectory generation and control for precise aggressive maneuvers with quadrotors. The International Journal of Robotics Research, 31(5):664–674, 2012.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
- Paternain et al. (2019) Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019.
- Raffin et al. (2021) Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. The Journal of Machine Learning Research, 22(1):12348–12355, 2021.
- Rasmussen (2003) Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer, 2003.
- Schreiter et al. (2015) Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc Toussaint. Safe exploration for active learning with gaussian processes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 133–149. Springer, 2015.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sui et al. (2015) Yanan Sui, Alkis Gotovos, Joel Burdick, and Andreas Krause. Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning, pages 997–1005. PMLR, 2015.
- Suttle et al. (2023) Wesley A Suttle, Amrit Bedi, Bhrij Patel, Brian M Sadler, Alec Koppel, and Dinesh Manocha. Beyond exponentially fast mixing in average-reward reinforcement learning via multi-level Monte Carlo actor-critic. In International Conference on Machine Learning, pages 33240–33267. PMLR, 2023.
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
- Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pages 1057–1063, 2000.
- Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature Methods, 17(3):261–272, 2020.
- Wabersich et al. (2023) Kim P Wabersich, Andrew J Taylor, Jason J Choi, Koushil Sreenath, Claire J Tomlin, Aaron D Ames, and Melanie N Zeilinger. Data-driven safety filters: Hamilton-Jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine, 43(5):137–177, 2023.
- Wachi et al. (2018) Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. Safe exploration and optimization of constrained mdps using gaussian processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Wiesemann et al. (2013) Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
- Xu and Sreenath (2018) Bin Xu and Koushil Sreenath. Safe teleoperation of dynamic uavs through control barrier functions. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7848–7855, 2018.
- Zhang et al. (2019) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Convergence and iteration complexity of policy gradient method for infinite-horizon reinforcement learning. In 2019 58th Conference on Decision and Control, pages 7415–7422, 2019.
- Zhang et al. (2020) Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020.
- Zhang et al. (2021) Kaiqing Zhang, Bin Hu, and Tamer Başar. Policy optimization for linear control with robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081–4109, 2021.
Supplementary Material
Appendix A Proofs
Proof of Proposition 1. We construct a sequence of -balls, each reachable from the previous element of the sequence, that leads from to , then show that the head of the Markov chain lies inside this sequence with positive probability. Fix and let be such that , for , and (see Figure 1 for an illustration). For a given , let be the Markov chain induced on by such that . We show that the trajectory is contained within the set with strictly positive probability, which will imply that enters with strictly positive probability. For each , consider the probability measure defined as
for any -measurable subset of . Note that is absolutely continuous with respect to , written , since if and only if , by Assumption 2. The Lebesgue-Radon-Nikodym Theorem implies that there exists a -integrable function , called the Radon-Nikodym derivative of , such that (see Folland (1999) for details). To make the link between and perfectly clear, let us write
By Assumptions 2 and 3, we also have . Since both and , the two measures are said to be equivalent, meaning that they agree on which sets have measure zero. Since and are equivalent, a standard result from real analysis allows us to take the Radon-Nikodym derivative to be strictly positive -almost everywhere. As a first consequence, notice that
(15) | ||||
Given Assumption 3, equation (15) is strictly positive, since is strictly positive almost everywhere and the integrals are taken over sets of positive volume.
Building on equation (15), we have
(16) | ||||
(17) | ||||
Note in the innermost integral that , since both sets are open and their intersection is non-empty by hypothesis. Finally, given Assumption 3, we have that (17) is strictly positive, since all integrals are taken over sets of positive volume and is strictly positive almost everywhere, for each .
To see that, for all , exists, for all , we simply need to verify that is always finite. But this follows immediately from Assumption 3 and the fact that by Assumption 1.
We next prove part 1) of the Lemma. The claim holds for the first term in (18) by part 2a) of Assumption 6, so we just need to show that it holds for the second term. To do this, we prove that, for a given , this term is Lipschitz in , then argue that the largest minimal Lipschitz constant over all is finite. We know by part 2b) of Assumption 6 that, for all , , for all . This means that, for all ,
for all . So is Lipschitz in , for each , and the largest Lipschitz constant over is finite. In addition, is clearly uniformly bounded.
Notice that , by Assumption 1. Thus, by Assumption 3, , for all . Since is compact, this means . This implies that is uniformly bounded. Since is Lipschitz, is therefore Lipschitz and bounded in , for all . We also know that, for each , is Lipschitz and bounded in , by Assumption 6, part 2a). Fix . Since the product of Lipschitz, bounded functions is Lipschitz and bounded, the function is Lipschitz and bounded in . Since this function is uniformly bounded over , there therefore exists such that, for all ,
for all . Combined with part 2a) of Assumption 6, this implies that, for all , , for all . This completes the proof of part 1).
Part 2) follows from the fact that is uniformly bounded and that, for all , , for all , by part 2) of Assumption 6.
Appendix B Background: Barrier Functions
In this section, we provide an overview of barrier functions for convenience. The theory of barrier functions revolve around controlled set invariance for dynamical systems. Safety can be represented through a set, say , defined as a level set of this barrier function, say . Then, we write the condition on this barrier function to guarantee the forward invariance of this safety set under the given dynamics.
B.1 Control Barrier Functions (CBF)
Consider the following nonlinear system
(19) |
where and denote the state and control input, and is a locally Lipschitz function that models the state transition. The following definition and the theorem follows the development in Ames et al. (2019); Agrawal and Sreenath (2017).
Theorem 2.
Consider a function that is continuously differentiable. Define a closed set as the super-level set of this function as follows:
(20) |
The function is a control barrier function, for (19) and with state , if there exists an extended function such that for all , ,
(21) |
Further, if we define the safe control set as
(22) |
then any input will render the set forward invariant.
When designing safe controller with control values sampled from this safe control set, we need the time-derivative of i.e. to explicitly contain . However, the above forward variance condition is restricted to barrier function with relative degree . At this point, we also note that for our quadcopter experiment, our CBF has a relative degree , since only the second time-derivative explicitly contains the control input . Therefore, for barrier functions with relative degree more than , which are often referred to as Exponential Control Barrier Functions (ECBF), we need a seperate discussion on forward invariance conditions.
B.2 Exponential Control Barrier Functions
We now discuss exponential CBFs for control affine nonlinear dynamical system. Consider the following control affine nonlinear dynamical system:
(23) |
with and locally lipshitz, and . We suppose that the Lipschitz constant for and are and respectively, and the vector containing the first time derivatives of including is given as: . Further suppose that the matrices and are defined as follows: , , and, .
Theorem 3.
Consider a function that is continuously differentiable. Define a closed set as the super-level set of this function as follows:
(24) |
Then the function is an ECBF, with relative degree , for system in (23), if there exist a row vector such that
(25) |
, implies , whenever . Further, if we define the safe control set as
(26) |
then any input will render the set forward invariant.
Appendix C Experiments: Additional Details
In this section, we provide additional details regarding the experiments presented in §4.
C.1 Inverted Pendulum Experiments
The safety-constrained inverted pendulum environment that we considered in §4.4 was obtained by modifying the standard implementation from Brockman et al. (2016) to include CBF-based safety constraints. In this section we describe the dynamical model and CBF used to obtain these constraints, then present implementation details and an additional experiment.
C.1.1 Dynamical Model
Consider the model of a simple inverted pendulum
(31) |
where denote the states (angle and angular velocity), denote the input (torque), and denotes the mass and the length of the pendulum, respectively, denotes the acceleration due to gravity and denotes the discretization time. Denote the safe operating region by
(34) |
C.1.2 Control Barrier Function
Corollary 2.
Let
(39) |
where , and . Consider system (31) with . Let and assume is non-empty. The set is forward invariant.
The safe set (39) is used to provide the state-dependent constraints to the Beta policies learned in our experiments.
C.1.3 Implementation Details
We next describe the implementation details of our experiments. As mentioned above, the environment was adapted from the implementation of (Brockman et al., 2016), with modifications to compute the CBF safe set (39). The reward function and other details are as in Brockman et al. (2016). The Beta and Gaussian policies used the corresponding distributions from the PyTorch library (Paszke et al., 2019). As described in §4.2, for a given state , the parameters of the Beta distribution were outputted by a two-layer, fully connected neural network. Control inputs were obtained by sampling from this distribution, then translating and rescaling to lie within the current CBF set . The Gaussian policy parameters were outputted by a two-layer, fully connected neural network. Control inputs were subsequently selected from the corresponding distribution by sampling, then, following standard practice (Chou et al., 2017), were clipped to a set of permissible controls, which was chosen to be . The hyperparameters used are presented in Figure 4.
policy learning rate | 0.0003 |
value learning rate | 0.0003 |
entropy coefficient | 0.0 |
clip range | 0.2 |
weight decay | 0.0 |
layer size | 64 |
batch size | 64 |
buffer size | 300 |
number of epochs | 10 |
rollout length | 300 |
discount factor | 0.99 |
policy learning rate | 0.01 |
value learning rate | 0.01 |
entropy coefficient | 0.0 |
clip range | 0.2 |
weight decay | 0.0 |
layer size | 64 |
batch size | 64 |
buffer size | 300 |
number of epochs | 10 |
rollout length | 300 |
discount factor | 0.99 |
C.1.4 Additional Results
![Refer to caption](extracted/5453604/AISTATS2024/figures/comparison_-1_1.png)
C.2 Quadcopter Experiments
In this section, we provide additional details regarding the environment and experiments presented in §4.3.
C.2.1 Dynamical Model
We summarize the dynamical model of the quadcopter derived in Xu and Sreenath (2018). We consider the body frame, say , and world frame, say , and discuss the transformation between these two frames using the rotation matrix defined as
(40) |
where , , and denote the Z-X-Y Euler angles corresponding to the roll, pitch, and yaw of the quadcopter. Suppose that the 3-dimensional position coordinates of the quadcopter along the x-,y-, and z-axis with respect to its body frame of and the world frame of reference be given by and respectively, then .
Then, the quadcopter dynamics is given by , where the control input comprises of the desired acceleration of the quadcopter. The dynamics of this controller under small angle assumptions on the Euler angles, that is ) evolves as Mellinger et al. (2012):
(47) |
where are respectively the mass of the quadcopter and gravitational constant, and is the desired acceleration component of the quadcopter in the x-,y-, and z-direction respectively, computed using the desired specifications on the Euler angles , and , and is the desired thrust on the -th rotor of the quadcopter. Lastly, the dynamical parameters for the quadcopter are setup as given in Ho et al. (2020).
C.2.2 Exponential Control Barrier Function
Recall, that the objective of our controller to enable the quadcopter to learn how to reach a target position , while avoiding an obstacle with position . For this obstacle avoidance, we now discuss our choice of CBF for the quadcopter experiment, as defined in (14), and reason why this is an exponential control barrier function (ECBF). We first derive expressions for and using dynamical equations as follows:
(48) |
and
(49) |
These equations can be re-written in vector form as follows:
(50) |
and
(51) |
Since from quadcopter dynamics, we can re-write the as follows:
(52) |
where , and .
Therefore, we note that or the time-derivative of which explicitly depends on the control input and therefore our choice of CBF is an exponential CBF with a relative degree Ames et al. (2019) of .
Correspondingly, we use the following forward invariance condition for the set as given in Xu and Sreenath (2018):
(53) |
with .
The above equation can be re-arranged as follows:
and, using (50) and (52), we can re-write this equation as
(54) |
where . Thus, we can write the safe control set as . For our quadcopter experiments, we consider navigation dimensions, therefore set the -dimension position and velocity to be . Thus the control input only comprises the desired acceleration for and axes and therefore our action space becomes two-dimensional, and we only consider the and components in the above CBF calculations.
C.2.3 Maximal Inner Hyperrectangle Computation
We now describe the construction of the maximal inner hyperrectangle contained in the set under actuator constraints . These are the sets that our Beta policies will sample from. We use the following optimization problem, with decision variables , to get the maximal inner hyper-rectangle inside the safe set:
(55) |
where is the area of a hyperrectangle inside and the decision variables are points on the line . One of the corner points of this hyperrectangle is formed by and the rest of corner points lie on the boundary hyperrectangle formed by . Suppose that are solutions to , then the definition of Area depends on how the line intersects with , and therefore, leads to the following four possibilities:
-
•
and
-
•
and
-
•
and
-
•
and .
is, in general, a non-convex program. However, through change-of-variables, we can transform this problem into a tractable problem through the following transformation. We perform a change of variables, with new variables denoted by defined by
-
•
and
-
•
and
-
•
and
-
•
and .
The corresponding objective function is given by . So long as the entries corresponding to and from (55) in the transformed problem are nonnegative, the resulting problem is a geometric program, which can be further transformed to a convex problem by standard methods and efficiently solved. In our quadcopter experiments, when , we solved the transformed geometric program using CVXPY (Diamond and Boyd, 2016), and used non-linear solvers from SCIPY (Virtanen et al., 2020) otherwise. We observed in our experiments that the transformation resulted in geometric programs in all but a handful of cases.
C.2.4 Reward
We now discuss reward sha** used in our quadcopter experiments.
Suppose that the and are environment boundaries, with -axis boundary repectively defined by and , that we employ for guiding exploration for both the quadcopter experiments. Then the reward used in our environment is defined by
where is the boundary around for which we give a constant positive reward of to the agent, and the inequalities in the reward definition are element-wise. Moreover, when the agent is inside the boundary but outside the -neighborhood of the goal, then the reward is negative of the distance between the agent and the goal. Lastly, we penalize the agent if it goes outside the boundary defined by and to encourage exploration in the region around the goal.
C.2.5 Hyperparameters
The hyperparameters used in the experiments are presented in Figure 6.
policy learning rate | 0.0004 |
---|---|
value learning rate | 0.0004 |
entropy coefficient | 0.00000001 |
clip range | 0.2 |
weight decay | 0.0 |
layer size | 256 |
batch size | 256 |
buffer size | 320 |
number of epochs | 10 |
rollout length | 320 |
discount factor | 0.90 |
policy learning rate | 0.0006 |
---|---|
value learning rate | 0.0006 |
entropy coefficient | 0.0 |
clip range | 0.2 |
weight decay | 0.0 |
layer size | 256 |
batch size | 256 |
buffer size | 180 |
number of epochs | 10 |
rollout length | 180 |
discount factor | 0.90 |
C.2.6 Learning Curves
Learning curves for the experiments illustrated in Figures 2 and 1(c) are presented in Figures 6(a) and 6(b).
![Refer to caption](extracted/5453604/AISTATS2024/figures/beta_learning_curve.png)
![Refer to caption](extracted/5453604/AISTATS2024/figures/projection_learning_curve.png)
Appendix D Computing Resources
We ran our experiments on both a personal laptop and an HPC cluster. The laptop was configured with a 6-core i7-8750H, 2.20GHz CPU, an NVIDIA GeForce RTX 2070 GPU, and 32GB RAM . The HPC server node was configured with a 32-core Intel Xeon CPU , an 80GB Nvidia Tesla GPU, and 512 GB RAM .