HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bibentry

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2312.10385v3 [cs.LG] 13 Mar 2024

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

Huy Hoang, Tien Mai, Pradeep Varakantham
Abstract

A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates “good” trajectories and avoids “bad” trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as “good” or “bad”. A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.

1 Introduction

Reinforcement learning (RL) is widely acknowledged as a powerful paradigm, thanks to its exceptional ability to learn and adapt by interacting with the environment. This adaptability has been demonstrated through numerous studies that highlight its practical applications across diverse domains. For example, reinforcement learning has been successfully employed in video games to achieve groundbreaking results (Mnih et al. 2016; Firoiu, Whitney, and Tenenbaum 2017), robot manipulation tasks have been enhanced using this approach (Hoang, Dinh, and Nguyen 2023; Kilinc and Montana 2022), and even the field of healthcare has benefited from its potential (Weng et al. 2017; Raghu et al. 2017). In light of the notable achievements of reinforcement learning, it is crucial to acknowledge the practical limitations that come with this approach when applied to real-world situations. The constraints of limited resources, budgetary restrictions, and safety concerns pose significant challenges in implementing reinforcement learning effectively.

Constrained RL: To address these challenges, Constrained Markov Decision Processes (CMDPs) have been developed as an extension of Markov Decision Processes (MDPs) (Altman 1999). CMDPs have emerged as a valuable framework for decision-making in various domains, as they allow for the optimization of objectives while ensuring the fulfillment of trajectory-based constraints over expected cost and other measures (e.g., CVaR). In order to tackle the challenges posed by these constraints, several Constrained RL algorithms have been proposed (Yang et al. 2022; Zhang, Vuong, and Ross 2020). State-of-the-art constrained RL approaches (Satija, Amortila, and Pineau 2020; Chow et al. 2019; Achiam et al. 2017) convert trajectory-based cost constraints into local cost constraints that can be solved easily while guaranteeing the enforcement of trajectory-based constraints. One potential issue with such local cost constraints in challenging constrained RL problems is the estimation of cost value functions. Due to the difficulty involved in estimating the costs of partial (or full) trajectories, output policies can either be conservative or aggressive with regard to costs.

In this work, we develop a novel principled framework that avoids the use of local cost constraints and, instead, focuses on directly solving the original constrained MDP problem, thereby avoiding cost estimation. Our innovation is rooted in the observation that, within the context of CMDP, from a given set of trajectories, it is easy to identify “good” trajectories that are feasible with respect to the cost constraints and offer high rewards. In contrast, “bad” trajectories would be identified as infeasible with respect to the cost constraints and/or yield low rewards. Subsequently, a policy that assigns high probabilities to good trajectories becomes a strong candidate for effectively addressing the CMDP problem. Hence, our approach to address CMDP involves learning a policy that replicates the actions of the good trajectories while steering clear of the bad ones. We do this by employing imitation learning, a framework designed to imitate an expert’s policy based on their demonstrations.

Imitation Learning: Imitation learning (IL) has been recognized as a compelling approach for making sequential decisions (Ng, Russell et al. 2000; Abbeel and Ng 2004). In IL, a set of expert trajectories is provided, and the aim is to train a policy that replicates the behavior of the expert’s policy. One of the simplest IL methods is Behavioral Cloning (BC), which mimics an expert’s policy by maximizing the likelihood of the expert’s actions under the learned policy. BC is simple to implement but it disregards environmental dynamics, making it unable to perform as well as an expert in unseen states (Ross, Gordon, and Bagnell 2011). To address this issue, Generative Adversarial Imitation Learning (Ho and Ermon 2016) and Adversarial Inverse Reinforcement Learning (Fu, Luo, and Levine 2017) were introduced. These methods use adversarial training to make the agent’s behavior match the expert’s occupancy distribution as estimated by their discriminator. However, the adversarial training often hinders the agent from achieving expert-level performance, especially in continuous settings. ValueDICE (Kostrikov, Nachum, and Tompson 2019) learns a value function based on the KL divergence of the learner and expert occupancy distributions and performs well in offline settings while still incorporating adversarial training. More recent methods like PWIL (Dadashi et al. 2020) and IQ-learn (Garg et al. 2021) use different statistical distances for occupancy distribution and successfully eliminate the need for adversarial training.

It is important to note that imitation within our context differs from the conventional IL approaches from the aforementioned works. Here, our approach not only involves mimicking the behavior of “good” demonstrations but also actively avoiding the bad ones. To the best of our knowledge, this marks the first time the concept of learning to avoid bad demonstrations is introduced within the realm of IL. Additionally, in a standard IL algorithm, the set of expert demonstrations is fixed beforehand. In contrast, in our context, the set of demonstrations is generated by a pre-trained or learning policy, thus allowing it to expand as training progresses. These factors collectively pave the way for the development of a novel IL algorithm that is well-suited to our specific context.

Contrastive Learning: Our framework is also related to the context of Contrastive Learning (CL). CL was first introduced by (Bromley et al. 1993) with the Siamese architecture to create a map** function for the inputs into a target space where two similar samples should be close while two different classes should be far away. There are several famous applications of CL in computer vision (Noroozi and Favaro 2016; He et al. 2019; Grill et al. 2020), natural language processing (Clark et al. 2020; Gao, Yao, and Chen 2021), recommendation systems (Zhou et al. 2021; Xie et al. 2021), and reinforcement learning (Fu et al. 2021; Laskin, Srinivas, and Abbeel 2020). Our algorithm shares a similar spirit with CL and also marks the first time the idea of contrastive learning being applied in IL.

Contributions: We make the following contributions:

  • New framework for Constrained RL: We propose a novel training framework for Constrained RL that incrementally improves an agent policy by imitating “good” trajectories and avoiding “bad” trajectories. The sets of “good” and “bad” trajectories are selected based on their accumulated rewards and costs and are updated as the policy is improved.

  • Theoretical insights: We show that our way of imitating the good trajectories and avoiding the bad ones can be shown to ensure no deterioration in the output policy performance.

  • New Learning algorithm: We develop a non-adversarial imitate and avoid algorithm that is able to imitate “good” trajectories and avoid “bad” trajectories. Due to the non-adversarial nature of the algorithm, it provides higher stability while being scalable.

  • Experimental results: We provide an extensive experimental results section, where we demonstrate that our approach outperforms existing best approaches on all six different environments111Existing works have typically showed results only on the simplest environment, Safety Point Goal-v0. within the highly challenging Safety-Gym benchmark. Furthermore, we also provide results for expected cost, CVaR cost, and unknown cost settings.

2 Background

We present a description of the Constrained MDP problem and some popular IL approaches.

2.1 Constrained Markov Decision Process

The Markov Decision Process (MDP) described in (Altman 1999) can be represented as =S,A,r,P,s0𝑆𝐴𝑟𝑃subscript𝑠0\mathcal{M}=\left\langle S,A,r,P,s_{0}\right\ranglecaligraphic_M = ⟨ italic_S , italic_A , italic_r , italic_P , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩. Here, S𝑆Sitalic_S denotes the set of states, s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the initial state set, A𝐴Aitalic_A is the set of actions, r:S×A:𝑟𝑆𝐴r:S\times A\rightarrow\mathbb{R}italic_r : italic_S × italic_A → blackboard_R defines the reward function for each state-action pair, and P:S×AS:𝑃𝑆𝐴𝑆P:S\times A\rightarrow Sitalic_P : italic_S × italic_A → italic_S is the transition function.

By introducing an additional constraint set 𝒞=d,cmax𝒞𝑑subscript𝑐max\mathcal{C}=\left\langle d,c_{\text{max}}\right\ranglecaligraphic_C = ⟨ italic_d , italic_c start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ⟩ to the MDP, we can formulate a Constrained Markov Decision Process (CMDP). The constraint set includes a cost function d:S:𝑑𝑆d:S\rightarrow\mathbb{R}italic_d : italic_S → blackboard_R and a maximum allowed accumulated cost cmaxsubscript𝑐maxc_{\text{max}}italic_c start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. The objective of the CMDP is to maximize the return while ensuring that the expected accumulated cost remains below the specified maximum. Mathematically, the objective function and constraint can be expressed as follows:

maxπ𝔼[t=0γtr(st,at)|s0,π]subscript𝜋𝔼delimited-[]conditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝜋\displaystyle\max_{\pi}\mathbb{E}\left[\sum_{t=0}^{\infty}{\gamma^{t}}r(s_{t},% a_{t})|s_{0},\pi\right]roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π ] (CMDP)
s.t. 𝔼[t=0γtd(st)|s0,π]cmax.s.t. 𝔼delimited-[]conditionalsuperscriptsubscript𝑡0superscript𝛾𝑡𝑑subscript𝑠𝑡subscript𝑠0𝜋subscript𝑐𝑚𝑎𝑥\displaystyle\text{ s.t. }\quad\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}d(% s_{t})|s_{0},\pi\right]\leq c_{max}.s.t. blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_d ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π ] ≤ italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT .

where π𝜋\piitalic_π represents a policy, γ𝛾\gammaitalic_γ is the discount factor, and the expectation is taken with respect to the initial state and the policy. From now, to simplify the notion, we define R(τ)𝑅𝜏R({\tau})italic_R ( italic_τ ) and C(τ)𝐶𝜏C(\tau)italic_C ( italic_τ ) be the expectation of return and accumulated cost on trajectories τ𝜏\tauitalic_τ, i.e., R(τ)=(st,at)τγtr(st,at)𝑅𝜏subscriptsubscript𝑠𝑡subscript𝑎𝑡𝜏superscript𝛾𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡R(\tau)=\sum_{(s_{t},a_{t})\in\tau}\gamma^{t}r(s_{t},a_{t})italic_R ( italic_τ ) = ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ italic_τ end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), C(τ)=stτγtd(st)𝐶𝜏subscriptsubscript𝑠𝑡𝜏superscript𝛾𝑡𝑑subscript𝑠𝑡C(\tau)=\sum_{s_{t}\in\tau}\gamma^{t}d(s_{t})italic_C ( italic_τ ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_τ end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_d ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

2.2 Imitation Learning

Behavioral Cloning.

In BC, the objective is to maximize the likelihood of the demonstrations.

maxπ𝔼τπE[(s,a)τln(π(a|s))]subscript𝜋subscript𝔼similar-to𝜏superscript𝜋𝐸delimited-[]subscript𝑠𝑎𝜏𝜋conditional𝑎𝑠\max_{\pi}\mathbb{E}_{\tau\sim\pi^{E}}\Big{[}\sum_{(s,a)\in\tau}\ln(\pi(a|s))% \Big{]}roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT roman_ln ( italic_π ( italic_a | italic_s ) ) ] (BC)

BC has a strong theoretical foundation but ignores environmental dynamics and only works with offline learning, requiring a huge number of samples to achieve a desired performance (Ross, Gordon, and Bagnell 2011).

Distribution matching.

A popular and useful approach for IL is based on state-action distribution matching. Specifically, let ρπ(s,a)superscript𝜌𝜋𝑠𝑎\rho^{\pi}(s,a)italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) be the occupancy measure of visiting state s𝑠sitalic_s and taking action a𝑎aitalic_a, under policy π𝜋\piitalic_π. Let ρπEsuperscript𝜌superscript𝜋𝐸\rho^{\pi^{E}}italic_ρ start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT the state-action distribution given by expert policy πEsuperscript𝜋𝐸\pi^{E}italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT. The distribution matching approach proposes to learn π𝜋\piitalic_π to minimize the discrepancy between ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and ρπEsuperscript𝜌superscript𝜋𝐸\rho^{\pi^{E}}italic_ρ start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT such as KL-divergence:

minπKL(ρπ||ρπE)=minπ{𝔼(s,a)ρπ[lnρπE(s,a)ρπ(s,a)]}\min_{\pi}KL\left(\rho^{\pi}||\rho^{\pi^{E}}\right)=\min_{\pi}\left\{\mathbb{E% }_{(s,a)\sim\rho^{\pi}}\left[\ln\frac{\rho^{\pi^{E}}(s,a)}{\rho^{\pi}(s,a)}% \frac{}{}\right]\right\}roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_K italic_L ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG divide start_ARG end_ARG start_ARG end_ARG ] } (1)

Approaches based on distributional matching include some state-of-the-art IL algorithms such as adversarial IL (Ho and Ermon 2016; Fu, Luo, and Levine 2017) or IQ-learn (Garg et al. 2021).

3 Self-Imitation Learning Approach

Before describing our learning approach, we define “Good” and “Bad” trajectories:

Definition 1.

A trajectory, τ𝜏\tauitalic_τ is a good trajectory if: R(τ)RG𝑅𝜏subscript𝑅𝐺R(\tau)\geq R_{G}italic_R ( italic_τ ) ≥ italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and C(τ)cmax𝐶𝜏subscript𝑐𝑚𝑎𝑥C(\tau)\leq c_{max}italic_C ( italic_τ ) ≤ italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. On the other, a trajectory, τ𝜏\tauitalic_τ is a bad trajectory if R(τ)<RB𝑅𝜏subscript𝑅𝐵R(\tau)<R_{B}italic_R ( italic_τ ) < italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT or C(τ)>cmax𝐶𝜏subscript𝑐𝑚𝑎𝑥C(\tau)>c_{max}italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT.

Here, RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and RBsubscript𝑅𝐵R_{B}italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT represent some predefined222We provide an in-depth analysis on the selection of these hyperparameters and changing them during the learning for certain problems. thresholds for selecting good and bad trajectories, respectively. We denote ΩGsuperscriptΩ𝐺\Omega^{G}roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT as the set of good and bad trajectories respectively.

3.1 Learning from Good and Bad Demonstrations

Our aim is to train an RL agent to imitate the good behavior from a set of good demonstrations (trajectories) and avoid the bad demonstrations. In other words, we try to mimic the good part of the pre-trained policy and avoid the bad part.

Behavior Cloning GB: When using a Behavior Cloning, BC type approach to achieve the above objective, the aim is to maximize the likelihood of the good set while minimizing the likelihood of the bad one. The training objective can be written as:

maxπ{λ𝔼τπ0τΩG[(s,a)τln(π(a|s))](1λ)𝔼τπ0τΩB[(s,a)τϕ(ln(π(a|s)))]}subscript𝜋𝜆subscript𝔼similar-to𝜏superscript𝜋0𝜏superscriptΩ𝐺delimited-[]subscript𝑠𝑎𝜏𝜋|𝑎𝑠1𝜆subscript𝔼similar-to𝜏superscript𝜋0𝜏superscriptΩ𝐵delimited-[]subscript𝑠𝑎𝜏italic-ϕ𝜋|𝑎𝑠\max_{\pi}\left\{\lambda\mathbb{E}_{\begin{subarray}{c}\tau\sim\pi^{0}\\ \tau\in\Omega^{G}\end{subarray}}\Big{[}\sum_{(s,a)\in\tau}\ln(\pi(a|s))\Big{]}% \right.\\ \left.-(1-\lambda)\mathbb{E}_{\begin{subarray}{c}\tau\sim\pi^{0}\\ \tau\in\Omega^{B}\end{subarray}}\Big{[}\sum_{(s,a)\in\tau}\phi(\ln(\pi(a|s)))% \Big{]}\right\}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT { italic_λ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_τ ∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT roman_ln ( italic_π ( italic_a | italic_s ) ) ] end_CELL end_ROW start_ROW start_CELL - ( 1 - italic_λ ) blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_τ ∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ italic_τ end_POSTSUBSCRIPT italic_ϕ ( roman_ln ( italic_π ( italic_a | italic_s ) ) ) ] } end_CELL end_ROW (1)

where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) is a monotone regularizer map** (,0)0(-\infty,0)( - ∞ , 0 ) to a finite interval, and λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ] is a parameter capturing the impact of each good or bad set on the objective function, and π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is a starting policy that we want to improve upon. We use π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT instead of πEsuperscript𝜋𝐸\pi^{E}italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT as the starting policy is not necessarily an expert one. If λ=1𝜆1\lambda=1italic_λ = 1, then we only learn from good demonstrations and ignore bad ones, and λ=0𝜆0\lambda=0italic_λ = 0 otherwise. We incorporate the regularization term ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) in this context to address a critical concern. Without this regularization, the maximization process could drive the value of ln(π(a|s))𝜋conditional𝑎𝑠\ln(\pi(a|s))roman_ln ( italic_π ( italic_a | italic_s ) ) in the second term of equation (1) towards negative infinity, leading to an unbounded and numerically unstable objective. Intuitively, to improve the objective in (1), it is necessary for the policy to allocate higher probabilities to trajectories in the good set while assigning lower probabilities to trajectories in the bad set.

Distribution Matching GB: In the realm of distribution matching, the learning process entails a delicate balance. It involves minimizing the Kullback-Leibler (KL) divergence between the occupancy measures of the policy under consideration, denoted as ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, and the good trajectories, represented as ρGsuperscript𝜌𝐺\rho^{G}italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. Simultaneously, the goal is to maximize the KL divergence between ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and the occupancy measure corresponding to bad trajectories, denoted as ρBsuperscript𝜌𝐵\rho^{B}italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. This dual divergence approach aims to shape the policy by aligning it closely with the good trajectories while also distancing it from the bad ones. These “good” and “bad” occupancy measures can be computed as: ρG(s,a)=(1γ)t=0γtpt(s,a|ΩG),superscript𝜌𝐺𝑠𝑎1𝛾superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑝𝑡𝑠conditional𝑎superscriptΩ𝐺\rho^{G}(s,a)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}p_{t}(s,a|\Omega^{G}),italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_s , italic_a ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a | roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) , where pt(s,a|ΩG)subscript𝑝𝑡𝑠conditional𝑎superscriptΩ𝐺p_{t}(s,a|\Omega^{G})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s , italic_a | roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) is the probability that (st,at)subscript𝑠𝑡subscript𝑎𝑡(s_{t},a_{t})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is in the set ΩGsuperscriptΩ𝐺\Omega^{G}roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and (st,at)=(s,a)subscript𝑠𝑡subscript𝑎𝑡𝑠𝑎(s_{t},a_{t})=(s,a)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_s , italic_a ). Similarly, ρB(s,a)superscript𝜌𝐵𝑠𝑎\rho^{B}(s,a)italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) can be computed in the same way. Then, the training objective becomes

minπ{λKL(ρπ||ρG)(1λ)KL(ρπ||ρB)}\min_{\pi}\left\{\lambda\textsc{KL}\left(\rho^{\pi}||\rho^{G}\right)-(1-% \lambda)\textsc{KL}\left(\rho^{\pi}||\rho^{B}\right)\right\}roman_min start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT { italic_λ KL ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) - ( 1 - italic_λ ) KL ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) } (DM-GB)

Intuitively, to minimize the objective function in (DM-GB), it is necessary for the occupancy distribution to move towards ρGsuperscript𝜌𝐺\rho^{G}italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and far away from ρBsuperscript𝜌𝐵\rho^{B}italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. Consequently, ρπsuperscript𝜌𝜋\rho^{\pi}italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT will allocate a higher probability to a pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) that appears more frequently in ΩGsuperscriptΩ𝐺\Omega^{G}roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT than in ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, and vice-versa.

3.2 Theoretical Insights

We investigate the theoretical properties of our concept of learning from good and bad demonstrations. Our aim is to explore the question whether we can obtain improved policies by learning from good and bad demonstrations. Since BC-GB works directly with trajectories, we will employ it to present our theory on why intuitively using good and bad trajectories is useful. Distribution Matching GB, on the other hand, works with state-action pairs, thus is much more challenging to analyze theoretically. That is why we develop our algorithm based on Distribution Matching GB, and show extensive empirical results with it in our experimental results to demonstrate that it is a more practical algorithm and it outperforms existing work.

We first note that, in the context of maximum likelihood estimation, πEsuperscript𝜋𝐸\pi^{E}italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT is optimal for (BC). In other words, if we have sufficient samples from the expert policy, it is guaranteed that we can recover the expert policy by solving (BC).

We now look at the BC with good and bad trajectories in (1). The following lemma says that a policy that allocates zero probabilities to bad trajectories in ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT will be optimal for (1).

Lemma 1.

For any λ>0𝜆0\lambda>0italic_λ > 0, if there exists a policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that Pπ*(τ)=0subscript𝑃superscript𝜋𝜏0P_{\pi^{*}}(\tau)=0italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 0 for all τΩB𝜏superscriptnormal-Ω𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, and Pπ*(τ)=Pπ0(τ)τΩGPπ0(τ);τΩGformulae-sequencesubscript𝑃superscript𝜋𝜏subscript𝑃superscript𝜋0𝜏subscriptsuperscript𝜏normal-′superscriptnormal-Ω𝐺subscript𝑃superscript𝜋0superscript𝜏normal-′for-all𝜏superscriptnormal-Ω𝐺P_{\pi^{*}}(\tau)=\frac{P_{\pi^{0}}(\tau)}{\sum_{\tau^{\prime}\in\Omega^{G}}P_% {\pi^{0}}(\tau^{\prime})};~{}\forall\tau\in\Omega^{G}italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ; ∀ italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT then π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is an optimal policy to (1).

Where Pπ*(τ)subscript𝑃superscript𝜋𝜏P_{\pi^{*}}(\tau)italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) is the probability of τ𝜏\tauitalic_τ given by π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, i.e., Pπ*(τ)=(st,at,st+1)τπ*(at|st)P(st+1|at,st)subscript𝑃superscript𝜋𝜏subscriptsubscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑡1𝜏superscript𝜋conditionalsubscript𝑎𝑡subscript𝑠𝑡𝑃conditionalsubscript𝑠𝑡1subscript𝑎𝑡subscript𝑠𝑡P_{\pi^{*}}(\tau)=\sum_{(s_{t},a_{t},s_{t+1})\in\tau}\pi^{*}(a_{t}|s_{t})P(s_{% t+1}|a_{t},s_{t})italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∈ italic_τ end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_P ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). There might be no policy that allocates exactly Pπ(τ)=0subscript𝑃𝜋𝜏0P_{\pi}(\tau)=0italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ ) = 0 for all τΩB𝜏superscriptΩ𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, due to, for instance, the dynamic of the environment or the structure of ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. However, intuitively, a policy trying to assign small probabilities to (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) that appear more frequently in ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT than in ΩGsuperscriptΩ𝐺\Omega^{G}roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT will move towards π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (so closer to the optimal policy).

In the proposition below we show that, if we construct a bad set consisting of trajectories having low reward values and violating the cost constraints, then it is guaranteed that the optimal policy mentioned in Lemma 1 will perform better than the initial policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT in terms of both reward and cost constraint satisfaction.

Proposition 1.

For any λ>0𝜆0\lambda>0italic_λ > 0, let π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT be a pre-trained feasible policy, 𝐑E=𝔼τπ0[R(τ)]superscript𝐑𝐸subscript𝔼similar-to𝜏superscript𝜋0delimited-[]𝑅𝜏\textbf{R}^{E}=\mathbb{E}_{\tau\sim\pi^{0}}\Big{[}R(\tau)\Big{]}R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ], and ΩBsuperscriptnormal-Ω𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT be a collection of trajectories of low reward and high-cost values

ΩB={τ|R(τ)𝑹E,C(τ)>cmax}]}\Omega^{B}=\left\{\tau\Big{|}~{}R(\tau)\leq\textbf{R}^{E},~{}C(\tau)>c_{\max}% \}\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_R ( italic_τ ) ≤ R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } ] }

the optimal policy mentioned in Lemma 1 is feasible to the cost constraint while offering a better expected reward than the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, specifically,

𝔼π*[R(τ)]𝔼π0[R(τ)]subscript𝔼superscript𝜋delimited-[]𝑅𝜏subscript𝔼superscript𝜋0delimited-[]𝑅𝜏\displaystyle\mathbb{E}_{\pi^{*}}\left[R(\tau)\right]-\mathbb{E}_{\pi^{0}}% \left[R(\tau)\right]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] - blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] =τPπ0(τ)(𝑹ER(τ))1Pπ0(ΩB)0absentsubscript𝜏subscript𝑃superscript𝜋0𝜏superscript𝑹𝐸𝑅𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵0\displaystyle=\frac{\sum_{\tau}{P_{\pi^{0}}(\tau)(\textbf{R}^{E}-R(\tau))}}{1-% P_{\pi^{0}}(\Omega^{B})}\geq 0= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) ( R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - italic_R ( italic_τ ) ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG ≥ 0 (2)
𝔼τπ*[C(τ)]subscript𝔼similar-to𝜏superscript𝜋delimited-[]𝐶𝜏\displaystyle\mathbb{E}_{\tau\sim\pi^{*}}\left[C(\tau)\right]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] cmaxabsentsubscript𝑐\displaystyle\leq c_{\max}≤ italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT

where Pπ0(ΩB)=τΩBPπ0(τ)subscript𝑃superscript𝜋0superscriptnormal-Ω𝐵subscript𝜏superscriptnormal-Ω𝐵subscript𝑃superscript𝜋0𝜏P_{\pi^{0}}(\Omega^{B})=\sum_{\tau\in\Omega^{B}}P_{\pi^{0}}(\tau)italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ).

The inequality in (2) suggests that increasing the proportion or total probability of the bad set ΩΩ\Omegaroman_Ω will result in a larger gap, thereby leading to improved policy enhancement. In other words, as more bad policies are identified, the quality of π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT improves.

The above results hold for λ>0𝜆0\lambda>0italic_λ > 0, also indicating that one might obtain a better policy by eliminating the probabilities of bad trajectories (trajectories with low reward and high-cost values). When λ=0𝜆0\lambda=0italic_λ = 0, the BC is about to learn only from the bad trajectories. Interestingly, we can show that by just learning from bad trajectories, it is not necessary to obtain a better policy. We first state the following lemma.

Lemma 2.

If λ=0𝜆0\lambda=0italic_λ = 0, any policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that Pπ*(τ)=0subscript𝑃superscript𝜋𝜏0P_{\pi^{*}}(\tau)=0italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 0 for all τΩB𝜏superscriptnormal-Ω𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT is optimal for (1).

The following proposition tells us that learning with λ=0𝜆0\lambda=0italic_λ = 0 would not offer a policy improvement as in the case of λ>0𝜆0\lambda>0italic_λ > 0.

Proposition 2.

If λ=0𝜆0\lambda=0italic_λ = 0, and the bad set ΩBsuperscriptnormal-Ω𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT is selected in the same manner as in Proposition 1, then the optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from Lemma (2) is feasible, but it does not necessarily provide a higher expected reward than the policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

In Propositions 1 and 2, it is assumed that the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is feasible. It is then relevant to discuss other scenarios where the expected policy is not feasible, or even when the cost function is unknown. We summarize our claims below.

Proposition 3.

The following hold

  • (i)

    If we select the bad set as ΩB={τ|R(τ)𝑹E,C(τ)>𝑪E}]}\Omega^{B}=\left\{\tau\Big{|}~{}R(\tau)\leq\textbf{R}^{E},~{}C(\tau)>\textbf{C% }^{E}\}\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_R ( italic_τ ) ≤ R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_C ( italic_τ ) > C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT } ] }, then it is guaranteed that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT offers a higher (or equal) expected reward and lower (or equal) expected cost, compared to those from π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, where 𝑪E=𝔼τπ0[C(τ)]superscript𝑪𝐸subscript𝔼similar-to𝜏superscript𝜋0delimited-[]𝐶𝜏\textbf{C}^{E}=\mathbb{E}_{\tau\sim\pi^{0}}[C(\tau)]C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ]

  • (ii)

    If the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is not feasible, then if we select the bad set as ΩB={τ|C(τ)>cmax}]}\Omega^{B}=\left\{\tau\Big{|}C(\tau)>c_{\max}\}\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } ] }, then it is guaranteed that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is feasible

  • (iii)

    If the cost function is not accessible, but there is an oracle that can tell us which trajectories are violating the constraint, then by selecting, ΩB={τ|τ is violated ]}\Omega^{B}=\left\{\tau\Big{|}\tau\text{ is violated }\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_τ is violated ] }, then π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is feasible.

The above results provide some interesting insights to understand the framework. It is evidenced that if we learn from both good and bad trajectories, the policy will be trained towards a better one, compared to the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. If we only use bad trajectories, then it is possible that we cannot get a better policy than π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. This remark will be further validated in our later experiments, as we observe that one cannot learn a good policy by just using bad trajectories. Moreover, according to Proposition 3, our framework can be used to train towards policies with lower cost or even feasible policies by selecting different bad sets ΩEsuperscriptΩ𝐸\Omega^{E}roman_Ω start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, even when the cost function is not known beforehand.

The theory also tells us that if we are more selective in choosing good and bad trajectories, we will tend to obtain better policies, as long as there are policies that can eliminate the probabilities of the bad ones. However, the selection process can be tricky and may not be easy to achieve in practice. So, it is better to not be too selective (or conservative) in classifying good and bad demonstrations.

3.3 Example

Refer to caption
Figure 1: Example

We give a small example to demonstrate how our framework returns a better policy by learning from bad and good trajectories. We consider the small deterministic MDP given in Figure 1. The rewards, r𝑟ritalic_r, costs, c𝑐citalic_c and pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT are as shown in Table 1. The probabilities are over feasible actions from the state. There are 4 possible trajectories τ1={s0,s1,s3,s5},τ2={s0,s1,s4,s5},τ3={s0,s1,s2,s5},formulae-sequencesubscript𝜏1subscript𝑠0subscript𝑠1subscript𝑠3subscript𝑠5formulae-sequencesubscript𝜏2subscript𝑠0subscript𝑠1subscript𝑠4subscript𝑠5subscript𝜏3subscript𝑠0subscript𝑠1subscript𝑠2subscript𝑠5\tau_{1}=\{s_{0},s_{1},s_{3},s_{5}\},~{}\tau_{2}=\{s_{0},s_{1},s_{4},s_{5}\},~% {}\tau_{3}=\{s_{0},s_{1},s_{2},s_{5}\},italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } , italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } , and τ4={s0,s2,s5}subscript𝜏4subscript𝑠0subscript𝑠2subscript𝑠5\tau_{4}=\{s_{0},s_{2},s_{5}\}italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT }.

s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT s2subscript𝑠2s_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT s4subscript𝑠4s_{4}italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT s5subscript𝑠5s_{5}italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT
r𝑟ritalic_r 0 2 3 1 2 0
c𝑐citalic_c 0 1 1 3 1 0
π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT 1/2, 1/2 1/3, 1/3, 1/3 1 1 1 1
Table 1: Rewards, Costs and Policy

We then see that 𝔼π0[R(τ)]=τPπ0(τ)=3.5;𝔼π0[C(τ)]=τPπ0(τ)=2,formulae-sequencesubscript𝔼superscript𝜋0delimited-[]𝑅𝜏subscript𝜏subscript𝑃superscript𝜋0𝜏3.5subscript𝔼superscript𝜋0delimited-[]𝐶𝜏subscript𝜏subscript𝑃superscript𝜋0𝜏2\mathbb{E}_{\pi^{0}}[R(\tau)]=\sum_{\tau}P_{\pi^{0}}(\tau)=3.5;~{}~{}\mathbb{E% }_{\pi^{0}}[C(\tau)]=\sum_{\tau}P_{\pi^{0}}(\tau)=2,blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] = ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 3.5 ; blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] = ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 2 , implying that π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is feasible for the CMDP problem.

Under our good-bad scheme, trajectory τ1={s0,s1,s3,s5}subscript𝜏1subscript𝑠0subscript𝑠1subscript𝑠3subscript𝑠5\tau_{1}=\{s_{0},s_{1},s_{3},s_{5}\}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } has the accumulated reward and cost as R(τ1)=3<𝔼π0[R(τ)],C(τ1)=4>cmaxformulae-sequence𝑅subscript𝜏13subscript𝔼superscript𝜋0delimited-[]𝑅𝜏𝐶subscript𝜏14subscript𝑐R(\tau_{1})=3<\mathbb{E}_{\pi^{0}}[R(\tau)],C(\tau_{1})=4>c_{\max}italic_R ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 3 < blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] , italic_C ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 4 > italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. So, according to the criteria in Theorem 1, τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT should be considered a bad trajectory (the others are good). The BC objective can be written as F(π)=λi{2,3,4}Pπ0(τi)lnPπ(τi)𝐹𝜋𝜆subscript𝑖234subscript𝑃superscript𝜋0subscript𝜏𝑖subscript𝑃𝜋subscript𝜏𝑖F(\pi)=\lambda\sum_{i\in\{2,3,4\}}P_{\pi^{0}}(\tau_{i})\ln P_{\pi}(\tau_{i})italic_F ( italic_π ) = italic_λ ∑ start_POSTSUBSCRIPT italic_i ∈ { 2 , 3 , 4 } end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_ln italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The following policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that π*(a4|s1)=0superscript𝜋conditionalsubscript𝑎4subscript𝑠10\pi^{*}(a_{4}|s_{1})=0italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0, π*(a6|s1)=π*(a3|s1)=1/2superscript𝜋conditionalsubscript𝑎6subscript𝑠1superscript𝜋conditionalsubscript𝑎3subscript𝑠112\pi^{*}(a_{6}|s_{1})=\pi^{*}(a_{3}|s_{1})=1/2italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1 / 2, π*(a1|s0)=2/5superscript𝜋conditionalsubscript𝑎1subscript𝑠025\pi^{*}(a_{1}|s_{0})=2/5italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 2 / 5 and π*(a2|s0)=3/5superscript𝜋conditionalsubscript𝑎2subscript𝑠035\pi^{*}(a_{2}|s_{0})=3/5italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 3 / 5 will satisfy the condition in Lemma 1, i.e.., Pπ*(τ1);Pπ*(τ2)=1/5=Pπ0(τ2)5/6;Pπ*(τ3)=1/5=Pπ0(τ3)5/6;Pπ*(τ4)=3/5=Pπ0(τ4)5/6and Pπ0(τ2)+Pπ0(τ3)+Pπ0(τ4)=5/6,formulae-sequencesubscript𝑃superscript𝜋subscript𝜏1subscript𝑃superscript𝜋subscript𝜏215subscript𝑃superscript𝜋0subscript𝜏256subscript𝑃superscript𝜋subscript𝜏315subscript𝑃superscript𝜋0subscript𝜏356subscript𝑃superscript𝜋subscript𝜏435subscript𝑃superscript𝜋0subscript𝜏456and subscript𝑃superscript𝜋0subscript𝜏2subscript𝑃superscript𝜋0subscript𝜏3subscript𝑃superscript𝜋0subscript𝜏456P_{\pi^{*}}(\tau_{1});~{}P_{\pi^{*}}(\tau_{2})=1/5=\frac{P_{\pi^{0}}(\tau_{2})% }{5/6};~{}P_{\pi^{*}}(\tau_{3})=1/5=\frac{P_{\pi^{0}}(\tau_{3})}{5/6};~{}P_{% \pi^{*}}(\tau_{4})=3/5=\frac{P_{\pi^{0}}(\tau_{4})}{5/6}~{}\text{and }P_{\pi^{% 0}}(\tau_{2})+P_{\pi^{0}}(\tau_{3})+P_{\pi^{0}}(\tau_{4})=5/6,italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ; italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 / 5 = divide start_ARG italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG 5 / 6 end_ARG ; italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = 1 / 5 = divide start_ARG italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) end_ARG start_ARG 5 / 6 end_ARG ; italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = 3 / 5 = divide start_ARG italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) end_ARG start_ARG 5 / 6 end_ARG and italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = 5 / 6 , thus π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is optimal for maxπ{F(π)}subscript𝜋𝐹𝜋\max_{\pi}\{F(\pi)\}roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT { italic_F ( italic_π ) }. On the other hand, 𝔼π*[R(τ)]=3.6;𝔼π*[C(τ)]=1.6.formulae-sequencesubscript𝔼superscript𝜋delimited-[]𝑅𝜏3.6subscript𝔼superscript𝜋delimited-[]𝐶𝜏1.6\mathbb{E}_{\pi^{*}}[R(\tau)]=3.6;~{}\mathbb{E}_{\pi^{*}}[C(\tau)]=1.6.blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] = 3.6 ; blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] = 1.6 . So π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT offers a better expected reward and a lower cost compared to the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

4 Self-Imitation based Safe RL

In this section, we present a practical IL-based algorithm for constrained RL. A BC-based algorithm can be developed using (1). However, this approach (or the BC in general) would not be practical and would necessitate a huge number of samples to attain the desired performance. In contrast, Distribution Matching proves to be a more practical alternative. Taking inspiration from the GAIL algorithm, to address (DM-GB), one can construct two discriminators: one for KL(ρπ||ρG)\textsc{KL}\left(\rho^{\pi}||\rho^{G}\right)KL ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) and another for KL(ρπ||ρB)\textsc{KL}\left(\rho^{\pi}||\rho^{B}\right)KL ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ). Nonetheless, this approach involves two adversaries and would be highly unstable (as demonstrated in our experiments). To get rid of adversarial training, let us put the occupancy measures of the learning policy and the good demonstrations together, and consider the following mixed state-action distribution ρG,π=(ρπ+ρG)/2superscript𝜌𝐺𝜋superscript𝜌𝜋superscript𝜌𝐺2\rho^{G,\pi}=({\rho^{\pi}+\rho^{G}})/{2}italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT = ( italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT + italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) / 2. We then set our aim to maximize the KL divergence between ρG,πsuperscript𝜌𝐺𝜋\rho^{G,\pi}italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT and the occupancy measure of the bad trajectories ρBsuperscript𝜌𝐵\rho^{B}italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT (thus making ρG,πsuperscript𝜌𝐺𝜋\rho^{G,\pi}italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT far away from the “bad” occupancy measure ρBsuperscript𝜌𝐵\rho^{B}italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT).

maxπ{KL(ρG,π||ρB)}=maxπ𝔼(s,a)ρG,π[lnρB(s,a)ρG,π(s,a)]\max_{\pi}\left\{\textsc{KL}\left(\rho^{G,\pi}||\rho^{B}\right)\right\}=\max_{% \pi}\mathbb{E}_{(s,a)\sim\rho^{G,\pi}}\left[\ln\frac{\rho^{B}(s,a)}{\rho^{G,% \pi}(s,a)}\right]roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT { KL ( italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) } = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG ] (3)

To estimate distribution ratio ρB(s,a)ρG,π(s,a)superscript𝜌𝐵𝑠𝑎superscript𝜌𝐺𝜋𝑠𝑎\frac{\rho^{B}(s,a)}{\rho^{G,\pi}(s,a)}divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG, we propose the following surrogate maximization problem

maxK:S×A(0,1){J(K,π):=𝔼ρB[ln(K(s,a))]\displaystyle\max_{K:S\times A\rightarrow(0,1)}\Big{\{}J(K,\pi):=\mathbb{E}_{% \rho^{B}}[\ln(K(s,a))]roman_max start_POSTSUBSCRIPT italic_K : italic_S × italic_A → ( 0 , 1 ) end_POSTSUBSCRIPT { italic_J ( italic_K , italic_π ) := blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( italic_K ( italic_s , italic_a ) ) ]
+12𝔼ρπ[ln(1K(s,a))]+12𝔼ρG[ln(1K(s,a))]}\displaystyle+\frac{1}{2}\mathbb{E}_{\rho^{\pi}}[\ln(1-K(s,a))]+\frac{1}{2}% \mathbb{E}_{\rho^{G}}[\ln(1-K(s,a))]\Big{\}}+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) ] } (4)

Here, (4) is connected to (3) through the following result:

Proposition 4.

The maximization in (4) is achieved at K*(s,a)superscript𝐾𝑠𝑎K^{*}(s,a)italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) such that

ln(K*(a,s)1K*(s,a))=lnρB(s,a)ρG,π(s,a)superscript𝐾𝑎𝑠1superscript𝐾𝑠𝑎superscript𝜌𝐵𝑠𝑎superscript𝜌𝐺𝜋𝑠𝑎\ln\left(\frac{K^{*}(a,s)}{1-K^{*}(s,a)}\right)=\ln\frac{\rho^{B}(s,a)}{\rho^{% G,\pi}(s,a)}roman_ln ( divide start_ARG italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a , italic_s ) end_ARG start_ARG 1 - italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG ) = roman_ln divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG

This implies that the distribution ratio can be estimated as lnK*(s,a)1K*(s,a)superscript𝐾𝑠𝑎1superscript𝐾𝑠𝑎\ln\frac{K^{*}(s,a)}{1-K^{*}(s,a)}roman_ln divide start_ARG italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG 1 - italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG. As a result, the policy can be updated by maximizing maxπ𝔼(s,a)ρG,π[lnK*(s,a)1K*(s,a)]subscript𝜋subscript𝔼similar-to𝑠𝑎superscript𝜌𝐺𝜋delimited-[]superscript𝐾𝑠𝑎1superscript𝐾𝑠𝑎\max_{\pi}\mathbb{E}_{(s,a)\sim\rho^{G,\pi}}\left[\ln\frac{K^{*}(s,a)}{1-K^{*}% (s,a)}\right]roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln divide start_ARG italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG 1 - italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG ], which is equivalent to maxπ𝔼(s,a)ρπ[lnK*(s,a)1K*(s,a)]subscript𝜋subscript𝔼similar-to𝑠𝑎superscript𝜌𝜋delimited-[]superscript𝐾𝑠𝑎1superscript𝐾𝑠𝑎\max_{\pi}\mathbb{E}_{(s,a)\sim\rho^{\pi}}\left[\ln\frac{K^{*}(s,a)}{1-K^{*}(s% ,a)}\right]roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln divide start_ARG italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG 1 - italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG ] as the occupancy measure of the good demonstrations is constant.

In practice, K(s,a)𝐾𝑠𝑎K(s,a)italic_K ( italic_s , italic_a ) need not be fully optimized. Instead, K𝐾Kitalic_K and π𝜋\piitalic_π can be updated alternatively by gradient ascent. It is important to note that we update K(s,a)𝐾𝑠𝑎K(s,a)italic_K ( italic_s , italic_a ) by maximizing J(K,π)𝐽𝐾𝜋J(K,\pi)italic_J ( italic_K , italic_π ) and update π𝜋\piitalic_π by maximizing KL(ρG,π||ρB)\textsc{KL}\left(\rho^{G,\pi}||\rho^{B}\right)KL ( italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT | | italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ), so our algorithm is non-adversarial. In other words, K(s,a)𝐾𝑠𝑎K(s,a)italic_K ( italic_s , italic_a ) operates in a cooperative manner rather than an adversarial one – it collaborates with the policy π𝜋\piitalic_π to estimate the distribution ratio and make the mixed distribution ρG,πsuperscript𝜌𝐺𝜋\rho^{G,\pi}italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT far away from the bad one ρBsuperscript𝜌𝐵\rho^{B}italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. Here, the non-adversarial nature of our method stems from our approach of maximizing the KL divergence, in contrast to the minimizing aspect employed in GAIL.

Drawing from the above analyses, we proceed to outline our algorithm. Let w𝑤witalic_w and θ𝜃\thetaitalic_θ denote the parameters of K(s,a)𝐾𝑠𝑎K(s,a)italic_K ( italic_s , italic_a ) and π(s,a)𝜋𝑠𝑎\pi(s,a)italic_π ( italic_s , italic_a ) respectively. The core concept involves iteratively enhancing K𝐾Kitalic_K and π𝜋\piitalic_π through alternating gradient ascent updates. While Kwsubscript𝐾𝑤K_{w}italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT can be updated by using the derivatives of J(Kw,πθ)𝐽subscript𝐾𝑤subscript𝜋𝜃J(K_{w},\pi_{\theta})italic_J ( italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ), πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be updated by a policy gradient method, e.g., PPO (Schulman et al. 2017). During the training process, we generate additional trajectories and update the good and bad sets. The key steps of our method are described in Algorithm 1.

Algorithm 1 Self-imitation Safe Reinforcement Learning
0:  π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT,Kωsubscript𝐾𝜔K_{\omega}italic_K start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT,RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT,RBsubscript𝑅𝐵R_{B}italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ,cmaxsubscript𝑐𝑚𝑎𝑥c_{max}italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, learning rates κθ,κwsubscript𝜅𝜃subscript𝜅𝑤\kappa_{\theta},\kappa_{w}italic_κ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_κ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
  ΩGsubscriptΩ𝐺{\Omega_{G}}\leftarrow\emptysetroman_Ω start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ← ∅ΩBsubscriptΩ𝐵{\Omega_{B}}\leftarrow\emptysetroman_Ω start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ← ∅πθπ0subscript𝜋𝜃superscript𝜋0\pi_{\theta}\leftarrow\pi^{0}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ← italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
  while not converge do
     # Sample new set trajectories
     T={τ0,τ1,,τnπθ}𝑇similar-tosubscript𝜏0subscript𝜏1subscript𝜏𝑛subscript𝜋𝜃T=\{\tau_{0},\tau_{1},...,\tau_{n}\sim\pi_{\theta}\}italic_T = { italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT }
     # Update the “good” and “bad” sets
     RB𝔼τT[R(τ)]στT[R(τ)]subscript𝑅𝐵subscript𝔼similar-to𝜏𝑇delimited-[]𝑅𝜏subscript𝜎similar-to𝜏𝑇delimited-[]𝑅𝜏R_{B}\leftarrow\mathbb{E}_{\tau\sim T}\left[R(\tau)\right]-\sigma_{\tau\sim T}% [R(\tau)]italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ← blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_T end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] - italic_σ start_POSTSUBSCRIPT italic_τ ∼ italic_T end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ], #σ𝜎\sigmaitalic_σ is the deviation
     ΩGΩG{τT|R(τ)RGC(τ)cmax}subscriptΩ𝐺subscriptΩ𝐺conditional-set𝜏𝑇𝑅𝜏subscript𝑅𝐺𝐶𝜏subscript𝑐𝑚𝑎𝑥{\Omega_{G}}\leftarrow{\Omega_{G}}\cup\left\{\tau\in T|R(\tau)\geq R_{G}\cap C% (\tau)\leq c_{max}\right\}roman_Ω start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ← roman_Ω start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∪ { italic_τ ∈ italic_T | italic_R ( italic_τ ) ≥ italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∩ italic_C ( italic_τ ) ≤ italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }
     ΩBΩB{τT|R(τ)<RBC(τ)>cmax}subscriptΩ𝐵subscriptΩ𝐵conditional-set𝜏𝑇𝑅𝜏expectationsubscript𝑅𝐵𝐶𝜏subscript𝑐𝑚𝑎𝑥{\Omega_{B}}\leftarrow{\Omega_{B}}\cup\left\{\tau\in T|R(\tau)<R_{B}\cup C(% \tau)>c_{max}\right\}roman_Ω start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ← roman_Ω start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∪ { italic_τ ∈ italic_T | italic_R ( italic_τ ) < italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∪ italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }
     # Update Kwsubscript𝐾𝑤K_{w}italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
     ww+κwwJ(Kw,πθ)𝑤𝑤subscript𝜅𝑤subscript𝑤𝐽subscript𝐾𝑤subscript𝜋𝜃w\leftarrow w+\kappa_{w}\nabla_{w}J(K_{w},\pi_{\theta})italic_w ← italic_w + italic_κ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_J ( italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT )
     where J(Kw,πθ)=𝔼ΩB[ln(Kw(s,a))]+12𝔼T[ln(1Kw(s,a))]+12𝔼ΩG[ln(1Kw(s,a))]𝐽subscript𝐾𝑤subscript𝜋𝜃subscript𝔼superscriptΩ𝐵delimited-[]subscript𝐾𝑤𝑠𝑎12subscript𝔼𝑇delimited-[]1subscript𝐾𝑤𝑠𝑎12subscript𝔼superscriptΩ𝐺delimited-[]1subscript𝐾𝑤𝑠𝑎J(K_{w},\pi_{\theta})=\mathbb{E}_{\Omega^{B}}[\ln(K_{w}(s,a))]+\frac{1}{2}% \mathbb{E}_{T}[\ln(1-K_{w}(s,a))]+\frac{1}{2}\mathbb{E}_{\Omega^{G}}[\ln(1-K_{% w}(s,a))]italic_J ( italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT [ roman_ln ( 1 - italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( 1 - italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_s , italic_a ) ) ]
     # Update πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
     θ=θ+κθ𝔼τT[θlnπθ(s,a)QK(s,a)]𝜃𝜃subscript𝜅𝜃subscript𝔼similar-to𝜏𝑇delimited-[]subscript𝜃subscript𝜋𝜃𝑠𝑎superscript𝑄𝐾𝑠𝑎\theta=\theta+\kappa_{\theta}\mathbb{E}_{\tau\sim T}\left[\nabla_{\theta}\ln% \pi_{\theta}(s,a)Q^{K}(s,a)\right]italic_θ = italic_θ + italic_κ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_T end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_ln italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ) italic_Q start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_s , italic_a ) ]
     where QK(s,a)=𝔼T[tγtlnKw(st,at)1Kw(st,at)|s0=sa0=a]superscript𝑄𝐾𝑠𝑎subscript𝔼𝑇delimited-[]conditionalsubscript𝑡superscript𝛾𝑡𝑙𝑛subscript𝐾𝑤subscript𝑠𝑡subscript𝑎𝑡1subscript𝐾𝑤subscript𝑠𝑡subscript𝑎𝑡subscript𝑠0𝑠subscript𝑎0𝑎Q^{K}(s,a)=\mathbb{E}_{T}\left[\sum_{t}\gamma^{t}ln\frac{K_{w}(s_{t},a_{t})}{1% -K_{w}(s_{t},a_{t})}\Big{|}\begin{subarray}{c}s_{0}=s\\ a_{0}=a\end{subarray}\right]italic_Q start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_l italic_n divide start_ARG italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - italic_K start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG | start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_a end_CELL end_ROW end_ARG ]
  end while

5 EXPERIMENTS

SafetyPointGoal
SafetyCarGoal
SafetyPointButton
SafetyCarButton
SafetyPointPush
SafetyCarPush

Return

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
SafetyPointGoal
SafetyCarGoal
SafetyPointButton
SafetyCarButton
SafetyPointPush
SafetyCarPush
Figure 2: Training curves for 6 different SafetyGym environments. Every lines in calculated by the mean with shaded by the standard error of 6 independent seeds.

We conduct experiments to compare our method against some state-of-the-art Constrained RL algorithms: FOCOPS (Zhang, Vuong, and Ross 2020), CUP (Yang et al. 2022), CPO (Achiam et al. 2017)333FOCOPS, CUP, and CPO implementations can be found on https://github.com/PKU-Alignment/omnisafe.. For the sake of completeness, we also include PPO-Lagrangian  (Ray, Achiam, and Amodei 2019) and unconstrained PPO. We use PPO-Lagrangian to train our pre-trained policy and name our algorithm as SIM, standing for Self-IMitation based safe RL algorithm. Through the following experiments, we aim to address the following questions: (Q1) Would SIM outperform state-of-the-art constrained RL algorithms? (Q2) Is it necessary to use both good and bad demonstrations in the training? (Q3) How is SIM compared to a BC-based and GAIL-based algorithm? (Q4) How does SIM perform with different expertise levels of the initial policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT? Can it benefit from a not well-trained policy?

We set the cost limit as cmax=18subscript𝑐𝑚𝑎𝑥18c_{max}=18italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 18. We, however, train the initial policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT with a higher cost limit of cmax=28subscript𝑐𝑚𝑎𝑥28c_{max}=28italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 28, which allows us to generate more trajectories of high rewards and more unsafe trajectories. We test our method on 6 SafetyGym environments (Ji et al. 2023). We also simplify the names of environments, e.g., SafetyPointGoal1-v0 is renamed as SafetyPointGoal.

5.1 SIM vs other Constrained RL methods on SafetyGym

We compare our algorithm with prior safe RL ones using six different SafetyGym environments (Ray, Achiam, and Amodei 2019; Ji et al. 2023). The learning curves are shown in Figure 2 where the experiments are repeated over 6 independent seeds. For the sake of comparison, we include the PPO-unconstrained. Here are the key observations. In all the 6 environments, including the very challenging ones (SafetyPointButton and SafetyCarButton), SIM achieves the best performance – it offers the highest expected rewards while being safe. The PPO-unconstrained gives the highest rewards but is unsafe by a huge margin. Notably, in the last two push tasks, SIM achieves competitive or even higher rewards, compared to the unconstrained one. Overall PPO-Lagrangian had the second best performance.

5.2 SIM vs GAIL, and the Importance of “Good” and “Bad” Demonstrations

SafetyCarButton
SafetyCarPush

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
Refer to caption
SafetyCarButton
SafetyCarPush
Figure 3: Comparisons with GAIL-based algorithms and other versions of SIM.

We aim to assess the importance of having both “good” and “bad” demonstrations in our IL-based approach, as well as to demonstrate the advantages of our non-adversarial method. To this end, we compared SIM with three versions of GAIL that use (i) only good demonstrations ΩGsuperscriptΩ𝐺\Omega^{G}roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, (ii) only “bad” demonstrations ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, and (iii) both “good” and “bad” demonstrations, but using two discriminators as described in Section 4. For the sake of comparison, we also include two versions of SIM with only good demonstrations and only bad demonstrations. We do this by just removing the “good” (or “bad”) part from (4). Figure 3 shows the comparisons on 2 SafetyGym environments, which clearly demonstrates the superior performance of SIM, compared to SIM versions with only good (or bad) demonstrations, and the 3 different GAIL versions, highlighting the importance of having both good and bad trajectories in the training. Furthermore, it can be observed that our non-adversarial algorithm is highly stable and consistent in returning high-reward and safe policies.

5.3 SIM vs Behavioral Cloning

As mentioned earlier, one can design an IL-based method for constrained RL based on (1). In this section, we aim to compare our algorithm with a BC-based approach. We implement two BC algorithms: one is based on (BC) with one “good” demonstration, and another is based on (1) with both sets. The comparison results are presented in Figure 4. For the two BC algorithms, since their training curves are not comparable with those from SIM, we only draw horizon lines representing their expected reward and cost at convergence (their training curves are provided in the appendix).

SafetyPointGoal
SafetyCarGoal

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
SafetyPointGoal
SafetyCarGoal
Figure 4: Comparison with BC-based algorithms.

In both environments, the BC achieves the highest expected rewards, but it fails to satisfy the constraint. BC-GB either gives low-reward or unsafe policies. On the other hand, SIM consistently achieves high rewards while satisfying the constraint in all the experiments.

5.4 Varying Expertise Level

In this section, our goal is to comprehend the influence of the training extent of the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT on the efficiency of SIM. To this end, we trained the initial policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT using varying numbers of environmental steps: specifically, 10 million (1e7), 20 million (2e7), and 30 million (3e7) steps, corresponding to what we term “entry-level”, “medium-level” and “expert-level”, respectively. The comparison results are shown in Figure 5, revealing that both the “entry-level” and “medium-level” pre-trained policies achieve lower expected rewards compared to the “expert-level” SIM. Nevertheless, with either the “entry-level” or “medium-level” π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, SIM outperforms the original PPO-Lagrangian baseline, which was the second best among all the baselines.

SafetyCarButton
SafetyCarPush

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
SafetyCarButton
SafetyCarPush
Figure 5: Comparison results for different expertise levels of the pre-trained policy.

These results demonstrate that even without a well-trained initial policy, SIM is able to efficiently improve it and outperform the traditional PPO-Lagrangian method. These also indicate that, for SIM to achieve the best performance, one should start with a well-trained initial policy. As mentioned previously, SIM would greatly benefit from learning from good trajectories generated by a well-trained policy.

6 CONCLUSION

We introduced a novel framework to solve Constrained RL without relying on cost estimations or cost penalties, as commonly done in prior work. Our new algorithm, based on the idea of learning to mimic the behavior of good demonstrations and avoid bad demonstrations, is non-adversarial and allows learning from demonstration sets to evolve during the training process. Extensive experiments on several challenging benchmark tasks demonstrate that our approach achieves superior performance compared to prior constrained RL algorithms. Our IL-based framework would open new directions to address safe RL problems without explicitly considering the reward or cost function. Our algorithm relies on sets of good demonstrations generated by a pre-trained policy, so a limitation would be that our algorithm will not work if it is difficult to generate feasible trajectories due to, for instance, strict constraints. A future direction would be to develop new IL-based algorithms to address such issues.

Acknowledgment

This research/project is supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-RP-2020-016).

References

  • Abbeel and Ng (2004) Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, 1.
  • Achiam et al. (2017) Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. In International conference on machine learning, 22–31. PMLR.
  • Altman (1999) Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC press.
  • Bromley et al. (1993) Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; and Shah, R. 1993. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6.
  • Chow et al. (2019) Chow, Y.; Nachum, O.; Faust, A.; Duenez-Guzman, E.; and Ghavamzadeh, M. 2019. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031.
  • Clark et al. (2020) Clark, K.; Luong, M.-T.; Le, Q. V.; and Manning, C. D. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
  • Dadashi et al. (2020) Dadashi, R.; Hussenot, L.; Geist, M.; and Pietquin, O. 2020. Primal wasserstein imitation learning. arXiv preprint arXiv:2006.04678.
  • Firoiu, Whitney, and Tenenbaum (2017) Firoiu, V.; Whitney, W. F.; and Tenenbaum, J. B. 2017. Beating the world’s best at Super Smash Bros. with deep reinforcement learning. arXiv preprint arXiv:1702.06230.
  • Fu et al. (2021) Fu, H.; Tang, H.; Hao, J.; Chen, C.; Feng, X.; Li, D.; and Liu, W. 2021. Towards effective context for meta-reinforcement learning: an approach based on contrastive learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7457–7465.
  • Fu, Luo, and Levine (2017) Fu, J.; Luo, K.; and Levine, S. 2017. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248.
  • Gao, Yao, and Chen (2021) Gao, T.; Yao, X.; and Chen, D. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  • Garg et al. (2021) Garg, D.; Chakraborty, S.; Cundy, C.; Song, J.; and Ermon, S. 2021. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34: 4028–4039.
  • Grill et al. (2020) Grill, J.-B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 21271–21284.
  • He et al. (2019) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2019. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722.
  • Ho and Ermon (2016) Ho, J.; and Ermon, S. 2016. Generative adversarial imitation learning. Advances in neural information processing systems, 29.
  • Hoang, Dinh, and Nguyen (2023) Hoang, M.-H.; Dinh, L.; and Nguyen, H. 2023. Learning from Pixels with Expert Observations. arXiv preprint arXiv:2306.13872.
  • Ji et al. (2023) Ji, J.; Zhang, B.; Pan, X.; Zhou, J.; Dai, J.; and Yang, Y. 2023. Safety-Gymnasium. GitHub repository.
  • Kilinc and Montana (2022) Kilinc, O.; and Montana, G. 2022. Reinforcement learning for robotic manipulation using simulated locomotion demonstrations. Machine Learning, 1–22.
  • Kostrikov, Nachum, and Tompson (2019) Kostrikov, I.; Nachum, O.; and Tompson, J. 2019. Imitation learning via off-policy distribution matching. arXiv preprint arXiv:1912.05032.
  • Laskin, Srinivas, and Abbeel (2020) Laskin, M.; Srinivas, A.; and Abbeel, P. 2020. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, 5639–5650. PMLR.
  • Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937. PMLR.
  • Ng, Russell et al. (2000) Ng, A. Y.; Russell, S.; et al. 2000. Algorithms for inverse reinforcement learning. In Icml, volume 1, 2.
  • Noroozi and Favaro (2016) Noroozi, M.; and Favaro, P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, 69–84. Springer.
  • Raghu et al. (2017) Raghu, A.; Komorowski, M.; Ahmed, I.; Celi, L.; Szolovits, P.; and Ghassemi, M. 2017. Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602.
  • Ray, Achiam, and Amodei (2019) Ray, A.; Achiam, J.; and Amodei, D. 2019. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7(1): 2.
  • Ross, Gordon, and Bagnell (2011) Ross, S.; Gordon, G.; and Bagnell, D. 2011. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 627–635. JMLR Workshop and Conference Proceedings.
  • Satija, Amortila, and Pineau (2020) Satija, H.; Amortila, P.; and Pineau, J. 2020. Constrained markov decision processes via backward value functions. In International Conference on Machine Learning, 8502–8511. PMLR.
  • Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Weng et al. (2017) Weng, W.-H.; Gao, M.; He, Z.; Yan, S.; and Szolovits, P. 2017. Representation and reinforcement learning for personalized glycemic control in septic patients. arXiv preprint arXiv:1712.00654.
  • Xie et al. (2021) Xie, Z.; Liu, C.; Zhang, Y.; Lu, H.; Wang, D.; and Ding, Y. 2021. Adversarial and contrastive variational autoencoder for sequential recommendation. In Proceedings of the Web Conference 2021, 449–459.
  • Yang et al. (2022) Yang, L.; Ji, J.; Dai, J.; Zhang, L.; Zhou, B.; Li, P.; Yang, Y.; and Pan, G. 2022. Constrained update projection approach to safe policy optimization. Advances in Neural Information Processing Systems, 35: 9111–9124.
  • Yang et al. (2021) Yang, Q.; Simão, T. D.; Tindemans, S. H.; and Spaan, M. T. 2021. WCSAC: Worst-case soft actor critic for safety-constrained reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 10639–10646.
  • Zhang, Vuong, and Ross (2020) Zhang, Y.; Vuong, Q.; and Ross, K. 2020. First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 33: 15338–15349.
  • Zhou et al. (2021) Zhou, C.; Ma, J.; Zhang, J.; Zhou, J.; and Yang, H. 2021. Contrastive learning for debiased candidate generation in large-scale recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 3985–3995.

Appendix A Missing Proofs

A.1 Proof of Lemma 1

Lemma 1: For any λ>0𝜆0\lambda>0italic_λ > 0, if there exists a policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that Pπ*(τ)=0subscript𝑃superscript𝜋𝜏0P_{\pi^{*}}(\tau)=0italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 0 for all τΩB𝜏superscriptnormal-Ω𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, and

Pπ*(τ)=Pπ0(τ)τΩGPπ0(τ);τΩGformulae-sequencesubscript𝑃superscript𝜋𝜏subscript𝑃superscript𝜋0𝜏subscriptsuperscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0superscript𝜏for-all𝜏superscriptΩ𝐺P_{\pi^{*}}(\tau)=\frac{P_{\pi^{0}}(\tau)}{\sum_{\tau^{\prime}\in\Omega^{G}}P_% {\pi^{0}}(\tau^{\prime})};~{}\forall\tau\in\Omega^{G}italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ; ∀ italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT

then π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is an optimal policy to (1).

Proof.

To simplify the proof, let us first prove the following result:

Lemma 3.

Given p^0,p^1,,p^N[0,1]subscriptnormal-^𝑝0subscriptnormal-^𝑝1normal-…subscriptnormal-^𝑝𝑁01\widehat{p}_{0},\widehat{p}_{1},\ldots,\widehat{p}_{N}\in[0,1]over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∈ [ 0 , 1 ] such that n=1Np^n1superscriptsubscript𝑛1𝑁subscriptnormal-^𝑝𝑛1\sum_{n=1}^{N}\widehat{p}_{n}\leq 1∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 1, then vector p*superscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that pn*=p^nnp^nsubscriptsuperscript𝑝𝑛subscriptnormal-^𝑝𝑛subscriptsuperscript𝑛normal-′subscriptnormal-^𝑝superscript𝑛normal-′p^{*}_{n}=\frac{\widehat{p}_{n}}{\sum_{n^{\prime}}\widehat{p}_{n^{\prime}}}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG is a unique optimal solution to the following optimization problem

maxp[0,1]N{f(p)=np^nlnpn|npn1}subscript𝑝superscript01𝑁𝑓𝑝conditionalsubscript𝑛subscript^𝑝𝑛subscript𝑝𝑛subscript𝑛subscript𝑝𝑛1\max_{p\in[0,1]^{N}}\left\{f(p)=\sum_{n}\widehat{p}_{n}\ln{p}_{n}\Big{|}~{}% \sum_{n}p_{n}\leq 1\right\}roman_max start_POSTSUBSCRIPT italic_p ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { italic_f ( italic_p ) = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_ln italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≤ 1 } (5)
Proof.

We first see that the objective function f(p)𝑓𝑝f(p)italic_f ( italic_p ) is strictly concave in (0,1)Nsuperscript01𝑁(0,1)^{N}( 0 , 1 ) start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, implying that (5) always has a unique optimal solution. We write the Lagrange dual of (5) as

(p,η)=np^nlnpnη(npn1)𝑝𝜂subscript𝑛subscript^𝑝𝑛subscript𝑝𝑛𝜂subscript𝑛subscript𝑝𝑛1{\mathcal{L}}(p,\eta)=\sum_{n}\widehat{p}_{n}\ln{p}_{n}-\eta\left(\sum_{n}p_{n% }-1\right)caligraphic_L ( italic_p , italic_η ) = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_ln italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_η ( ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 )

Let p¯¯𝑝\overline{p}over¯ start_ARG italic_p end_ARG be the optimal solution of (5) and η¯¯𝜂\overline{\eta}over¯ start_ARG italic_η end_ARG be its associated Lagrange multiplier. The KKT conditions imply that the following hold

{(p¯,η)pn=0,n=1,,Nη(np¯n)=0casesformulae-sequence¯𝑝𝜂subscript𝑝𝑛0for-all𝑛1𝑁𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒𝜂subscript𝑛subscript¯𝑝𝑛0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒\begin{cases}\frac{\partial{\mathcal{L}}(\overline{p},\eta)}{\partial p_{n}}=0% ,~{}\forall n=1,\ldots,N\\ \eta(\sum_{n}\overline{p}_{n})=0\end{cases}{ start_ROW start_CELL divide start_ARG ∂ caligraphic_L ( over¯ start_ARG italic_p end_ARG , italic_η ) end_ARG start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG = 0 , ∀ italic_n = 1 , … , italic_N end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_η ( ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = 0 end_CELL start_CELL end_CELL end_ROW

which is equivalent to

{p^1p¯1=p^2p¯2==p^Np¯N=ηη(np¯n1)=0casessubscript^𝑝1subscript¯𝑝1subscript^𝑝2subscript¯𝑝2subscript^𝑝𝑁subscript¯𝑝𝑁𝜂𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒𝜂subscript𝑛subscript¯𝑝𝑛10𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒\begin{cases}\frac{\widehat{p}_{1}}{\overline{p}_{1}}=\frac{\widehat{p}_{2}}{% \overline{p}_{2}}=...=\frac{\widehat{p}_{N}}{\overline{p}_{N}}=\eta\\ \eta(\sum_{n}\overline{p}_{n}-1)=0\end{cases}{ start_ROW start_CELL divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = … = divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG = italic_η end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_η ( ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 ) = 0 end_CELL start_CELL end_CELL end_ROW

We than see that η>0𝜂0\eta>0italic_η > 0, thus np¯n1=0subscript𝑛subscript¯𝑝𝑛10\sum_{n}\overline{p}_{n}-1=0∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - 1 = 0. On the other hand η=np^nnp¯n=np^n,𝜂subscript𝑛subscript^𝑝𝑛subscript𝑛subscript¯𝑝𝑛subscript𝑛subscript^𝑝𝑛\eta=\frac{\sum_{n}\widehat{p}_{n}}{\sum_{n}\overline{p}_{n}}=\sum_{n}\widehat% {p}_{n},italic_η = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG = ∑ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , which implies that p¯n=p^nnp^nsubscript¯𝑝𝑛subscript^𝑝𝑛subscriptsuperscript𝑛subscript^𝑝superscript𝑛\overline{p}_{n}=\frac{\widehat{p}_{n}}{\sum_{n^{\prime}}\widehat{p}_{n^{% \prime}}}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG. Thus p*=p¯superscript𝑝¯𝑝p^{*}=\overline{p}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = over¯ start_ARG italic_p end_ARG is a unique optimal solution to (5), as desired. ∎

We now get back to the main proof. Recall that the objective function of the training with good and bad trajectories, under BC, is

F(π)=λτΩGPπ0(τ)lnPπ(τ)(1λ)τΩBPπ0(τ)ϕ(lnPπ(τ))𝐹𝜋𝜆subscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏subscript𝑃𝜋𝜏1𝜆subscript𝜏superscriptΩ𝐵subscript𝑃superscript𝜋0𝜏italic-ϕsubscript𝑃𝜋𝜏F(\pi)=\lambda\sum_{\begin{subarray}{c}\tau\in\Omega^{G}\end{subarray}}P_{\pi^% {0}}(\tau)\ln P_{\pi}(\tau)-(1-\lambda)\sum_{\begin{subarray}{c}\tau\in\Omega^% {B}\end{subarray}}P_{\pi^{0}}(\tau)\phi(\ln P_{\pi}(\tau))italic_F ( italic_π ) = italic_λ ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_ln italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ ) - ( 1 - italic_λ ) ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_ϕ ( roman_ln italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ ) )

We now assume that the regularizer ϕ(.)\phi(.)italic_ϕ ( . ) map (,0]0(-\infty,0]( - ∞ , 0 ] to a finite interval [a,b]𝑎𝑏[a,b][ italic_a , italic_b ]. Since ϕ()italic-ϕ\phi()italic_ϕ ( ) is monotone, we see that, for any τΩ𝜏Ω\tau\in\Omegaitalic_τ ∈ roman_Ω, Pπ(τ)Pπ*(τ)subscript𝑃𝜋𝜏subscript𝑃superscript𝜋𝜏P_{\pi}(\tau)\geq P_{\pi^{*}}(\tau)italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ ) ≥ italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ), thus ϕ(Pπ(τ))ϕ(Pπ*(τ))italic-ϕsubscript𝑃𝜋𝜏italic-ϕsubscript𝑃superscript𝜋𝜏\phi(P_{\pi}(\tau))\geq\phi(P_{\pi^{*}}(\tau))italic_ϕ ( italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ ) ) ≥ italic_ϕ ( italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) ). Moreover, from Lemma 3, the first term of F(τ)𝐹𝜏F(\tau)italic_F ( italic_τ ) can be bounded as

τΩGPπ0(τ)lnPπ(τ)τΩGPπ0(τ)lnPπ*(τ)subscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏subscript𝑃𝜋𝜏subscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏subscript𝑃superscript𝜋𝜏\sum_{\begin{subarray}{c}\tau\in\Omega^{G}\end{subarray}}P_{\pi^{0}}(\tau)\ln P% _{\pi}(\tau)\leq\sum_{\begin{subarray}{c}\tau\in\Omega^{G}\end{subarray}}P_{% \pi^{0}}(\tau)\ln P_{\pi^{*}}(\tau)∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_ln italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ ) ≤ ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) roman_ln italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ )

implying F(π)F(π*)𝐹𝜋𝐹superscript𝜋F(\pi)\leq F(\pi^{*})italic_F ( italic_π ) ≤ italic_F ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) for any policy π𝜋\piitalic_π. So, if π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT exists, it will be optimal to (1). ∎

A.2 Proof of Proposition 1

Proposition 1: For any λ>0𝜆0\lambda>0italic_λ > 0, let π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT be a pre-trained feasible policy, 𝐑E=𝔼τπ0[R(τ)]superscript𝐑𝐸subscript𝔼similar-to𝜏superscript𝜋0delimited-[]𝑅𝜏\textbf{R}^{E}=\mathbb{E}_{\tau\sim\pi^{0}}\Big{[}R(\tau)\Big{]}R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ], and ΩBsuperscriptnormal-Ω𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT be a collection of trajectories of low reward and high-cost values

ΩB={τ|R(τ)𝑹E,C(τ)>cmax}]}\Omega^{B}=\left\{\tau\Big{|}~{}R(\tau)\leq\textbf{R}^{E},~{}C(\tau)>c_{\max}% \}\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_R ( italic_τ ) ≤ R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } ] }

the optimal policy mentioned in Lemma 1 is feasible to the cost constraint while offering a better expected reward than the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, specifically,

𝔼π*[R(τ)]𝔼π0[R(τ)]subscript𝔼superscript𝜋delimited-[]𝑅𝜏subscript𝔼superscript𝜋0delimited-[]𝑅𝜏\displaystyle\mathbb{E}_{\pi^{*}}\left[R(\tau)\right]-\mathbb{E}_{\pi^{0}}% \left[R(\tau)\right]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] - blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] =τPπ0(τ)(𝑹ER(τ))1Pπ0(ΩB)0absentsubscript𝜏subscript𝑃superscript𝜋0𝜏superscript𝑹𝐸𝑅𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵0\displaystyle=\frac{\sum_{\tau}{P_{\pi^{0}}(\tau)(\textbf{R}^{E}-R(\tau))}}{1-% P_{\pi^{0}}(\Omega^{B})}\geq 0= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) ( R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - italic_R ( italic_τ ) ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG ≥ 0 (6)
𝔼τπ*[C(τ)]subscript𝔼similar-to𝜏superscript𝜋delimited-[]𝐶𝜏\displaystyle\mathbb{E}_{\tau\sim\pi^{*}}\left[C(\tau)\right]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] cmaxabsentsubscript𝑐\displaystyle\leq c_{\max}≤ italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT

where Pπ0(ΩB)=τΩBPπ0(τ)subscript𝑃superscript𝜋0superscriptnormal-Ω𝐵subscript𝜏superscriptnormal-Ω𝐵subscript𝑃superscript𝜋0𝜏P_{\pi^{0}}(\Omega^{B})=\sum_{\tau\in\Omega^{B}}P_{\pi^{0}}(\tau)italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ).

Proof.

Recall that Pπ*(τ)=0subscript𝑃superscript𝜋𝜏0P_{\pi^{*}}(\tau)=0italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 0 for all τΩB𝜏superscriptΩ𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and Pπ*(τ)=Pπ0(τ)τPπ0(τ)subscript𝑃superscript𝜋𝜏subscript𝑃superscript𝜋0𝜏subscriptsuperscript𝜏subscript𝑃superscript𝜋0𝜏P_{\pi^{*}}(\tau)=\frac{P_{\pi^{0}}(\tau)}{\sum_{\tau^{\prime}}P_{\pi^{0}}(% \tau)}italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) end_ARG for all τΩG𝜏superscriptΩ𝐺\tau\in\Omega^{G}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. We write the expected reward under π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as

𝔼τπ*[R(τ)]subscript𝔼similar-to𝜏superscript𝜋delimited-[]𝑅𝜏\displaystyle\mathbb{E}_{\tau\sim\pi^{*}}[R(\tau)]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] =τΩGPπ*(τ)R(τ)=τΩGPπ0(τ)1Pπ0(ΩB)absentsubscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋𝜏𝑅𝜏subscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵\displaystyle=\sum_{\tau\in\Omega^{G}}P_{\pi^{*}}(\tau)R(\tau)=\frac{\sum_{% \tau\in\Omega^{G}}P_{\pi^{0}}(\tau)}{1-P_{\pi^{0}}(\Omega^{B})}= ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_R ( italic_τ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG

Thus

𝔼τπ*[R(τ)]𝔼τπ0[R(τ)]subscript𝔼similar-to𝜏superscript𝜋delimited-[]𝑅𝜏subscript𝔼similar-to𝜏superscript𝜋0delimited-[]𝑅𝜏\displaystyle\mathbb{E}_{\tau\sim\pi^{*}}[R(\tau)]-\mathbb{E}_{\tau\sim\pi^{0}% }[R(\tau)]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] - blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] =τΩGPπ*(τ)R(τ)=τΩGPπ0(τ)R(τ)τPπ0R(τ)(1Pπ0(ΩB))1Pπ0(ΩB)absentsubscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋𝜏𝑅𝜏subscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏𝑅𝜏subscript𝜏subscript𝑃superscript𝜋0𝑅𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵1subscript𝑃superscript𝜋0superscriptΩ𝐵\displaystyle=\sum_{\tau\in\Omega^{G}}P_{\pi^{*}}(\tau)R(\tau)=\frac{\sum_{% \tau\in\Omega^{G}}P_{\pi^{0}}(\tau)R(\tau)-\sum_{\tau}P_{\pi^{0}}R(\tau)(1-P_{% \pi^{0}}(\Omega^{B}))}{1-P_{\pi^{0}}(\Omega^{B})}= ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_R ( italic_τ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_R ( italic_τ ) - ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_R ( italic_τ ) ( 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG
=τΩGPπ0(τ)R(τ)τPπ0R(τ)+Pπ0(ΩB)𝐑E1Pπ0(ΩB)absentsubscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏𝑅𝜏subscript𝜏subscript𝑃superscript𝜋0𝑅𝜏subscript𝑃superscript𝜋0superscriptΩ𝐵superscript𝐑𝐸1subscript𝑃superscript𝜋0superscriptΩ𝐵\displaystyle=\frac{\sum_{\tau\in\Omega^{G}}P_{\pi^{0}}(\tau)R(\tau)-\sum_{% \tau}P_{\pi^{0}}R(\tau)+P_{\pi^{0}}(\Omega^{B})\textbf{R}^{E}}{1-P_{\pi^{0}}(% \Omega^{B})}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_R ( italic_τ ) - ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_R ( italic_τ ) + italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG
=τΩBPπ0(τ)(𝐑ER(τ))1Pπ0(ΩB)(a)0absentsubscript𝜏superscriptΩ𝐵subscript𝑃superscript𝜋0𝜏superscript𝐑𝐸𝑅𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵superscript𝑎0\displaystyle=\frac{\sum_{\tau\in\Omega^{B}}P_{\pi^{0}}(\tau)(\textbf{R}^{E}-R% (\tau))}{1-P_{\pi^{0}}(\Omega^{B})}\stackrel{{\scriptstyle(a)}}{{\geq}}0= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) ( R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - italic_R ( italic_τ ) ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≥ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP 0

where (a)𝑎(a)( italic_a ) is due to the fact that R(τ)𝐑E𝑅𝜏superscript𝐑𝐸R(\tau)\leq\textbf{R}^{E}italic_R ( italic_τ ) ≤ R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT for all τΩB𝜏superscriptΩ𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT. We now consider the expected cost given by π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Let 𝐂E=𝔼π0[C(τ)]superscript𝐂𝐸subscript𝔼superscript𝜋0delimited-[]𝐶𝜏\textbf{C}^{E}=\mathbb{E}_{\pi^{0}}[C(\tau)]C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ], we write

𝔼π*[C(τ)]𝐂Esubscript𝔼superscript𝜋delimited-[]𝐶𝜏superscript𝐂𝐸\displaystyle\mathbb{E}_{\pi^{*}}[C(\tau)]-\textbf{C}^{E}blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] - C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT =τPπ*(τ)C(τ)𝐂Eabsentsubscript𝜏subscript𝑃superscript𝜋𝜏𝐶𝜏superscript𝐂𝐸\displaystyle=\sum_{\tau}P_{\pi^{*}}(\tau)C(\tau)-\textbf{C}^{E}= ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_C ( italic_τ ) - C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT
=τΩGPπ0(τ)C(τ)τPπ0(τ)C(τ)+𝐂EP(ΩB)1Pπ0(ΩB)absentsubscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏𝐶𝜏subscript𝜏subscript𝑃superscript𝜋0𝜏𝐶𝜏superscript𝐂𝐸𝑃superscriptΩ𝐵1subscript𝑃superscript𝜋0superscriptΩ𝐵\displaystyle=\frac{\sum_{\tau\in\Omega^{G}}P_{\pi^{0}}(\tau)C(\tau)-\sum_{% \tau}P_{\pi^{0}}(\tau)C(\tau)+\textbf{C}^{E}P(\Omega^{B})}{1-P_{\pi^{0}}(% \Omega^{B})}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_C ( italic_τ ) - ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_C ( italic_τ ) + C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_P ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG
=𝐂EP(ΩB)τPπ0(τ)C(τ)1Pπ0(ΩB)absentsuperscript𝐂𝐸𝑃superscriptΩ𝐵subscript𝜏subscript𝑃superscript𝜋0𝜏𝐶𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵\displaystyle=\frac{\textbf{C}^{E}P(\Omega^{B})-\sum_{\tau}P_{\pi^{0}}(\tau)C(% \tau)}{1-P_{\pi^{0}}(\Omega^{B})}= divide start_ARG C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_P ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_C ( italic_τ ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG
(b)𝐂EP(ΩB)τΩBPπ0(τ)cmax1Pπ0(ΩB)=(𝐂Ecmax)P(ΩB)1Pπ0(ΩB)(c)0superscript𝑏absentsuperscript𝐂𝐸𝑃superscriptΩ𝐵subscript𝜏superscriptΩ𝐵subscript𝑃superscript𝜋0𝜏subscript𝑐𝑚𝑎𝑥1subscript𝑃superscript𝜋0superscriptΩ𝐵superscript𝐂𝐸subscript𝑐𝑚𝑎𝑥𝑃superscriptΩ𝐵1subscript𝑃superscript𝜋0superscriptΩ𝐵superscript𝑐0\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\frac{\textbf{C}^{E}P(\Omega^% {B})-\sum_{\tau\in\Omega^{B}}P_{\pi^{0}}(\tau)c_{max}}{1-P_{\pi^{0}}(\Omega^{B% })}=\frac{(\textbf{C}^{E}-c_{max})P(\Omega^{B})}{1-P_{\pi^{0}}(\Omega^{B})}% \stackrel{{\scriptstyle(c)}}{{\leq}}0start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_b ) end_ARG end_RELOP divide start_ARG C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_P ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG = divide start_ARG ( C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) italic_P ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_c ) end_ARG end_RELOP 0

where (b)𝑏(b)( italic_b ) is because C(τ)>cmax𝐶𝜏subscript𝑐𝑚𝑎𝑥C(\tau)>c_{max}italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (according to the way we choose ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT), and 𝐜Ecmaxsuperscript𝐜𝐸subscript𝑐𝑚𝑎𝑥\textbf{c}^{E}\leq c_{max}c start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ≤ italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (π0(\pi^{0}( italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is feasible w.r.t the cost constraint). So, we have 𝔼π*[C(τ)]𝐂Ecmaxsubscript𝔼superscript𝜋delimited-[]𝐶𝜏superscript𝐂𝐸subscript𝑐𝑚𝑎𝑥\mathbb{E}_{\pi^{*}}[C(\tau)]\leq\textbf{C}^{E}\leq c_{max}blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] ≤ C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ≤ italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, implying that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is safe, as desired. ∎

A.3 Proof of Lemma 2

Lemma 2: If λ=0𝜆0\lambda=0italic_λ = 0, any policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that Pπ*(τ)=0subscript𝑃superscript𝜋𝜏0P_{\pi^{*}}(\tau)=0italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 0 for all τΩB𝜏superscriptnormal-Ω𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT is optimal for (1).

Proof.

This can be obviously seen, as if λ=0𝜆0\lambda=0italic_λ = 0, then the objective function becomes F(π)=τΩBPπ0(τ)ϕ(lnPπ(τ))𝐹𝜋subscript𝜏superscriptΩ𝐵subscript𝑃superscript𝜋0𝜏italic-ϕsubscript𝑃𝜋𝜏F(\pi)=-\sum_{\begin{subarray}{c}\tau\in\Omega^{B}\end{subarray}}P_{\pi^{0}}(% \tau)\phi(\ln P_{\pi}(\tau))italic_F ( italic_π ) = - ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_ϕ ( roman_ln italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_τ ) ). Since ϕ(.)\phi(.)italic_ϕ ( . ) is monotone, F(π)F(π*)𝐹𝜋𝐹superscript𝜋F(\pi)\geq F(\pi^{*})italic_F ( italic_π ) ≥ italic_F ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) for any policy π𝜋\piitalic_π. ∎

A.4 Proof of Proposition 2

Proposition 2: If λ=0𝜆0\lambda=0italic_λ = 0, and the bad set ΩBsuperscriptnormal-Ω𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT is selected in the same manner as in Theorem 1, then the optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from Lemma (2) is feasible, but it does not necessarily provide a higher expected reward than the policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

Proof.
Refer to caption
Figure 6: Example

According to the Lemma 2, any π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that Pπ*(τ)=0subscript𝑃superscript𝜋𝜏0P_{\pi^{*}}(\tau)=0italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) = 0 is optimal for (1). To prove that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT may not offer a higher expected reward than π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, we will use the counter-example shown in Figure 6. There are 5 states and the MDP is deterministic. The rewards and cost are set as r(s0)=0,r(s4)=0,r(s1)=2,r(s2)=3,r(s3)=8formulae-sequence𝑟subscript𝑠00formulae-sequence𝑟subscript𝑠40formulae-sequence𝑟subscript𝑠12formulae-sequence𝑟subscript𝑠23𝑟subscript𝑠38r(s_{0})=0,r(s_{4})=0,r(s_{1})=2,r(s_{2})=3,r(s_{3})=8italic_r ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0 , italic_r ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = 0 , italic_r ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 2 , italic_r ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 3 , italic_r ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = 8, and d(s0)=0,d(s4)=0formulae-sequence𝑑subscript𝑠00𝑑subscript𝑠40d(s_{0})=0,d(s_{4})=0italic_d ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0 , italic_d ( italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = 0, d(s1)=5,d(s2)=d(s3)=1formulae-sequence𝑑subscript𝑠15𝑑subscript𝑠2𝑑subscript𝑠31d(s_{1})=5,d(s_{2})=d(s_{3})=1italic_d ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 5 , italic_d ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_d ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = 1. The initial policy is set as π0(a1|s0)=1/5superscript𝜋0conditionalsubscript𝑎1subscript𝑠015\pi^{0}(a_{1}|s_{0})=1/5italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 1 / 5, π0(a2|s0)=1/5superscript𝜋0conditionalsubscript𝑎2subscript𝑠015\pi^{0}(a_{2}|s_{0})=1/5italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 1 / 5 and π0(a3|s0)=3/5superscript𝜋0conditionalsubscript𝑎3subscript𝑠035\pi^{0}(a_{3}|s_{0})=3/5italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 3 / 5. We also choose cmax=2subscript𝑐𝑚𝑎𝑥2c_{max}=2italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 2 The expected reward is 𝐑E=5.6superscript𝐑𝐸5.6\textbf{R}^{E}=5.6R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = 5.6 and and expected cost is 𝒞E=1.8superscript𝒞𝐸1.8{\mathcal{C}}^{E}=1.8caligraphic_C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = 1.8. It is then clear that the trajectory {s0,s1,s4}subscript𝑠0subscript𝑠1subscript𝑠4\{s_{0},s_{1},s_{4}\}{ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } should be classified in the bad set. Policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT such that π*(a1|s0)=0superscript𝜋conditionalsubscript𝑎1subscript𝑠00\pi^{*}(a_{1}|s_{0})=0italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0, π*(a2|s0)=4/5superscript𝜋conditionalsubscript𝑎2subscript𝑠045\pi^{*}(a_{2}|s_{0})=4/5italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 4 / 5 and π*(a3|s0)=1/5superscript𝜋conditionalsubscript𝑎3subscript𝑠015\pi^{*}(a_{3}|s_{0})=1/5italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 1 / 5 is definitely optimal for (1), according to Lemma 2. We however see that 𝔼π*[R(τ)]=4<𝐑Esubscript𝔼superscript𝜋delimited-[]𝑅𝜏4superscript𝐑𝐸\mathbb{E}_{\pi^{*}}[R(\tau)]=4<\textbf{R}^{E}blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] = 4 < R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, implying that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT offers a worse expected reward than the initial policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. We complete the proof. ∎

A.5 Proof of Proposition 3

Proposition 3: The following hold

  • (i)

    If we select the bad set as ΩB={τ|R(τ)𝑹E,C(τ)>𝑪E}]}\Omega^{B}=\left\{\tau\Big{|}~{}R(\tau)\leq\textbf{R}^{E},~{}C(\tau)>\textbf{C% }^{E}\}\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_R ( italic_τ ) ≤ R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_C ( italic_τ ) > C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT } ] }, then it is guaranteed that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT offers a higher (or equal) expected reward and lower (or equal) expected cost, compared to those from π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, where 𝑪E=𝔼τπ0[C(τ)]superscript𝑪𝐸subscript𝔼similar-to𝜏superscript𝜋0delimited-[]𝐶𝜏\textbf{C}^{E}=\mathbb{E}_{\tau\sim\pi^{0}}[C(\tau)]C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ]

  • (ii)

    If the pre-trained policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is not feasible, then if we select the bad set as ΩB={τ|C(τ)>cmax}]}\Omega^{B}=\left\{\tau\Big{|}C(\tau)>c_{\max}\}\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } ] }, then it is guaranteed that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is feasible

  • (iii)

    If the cost function is not accessible, but there is an oracle that can tell us which trajectories are violating the constraint, then by selecting, ΩB={τ|τ is violated ]}\Omega^{B}=\left\{\tau\Big{|}\tau\text{ is violated }\Big{]}\right\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_τ is violated ] }, then π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is feasible.

Proof.

The proof is similar to the proof of Proposition 1. For (i), we also write

𝔼π*[R(τ)]𝐑Esubscript𝔼superscript𝜋delimited-[]𝑅𝜏superscript𝐑𝐸\displaystyle\mathbb{E}_{\pi^{*}}[R(\tau)]-\textbf{R}^{E}blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] - R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT =τΩBPπ0(τ)(𝐑ER(τ))1Pπ0(ΩB)absentsubscript𝜏superscriptΩ𝐵subscript𝑃superscript𝜋0𝜏superscript𝐑𝐸𝑅𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵\displaystyle=\frac{\sum_{\tau\in\Omega^{B}}P_{\pi^{0}}(\tau)(\textbf{R}^{E}-R% (\tau))}{1-P_{\pi^{0}}(\Omega^{B})}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) ( R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - italic_R ( italic_τ ) ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG
𝔼π*[C(τ)]𝐂Esubscript𝔼superscript𝜋delimited-[]𝐶𝜏superscript𝐂𝐸\displaystyle\mathbb{E}_{\pi^{*}}[C(\tau)]-\textbf{C}^{E}blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] - C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT =τΩB(𝐂EC(τ))Pπ0(τ)1Pπ0(ΩB)absentsubscript𝜏superscriptΩ𝐵superscript𝐂𝐸𝐶𝜏subscript𝑃superscript𝜋0𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵\displaystyle=\frac{\sum_{\tau\in\Omega^{B}}(\textbf{C}^{E}-C(\tau))P_{\pi^{0}% }(\tau)}{1-P_{\pi^{0}}(\Omega^{B})}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT - italic_C ( italic_τ ) ) italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG

Then according to the way we select ΩBsuperscriptΩ𝐵\Omega^{B}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT in (i), we should have 𝔼π*[R(τ)]𝐑Esubscript𝔼superscript𝜋delimited-[]𝑅𝜏superscript𝐑𝐸\mathbb{E}_{\pi^{*}}[R(\tau)]\geq\textbf{R}^{E}blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] ≥ R start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT and 𝔼π*[C(τ)]𝐂Esubscript𝔼superscript𝜋delimited-[]𝐶𝜏superscript𝐂𝐸\mathbb{E}_{\pi^{*}}[C(\tau)]\leq\textbf{C}^{E}blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] ≤ C start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, implying that π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT yields a higher expected reward and lower expected cost, compared to π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

For (ii), since C(τ)>cmax𝐶𝜏subscript𝑐𝑚𝑎𝑥C(\tau)>c_{max}italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT for all τΩB𝜏superscriptΩ𝐵\tau\in\Omega^{B}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT, C(τ)cmax𝐶𝜏subscript𝑐𝑚𝑎𝑥C(\tau)\leq c_{max}italic_C ( italic_τ ) ≤ italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT for all τΩG𝜏superscriptΩ𝐺\tau\in\Omega^{G}italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. We write the expected cost under π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as

𝔼π*[C(τ)]subscript𝔼superscript𝜋delimited-[]𝐶𝜏\displaystyle\mathbb{E}_{\pi^{*}}[C(\tau)]blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_C ( italic_τ ) ] =τΩGPπ0(τ)C(τ)1Pπ0(ΩB)τΩGPπ0(τ)cmax1Pπ0(ΩB)=cmaxabsentsubscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏𝐶𝜏1subscript𝑃superscript𝜋0superscriptΩ𝐵subscript𝜏superscriptΩ𝐺subscript𝑃superscript𝜋0𝜏subscript𝑐𝑚𝑎𝑥1subscript𝑃superscript𝜋0superscriptΩ𝐵subscript𝑐𝑚𝑎𝑥\displaystyle=\frac{\sum_{\tau\in\Omega^{G}}P_{\pi^{0}}(\tau)C(\tau)}{1-P_{\pi% ^{0}}(\Omega^{B})}\leq\frac{\sum_{\tau\in\Omega^{G}}P_{\pi^{0}}(\tau)c_{max}}{% 1-P_{\pi^{0}}(\Omega^{B})}=c_{max}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_C ( italic_τ ) end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG ≤ divide start_ARG ∑ start_POSTSUBSCRIPT italic_τ ∈ roman_Ω start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_τ ) italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_P start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) end_ARG = italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT

So, π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is safe.

Claim (iii) is the same as (ii), in the sense that the oracle can correctly select the bad set ΩB={τ|C(τ)>cmax}superscriptΩ𝐵conditional-set𝜏𝐶𝜏subscript𝑐𝑚𝑎𝑥\Omega^{B}=\{\tau|~{}C(\tau)>c_{max}\}roman_Ω start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT = { italic_τ | italic_C ( italic_τ ) > italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }. Thus, the policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, if exits, will be safe. ∎

A.6 Proof of Proposition 4

Proposition 4: The maximization in (4) is achieved at K*(s,a)superscript𝐾𝑠𝑎K^{*}(s,a)italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) such that

ln(K*(a,s)1K*(s,a))=lnρB(s,a)ρG,π(s,a)superscript𝐾𝑎𝑠1superscript𝐾𝑠𝑎superscript𝜌𝐵𝑠𝑎superscript𝜌𝐺𝜋𝑠𝑎\ln\left(\frac{K^{*}(a,s)}{1-K^{*}(s,a)}\right)=\ln\frac{\rho^{B}(s,a)}{\rho^{% G,\pi}(s,a)}roman_ln ( divide start_ARG italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a , italic_s ) end_ARG start_ARG 1 - italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG ) = roman_ln divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT italic_G , italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG
Proof.

We first look at the trainning objective of K(s,a)𝐾𝑠𝑎K(s,a)italic_K ( italic_s , italic_a ) and write

J(K,π)𝐽𝐾𝜋\displaystyle J(K,\pi)italic_J ( italic_K , italic_π ) =𝔼ρB[ln(K(s,a))]+12𝔼ρπ[ln(1K(s,a))]+12𝔼ρG[ln(1K(s,a))]absentsubscript𝔼superscript𝜌𝐵delimited-[]𝐾𝑠𝑎12subscript𝔼superscript𝜌𝜋delimited-[]1𝐾𝑠𝑎12subscript𝔼superscript𝜌𝐺delimited-[]1𝐾𝑠𝑎\displaystyle=\mathbb{E}_{\rho^{B}}[\ln(K(s,a))]+\frac{1}{2}\mathbb{E}_{\rho^{% \pi}}[\ln(1-K(s,a))]+\frac{1}{2}\mathbb{E}_{\rho^{G}}[\ln(1-K(s,a))]= blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( italic_K ( italic_s , italic_a ) ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) ]
=𝔼ρB[ln(K(s,a))]+(s,a)ln(K(s,a))ρπ(s,a)+ρG(s,a)2absentsubscript𝔼superscript𝜌𝐵delimited-[]𝐾𝑠𝑎subscript𝑠𝑎𝐾𝑠𝑎superscript𝜌𝜋𝑠𝑎superscript𝜌𝐺𝑠𝑎2\displaystyle=\mathbb{E}_{\rho^{B}}[\ln(K(s,a))]+\sum_{(s,a)}\ln(K(s,a))\frac{% \rho^{\pi}(s,a)+\rho^{G}(s,a)}{2}= blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( italic_K ( italic_s , italic_a ) ) ] + ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUBSCRIPT roman_ln ( italic_K ( italic_s , italic_a ) ) divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) + italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG 2 end_ARG
=𝔼ρB[ln(K(s,a))]+𝔼ρπ,G[ln(1K(s,a))]absentsubscript𝔼superscript𝜌𝐵delimited-[]𝐾𝑠𝑎subscript𝔼superscript𝜌𝜋𝐺delimited-[]1𝐾𝑠𝑎\displaystyle=\mathbb{E}_{\rho^{B}}[\ln(K(s,a))]+\mathbb{E}_{\rho^{\pi,G}}[\ln% (1-K(s,a))]= blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( italic_K ( italic_s , italic_a ) ) ] + blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_π , italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) ]
=(s,a)ln(K(s,a))ρB(s,a)+ln(1K(s,a))ρπ,G(s,a)absentsubscript𝑠𝑎𝐾𝑠𝑎superscript𝜌𝐵𝑠𝑎1𝐾𝑠𝑎superscript𝜌𝜋𝐺𝑠𝑎\displaystyle=\sum_{(s,a)}\ln(K(s,a))\rho^{B}(s,a)+\ln(1-K(s,a))\rho^{\pi,G}(s% ,a)= ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) end_POSTSUBSCRIPT roman_ln ( italic_K ( italic_s , italic_a ) ) italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) + roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) italic_ρ start_POSTSUPERSCRIPT italic_π , italic_G end_POSTSUPERSCRIPT ( italic_s , italic_a ) (7)

So, to maximize J(K,π)𝐽𝐾𝜋J(K,\pi)italic_J ( italic_K , italic_π ), each component ln(K(s,a))ρB(s,a)+ln(1K(s,a))ρπ,G(s,a)𝐾𝑠𝑎superscript𝜌𝐵𝑠𝑎1𝐾𝑠𝑎superscript𝜌𝜋𝐺𝑠𝑎\ln(K(s,a))\rho^{B}(s,a)+\ln(1-K(s,a))\rho^{\pi,G}(s,a)roman_ln ( italic_K ( italic_s , italic_a ) ) italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) + roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) italic_ρ start_POSTSUPERSCRIPT italic_π , italic_G end_POSTSUPERSCRIPT ( italic_s , italic_a ) needs to be maximized. To study this maximization problem, we consider the following simple optimization problem maxx(0,1){f(x)=ln(x)a+ln(1x)b}subscript𝑥01𝑓𝑥𝑥𝑎1𝑥𝑏\max_{x\in(0,1)}\{f(x)=\ln(x)a+\ln(1-x)b\}roman_max start_POSTSUBSCRIPT italic_x ∈ ( 0 , 1 ) end_POSTSUBSCRIPT { italic_f ( italic_x ) = roman_ln ( italic_x ) italic_a + roman_ln ( 1 - italic_x ) italic_b }, where a,b0𝑎𝑏0a,b\geq 0italic_a , italic_b ≥ 0. We first see that f(x)=axb1xsuperscript𝑓𝑥𝑎𝑥𝑏1𝑥f^{\prime}(x)=\frac{a}{x}-\frac{b}{1-x}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG italic_a end_ARG start_ARG italic_x end_ARG - divide start_ARG italic_b end_ARG start_ARG 1 - italic_x end_ARG. Thus if we set f(x)=0superscript𝑓𝑥0f^{\prime}(x)=0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = 0, this equation has a unique solution as x*=aa+bsuperscript𝑥𝑎𝑎𝑏x^{*}=\frac{a}{a+b}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = divide start_ARG italic_a end_ARG start_ARG italic_a + italic_b end_ARG. Moreover f(x)0superscript𝑓𝑥0f^{\prime}(x)\leq 0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ≤ 0 if xx*𝑥superscript𝑥x\leq x^{*}italic_x ≤ italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and f(x)0superscript𝑓𝑥0f^{\prime}(x)\geq 0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ≥ 0 if xx*𝑥superscript𝑥x\geq x^{*}italic_x ≥ italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, thus x*superscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is a unique solution to maxx(0,1){f(x)=ln(x)a+ln(1x)b}subscript𝑥01𝑓𝑥𝑥𝑎1𝑥𝑏\max_{x\in(0,1)}\{f(x)=\ln(x)a+\ln(1-x)b\}roman_max start_POSTSUBSCRIPT italic_x ∈ ( 0 , 1 ) end_POSTSUBSCRIPT { italic_f ( italic_x ) = roman_ln ( italic_x ) italic_a + roman_ln ( 1 - italic_x ) italic_b }.

We now get back to the maximization

maxK{ln(K(s,a))ρB(s,a)+ln(1K(s,a))ρπ,G(s,a)}subscript𝐾𝐾𝑠𝑎superscript𝜌𝐵𝑠𝑎1𝐾𝑠𝑎superscript𝜌𝜋𝐺𝑠𝑎\max_{K}\left\{\ln(K(s,a))\rho^{B}(s,a)+\ln(1-K(s,a))\rho^{\pi,G}(s,a)\right\}roman_max start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT { roman_ln ( italic_K ( italic_s , italic_a ) ) italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) + roman_ln ( 1 - italic_K ( italic_s , italic_a ) ) italic_ρ start_POSTSUPERSCRIPT italic_π , italic_G end_POSTSUPERSCRIPT ( italic_s , italic_a ) } (8)

From the above small problem, we know that (8) has a unique optimization solution K*(s,a)superscript𝐾𝑠𝑎K^{*}(s,a)italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) such that

K*(s,a)=ρB(s,a)ρB(s,a)+ρπ,G(s,a)superscript𝐾𝑠𝑎superscript𝜌𝐵𝑠𝑎superscript𝜌𝐵𝑠𝑎superscript𝜌𝜋𝐺𝑠𝑎K^{*}(s,a)=\frac{\rho^{B}(s,a)}{\rho^{B}(s,a)+\rho^{\pi,G}(s,a)}italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) = divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) + italic_ρ start_POSTSUPERSCRIPT italic_π , italic_G end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG

implying

K*(s,a)1K*(s,a)=ρB(s,a)ρπ,G(s,a).superscript𝐾𝑠𝑎1superscript𝐾𝑠𝑎superscript𝜌𝐵𝑠𝑎superscript𝜌𝜋𝐺𝑠𝑎\frac{K^{*}(s,a)}{1-K^{*}(s,a)}=\frac{\rho^{B}(s,a)}{\rho^{\pi,G}(s,a)}.divide start_ARG italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG 1 - italic_K start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG = divide start_ARG italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT italic_π , italic_G end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_ARG .

as desired. ∎

Appendix B Additional Details

B.1 Method Overview

In Figure 7 we show a diagram illustrating in detail our algorithm SIM.

Refer to caption
Figure 7: Overview of SIM

B.2 Additional Settings for the Experiment with Varying Expertise Level

We provide additional details for Section 5.4 (Varying Expertise Level) in the main paper. Table 2 shows the expected rewards and the chosen thresholds RGsuperscript𝑅𝐺R^{G}italic_R start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT (those for selecting the good trajectories) of the three levels of π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

SafetyCarButton SafetyCarPush

Steps

Expected return

RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT

Expected return

RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT

1e7

5.3

7.0

2.9

3.0

2e7

8.92

9.0

5.07

5.0

3e7

14.4

15.0

6.85

8.0

Table 2: Expected rewards and thresholds RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT of for different expertise levels of π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

B.3 Relaxed Constraints

We provide a more detailed explanation of why relaxing the constraints is beneficial for the training of π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. In practical scenarios, enforcing strict constraints on trajectory generation may hinder the achievement of good trajectories due to exploration challenges and limitations in obtaining high rewards. Conversely, adopting a more relaxed constraint (constraint with higher cmaxsubscript𝑐𝑚𝑎𝑥c_{max}italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT) setting could lead to higher returns, but it might also reduce the chances of satisfying the strict constraint. To address this, we initiate the training process with relaxed constraints (i.e., higher cmaxsubscript𝑐𝑚𝑎𝑥c_{max}italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT) that allow us to generate a better set of good trajectories (Figure 8 illustrate an advantage of using relaxed-constrained initial policy). Our experiments clearly demonstrate the significant advantages of employing relaxed constraints on the algorithm’s final performance.

Refer to caption
Figure 8: Although a significant number of trajectories do not satisfy the constraints (red lines), the relaxed-constraint setting is still able to offer a considerable number of good trajectories (green lines).

B.4 Environmental Details

Safety-gym

The Safety-gym benchmark (Ray, Achiam, and Amodei 2019), has emerged as a highly challenging benchmark for Constraint RL. Previous research mostly focused on the easiest environment, SafetyPointGoal, with some providing results for even simpler variations  (Yang et al. 2021). In contrast, we conducted comprehensive experiments, exploring all six challenging environments within this benchmark. These environments are illustrated in Figure 9 below.

SafetyPointGoal
SafetyPointButton
SafetyPointPush
SafetyCarGoal
SafetyCarButton
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
SafetyPointGoal
SafetyPointButton
SafetyPointPush
SafetyCarGoal
SafetyCarButton
SafetyCarPush
Figure 9: Six different environments in Safety-Gym.

In the first pair of environments, SafetyPointGoal and SafetyCarGoal, the agent’s primary objective is to reach the designated goal position, represented by the green area in the visuals. This must be accomplished with skillful navigation to avoid both hazardous areas (blue regions) and obstacles (cyan blocks). The SafetyPointGoal task features a point agent, which is relatively easier to control, allowing for efficient training. On the other hand, the SafetyCarGoal task poses a greater challenge due to the more demanding control requirements of the car agent.

Moving on to the next set of environments, SafetyPointButton and SafetyCarButton, the agent encounters a fresh set of challenges. In SafetyPointButton, the primary goal is to navigate to the correct button, indicated by the green button, while carefully avoiding incorrect buttons, hazardous areas (blue regions), and maneuvering around moving obstacles (purple blocks). The SafetyCarButton environment shares a similar objective, but with the removal of moving obstacles to reduce training difficulty. Despite this adjustment, controlling the car agent remains challenging.

Lastly, in the last pair of environments, SafetyPointPush and SafetyCarPush, the agent’s main task is to push the yellow block to the goal area (green region) while skillfully evading hazard areas (blue regions) and the blocking pillar (dark-blue cylinder) to increase the task difficulty. Similar to the button tasks, the pillar is removed to ease the task difficulty for the car agent.

Mujoco Circle

The Mujoco Circle task was developed by (Achiam et al. 2017), involving agents moving along a circle centered at the origin. However, there is a constraint that the agent must remain in a area within a safety region, which is smaller than the radius of the circle and represented by the green area. To further challenge the agent, two walls are introduced that hinder its ability to move freely.Compared to the Safety-Gym environments, these tasks are considered less difficult because there is no randomness in the constraints imposed on the agent. The constraints are well-defined and consistent throughout the task.

SafetyPointCircle
Refer to caption
Refer to caption
SafetyPointCircle
SafetyCarCircle
Figure 10: Mujoco Circle

To evaluate the performance of different agents under increasing difficulty, two types of agents are tested: Point and Car. Each agent faces the same task but with varying degrees of complexity. The Point agent is presumably the easier to control, while the Car agent poses a higher level of difficulty due to more demanding control requirements. The illustration is in Figure 10.

By conducting experiments with these agents in the Mujoco Circle task, we can gain valuable insights into the agents’ abilities to navigate the circular environment while adhering to the constraints, allowing for a comparative analysis of their performance under increasing difficulty levels.

Mujoco-velocity

SafetyAntVelocity
Refer to caption
Refer to caption
SafetyAntVelocity
SafetyHalfCheetahVelocity
Figure 11: Mujoco Velocity

We also test our algorithm with the Mujoco Velocity domains. MuJoCo is an advanced framework specialized in simulating intricate physical systems that feature multi-joint mechanisms and interactions. A key aspect of MuJoCo’s capabilities involves its integration of velocity constraints. In our experiments, these constraints play a crucial role as we impose specific velocity limits on the agent’s movements. This action allows us to exert significant control over the motion of articulated entities within the simulation, effectively replicating real-world constraints and behaviors. The illustration is in Figure 11.

It’s worth noting that in our experimental setups, the MuJoCo environments that emphasize velocity demonstrate a relatively lower level of challenge due to the absence of external obstacles and random elements. Furthermore, achieving a high performance score doesn’t solely rely on achieving high velocity. As a result, all algorithms tested within this context exhibit impressive learning capabilities.

Appendix C Additional Experiments

In this section, we provide experiments to answer 5 additional questions:

  • (Q5)

    Can SIM provide a high-reward and safe policy using a relaxed-constraint expert?

  • (Q6)

    What happens if the cost function is inaccessible?

  • (Q7)

    Would an unconstrained problem benefit from our approach?

  • (Q8)

    Would our approach work with CVaR constrained problems (Yang et al. 2021)?

  • (Q9)

    Do the number of initial good trajectories impact to the final performance?

C.1 Hyper-parameter selection

We conducted all experiments on a total of 4 NVIDIA RTX A5000 GPUs and 96 core CPUs. The detailed hyper-parameters are reported in Table 3.

Hyper Parameter Safety-gym Mujoco-circle Mujoco-velocity
Actor Network [256,256,256]256256256[256,256,256][ 256 , 256 , 256 ] [256,256,256]256256256[256,256,256][ 256 , 256 , 256 ] [64,64]6464[64,64][ 64 , 64 ]
Critic Network [256,256,256]256256256[256,256,256][ 256 , 256 , 256 ] [256,256,256]256256256[256,256,256][ 256 , 256 , 256 ] [64,64]6464[64,64][ 64 , 64 ]
Cost Critic Network [256,256,256]256256256[256,256,256][ 256 , 256 , 256 ] [256,256,256]256256256[256,256,256][ 256 , 256 , 256 ] [64,64]6464[64,64][ 64 , 64 ]
Classifier Network [100,100,100]100100100[100,100,100][ 100 , 100 , 100 ] [100,100]100100[100,100][ 100 , 100 ] [100,100]100100[100,100][ 100 , 100 ]
Gamma 0.990.990.990.99 0.990.990.990.99 0.990.990.990.99
lr actor 0.00010.00010.00010.0001 0.00010.00010.00010.0001 0.00030.00030.00030.0003
lr Critic 0.00010.00010.00010.0001 0.00010.00010.00010.0001 0.00010.00010.00010.0001
lr Cost Critic 0.00010.00010.00010.0001 0.00010.00010.00010.0001 0.00010.00010.00010.0001
lr Classifier 0.010.010.010.01 0.010.010.010.01 0.010.010.010.01
lr Penalty 0.010.010.010.01 0.010.010.010.01 0.010.010.010.01
max KL 0.050.050.050.05 0.050.050.050.05 0.20.20.20.2
max iteration per update 80808080 80808080 120120120120
buffer size 50,0005000050,00050 , 000 50,0005000050,00050 , 000 20,0002000020,00020 , 000
max episode length 1,00010001,0001 , 000 500500500500 1,00010001,0001 , 000
Classifier batch size 4,09640964,0964 , 096 4,09640964,0964 , 096 4,09640964,0964 , 096
RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT (fixed) 𝔼τπE[R(τ)]subscript𝔼similar-to𝜏superscript𝜋𝐸delimited-[]𝑅𝜏\mathbb{E}_{\tau\sim\pi^{E}}[R(\tau)]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] 𝔼τπE[R(τ)]subscript𝔼similar-to𝜏superscript𝜋𝐸delimited-[]𝑅𝜏\mathbb{E}_{\tau\sim\pi^{E}}[R(\tau)]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] 𝔼τπE[R(τ)]subscript𝔼similar-to𝜏superscript𝜋𝐸delimited-[]𝑅𝜏\mathbb{E}_{\tau\sim\pi^{E}}[R(\tau)]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ]
max RBsubscript𝑅𝐵R_{B}italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT max(RG/2,RG5.0)subscript𝑅𝐺2subscript𝑅𝐺5.0\max(R_{G}/2,R_{G}-5.0)roman_max ( italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT / 2 , italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - 5.0 ) max(RG/2,RG10.0)subscript𝑅𝐺2subscript𝑅𝐺10.0\max(R_{G}/2,R_{G}-10.0)roman_max ( italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT / 2 , italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - 10.0 ) max(RG/2,RG1000.0)subscript𝑅𝐺2subscript𝑅𝐺1000.0\max(R_{G}/2,R_{G}-1000.0)roman_max ( italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT / 2 , italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - 1000.0 )
Table 3: Hyper parameters.

Moreover, to enhance stability, we use a Chi-square function ϕ(x)=x1ax2italic-ϕ𝑥𝑥1𝑎superscript𝑥2\phi(x)=x-\frac{1}{a}x^{2}italic_ϕ ( italic_x ) = italic_x - divide start_ARG 1 end_ARG start_ARG italic_a end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to regularize the loss function in (4) :

maxK:S×A(0,1){J(K,π):=𝔼ρB[K(s,a)1aK(s,a)2]\displaystyle\max_{K:S\times A\rightarrow(0,1)}\Big{\{}J(K,\pi):=\mathbb{E}_{% \rho^{B}}[-K(s,a)-\frac{1}{a}K(s,a)^{2}]roman_max start_POSTSUBSCRIPT italic_K : italic_S × italic_A → ( 0 , 1 ) end_POSTSUBSCRIPT { italic_J ( italic_K , italic_π ) := blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ - italic_K ( italic_s , italic_a ) - divide start_ARG 1 end_ARG start_ARG italic_a end_ARG italic_K ( italic_s , italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+12𝔼ρπ[K(s,a)1aK(s,a)2]+12𝔼ρG[K(s,a)1aK(s,a)2]}\displaystyle+\frac{1}{2}\mathbb{E}_{\rho^{\pi}}[K(s,a)-\frac{1}{a}K(s,a)^{2}]% +\frac{1}{2}\mathbb{E}_{\rho^{G}}[K(s,a)-\frac{1}{a}K(s,a)^{2}]\Big{\}}+ divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_K ( italic_s , italic_a ) - divide start_ARG 1 end_ARG start_ARG italic_a end_ARG italic_K ( italic_s , italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_K ( italic_s , italic_a ) - divide start_ARG 1 end_ARG start_ARG italic_a end_ARG italic_K ( italic_s , italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] } (9)

C.2 Training Curves of BC and BC-GB

SafetyPointGoal
SafetyCarGoal

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
SafetyPointGoal
SafetyCarGoal
Figure 12: Training Curves of BC and BC-GB

Figure 12 shows the training curves of the BC and BC-GB approaches. This supplements our experimental results in Section 5.3, where we compare SIM against BC-based approaches.

C.3 Unknown Cost

SafetyPointGoal
SafetyCarPush

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
SafetyPointGoal
SafetyCarPush
Figure 13: Results for the unknown-cost scenario.

To answer Q6 (what happens if the cost function is inaccessible?), we demonstrate the capability of our method in handling the situation that the cost function is unknown. In this setting, we assume that there is an oracle telling us which trajectories are violated (i.e., the accumulated cost is greater than cmaxsubscript𝑐𝑚𝑎𝑥c_{max}italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT). In this scenario, other constrained RL algorithms do not apply, as they all rely on the cost function. On the other hand, our algorithm utilizes the identification of good and bad trajectories. Hence, it can be employed directly with the oracle’s assistance. However, it’s worth noting that in this situation, the oracle only aids in identifying bad trajectories, and the accessibility to good trajectories might be less potent compared to scenarios with a known cost function. We test our method on two Safety-Gym environments: SafetyPointGoal and SafetyCarPush. Since other constrained RL algorithms can be used, we just compare our algorithm with PPO-unconstraint and PPO-Lag, where the later still works with the cost function. We use PPO-unconstraint to train the initial policy π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. The results shown in Figure 13 indicate that SIM is able to give safe policies while offering competitive expected rewards.

The ability to work with an unknown-cost setting to improve the safety of unconstrained policies would be valuable in may real-life situations where costs might be difficult or even impossible to get. This enhanced adaptability opens up new opportunities for applying RL in real-world settings.

C.4 Enhancing Unconstrained Agent

SafetyPointPush
SafetyCarPush

Return

Refer to caption
Refer to caption
SafetyPointPush
SafetyCarPush
Figure 14: Comparison results for unconstrained tasks.

In this experiment, we want to answer Q7 (would an unconstrained problem benefit from our approach?). We aim to see if SIM can improve the quality of a policy trained by an unconstrained RL algorithm (e.g., PPO-unconstraint ). To this end, we choose two Safety-Gym enviroments SafetyPointPush and SafetyCarPush and set the threshold RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT for them as 11.011.011.011.0 and 8.08.08.08.0, respectively. The comparison results are shown in Figure 14, which show that algorithm was successful in significantly improving PPO under such unconstrained settings. In fact, by removing the constraint, our algorithm was able to focus solely on maximizing the reward without worrying about the costs associated with its actions. As a result, it could explore more freely and achieve better results in challenging scenarios.

Overall, this experiment showcases the efficacy of our approach in enhancing the performance of an unconstrained agent, especially in tasks where achieving high returns is challenging.

C.5 Conditional Value at Risk

SafetyPointGoal
SafetyCarGoal

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
Refer to caption
SafetyPointGoal
SafetyCarGoal
Figure 15: Comparison results with CVaR constraints.

So far, we have focused on expected cost constraints. In this section, we expand our experiments to CVaR constraints (Yang et al. 2021) (to address Q8 - would our approach work with CVaR constrained problems?). We implemented a SIM version that works with CVaR constraints by using the following criteria to select good trajectories: R(τ)RG𝑅𝜏subscript𝑅𝐺R(\tau)\geq R_{G}italic_R ( italic_τ ) ≥ italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and C(τ)+α1ϕ(Φ1(α))σ(C)cmax𝐶𝜏superscript𝛼1italic-ϕsuperscriptΦ1𝛼𝜎𝐶subscript𝑐𝑚𝑎𝑥C(\tau)+\alpha^{-1}\phi(\Phi^{-1}(\alpha))\sigma(C)\leq c_{max}italic_C ( italic_τ ) + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϕ ( roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) italic_σ ( italic_C ) ≤ italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. Here, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 represents the risk level, ϕitalic-ϕ\phiitalic_ϕ and ΦΦ\Phiroman_Φ denote the probability density function (PDF) and cumulative distribution function (CDF) of the standard normal distribution, respectively, and σ(C)𝜎𝐶\sigma(C)italic_σ ( italic_C ) is the standard deviation of the cost of the collected trajectories. We compare our approach with WC-SAC  (Yang et al. 2021) (a state-of-the-art CVaR constrained RL algorithm). Additionally, we implement a PPO version with CVaR constrained (denoted as PPO-CVaR) for the sake of comparison.

The comparison results are shown in Figure 15. Interestingly, the original version of the WC-SAC struggled to achieve satisfactory results. However, PPO-CVaR approach performed exceptionally well in both environments, achieving improved performance while still maintaining lower costs than PPO-Lagrangian. Furthermore, our algorithm SIM (CVaR) outperformed all other curves, achieving the same expected cost while offering even higher expected rewards. This indicates the superior performance and effectiveness of our proposed approach compared to the other baseline methods considered. Overall, our experiments demonstrate that incorporating CVaR and SIM significantly enhances the performance of prior algorithms.

C.6 Number of initial expert demonstrations

In this section, we aim to address the impact of the number of initial expert trajectories on our final performance Q9. To this end, we run the experiments with different numbers of expert trajectories, taken from the set [5,10,25,50,100,300,500]5102550100300500[5,10,25,50,100,300,500][ 5 , 10 , 25 , 50 , 100 , 300 , 500 ]. The detailed results are shown in Figure 16. Here, it is easy to observe that having a large number of expert demonstrations can offer better performance. This is possibly because having this high number of expert demonstrations can reduce the number of explorations in the environment and quickly understand the criteria for classifying good and bad trajectories.

SafetyPointGoal
SafetyCarGoal

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
SafetyPointGoal
SafetyCarGoal
Figure 16: Best performance of the 7 different number of expert trajectories.

C.7 Low-quality Initial Policy

In practical scenarios, there is no guarantee that an initial policy would be able to generate a sufficient set of good trajectories. In particular, a low-quality initial policy would even struggle with exploring good actions. To showcase such a situation, we run our SIM with six different random initial policies and plot their training curves in Figure 17 clearly shows that SIM was unable to achieve high rewards for Seed #3. This problem have raised a question

Return
Cost
Refer to caption
Refer to caption
Cost
Figure 17: Results of 6 different seeds of SIM in SafetyPointPush.

Taking the above into consideration, we will show below that the issue can be addressed by using dynamic thresholds RGsuperscript𝑅𝐺R^{G}italic_R start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT. Our approach is to dynamically adjust the “good” threshold RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT during each update step to incentivize the policy to perform at par with the highest-ranking of return trajectories within the collected trajectory set T𝑇Titalic_T: RG=𝔼τT[R(τ)]+2στT[R(τ)]subscript𝑅𝐺subscript𝔼similar-to𝜏𝑇delimited-[]𝑅𝜏2subscript𝜎similar-to𝜏𝑇delimited-[]𝑅𝜏R_{G}=\mathbb{E}_{\tau\sim T}[R(\tau)]+2\sigma_{\tau\sim T}[R(\tau)]italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_T end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ] + 2 italic_σ start_POSTSUBSCRIPT italic_τ ∼ italic_T end_POSTSUBSCRIPT [ italic_R ( italic_τ ) ]. The comparison of the original SIM and SIM with dynamic RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT for Seed #3 is shown in Figure 18, which clearly indicates the superiority of the dynamic SIM, compared to the static version (as well as the PPO baseline). Notably, it’s essential to acknowledge that due to reducing the threshold RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, the training can achieve higher rewards when generating good trajectories is challenging. However, when the highest RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is attainable, it can not replicate the performance of fixed RGsubscript𝑅𝐺R_{G}italic_R start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT.

Return
Cost
Refer to caption
Refer to caption
Cost
Figure 18: dynamic experiment in SafetyPointPush.

C.8 Experiments on Mujoco domains

Mujoco Circle

In this section, we present additional comparisons on Mujoco-Circle environments (Achiam et al. 2017; Ji et al. 2023). Figure 19 shows our comparison results, which clearly compare the strength of SIM over other baseline methods.

SafetyPointCircle
SafetyCarCircle

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
Refer to caption
SafetyPointCircle
SafetyCarCircle
Figure 19: Results for Mujoco Circle environments.

Mujoco Velocity

We further provide experiments on Mujoco-Velocity environments. The performance curves reported in Figure 20 shows our comparison results, which clearly demonstrate the superiority of SIM over other baseline methods in satisfying the constraint during the training as well as having a high return.

SafetyAntVelocity
SafetyHalfCheetahVelocity

Return

Refer to caption
Refer to caption

Cost

Refer to caption
Refer to caption
Refer to caption
SafetyAntVelocity
SafetyHalfCheetahVelocity
Figure 20: Results for Mujoco Velocity environments.