\newbibmacro

string+doiurl[1] \addbibresourcereferences.bib

Gameplay Filters: Safe Robot Walking through Adversarial Imagination

Duy P. Nguyen1§, Kai-Chieh Hsu1§, Wenhao Yu2, Jie Tan2, Jaime F. Fisac1 1Department of Electrical and Computer Engineering, Princeton University, United States
{duyn, kaichieh, jfisac}@princeton.edu
2Google Deepmind, United States
{magicmelon, jietan}@google.com
Abstract

Ensuring the safe operation of legged robots in uncertain, novel environments is crucial to their widespread adoption. Despite recent advances in safety filters that can keep arbitrary task-driven policies from incurring safety failures, existing solutions for legged robot locomotion still rely on simplified dynamics and may fail when the robot is perturbed away from predefined stable gaits. This paper presents a general approach that leverages offline game-theoretic reinforcement learning to synthesize a highly robust safety filter for high-order nonlinear dynamics. This gameplay filter then maintains runtime safety by continually simulating adversarial futures and precluding task-driven actions that would cause it to lose future games (and thereby violate safety). Validated on a 36-dimensional quadruped robot locomotion task, the gameplay safety filter exhibits inherent robustness to the sim-to-real gap without manual tuning or heuristic designs. Physical experiments demonstrate the effectiveness of the gameplay safety filter under perturbations, such as tugging and unmodeled irregular terrains, while simulation studies shed light on how to trade off computation and conservativeness without compromising safety.

§§footnotetext: Denotes equal contribution.
Refer to caption
Figure 1: We deploy our gameplay-based safety filter on a quadruped robot equipped with a safety-agnostic walking (task) policy, and evaluate its effectiveness under strong tugging forces and unmodeled irregular terrain. The gameplay filter continually monitors the robot’s safety by rapidly simulating adversarial futures, pitting its best-effort safety strategy against a learned virtual adversary that aims to exploit uncertainty and sim-to-real error to make it tip over. If hazardous conditions arise, the filter intervenes to preclude task-driven actions that would cause the robot to lose this imaginary safety game at a later time. Interventions result in highly robust, adaptive behaviors such as counterbalancing to fight persistent pulls and springing into a wide stance to break imminent falls.

I Introduction

Increasingly, autonomous robots are being deployed beyond controlled environments and required to operate reliably in uncertain, unforeseen conditions \citepkumar2021rma,zhuang2023robot,hsuzen2022sim2lab2real,margolis2022rapid,kostrikov2023demonstrating. This has resulted in a growing need for robot safety frameworks that can scale with system complexity and generalize gracefully to novel environments.

Model-based approaches developed by the robotics and control communities offer a principled treatment of safe decision-making under uncertainty. Unfortunately, computing global safety \removecontrol \newfallback strategies for high-dimensional, nonlinear robot dynamics remains an open problem. State-of-the-art numerical safety methods only scale to 5–6 state variables [bansal2017hamilton, bui2021realtime], woefully short of the 12 needed to accurately model the flight of a drone in free space, let alone the 30–50 required for most legged robots. Analytical approaches like Lyapunov controllers and CBFs rely on hand-design, structural assumptions, and reduced-order models [nguyen2022robust, molnar2022modelfree], restricting their use to a local operating envelope, such as a predefined stable walking gait. As a result, legged robots are notorious for falling easily, especially on irregular terrain or when externally perturbed (pushed, tugged, or tripped).

Data-driven approaches grounded in machine learning address the scalability challenge by automatically distilling efficient representations from the robot’s prior experience or, more recently, from web-scale data \citepdeepmind2023rtx. In practice, however, learned models for robot control, including deep reinforcement learning and imitation learning, are often trained in simulated environments due to hardware constraints and poor sample complexity (requiring millions of training episodes that can much more easily be procured by at-scale simulation). The discrepancy between training and deployment conditions, or sim-to-real gap, can result in deteriorated operational performance and, in extreme cases, catastrophic safety failures (e.g., damaging the robot or hurting nearby people) [hsuzen2022sim2lab2real]. Additionally, end-to-end approaches often require re-training for different task specifications, which presents technical challenges in balancing safety objectives with task-specific goals, especially avoiding situations where a robot may unexpectedly prioritize task performance over safety.

A recent line of work breaks down the safety–performance trade-off through variations of a supervisory control mechanism known as a safety filter, which monitors the autonomous system’s safety at runtime and intervenes when necessary by adjusting the original performance-oriented control to avert catastrophic failures \citephsu2023safety,fisac2019AGS,ames2017cbf,wabersich2018linear,bastani2021safe,kumar2023cbfddp. While some efforts have been made to synthesize safety filters for legged robot locomotion, these typically rely on simplified low-order dynamics to maintain tractability, and they lack a systematic treatment of uncertainty and reality gap \citephsuzen2022sim2lab2real,yang2022safe.

This paper introduces a novel type of safety filter that brings together the scalability of learning-based representations and the reliability of model-based safety analysis, enabling highly robust and minimally disruptive safety assurance for arbitrary robot task policies. Unlike most general safety filter techniques, the approach scales readily to robot dynamics with tens of state dimensions, which allows us to focus on its use in the dynamic legged locomotion domain. \newFurther, our safety filter can monitor a closed-loop policy and address the associated computational latency, while existing safety filters only handle a single control input.

A preliminary offline stage leverages game-theoretic reinforcement learning to \removesystematicallysynthesize \newcontrol and disturbance policies, which can be systematically used to construct safety filters for general nonlinear, high-dimensional dynamic systems \newat runtime. At every control cycle, the online gameplay safety filter assesses safety risks based on an imagined game between the control and adversarial disturbance policy trained in offline gameplay learning. This imagined gameplay aims to simulate the worst-case realization of the uncertainty in the system, either from a sim-to-real error or perturbations from the environment. If dangerous conditions emerge, the filter steps in to prevent task-driven actions that could lead the robot to lose in the subsequent safety-oriented gameplay.

The effectiveness of the proposed gameplay safety filter is validated in a legged robot locomotion task with a 36-dimensional state space and a 12-dimensional control space.111See https://saferobotics.princeton.edu/research/gameplay-filter for supplementary material. Our results demonstrate that the gameplay safety filter is inherently robust to the sim-to-real gap, operating in a “zero-shot” manner without requiring manual design or hyperparameter tuning during deployment. Moreover, the gameplay safety filter achieves a high safety rate without being overly conservative, avoiding frequent interventions in the performance-oriented control policy. Importantly, the gameplay safety filter synthesis remains independent of the performance-oriented policy, making it modular and adaptable to any performance-oriented policy at runtime. Our evaluation includes real-world experiments on different terrains with perturbations (see Figure 1) and a comprehensive simulation study on the relative importance of design choices.

II Related Work

Learning for Locomotion. Conventionally, legged locomotion has been addressed through model-based techniques, including model-predictive control \citepbledt2018mit and trajectory optimization \citepwinkler2018gait. However, recent advancements in deep learning offer the opportunity to learn directly from interactions with environments and feedback in the form of a reward signal, bypassing the need for intricate dynamics modeling and extensive domain knowledge. \citetkostrikov2023demonstrating demonstrated the direct training of locomotion policies across various terrains in the real world through reinforcement learning by carefully formulating the problem with consideration for state space, action space, and reward function. Despite the success of reinforcement learning, it relies on trial and error during training. In safety-critical environments, learning from scratch can lead to catastrophic safety failures. An alternative approach involves initially training control policies in simulation and then bridging the simulation-to-real gap through methods such as domain randomization \citeptobin2017domain, task-driven adaptation \citepkumar2021rma, ren2023adaptsim, and system identification \citepfabio2019bayessim.

Safety Filters. While learning-based policies discussed earlier exhibit practical utility, they primarily focus on task-oriented performance metrics. However, ensuring their safe operation in unforeseen, uncertain, and unforgiving environments is of paramount importance. A recent line of work aims at inducing safety awareness and even guarantees for learning-based policies through a safety filter. The runtime operation of every safety filter can be conceptualized as two interrelated functions: monitoring and intervention \citephsu2023safety. The safety filter continually monitors the robot’s planned actions to assess the level of safety risk. Subsequently, the filter may intervene by modulating or entirely overriding the robot’s intended control input to guarantee the preservation of safety. Many safety filters incorporate monitoring and intervention procedures guided by a safety-oriented control strategy, which the filter views as a viable fallback.

One important family of safety filters is built on Hamilton-Jacobi (HJ) reachability analysis, which computes a global safe value function through dynamic programming \citepmitchell2008flexible,fisac2015reachavoid. The resulting value function encodes the maximal safe set and optimal safety \removepolicy \newfallback, and thus, a least-restrictive safety filter can be synthesized by a switch-type intervention \citepfisac2019AGS. Although systematic and powerful, HJ methods have poor scalability and are limited to no more than 6 state dimensions \citepbui2022optimizeddp.

On the other hand, control barrier functions (CBFs) \citepames2017cbf,ames2019control no longer encode or approximate the maximal safe set. Instead, CBFs, if found, provide a sufficient condition to keep the system safe forever, akin to control Lyapunov functions \citepsontag1983lyapunov. Another critical feature of CBFs is their usage of optimization-type intervention, which finds minimal modulation to the task-oriented control that still keeps the system safe, and thus CBFs allow a smooth intervention mechanism. However, finding a CBF for general dynamics is usually not trivial, and CBF is only local and not robust to model mismatch.

For high-dimensional dynamics, computing global optimal value functions (HJ) is computationally prohibitive, and finding a valid CBF is often heuristic. Instead of relying on value functions, model predictive safety filters aim to certify the system safety in real time by forward simulating (“rolling out”) or trajectory optimization,222Some recent efforts have been made to synthesize CBFs based on model predictive methods \citepchen2021backup,kumar2023cbfddp, while the concerned dynamics has no more than 5 state dimensions. \citepwabersich2018linear,bastani2021safe,hsunguyen2023isaacs which closely link to this work. \citethsunguyen2023isaacs consider the forward-reachable set (FRS) of the system trajectories. However, the use of FRS brings two challenges for general high-dimensional dynamics: 1) FRS needs to be tight to make safety filters not overly conservative, and 2) the computation of FRS needs to be quick to satisfy real-time constraints. Instead, \citetbastani2021safe assume disturbance distribution is known, by which a sufficient number of trajectories are sampled and a statistical guarantee is derived; nonetheless, disturbance distribution may be difficult to obtain in practice.

Safety filters have been applied to ensure the safe operation of learning-based locomotion \citephsuzen2022sim2lab2real,yang2022safe. \citethsuzen2022sim2lab2real introduce a safety monitor based on a value function and fine-tune the corresponding safety filter using a two-stage reinforcement learning framework, providing statistical safety guarantees. However, they consider the uncertainty distribution as a whole, while our work focuses on robustly safeguarding against the worst-case realization of uncertainty. \citetyang2022safe propose a safety monitor criterion based on a heuristically defined safety-triggered set, checking if rollouts activate the criterion. In contrast, our work determines such a safety-triggered set through gameplay rollouts. Additionally, these methods employ a simplified dynamics model and only consider velocity control instead of torque or joint position control directly.

III Preliminaries

III-A Scalable Safety Analysis via Reinforcement Learning

We consider discrete-time, uncertain robot dynamics

xk+1=f(xk,uk,dk),subscript𝑥𝑘1𝑓subscript𝑥𝑘subscript𝑢𝑘subscript𝑑𝑘\displaystyle{{x}_{{k}+1}}={f}({{x}_{k}},{{u}_{k}},{{d}_{k}}),italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (1)

where, at each time step k𝑘{k}\in\mathbb{N}italic_k ∈ blackboard_N, xk𝒳nxsubscript𝑥𝑘𝒳superscriptsubscript𝑛𝑥{{x}_{k}}\in{\mathcal{X}}\subseteq\mathbb{R}^{n_{x}}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_X ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the state of the system, uk𝒰nusubscript𝑢𝑘𝒰superscriptsubscript𝑛𝑢{{u}_{k}}\in{\mathcal{U}}\subseteq\mathbb{R}^{n_{u}}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_U ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the control input (typically from a control policy πuΠu:𝒳𝒰:superscript𝜋𝑢superscriptΠ𝑢𝒳𝒰{{\pi}^{u}}\in{\Pi^{u}}\colon{\mathcal{X}}\to{\mathcal{U}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT : caligraphic_X → caligraphic_U), and dk𝒟ndsubscript𝑑𝑘𝒟superscriptsubscript𝑛𝑑{{d}_{k}}\in{\mathcal{D}}\subseteq\mathbb{R}^{n_{d}}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_D ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the disturbance input, unknown a priori. \newThe disturbance bound defines the operational design domain (ODD), under which we must ensure autonomous systems function safely and effectively. We further assume we are given a specification of the failure set 𝒳𝒳{\mathcal{F}}\subset{\mathcal{X}}caligraphic_F ⊂ caligraphic_X of all conditions the system state should never reach. Safety analysis aims to determine the largest possible safe set Ω𝒳Ω𝒳{\Omega}\subset{\mathcal{X}}roman_Ω ⊂ caligraphic_X, from which there exists a control policy that can maintain system safety against all admissible uncertainty realizations (encoded by a disturbance policy πdΠd:𝒳×𝒰𝒟:superscript𝜋𝑑superscriptΠ𝑑𝒳𝒰𝒟{{\pi}^{d}}\in{\Pi^{d}}\colon{\mathcal{X}}\times{\mathcal{U}}\to{\mathcal{D}}italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : caligraphic_X × caligraphic_U → caligraphic_D)

Ω:={x𝒳πuΠu,πdΠd,k>0,xk},assignΩconditional-set𝑥𝒳formulae-sequencesuperscript𝜋𝑢superscriptΠ𝑢formulae-sequencefor-allsuperscript𝜋𝑑superscriptΠ𝑑formulae-sequencefor-all𝑘0subscript𝑥𝑘\displaystyle{\Omega}:=\left\{{x}\in{\mathcal{X}}\mid\exists{{\pi}^{u}}\in{\Pi% ^{u}},\,\forall{{\pi}^{d}}\in{\Pi^{d}},\,\forall{k}>0,\,{{x}_{k}}\notin{% \mathcal{F}}\right\},roman_Ω := { italic_x ∈ caligraphic_X ∣ ∃ italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , ∀ italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∀ italic_k > 0 , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∉ caligraphic_F } , (2)

where xk=𝐱xπu,πd(k)subscript𝑥𝑘superscriptsubscript𝐱𝑥superscript𝜋𝑢superscript𝜋𝑑𝑘{{x}_{k}}={\mathbf{{x}}}_{{x}}^{{{\pi}^{u}}\!,{{\pi}^{d}}}({k})italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_k ) and 𝐱xπu,πdsuperscriptsubscript𝐱𝑥superscript𝜋𝑢superscript𝜋𝑑{\mathbf{{x}}}_{{x}}^{{{\pi}^{u}}\!,{{\pi}^{d}}}bold_x start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the system trajectory starting from x0=xsubscript𝑥0𝑥{{x}_{0}}={x}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x and following dynamics Eq. 1 with control and disturbance inputs from control policy πusuperscript𝜋𝑢{{\pi}^{u}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT and disturbance policy πdsuperscript𝜋𝑑{{\pi}^{d}}italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, respectively.

Hamilton-Jacobi-Isaacs (HJI) reachability analysis leverages the level set representations to transform the binary outcome, or game-of-kind as formulated in Eq. 2, into a continuous outcome, or game-of-degree, by a (Lipschitz-continuous) margin function g:𝒳:𝑔𝒳{g}:{\mathcal{X}}\to\mathbb{R}italic_g : caligraphic_X → blackboard_R such that g(x)<0x𝑔𝑥0𝑥{g}({x})<0\Leftrightarrow{x}\in{\mathcal{F}}italic_g ( italic_x ) < 0 ⇔ italic_x ∈ caligraphic_F333An example of a margin function is the signed distance function to the failure set. \citepmitchell2008flexible,bansal2017hamilton

Jkπu,πd(x):=minτ[k,H]g(𝐱xπu,πd(τ)),assignsubscriptsuperscript𝐽superscript𝜋𝑢superscript𝜋𝑑𝑘𝑥subscript𝜏𝑘𝐻𝑔superscriptsubscript𝐱𝑥superscript𝜋𝑢superscript𝜋𝑑𝜏\displaystyle{J}^{{{\pi}^{u}}\!,{{\pi}^{d}}}_{k}({x}):=\min_{{\tau}\in[{k},{H}% ]}{g}\left({\mathbf{{x}}}_{{x}}^{{{\pi}^{u}}\!,{{\pi}^{d}}}({\tau})\right),italic_J start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) := roman_min start_POSTSUBSCRIPT italic_τ ∈ [ italic_k , italic_H ] end_POSTSUBSCRIPT italic_g ( bold_x start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_τ ) ) , (3)

where H𝐻{H}italic_H is the concerned control horizon. Consistent with the identifiers in Eq. 2, we compute the lower value of the game Vk(x):=maxπuminπdJkπu,πd(x)assignsubscript𝑉𝑘𝑥subscriptsuperscript𝜋𝑢subscriptsuperscript𝜋𝑑superscriptsubscript𝐽𝑘superscript𝜋𝑢superscript𝜋𝑑𝑥{V}_{k}({x}):=\max_{\vphantom{{{\pi}^{d}}}{{\pi}^{u}}}\min_{{{\pi}^{d}}}{J}_{k% }^{{{\pi}^{u}}\!,{{\pi}^{d}}}({x})italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) := roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ), which gives the disturbance policy information advantage \citepisaacs1954differential. Additionally, this value function is the fixed-point solution of the Isaacs equation, which can be solved by dynamic programming

Vk(x)subscript𝑉𝑘𝑥\displaystyle{V}_{k}({x})italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) =maxumindmin{g(x),Vk+1(f(x,u,d))},absentsubscript𝑢subscript𝑑𝑔𝑥subscript𝑉𝑘1𝑓𝑥𝑢𝑑\displaystyle=\max_{\vphantom{{d}}{u}}\min_{{d}}\min\left\{{g}({x}),\,{V}_{{k}% +1}\big{(}{f}({x},{u},{d})\big{)}\right\},= roman_max start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT roman_min { italic_g ( italic_x ) , italic_V start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_f ( italic_x , italic_u , italic_d ) ) } , (4a)
VH(x)subscript𝑉𝐻𝑥\displaystyle{V}_{H}({x})italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_x ) =g(x).absent𝑔𝑥\displaystyle={g}({x}).= italic_g ( italic_x ) . (4b)

If a nonempty safe set is present in the context of the differential game, the value function converges and becomes time-independent within this set as H𝐻{H}\to\inftyitalic_H → ∞. Consequently, we can eliminate the dependence on k𝑘{k}italic_k, resulting in V(x)=limkVk(x)𝑉𝑥subscript𝑘subscript𝑉𝑘𝑥{V}({x})=\lim_{{k}\to-\infty}{V}_{k}({x})italic_V ( italic_x ) = roman_lim start_POSTSUBSCRIPT italic_k → - ∞ end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ). We can recover the (maximal) safe set by the superzero level set of the value function Ω={x𝒳V(x)0}superscriptΩconditional-set𝑥𝒳𝑉𝑥0{{\Omega}^{*}}=\{{x}\in{\mathcal{X}}\mid{V}({x})\geq 0\}roman_Ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_x ∈ caligraphic_X ∣ italic_V ( italic_x ) ≥ 0 } and the optimal policies πu,πψsuperscript𝜋𝑢superscriptsubscript𝜋𝜓{{\pi}^{{u}*}},\,{{{\pi}_{\psi}}^{*}}italic_π start_POSTSUPERSCRIPT italic_u ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from the optimizers of Eq. 4.

However, in practice, it is difficult to know the sufficient control horizon a priori. \newAlso, finding the maximal safe set is usually difficult to find in complex, high-dimensional dynamics. Instead, reach–avoid analysis simplifies the safety analysis by checking whether a control policy exists to guide the system into a (robust controlled-invariant) target set 𝒯𝒳𝒯𝒳{\mathcal{T}}\subset{\mathcal{X}}caligraphic_T ⊂ caligraphic_X (𝒯=Ø𝒯Ø{\mathcal{T}}\cap{\mathcal{F}}={\text{\O}}caligraphic_T ∩ caligraphic_F = Ø) in H𝐻{H}italic_H time steps without entering the failure set previously. \removeInstead, we can extend the safety analysis to reach–avoid analysis, requiring the system to reach specific conditions 𝒯𝒳𝒯𝒳{\mathcal{T}}\subset{\mathcal{X}}caligraphic_T ⊂ caligraphic_X in H𝐻{H}italic_H time steps without entering the failure set previously. The reach–avoid set is defined by

𝒜:={\displaystyle{\mathcal{RA}}:=\Big{\{}\,caligraphic_R caligraphic_A := { x𝒳πuΠu,πdΠd,formulae-sequence𝑥conditional𝒳superscript𝜋𝑢superscriptΠ𝑢for-allsuperscript𝜋𝑑superscriptΠ𝑑\displaystyle{x}\in{\mathcal{X}}\mid\ \exists{{\pi}^{u}}\in{\Pi^{u}},\,\forall% {{\pi}^{d}}\in{\Pi^{d}},\,italic_x ∈ caligraphic_X ∣ ∃ italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , ∀ italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,
k[0,H],xk𝒯τ[0,k],xτ}.\displaystyle\exists{k}\in[0,{H}],\,{{x}_{k}}\in{\mathcal{T}}\land\forall{\tau% }\in[0,{k}],\,{x}_{{\tau}}\not\in{\mathcal{F}}\,\Big{\}}.∃ italic_k ∈ [ 0 , italic_H ] , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_T ∧ ∀ italic_τ ∈ [ 0 , italic_k ] , italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∉ caligraphic_F } . (5)
\remove

Importantly, we choose the target set 𝒯𝒯{\mathcal{T}}caligraphic_T to be a known robust controlled-invariant set, where there exists a policy π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT that can maintain the system state to stay in 𝒯𝒯{\mathcal{T}}caligraphic_T forever under all disturbance realizations. \newSince 𝒯𝒯{\mathcal{T}}caligraphic_T is a robust controlled-invariant set, there exists a policy π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT that can maintain the system state to stay in 𝒯𝒯{\mathcal{T}}caligraphic_T forever under all disturbance realizations. After the reach–avoid policy safely guides the system into 𝒯𝒯{\mathcal{T}}caligraphic_T, we can switch to π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT to keep the system in 𝒯𝒯{\mathcal{T}}caligraphic_T forever. Thus, reach–avoid analysis simplifies the safety control design by requiring only the assurance of control invariance for a small subset of states 𝒯𝒯{\mathcal{T}}caligraphic_T, which\remove This property is sufficient for the reach–avoid set to be a safe set.

An auxiliary game of degree can be similarly formulated by introducing another margin function with respect to the target set :𝒳:𝒳{\ell}\colon{\mathcal{X}}\to\mathbb{R}roman_ℓ : caligraphic_X → blackboard_R such that (x)0x𝒯𝑥0𝑥𝒯{\ell}({x})\geq 0\Leftrightarrow{x}\in{\mathcal{T}}roman_ℓ ( italic_x ) ≥ 0 ⇔ italic_x ∈ caligraphic_T. We consider the reach–avoid outcome \citephsu2021safety

Jkπu,πd(x):=maxτ[k,H]min{(xτ),mins[k,τ]g(xs)}.assignsubscriptsuperscript𝐽superscript𝜋𝑢superscript𝜋𝑑𝑘𝑥subscript𝜏𝑘𝐻subscript𝑥𝜏subscript𝑠𝑘𝜏𝑔subscript𝑥𝑠\displaystyle{J}^{{{\pi}^{u}}\!,{{\pi}^{d}}}_{k}({x}):=\max_{{\tau}\in[{k},{H}% ]}\min\left\{{\ell}\left({x}_{{\tau}}\right),\,\min_{{s}\in[{k},{\tau}]}{g}% \left({x}_{{s}}\right)\right\}.italic_J start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) := roman_max start_POSTSUBSCRIPT italic_τ ∈ [ italic_k , italic_H ] end_POSTSUBSCRIPT roman_min { roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , roman_min start_POSTSUBSCRIPT italic_s ∈ [ italic_k , italic_τ ] end_POSTSUBSCRIPT italic_g ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) } . (6)

The reach–avoid value function can be solved by the following Isaacs equation

Vk(x)subscript𝑉𝑘𝑥\displaystyle{V}_{k}({x})italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) =maxumindmin{g(x),max{(x),Vk+1(f(x,u,d))}},absentsubscript𝑢subscript𝑑𝑔𝑥𝑥subscript𝑉𝑘1𝑓𝑥𝑢𝑑\displaystyle=\max_{\vphantom{{d}}{u}}\min_{{d}}\min\left\{{g}({x}),\,\max% \left\{{\ell}({x}),\,{V}_{{k}+1}\big{(}{f}({x},{u},{d})\big{)}\right\}\right\},= roman_max start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT roman_min { italic_g ( italic_x ) , roman_max { roman_ℓ ( italic_x ) , italic_V start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_f ( italic_x , italic_u , italic_d ) ) } } , (7a)
VH(x)subscript𝑉𝐻𝑥\displaystyle{V}_{H}({x})italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_x ) =min{(x),g(x)}.absent𝑥𝑔𝑥\displaystyle=\min\left\{{\ell}({x}),\,{g}({x})\right\}.= roman_min { roman_ℓ ( italic_x ) , italic_g ( italic_x ) } . (7b)

Similarly, the reach–avoid set can be recovered by 𝒜={x𝒳V0(x)0}𝒜conditional-set𝑥𝒳subscript𝑉0𝑥0{\mathcal{RA}}=\{{x}\in{\mathcal{X}}\mid{V}_{0}({x})\geq 0\}caligraphic_R caligraphic_A = { italic_x ∈ caligraphic_X ∣ italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) ≥ 0 }.

The computation complexity and memory requirement of solving Eq. 4 and Eq. 7 grows exponentially with respect to the dimension of continuous states, which limits its applicability to no more than six state dimensions for general dynamics \citepbui2022optimizeddp. In recent work, \citethsunguyen2023isaacs proposed an adversarial reinforcement learning framework ISAACS to find approximation solutions to the Isaacs equations, where the state-action value function, or Q-function, Qωsubscript𝑄𝜔{{Q}_{{\omega}}}italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, \removecontrol policy \newreach–avoid control policy πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and disturbance policy πψsubscript𝜋𝜓{{\pi}_{\psi}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT are parameterized by neural networks ω,θ,ψ𝜔𝜃𝜓{\omega},{\theta},{\psi}italic_ω , italic_θ , italic_ψ, respectively.444In reinforcement learning literature, Qωsubscript𝑄𝜔{{Q}_{{\omega}}}italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is also called critic as it evaluates the quality of the action, while πθ,πψsubscript𝜋𝜃subscript𝜋𝜓{{\pi}_{\theta}},{{\pi}_{\psi}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT are called actors since they determine which action to take. In each iteration of ISAACS, it simulates adversarial safety games to collect state-action sequences, performs gradient updates of neural networks, and determines the policies to sample from in the next simulated games.

III-B Value-Based and Rollout-Based Safety Filters

This section introduces safety filters ϕ:𝒳×ΠuΠu:italic-ϕ𝒳superscriptΠ𝑢superscriptΠ𝑢{\phi}\colon{\mathcal{X}}\times{\Pi^{u}}\to{\Pi^{u}}italic_ϕ : caligraphic_X × roman_Π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT → roman_Π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT and \newϕ=(π\faShield*,Δ\faShield*)italic-ϕsuperscript𝜋\faShield*superscriptΔ\faShield*{\phi}=({{\pi}^{\text{\tiny{\faShield*}}}},{\Delta}^{\text{\tiny{\faShield*}}})italic_ϕ = ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT )555With a slight abuse of notation, we highlight here the safety filter is composed of a safety fallback policy and a safety monitor. based on switch-type intervention, which can generally be formulated \citephsu2023safety

ϕ(x,πtask)={πtask,Δ\faShield*(x,πtask)0,π\faShield*,Otherwise,italic-ϕ𝑥superscript𝜋taskcasessuperscript𝜋tasksuperscriptΔ\faShield*𝑥superscript𝜋task0superscript𝜋\faShield*Otherwise,\displaystyle{\phi}({x},{{\pi}^{\text{task}}})=\left\{\begin{array}[]{ll}{{\pi% }^{\text{task}}},&{\Delta}^{\text{\tiny{\faShield*}}}({x},{{\pi}^{\text{task}}% })\geq 0,\\ {{\pi}^{\text{\tiny{\faShield*}}}},&\text{Otherwise,}\end{array}\right.italic_ϕ ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) = { start_ARRAY start_ROW start_CELL italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT , end_CELL start_CELL roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) ≥ 0 , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , end_CELL start_CELL Otherwise, end_CELL end_ROW end_ARRAY (10)

where πtask:𝒳𝒰:superscript𝜋task𝒳𝒰{{\pi}^{\text{task}}}\colon{\mathcal{X}}\to{\mathcal{U}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT : caligraphic_X → caligraphic_U is an arbitrary performance-oriented task policy, Δ\faShield*:𝒳×Πu:superscriptΔ\faShield*𝒳superscriptΠ𝑢{\Delta}^{\text{\tiny{\faShield*}}}\colon{\mathcal{X}}\times{\Pi^{u}}\to% \mathbb{R}roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : caligraphic_X × roman_Π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT → blackboard_R, and π\faShield*:𝒳𝒰:superscript𝜋\faShield*𝒳𝒰{{\pi}^{\text{\tiny{\faShield*}}}}\colon{\mathcal{X}}\to{\mathcal{U}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : caligraphic_X → caligraphic_U is a safety-aware or even safety-guaranteed fallback policy. \newΔ\faShield*superscriptΔ\faShield*{\Delta}^{\text{\tiny{\faShield*}}}roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is a safety monitor if it satisfies that its positive output indicates that the input policy can keep the system safe from the input state. Therefore, the safety filter in Eq. 10 maintains the system’s safety, following the Theorem 1 in \citephsu2023safety. Our approach introduces a more general and novel safety filter and monitor, considering the task policy as a function rather than just a singular proposed task control.

Previous work has used neural-network-parameterized Q-function for safety monitor with a threshold ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 \citephsuzen2022sim2lab2real,thananjeyan2021recovery

Δϵ\faShield*,critic(x,πtask):=Qω(x,πtask(x))ϵ.assignsubscriptsuperscriptΔ\faShield*criticitalic-ϵ𝑥superscript𝜋tasksubscript𝑄𝜔𝑥superscript𝜋task𝑥italic-ϵ\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{critic}}_{\epsilon}({x},{{% \pi}^{\text{task}}}):={{Q}_{{\omega}}}({x},{{\pi}^{\text{task}}}({x}))-\epsilon.roman_Δ start_POSTSUPERSCRIPT * , critic end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) := italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ( italic_x ) ) - italic_ϵ . (11)

Also, the safety fallback policy can be constructed by argmaxu𝒰Qω(x,u)subscriptargmax𝑢𝒰subscript𝑄𝜔𝑥𝑢\operatorname*{{\mathop{\mathrm{argmax}}}}_{{u}\in{\mathcal{U}}}{{Q}_{{\omega}% }}({x},{u})roman_argmax start_POSTSUBSCRIPT italic_u ∈ caligraphic_U end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , italic_u ) or directly πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT when the state is outside of the target set. \newOn the other hand, when the state is in the target set 𝒯𝒯{\mathcal{T}}caligraphic_T, π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT serves as a fallback to keep the state inside the target set. \newIn other words, the safety fallback policy π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is defined by a switching rule

π\faShield*(x)={π𝒯(x),x𝒯,πθ(x),x𝒯.superscript𝜋\faShield*𝑥casessuperscript𝜋𝒯𝑥𝑥𝒯subscript𝜋𝜃𝑥𝑥𝒯\displaystyle{{\pi}^{\text{\tiny{\faShield*}}}}({x})=\left\{\begin{array}[]{ll% }{{\pi}^{{\mathcal{T}}}}({x}),&{x}\in{\mathcal{T}},\\ {{\pi}_{\theta}}({x}),&{x}\not\in{\mathcal{T}}.\end{array}\right.italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_x ) = { start_ARRAY start_ROW start_CELL italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ( italic_x ) , end_CELL start_CELL italic_x ∈ caligraphic_T , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , end_CELL start_CELL italic_x ∉ caligraphic_T . end_CELL end_ROW end_ARRAY (14)

Although this critic filter has shown practical utility, it does not readily provide safety guarantees. In addition, the value threshold ϵitalic-ϵ\epsilonitalic_ϵ needs to be carefully tuned, which may be difficult for safety-critical applications.

In contrast, the safety monitor can be built on model predictive rollouts. \citetbastani2021safe-acc assumes the dynamical model is perfectly accurate (disturbance-free) and checks if, after executing performance-oriented control, rollout based on the fallback policy satisfies the reach–avoid criterion

ΔH,𝒜\faShield*,nom(x,πtask)subscriptsuperscriptΔ\faShield*nom𝐻𝒜𝑥superscript𝜋task\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{nom}}_{{H},{\mathcal{RA}}}% ({x},{{\pi}^{\text{task}}})roman_Δ start_POSTSUPERSCRIPT * , nom end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , caligraphic_R caligraphic_A end_POSTSUBSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) :=𝟙{τ{1,,H},x^τ𝒯\displaystyle:=\mathbbm{1}\Big{\{}\,\exists{\tau}\in\{1,\dots,{H}\},{\hat{x}}_% {{\tau}}\in{\mathcal{T}}\,\land:= blackboard_1 { ∃ italic_τ ∈ { 1 , … , italic_H } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_T ∧
s{1,,τ},x^s}12\displaystyle\forall{s}\in\{1,\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal{F% }}\Big{\}}-\frac{1}{2}∀ italic_s ∈ { 1 , … , italic_τ } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∉ caligraphic_F } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG (15)

with x^0=xsubscript^𝑥0𝑥{\hat{x}}_{0}={x}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x, x^1=f(x,πtask(x),0)subscript^𝑥1𝑓𝑥superscript𝜋task𝑥0{\hat{x}}_{1}={f}({x},{{\pi}^{\text{task}}}({x}),0)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_f ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ( italic_x ) , 0 ), and x^τ+1=f(x^τ,π\faShield*(x^τ),0)subscript^𝑥𝜏1𝑓subscript^𝑥𝜏superscript𝜋\faShield*subscript^𝑥𝜏0{\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{{\pi}^{\text{\tiny{\faShield*}}}}% ({\hat{x}}_{{\tau}}),0)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT = italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , 0 ), τ1𝜏1{\tau}\geq 1italic_τ ≥ 1. \citethsunguyen2023isaacs tackle the model mismatch by employing robust rollout with FRS

ΔH,𝒜\faShield*,FRS(x,πtask)subscriptsuperscriptΔ\faShield*FRS𝐻𝒜𝑥superscript𝜋task\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{FRS}}_{{H},{\mathcal{RA}}}% ({x},{{\pi}^{\text{task}}})roman_Δ start_POSTSUPERSCRIPT * , FRS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , caligraphic_R caligraphic_A end_POSTSUBSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) :=𝟙{τ{1,,H},τ𝒯\displaystyle:=\mathbbm{1}\Big{\{}\exists{\tau}\in\{1,\dots,{H}\},{\mathcal{R}% }_{\tau}\subseteq{\mathcal{T}}\,\land:= blackboard_1 { ∃ italic_τ ∈ { 1 , … , italic_H } , caligraphic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ⊆ caligraphic_T ∧
s{1,,τ},s=Ø}12\displaystyle\forall{s}\in\{1,\dots,{\tau}\},{\mathcal{R}}_{s}\cap{\mathcal{F}% }={\text{\O}}\Big{\}}-\frac{1}{2}∀ italic_s ∈ { 1 , … , italic_τ } , caligraphic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∩ caligraphic_F = Ø } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG (16)

with 0={x}subscript0𝑥{\mathcal{R}}_{0}=\{{x}\}caligraphic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_x }, 1={f(x,πtask(x),d),d𝒟}subscript1𝑓𝑥superscript𝜋task𝑥𝑑𝑑𝒟{\mathcal{R}}_{1}=\{{f}({x},{{\pi}^{\text{task}}}({x}),{d}),{d}\in{\mathcal{D}}\}caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_f ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ( italic_x ) , italic_d ) , italic_d ∈ caligraphic_D }, and τ+1={f(x^,π\faShield*(x^),d),x^τ,d𝒟}{\mathcal{R}}_{{\tau}+1}=\{{f}(\hat{x},{{\pi}^{\text{\tiny{\faShield*}}}}(\hat% {x}),{d}),\hat{x}\in{\mathcal{R}}_{\tau},{d}\in{\mathcal{D}}\}caligraphic_R start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT = { italic_f ( over^ start_ARG italic_x end_ARG , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG ) , italic_d ) , over^ start_ARG italic_x end_ARG ∈ caligraphic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_d ∈ caligraphic_D }, τ1𝜏1{\tau}\geq 1italic_τ ≥ 1. In this work, we instead rely on rollouts of the \removetrained safety policy and disturbance policy \newlearned reach–avoid control and disturbance actors to assess system safety risk.

IV Safe Walking by Adversarial Gameplay

This section introduces a systematic way to construct a safety filter for nonlinear, high-dimensional dynamics. In this paper, we specifically consider the task of quadruped walking, but we stress that the method presented is general for different robots and tasks. We start with careful problem formulation by defining state space, control space, uncertainty modeling, and safety specifications. Then, we elucidate offline gameplay learning, \removewhich exploits robot symmetry in simulated games to prevent safety control policy from overfitting to attacks from a specific direction \newwhere a disturbance actor is jointly trained with the reach–avoid control actor to generate the worst-case realization of uncertainty to attack the system adversarially. We close the section \removewith a constructive synthesis for \newby systematically constructing an online gameplay filter using the trained control and disturbance policies \newsynthesized from offline learning.

IV-A State and Action Spaces

The robot’s state and control input are defined by
x𝑥\displaystyle{x}italic_x =[𝐩,𝐩˙,𝜽,𝜽˙,𝜽J,𝜽˙J],absent𝐩˙𝐩𝜽bold-˙𝜽subscript𝜽Jsubscriptbold-˙𝜽J\displaystyle=\left[\mathbf{{p}},\mathbf{\dot{p}},\bm{{\theta}},\bm{\dot{% \theta}},\bm{{\theta}_{\text{J}}},\bm{\dot{\theta}_{\text{J}}}\right],= [ bold_p , over˙ start_ARG bold_p end_ARG , bold_italic_θ , overbold_˙ start_ARG bold_italic_θ end_ARG , bold_italic_θ start_POSTSUBSCRIPT J end_POSTSUBSCRIPT , overbold_˙ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT J end_POSTSUBSCRIPT ] , (17a)
u𝑢\displaystyle{u}italic_u =[𝜹𝜽J]absentdelimited-[]𝜹subscript𝜽J\displaystyle=\left[\bm{{\delta{\theta}_{\text{J}}}}\right]= [ bold_italic_δ bold_italic_θ start_POSTSUBSCRIPT J end_POSTSUBSCRIPT ] (17b)
\newwith 𝐩,𝜽𝐩𝜽\mathbf{{p}},\bm{{\theta}}bold_p , bold_italic_θ the robot pose, 𝜽Jsubscript𝜽J\bm{{\theta}_{\text{J}}}bold_italic_θ start_POSTSUBSCRIPT J end_POSTSUBSCRIPT the angular joint position, ()˙bold-˙bold-⋅\bm{\dot{(\cdot)}}overbold_˙ start_ARG bold_( bold_⋅ bold_) end_ARG the rate of these variables and 𝜹𝜽J𝜹subscript𝜽J\bm{{\delta{\theta}_{\text{J}}}}bold_italic_δ bold_italic_θ start_POSTSUBSCRIPT J end_POSTSUBSCRIPT the commanded angular increment of the robot’s joint. Appendix A illustrates the state and action space details. Since the robot has three joints per leg, we end up with a 36-D state space and a 12-D control space.

We model the sim-to-real gap via a 6-D adversarial force pushing or pulling the robot with a magnitude of 5\new0 N

d𝑑\displaystyle{d}italic_d =[Fx,Fy,Fz,\removepx,py,pz\newpxF,pyF,pzF]absentsubscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑧\removesubscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧\newsubscriptsuperscript𝑝𝐹𝑥subscriptsuperscript𝑝𝐹𝑦subscriptsuperscript𝑝𝐹𝑧\displaystyle=\left[{{F}_{x}},{{F}_{y}},{{F}_{z}},\remove{p_{x},p_{y},p_{z}}% \new{{{p}^{F}_{x}},{{p}^{F}_{y}},{{p}^{F}_{z}}}\right]= [ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] (17c)

with F=[Fx,Fy,Fz]𝐹subscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑧{F}=[{{F}_{x}},{{F}_{y}},{{F}_{z}}]italic_F = [ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] represents the force vector applied at position defined by \remove{px,py,pz}subscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧\{p_{x},p_{y},p_{z}\}{ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT }\newpxF,pyF,pzFsubscriptsuperscript𝑝𝐹𝑥subscriptsuperscript𝑝𝐹𝑦subscriptsuperscript𝑝𝐹𝑧{{p}^{F}_{x}},{{p}^{F}_{y}},{{p}^{F}_{z}}italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT in the body coordinates, \removepx,pysubscript𝑝𝑥subscript𝑝𝑦p_{x},p_{y}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT\newpxF,pyF[0.1,0.1]\newm\newsubscriptsuperscript𝑝𝐹𝑥subscriptsuperscript𝑝𝐹𝑦0.10.1\newm\new{{{p}^{F}_{x}},{{p}^{F}_{y}}}\in[-0.1,0.1]\new{~{}\text{m}}italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ [ - 0.1 , 0.1 ] m, \removepzsubscript𝑝𝑧p_{z}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT\newpzF[0,\remove0.5\new0.05]\newm\newsubscriptsuperscript𝑝𝐹𝑧0\remove0.5\new0.05\newm\new{{{p}^{F}_{z}}}\in[0,\remove{0.5}\new{0.05}]\new{~{}\text{m}}italic_p start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ [ 0 , 0.5 0.05 ] m. We further assume the optimal force is bang-bang with F2=5\new0Nsubscriptnorm𝐹25\new0N||F||_{2}=5\new{0}~{}\text{N}| | italic_F | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5 0 N. \removeThe adversarial disturbance policy takes in the robot’s state and control input , i.e., dk=πd(xk,uk)subscript𝑑𝑘superscript𝜋𝑑subscript𝑥𝑘subscript𝑢𝑘{{d}_{k}}={{\pi}^{d}}({{x}_{k}},{{u}_{k}})italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). The red \removelines\newarrows in the imagined gameplay of Figure 2 show examples of learned adversarial disturbance.

IV-B Safety Specifications

We consider the failure set of states \newwhere the defined critical points (of the robot’s body) 𝐩𝐠subscript𝐩𝐠\mathbf{p_{g}}bold_p start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT are very close to the ground. \removeWe consider the failure set of states where the robot body is very close to the ground or the robot stands with its knees. The safety margin function is defined as

\new
g(x)=mini{pgipg¯i},𝑔𝑥subscript𝑖superscriptsubscript𝑝𝑔𝑖superscript¯subscript𝑝𝑔𝑖\displaystyle{g}({x})=\min_{i}\left\{p_{g}^{i}-\bar{p_{g}}^{i}\right\},italic_g ( italic_x ) = roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - over¯ start_ARG italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } ,

with ()¯¯\bar{(\cdot)}over¯ start_ARG ( ⋅ ) end_ARG denotes the desired magnitude.

\new

Also, we consider the target set of states where the concerned variables 𝐩subscript𝐩\mathbf{p_{\ell}}bold_p start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT are within a small box around the target pose and velocity. The target set is designed so that the robot is known to be robustly stable with a simple stance controller. The target margin function is then defined as

(x)=mini{p¯i|pi|}.𝑥subscript𝑖superscript¯subscript𝑝𝑖superscriptsubscript𝑝𝑖\displaystyle{\ell}({x})=\min_{i}\left\{\bar{p_{\ell}}^{i}-|p_{\ell}^{i}|% \right\}.roman_ℓ ( italic_x ) = roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { over¯ start_ARG italic_p start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - | italic_p start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | } .

Appendix A illustrates the safety specifications details.

IV-C Offline Gameplay Learning

We introduce an offline gameplay learning scheme, which builds upon ISAACS \citephsunguyen2023isaacs. At each iteration, the learning algorithm collects interactions with environments via simulated adversarial safety games, updates neural-network-parameterized Q-function and policies, and determines which control and disturbance policies are used for the next iteration’s simulated gameplay. \removeFurthermore, we utilize the symmetry of legged locomotion to prevent safety control policy from overfitting to attacks from one direction.

Simulated Adversarial Safety Games. At every time step of games, we store the transition (x,u,d,x,,g)𝑥𝑢𝑑superscript𝑥superscriptsuperscript𝑔({x},{u},{d},{{x}^{\prime}},{{\ell}^{\prime}},{{g}^{\prime}})( italic_x , italic_u , italic_d , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in the replay buffer {\mathcal{B}}caligraphic_B, with x=f(x,u,d)superscript𝑥𝑓𝑥𝑢𝑑{{{x}^{\prime}}={f}({x},{u},{d})}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f ( italic_x , italic_u , italic_d ), =(x)superscriptsuperscript𝑥{{\ell}^{\prime}}={\ell}({{x}^{\prime}})roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_ℓ ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and g=g(x)superscript𝑔𝑔superscript𝑥{{g}^{\prime}}={g}({{x}^{\prime}})italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_g ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The control and disturbance inputs are selected from the policies either trained concurrently or fixed after pre-training. \removeWe notice that in some training runs, the disturbance policy always attacks from a fixed direction (e.g., the positive direction of the y-axis), which results in the safety control policy being vulnerable to the attack in the other direction (e.g., the negative direction of the y-axis). To prevent this overfitting, we utilize the symmetry in robot locomotion by flip** the disturbance inputs in the collected interactions and re-simulate the new disturbance sequences, i.e., d~=[Fx,Fy,Fz,px,py,pz]~𝑑subscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑧subscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧{\tilde{{d}}}=[F_{x},-{{F}_{y}},F_{z},p_{x},p_{y},p_{z}]over~ start_ARG italic_d end_ARG = [ italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , - italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ].

Policy and Critic Networks Update The core of the proposed offline gameplay learning is to find approximate solutions to the Isaacs equation Eq. 7. We employ the Soft Actor-Critic (SAC) \citephaarnoja2018sac framework to update the critic and actor networks with the following loss functions.

We update the critic to reduce the deviation from the Isaacs target666Deep reinforcement learning typically involves training an auxiliary target critic Qωsubscript𝑄superscript𝜔{{Q}_{\omega^{\prime}}}italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, with parameters ωsuperscript𝜔{\omega^{\prime}}italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that undergo slow adjustments to align with the critic parameters ω𝜔{\omega}italic_ω. This process aims to stabilize the regression by maintaining a fixed target within a relatively short timeframe.
L(ω)𝐿𝜔\displaystyle L({\omega})italic_L ( italic_ω ) :=𝔼(x,u,d,x,,g)[(Qω(x,u,d)y)2],assignabsentsubscript𝔼similar-to𝑥𝑢𝑑superscript𝑥superscriptsuperscript𝑔superscriptsubscript𝑄𝜔𝑥𝑢𝑑𝑦2\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u},{d},{{x}^{\prime}},{{% \ell}^{\prime}},{{g}^{\prime}})\sim{\mathcal{B}}}\left[\left({{Q}_{{\omega}}}(% {x},{u},{d})-y\right)^{2}\right]\,,:= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_u , italic_d , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_B end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , italic_u , italic_d ) - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,
y𝑦\displaystyle yitalic_y =γmin{g,max{,Qω(x,u,d)}}absent𝛾superscript𝑔superscriptsubscript𝑄superscript𝜔superscript𝑥superscript𝑢superscript𝑑\displaystyle=\gamma\min\left\{\ {{g}^{\prime}},\max\left\{{{\ell}^{\prime}},{% {Q}_{\omega^{\prime}}}({{x}^{\prime}},{{u}^{\prime}},{{d}^{\prime}})\right\}\right\}= italic_γ roman_min { italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_max { roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } }
+(1γ)min{,g}1𝛾superscriptsuperscript𝑔\displaystyle\ \ \ +(1-\gamma)\min\left\{{{\ell}^{\prime}},{{g}^{\prime}}\right\}+ ( 1 - italic_γ ) roman_min { roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } (18a)
with uπθ(x){{u}^{\prime}}\sim{{\pi}_{\theta}}(\cdot\mid{{x}^{\prime}})italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), dπψ(x){{d}^{\prime}}\sim{{\pi}_{\psi}}(\cdot\mid{{x}^{\prime}})italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). We update control and disturbance policies following the policy gradient induced by the critic and entropy loss:
L(θ)𝐿𝜃\displaystyle L({\theta})italic_L ( italic_θ ) :=𝔼(x,d)[Qω(x,u~,d)+αulogπθ(u~x)],assignabsentsubscript𝔼similar-to𝑥𝑑subscript𝑄𝜔𝑥~𝑢𝑑superscript𝛼𝑢subscript𝜋𝜃conditional~𝑢𝑥\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{d})\sim{\mathcal{B}}}\Big{[% }-{{Q}_{{\omega}}}({x},{\tilde{{u}}},{d})+{{\alpha}^{u}}\log{{\pi}_{\theta}}({% \tilde{{u}}}\mid{x})\Big{]},:= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_d ) ∼ caligraphic_B end_POSTSUBSCRIPT [ - italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , over~ start_ARG italic_u end_ARG , italic_d ) + italic_α start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_u end_ARG ∣ italic_x ) ] , (18b)
L(ψ)𝐿𝜓\displaystyle L({\psi})italic_L ( italic_ψ ) :=𝔼(x,u)[Qω(x,u,d~)+αdlogπψ(d~x)],assignabsentsubscript𝔼similar-to𝑥𝑢subscript𝑄𝜔𝑥𝑢~𝑑superscript𝛼𝑑subscript𝜋𝜓conditional~𝑑𝑥\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u})\sim{\mathcal{B}}}\Big{[% }{{Q}_{{\omega}}}({x},{u},{\tilde{{d}}})+{{\alpha}^{d}}\log{{\pi}_{\psi}}({% \tilde{{d}}}\mid{x})\Big{]},:= blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_u ) ∼ caligraphic_B end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_x , italic_u , over~ start_ARG italic_d end_ARG ) + italic_α start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ∣ italic_x ) ] , (18c)
where u~πθ(x){\tilde{{u}}}\sim{{\pi}_{\theta}}(\cdot\mid{x})over~ start_ARG italic_u end_ARG ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x ), d~πψ(x){\tilde{{d}}}\sim{{\pi}_{\psi}}(\cdot\mid{x})over~ start_ARG italic_d end_ARG ∼ italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( ⋅ ∣ italic_x ), and αu,αdsuperscript𝛼𝑢superscript𝛼𝑑{{\alpha}^{u}},{{\alpha}^{d}}italic_α start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT , italic_α start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are hyperparameters encouraging higher entropy in the stochastic policies for more exploration, which decay gradually in magnitude through the training.

We can directly train the critic and control and disturbance actors from scratch through Eq. 18. On the other hand, we can also utilize a three-level training curriculum with two additional pre-training stages (L1 and L2). In L1, we only train the \removesafety control policy \newreach–avoid control actor without considering adversarial disturbance inputs, which is a special case of Eq. 18 when d=0𝑑0{d}=0italic_d = 0. Then, in L2, we fix the control policy trained in L1 and train the disturbance policy instead. Since there is only one policy to optimize in L1 and L2, we can use standard SAC directly. At the beginning of the gameplay learning (L3), we can then initialize actor and critic networks with pre-trained weights, i.e., the control actor (L1), disturbance actor (L2), and the safety critic (L2).

Furthermore, L2 training can be viewed as finding the best adversary to attack the associated control policies, or simply best response πψ(πu)superscriptsubscript𝜋𝜓superscript𝜋𝑢{{{\pi}_{\psi}}^{*}}({{\pi}^{u}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ). \newAfter ISAACS training, we additionally use L2 training to fine-tune πψsubscript𝜋𝜓{{\pi}_{\psi}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT against frozen πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We combine the resulting πψ(πθ)superscriptsubscript𝜋𝜓subscript𝜋𝜃{{{\pi}_{\psi}}^{*}}({{\pi}_{\theta}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) into our gameplay filter. On the other hand, we utilize L2 training to perform a \newbespoke ultimate stress test (BUST) for safety policies and safety filters under various design choices in Table III.

Policy Selection. During the L3 training, we also maintain a finite leaderboard of control and disturbance \removepolicies \newactors from past iterations of training. Periodically, the leaderboard is updated by performing simulated gameplays between the current control and disturbance policies and the previous leaders. If the capacity of the leaderboard is reached, we remove the \removecontrol policy \newcontrol actor checkpoints with the lowest safe rate (and the disturbance \removepolicy \newactor checkpoints with the highest safe rate). At the next iteration’s simulated adversarial safety games, we randomly select control and disturbance \removepolicies \newactors from the leaderboard to generate action inputs, which prevents the control \removepolicy \newactor updates from excessively fitting into a single disturbance \removepolicy \newactor \citepvinitsky2020robust.

IV-D Online Gameplay Safety Filter

Refer to caption
Figure 2: The operation of the gameplay safety filter. Top: The L-step gameplay monitor evaluates πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT for L𝐿{L}italic_L steps. Since the gameplay result is successful at time step 00, the safety filter executes task policy from L𝐿{L}italic_L to 2L12𝐿12{L}-12 italic_L - 1 steps. On the other hand, the safety filter employs π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from 2L2𝐿2{L}2 italic_L to 3L13𝐿13{L}-13 italic_L - 1 steps due to the failure result of gameplay imagination at time step L𝐿{L}italic_L. Middle: The block diagram of safety filters shows they are modular to any πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT. Bottom: The safety fallback policy alternates between π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT and πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, while the safety filter employs a switch-type intervention between πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT and π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT given the safety monitor’s result.
\remove

This section illustrates that the offline game-theoretic reinforcement learning scheme outputs can systematically synthesize \newThis section demonstrates how the reach–avoid control actor πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and disturbance actor πψsubscript𝜋𝜓{{\pi}_{\psi}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT synthesized offline through game-theoretic reinforcement learning can be systematically used at runtime to construct highly effective safety filters for general nonlinear, high-dimensional dynamic systems. \removeWe employ the rollout-based safety monitor as the value-based safety monitor requires tuning the value threshold, which is difficult to perform offline. \new A predictive (rollout-based) safety monitor is employed to prevent tuning the value threshold as the value-based safety monitor, which is difficult to perform before deployment. Also, a simple switching intervention scheme in the form of Eq. 10 is used, although optimization-based schemes like CBF–QP are also possible..

\remove

However, \newThe state-of-the-art predictive safety monitors face scalability and robustness challenges. For example, the nominal rollout in Section III-B can result in an overly optimistic filter, while the FRS-based robust rollout in Section III-B can be computationally intensive for high-dimensional dynamics. To tackle \removescalability and robustness \newthese challenges, we propose using \newa novel adversarial gameplay rollout between the fallback and disturbance policy from offline gameplay learning.

\remove

Since the safety policy from ISAACS only aims to reach the target set safety, we need to utilize π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT to keep the state inside the target set once the system reaches the target set. Therefore, the fallback policy is defined by the switching rule \removewhere we use πISAACSsuperscript𝜋ISAACS{\pi}^{\text{ISAACS}}italic_π start_POSTSUPERSCRIPT ISAACS end_POSTSUPERSCRIPT to denote safety control policy πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for clarity.

\remove

The adversarial gameplay then checks if the fallback policy can safely reach the target set after executing the performance-oriented controls, with the whole rollout under attacks from the ISAACS disturbance policy. \newThe adversarial gameplay begins with applying a control from task policy πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT and fallback policy π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT afterward, with the whole rollout under attacks from the ISAACS disturbance policy πψsubscript𝜋𝜓{{\pi}_{\psi}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT. This gameplay monitor returns success if the state trajectory safely reaches the target set:

ΔH,𝒜\faShield*,game(x,πtask):=assignsubscriptsuperscriptΔ\faShield*game𝐻𝒜𝑥superscript𝜋taskabsent\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{\mathcal{RA}}% }({x},{{\pi}^{\text{task}}}):=roman_Δ start_POSTSUPERSCRIPT * , game end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , caligraphic_R caligraphic_A end_POSTSUBSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) := 𝟙{τ{1,,H},x^τ𝒯\displaystyle\mathbbm{1}\Big{\{}\,\exists{\tau}\in\{1,\dots,{H}\},{\hat{x}}_{{% \tau}}\in{\mathcal{T}}\,\landblackboard_1 { ∃ italic_τ ∈ { 1 , … , italic_H } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_T ∧
s{1,,τ},x^s}12\displaystyle\forall{s}\in\{1,\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal{F% }}\Big{\}}-\frac{1}{2}∀ italic_s ∈ { 1 , … , italic_τ } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∉ caligraphic_F } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG (19a)
with x^0=xsubscript^𝑥0𝑥{\hat{x}}_{0}={x}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x, x^τ+1=f(x^τ,u^τ,πψ(x^τ)),τ0formulae-sequencesubscript^𝑥𝜏1𝑓subscript^𝑥𝜏subscript^𝑢𝜏subscript𝜋𝜓subscript^𝑥𝜏𝜏0{\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{{\pi}_{\psi}}(% {\hat{x}}_{{\tau}})),{\tau}\geq 0over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT = italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ) , italic_τ ≥ 0, and
u^τ={πtask(x^τ),τ=0,π\faShield*(x^τ),τ{1,,H1}.subscript^𝑢𝜏casessuperscript𝜋tasksubscript^𝑥𝜏𝜏0superscript𝜋\faShield*subscript^𝑥𝜏𝜏1𝐻1\displaystyle{\hat{u}}_{{\tau}}=\left\{\begin{array}[]{ll}{{\pi}^{\text{task}}% }({\hat{x}}_{{\tau}}),&{\tau}=0,\\ {{\pi}^{\text{\tiny{\faShield*}}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{1,\dots,{H}% -1\}.\end{array}\right.over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_τ = 0 , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_τ ∈ { 1 , … , italic_H - 1 } . end_CELL end_ROW end_ARRAY (19d)

Finally, the (real-time) gameplay filter ϕ=(π\faShield*,Δ\faShield*)italic-ϕsuperscript𝜋\faShield*superscriptΔ\faShield*{\phi}=({{\pi}^{\text{\tiny{\faShield*}}}},{\Delta}^{\text{\tiny{\faShield*}}})italic_ϕ = ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , roman_Δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) is constructed by the fallback policy π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in Eq. 14, the gameplay monitor ΔH,𝒜\faShield*,gamesubscriptsuperscriptΔ\faShield*game𝐻𝒜\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{\mathcal{RA}}}roman_Δ start_POSTSUPERSCRIPT * , game end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , caligraphic_R caligraphic_A end_POSTSUBSCRIPT in Eq. 19, and a switch-type intervention scheme in Eq. 10. Algorithm 1 illustrates the proposed gameplay filter, and Appendix Appendix A summarizes the terminology (and the symbols) of the modules in safety filters.

\remove

Additionally, we consider a longer horizon instead of potentially switching between steps. \newHowever, the computation time in the gameplay rollout may require multiple time steps. To resolve this latency issue, we verify the task policy by a longer execution horizon of L𝐿{L}italic_L steps instead of within one step, as in Algorithm 1. The longer foresight \removeaddresses measurement latency in real deployment but also smoothens out undesired oscillations close to the boundary of the reach–avoid set. The gameplay-based safety monitor \newwith latency is formulated below

ΔH,L,𝒜\faShield*,game(x,πtask):=assignsubscriptsuperscriptΔ\faShield*game𝐻𝐿𝒜𝑥superscript𝜋taskabsent\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{L},{\mathcal{% RA}}}({x},{{\pi}^{\text{task}}}):=roman_Δ start_POSTSUPERSCRIPT * , game end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_L , caligraphic_R caligraphic_A end_POSTSUBSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) := 𝟙{τ{2L,,H},x^τ𝒯\displaystyle\mathbbm{1}\Big{\{}\,\exists{\tau}\in\{2{L},\dots,{H}\},{\hat{x}}% _{{\tau}}\in{\mathcal{T}}\,\landblackboard_1 { ∃ italic_τ ∈ { 2 italic_L , … , italic_H } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_T ∧
s{L,,τ},x^s}12\displaystyle\forall{s}\in\{{L},\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal% {F}}\Big{\}}-\frac{1}{2}∀ italic_s ∈ { italic_L , … , italic_τ } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∉ caligraphic_F } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG (20a)
with x^0=xsubscript^𝑥0𝑥{\hat{x}}_{0}={x}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x, x^τ+1=f(x^τ,u^τ,πψ(x^τ))subscript^𝑥𝜏1𝑓subscript^𝑥𝜏subscript^𝑢𝜏subscript𝜋𝜓subscript^𝑥𝜏{\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{{\pi}_{\psi}}(% {\hat{x}}_{{\tau}}))over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT = italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ), τ0𝜏0{\tau}\geq 0italic_τ ≥ 0, and
u^τ={ϕprev(x^τ),τ{0,,L1},πtask(x^τ),τ{L,,2L1},π\faShield*(x^τ),τ{2L,,H1},subscript^𝑢𝜏casessuperscriptitalic-ϕprevsubscript^𝑥𝜏𝜏0𝐿1superscript𝜋tasksubscript^𝑥𝜏𝜏𝐿2𝐿1superscript𝜋\faShield*subscript^𝑥𝜏𝜏2𝐿𝐻1\displaystyle{\hat{u}}_{{\tau}}=\left\{\begin{array}[]{ll}{{\phi}^{\text{prev}% }}({\hat{x}}_{{\tau}}),&{\tau}\in\{0,\dots,{L}-1\},\\ {{\pi}^{\text{task}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{{L},\dots,2{L}-1\},\\ {{\pi}^{\text{\tiny{\faShield*}}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{2{L},\dots,% {H}-1\},\end{array}\right.over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL italic_ϕ start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_τ ∈ { 0 , … , italic_L - 1 } , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_τ ∈ { italic_L , … , 2 italic_L - 1 } , end_CELL end_ROW start_ROW start_CELL italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) , end_CELL start_CELL italic_τ ∈ { 2 italic_L , … , italic_H - 1 } , end_CELL end_ROW end_ARRAY (20e)
where \removeπexecsuperscript𝜋exec{\pi}^{\text{exec}}italic_π start_POSTSUPERSCRIPT exec end_POSTSUPERSCRIPT\newϕprevsuperscriptitalic-ϕprev{{\phi}^{\text{prev}}}italic_ϕ start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT is the \removepolicy being \newverified gameplay filter and is executed\new, and thus immutable, during the wait for simulated adversarial gameplay. \newIn other words, if the (trained) disturbance policy captures the worst-case realization of the uncertainty and sim-to-real gap, executing ϕprevsuperscriptitalic-ϕprev{{\phi}^{\text{prev}}}italic_ϕ start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT is guaranteed safe. However, we may not have the optimal disturbance policy due to the parameterization of neural networks. Instead, we still switch to the fallback π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT if there is a safety failure during the first L𝐿{L}italic_L steps in the imagined gameplay.

Algorithm 2 summarizes the L𝐿{L}italic_L-step gameplay safety filter.

Figure 2 illustrates the operation of the gameplay safety filter \newwith the L𝐿{L}italic_L-step gameplay safety monitor. For example, at monitor cycle k=L𝑘𝐿{k}={L}italic_k = italic_L, since the safety monitor check is successful at k=0𝑘0{k}=0italic_k = 0, ϕprev=πtasksuperscriptitalic-ϕprevsuperscript𝜋task{\phi}^{\text{prev}}={{\pi}^{\text{task}}}italic_ϕ start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT. On the other hand, at monitor cycle k=2L𝑘2𝐿{k}=2{L}italic_k = 2 italic_L, since the safety monitor check is failed at k=L𝑘𝐿{k}={L}italic_k = italic_L, ϕprev=π\faShield*superscriptitalic-ϕprevsuperscript𝜋\faShield*{\phi}^{\text{prev}}={{\pi}^{\text{\tiny{\faShield*}}}}italic_ϕ start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

Algorithm 1 Real-Time Gameplay Safety Filter
1:xk,πtask,π\faShield*,πψ,f,L,Hsubscript𝑥𝑘superscript𝜋tasksuperscript𝜋\faShield*subscript𝜋𝜓𝑓𝐿𝐻{{x}_{k}},{{\pi}^{\text{task}}},{{\pi}^{\text{\tiny{\faShield*}}}},{{\pi}_{% \psi}},{f},{L},{H}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_f , italic_L , italic_H
2:x^0xksubscript^𝑥0subscript𝑥𝑘{\hat{x}}_{0}\leftarrow{{x}_{k}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
3:for τ=0𝜏0{\tau}=0italic_τ = 0 to H𝐻{H}italic_H do
4:     if τ=0𝜏0{\tau}=0italic_τ = 0 then
5:         πuπtasksuperscript𝜋𝑢superscript𝜋task{{\pi}^{u}}\leftarrow{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ← italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT\triangleright Evaluate the task policy
6:     else
7:         πuπ\faShield*superscript𝜋𝑢superscript𝜋\faShield*{{\pi}^{u}}\leftarrow{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ← italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT\triangleright Followed by the fallback      
8:     u^τπu(x^τ)subscript^𝑢𝜏superscript𝜋𝑢subscript^𝑥𝜏{\hat{u}}_{{\tau}}\leftarrow{{\pi}^{u}}({\hat{x}}_{{\tau}})over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ← italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
9:     d^τπψ(x^τ)subscript^𝑑𝜏subscript𝜋𝜓subscript^𝑥𝜏{\hat{d}}_{{\tau}}\leftarrow{{\pi}_{\psi}}({\hat{x}}_{{\tau}})over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ← italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
10:     x^τ+1f(x^τ,u^τ,d^τ)subscript^𝑥𝜏1𝑓subscript^𝑥𝜏subscript^𝑢𝜏subscript^𝑑𝜏{\hat{x}}_{{\tau}+1}\leftarrow{f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{\hat{% d}}_{{\tau}})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ← italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
11:     if g(x^τ+1)<0𝑔subscript^𝑥𝜏10{g}({\hat{x}}_{{\tau}+1})<0italic_g ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) < 0 then\triangleright Gameplay violates safety
12:         res0res0{\text{res}}\leftarrow 0res ← 0
13:         return res,π\faShield*ressuperscript𝜋\faShield*{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}res , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT      
14:     if (x^τ+1)>0subscript^𝑥𝜏10{\ell}({\hat{x}}_{{\tau}+1})>0roman_ℓ ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) > 0 then\triangleright Gameplay succeeds
15:         res1res1{\text{res}}\leftarrow 1res ← 1
16:         return res,πtaskressuperscript𝜋task{\text{res}},{{\pi}^{\text{task}}}res , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT      res0res0{\text{res}}\leftarrow 0res ← 0\triangleright Gameplay does not reach
17:return res,π\faShield*ressuperscript𝜋\faShield*{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}res , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
Algorithm 2 L𝐿{L}italic_L-Step Gameplay Safety Filter
1:xk,πtask,ϕprev,π\faShield*,πψ,f,L,Hsubscript𝑥𝑘superscript𝜋tasksuperscriptitalic-ϕprevsuperscript𝜋\faShield*subscript𝜋𝜓𝑓𝐿𝐻{{x}_{k}},{{\pi}^{\text{task}}},{{\phi}^{\text{prev}}},{{\pi}^{\text{\tiny{% \faShield*}}}},{{\pi}_{\psi}},{f},{L},{H}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT , italic_f , italic_L , italic_H
2:x^0xksubscript^𝑥0subscript𝑥𝑘{\hat{x}}_{0}\leftarrow{{x}_{k}}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
3:for τ=0𝜏0{\tau}=0italic_τ = 0 to H𝐻{H}italic_H do
4:     if \newτ<L𝜏𝐿{\tau}<{L}italic_τ < italic_L then
5:         \newπuϕprevsuperscript𝜋𝑢superscriptitalic-ϕprev{{\pi}^{u}}\leftarrow{{\phi}^{\text{prev}}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ← italic_ϕ start_POSTSUPERSCRIPT prev end_POSTSUPERSCRIPT\triangleright Apply the verified filter
6:     else if τ<2L𝜏2𝐿{\tau}<2{L}italic_τ < 2 italic_L then
7:         πuπtasksuperscript𝜋𝑢superscript𝜋task{{\pi}^{u}}\leftarrow{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ← italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT\triangleright Evaluate the task policy
8:     else
9:         πuπ\faShield*superscript𝜋𝑢superscript𝜋\faShield*{{\pi}^{u}}\leftarrow{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ← italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT\triangleright Followed by the fallback      
10:     u^τπu(x^τ)subscript^𝑢𝜏superscript𝜋𝑢subscript^𝑥𝜏{\hat{u}}_{{\tau}}\leftarrow{{\pi}^{u}}({\hat{x}}_{{\tau}})over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ← italic_π start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
11:     d^τπψ(x^τ)subscript^𝑑𝜏subscript𝜋𝜓subscript^𝑥𝜏{\hat{d}}_{{\tau}}\leftarrow{{\pi}_{\psi}}({\hat{x}}_{{\tau}})over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ← italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
12:     x^τ+1f(x^τ,u^τ,d^τ)subscript^𝑥𝜏1𝑓subscript^𝑥𝜏subscript^𝑢𝜏subscript^𝑑𝜏{\hat{x}}_{{\tau}+1}\leftarrow{f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{\hat{% d}}_{{\tau}})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ← italic_f ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )
13:     if g(x^τ+1)<0𝑔subscript^𝑥𝜏10{g}({\hat{x}}_{{\tau}+1})<0italic_g ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) < 0 then\triangleright Gameplay violates safety
14:         res0res0{\text{res}}\leftarrow 0res ← 0
15:         return res,π\faShield*ressuperscript𝜋\faShield*{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}res , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT      
16:     if (x^τ+1)>0subscript^𝑥𝜏10{\ell}({\hat{x}}_{{\tau}+1})>0roman_ℓ ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ + 1 end_POSTSUBSCRIPT ) > 0 and τ2L𝜏2𝐿{\tau}\geq 2{L}italic_τ ≥ 2 italic_L then\triangleright Gameplay succeeds
17:         res1res1{\text{res}}\leftarrow 1res ← 1
18:         return res,πtaskressuperscript𝜋task{\text{res}},{{\pi}^{\text{task}}}res , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT      res0res0{\text{res}}\leftarrow 0res ← 0\triangleright Gameplay does not reach
19:return res,π\faShield*ressuperscript𝜋\faShield*{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}res , italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT

V Experiments

Through extensive simulation study and hardware experiments, we aim to answer the following questions: Can our offline game-theoretic learning and gameplay safety filter

  1. (1)
    \remove

    provide a systematic synthesis method for general nonlinear, high-dimensional systems, which achieves a high safe rate without overly frequent filter intervention? \newachieve robust safety for general nonlinear, high-dimensional systems without obstructing task execution?

  2. (2)

    enable the robot to operate safely in various deployment conditions that are different from the training conditions in a “zero-shot” manner?

  3. (3)

    outperform reward-based learning, non-game-theoretic learning, and value-based (critic) safety filters?

Additionally, we analyze the relative importance of our design choices, including (a) gameplay filter with reach–avoid criteria versus avoid-only, (b) three-level training curriculum versus L3 directly, and (c) symmetric exploitation in offline learning versus without.

V-A Experiment Setup

Robot \newand sensors. We use Spirit 40 from Ghost Robotics for the robot platform as shown in Figure 1 and the PyBullet physics engine \citepcoumans2021pybullet to construct the simulated environment. We use the internal motor encoder of the robot to obtain joint absolute position \newθJisubscriptsuperscript𝜃𝑖J{{\theta}^{i}_{\text{J}}}italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT and velocity \removeθJisubscriptsuperscript𝜃𝑖J{{\theta}^{i}_{\text{J}}}italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT,ωJisubscriptsuperscript𝜔𝑖J{{\omega}^{i}_{\text{J}}}italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT, and built-in IMU for roll \newθxsubscript𝜃𝑥{{\theta}_{x}}italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and pitch \removeangle \newθysubscript𝜃𝑦{{\theta}_{y}}italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, body axial rotational rate \newωx,ωy,ωzsubscript𝜔𝑥subscript𝜔𝑦subscript𝜔𝑧{{\omega}_{x}},{{\omega}_{y}},{{\omega}_{z}}italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and velocity \newvx,vy,vzsubscript𝑣𝑥subscript𝑣𝑦subscript𝑣𝑧{{v}_{x}},{{v}_{y}},{{v}_{z}}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT\remove, and a motion capture system for robot body height pzsubscript𝑝𝑧{{p}_{z}}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. There is no force sensor or contact sensing capability enabled, meaning that ground contact can only be implicitly inferred.

\remove

Physical setup. \newGameplay filter. To implement a gameplay safety filter on a physical robot, we create a gameplay rollout server, \removewhich is a ROS service that takes in the current physical robot state and proposed control action. The server then runs the gameplay rollout for a fixed horizon and returns the filtered safe action. Using the reach–avoid criterion for the gameplay rollout terminal condition, we observe a flat change in elapsed time (from request to response) when the rollout horizon increases (from 10 to 300 steps), yielding an average cycle of 35Hz35Hz35~{}\text{Hz}35 Hz.

\new

Task and perturbations. We construct two different terrains for physical experiments: flat terrain with tugging forces and unmodeled irregular terrain. The robot’s task is to traverse safely across from the same state initialization to reach the goal on the other side of the terrain.

To \removesimulate \newemulate adversarial \newtugging forces on the robot, we \removemanually create a tugging force on the robot by mounting \newmount a rope to the robot on one end and a motion-tracked dynamometer on the other \newend to monitor the force magnitude and direction. The dynamometer has a rated capacity of 500N500N500~{}\text{N}500 N and a resolution of 0.1N0.1N0.1~{}\text{N}0.1 N, tethered to a computer via RS232C. The sampling rate is 1000Hz1000Hz1000~{}\text{Hz}1000 Hz to record both constant pulling and force pulses. \removeAs the rope is attached to the body of the robot, the range of pzsubscript𝑝𝑧p_{z}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is different from the simulated environment, with the arm length from the robot’s center of mass to the mounting point being 0.05m0.05m0.05~{}\text{m}0.05 m, resulting in a net moment force of 50N50N50~{}\text{N}50 N comparable to 5N5N5~{}\text{N}5 N applying at pzsubscript𝑝𝑧p_{z}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT.

\remove

We construct two different terrains for physical experiments: unmodeled irregular terrain and flat terrain with tugging forces. The irregular terrain is a 2m×4m2m4m2~{}\text{m}\times 4~{}\text{m}2 m × 4 m area with a 15-degree incline along one edge, and two mounds emerged in the middle using memory foam, with \removesize length ×\times× width ×\times× height of 1.2m×0.7m×0.05m1.2m0.7m0.05m1.2~{}\text{m}\times 0.7~{}\text{m}\times 0.05~{}\text{m}1.2 m × 0.7 m × 0.05 m and 1.2m×0.8m×0.15m1.2m0.8m0.15m1.2~{}\text{m}\times 0.8~{}\text{m}\times 0.15~{}\text{m}1.2 m × 0.8 m × 0.15 m, positioned 1.8m1.8m1.8~{}\text{m}1.8 m away from each other. \removeFor both types of terrain, the goal of the robot is to traverse safely across from the same state initialization to reach the goal on the other side of the terrain.

Baselines. To evaluate the effectiveness of margin-based feedback signal and uncertainty-aware offline learning, we consider three prior reinforcement learning algorithms: (1) standard SAC \citephaarnoja2018sac with reward defined as

r(x)={1,x𝒯,1,x,0,Otherwise,𝑟𝑥cases1𝑥𝒯1𝑥0Otherwise,\displaystyle{r}({x})=\left\{\begin{array}[]{ll}1,&{x}\in{\mathcal{T}},\\ -1,&{x}\in{\mathcal{F}},\\ 0,&\text{Otherwise,}\end{array}\right.italic_r ( italic_x ) = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL italic_x ∈ caligraphic_T , end_CELL end_ROW start_ROW start_CELL - 1 , end_CELL start_CELL italic_x ∈ caligraphic_F , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL Otherwise, end_CELL end_ROW end_ARRAY

(2) non-game-theoretic reach–avoid reinforcement learning (RARL) \citephsu2021safety, \newRARL with the domain randomization (DR), and (4) adversarial SAC with the reward feedback signal. For the critic filter, we conduct a parameter sweep to find the best value threshold in the simulation and use the same threshold in physical experiments directly.

Policy. We handcraft a task policy by using an inverse kinematics gait planner for forward walking and sideways walking. We parameterize all policies by neural networks of 3 fully connected layers with 256 neurons, and critics have 3 layers with 128 neurons. The gameplay filter uses horizon H=300𝐻300{H}=300italic_H = 300 and L=1𝐿1{L}=1italic_L = 1. We use a low-level PD position controller that outputs torques \removeτ=Kp(qq)Kdq˙𝜏subscript𝐾𝑝superscript𝑞𝑞subscript𝐾𝑑˙𝑞\tau=K_{p}(q^{*}-q)-K_{d}\cdot\dot{q}italic_τ = italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_q ) - italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⋅ over˙ start_ARG italic_q end_ARG to the robot motor controller, with qsuperscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the target angular position, q𝑞qitalic_q the current position, q˙˙𝑞\dot{q}over˙ start_ARG italic_q end_ARG the angular velocity and Kp,Kdsubscript𝐾𝑝subscript𝐾𝑑K_{p},K_{d}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT the PD gains.\newτi=Kp(δθJi)KdωJisuperscript𝜏𝑖subscript𝐾𝑝𝛿subscriptsuperscript𝜃𝑖Jsubscript𝐾𝑑subscriptsuperscript𝜔𝑖J\tau^{i}=K_{p}({\delta{\theta}^{i}_{\text{J}}})-K_{d}\cdot{{\omega}^{i}_{\text% {J}}}italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_δ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT ) - italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⋅ italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT to the robot motor controller with Kp,Kdsubscript𝐾𝑝subscript𝐾𝑑K_{p},K_{d}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT the proportional and derivative gains.

TABLE I: We evaluate physical robots walking on flat terrain with tugging force and unmodeled irregular terrain. Favgsubscript𝐹avgF_{\text{avg}}italic_F start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT and Fmaxsubscript𝐹maxF_{\text{max}}italic_F start_POSTSUBSCRIPT max end_POSTSUBSCRIPT are the average and maximum force during the test and Tgoalsubscript𝑇goalT_{\text{goal}}italic_T start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT is the time to reach the goal. Our gameplay filter has the highest safe rate in both tugging force and irregular terrain tests without overly intervening in task-oriented actions.
 
\TstrutPolicy Tugging Force Bumpy Terrain
Successful Runs Failed Runs Successful Runs
Safe Rate Filter Freq. Tgoalsubscript𝑇goalT_{\text{goal}}italic_T start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT Favgpeaksubscriptsuperscript𝐹peakavgF^{\text{peak}}_{\text{avg}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT Fmaxpeaksubscriptsuperscript𝐹peakmaxF^{\text{peak}}_{\text{max}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT Favgpeaksubscriptsuperscript𝐹peakavgF^{\text{peak}}_{\text{avg}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT avg end_POSTSUBSCRIPT Fminpeaksubscriptsuperscript𝐹peakminF^{\text{peak}}_{\text{min}}italic_F start_POSTSUPERSCRIPT peak end_POSTSUPERSCRIPT start_POSTSUBSCRIPT min end_POSTSUBSCRIPT Safe Rate Filter Freq. Tgoalsubscript𝑇goalT_{\text{goal}}italic_T start_POSTSUBSCRIPT goal end_POSTSUBSCRIPT
ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT 7/10 0.17 26.3 67.5N 70.5N 59.8N 52.7N 10/10 0.19 41.2
ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT 4/10 0.10 26.8 73.7N 80.9N 53.6N 40.0N 5/10 0.22 33.5
πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT 0/10 n/a n/a n/a n/a 56.5N 41.4N 5/10 n/a 16.4\Bstrut
 
TABLE II: Maximum force magnitude withstood by the physical robot with various safety policies, task policy πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT, and fixed-pose policy π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT in different tugging directions. Our employed fallback π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT outperforms the task policy and other safety fallback baselines and has comparable robustness to the policy used in the target set.
 
\TstrutAlgorithm Maximum Force
Left Right
Low High Low High
π\faShield*superscript𝜋\faShield*{{\pi}^{\text{\tiny{\faShield*}}}}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 87.1N 61.1N 99.3N 59.1N
πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 100.5N 150.3N 121.6N 121.9N
\newRARL + DR 46.4N 43N 57.2N 72.1N
πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT 83.2N 96.9N 82.8N 59N \Bstrut
π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT 151.9N 173.7N 140.3N 142.6N
 
  • {\dagger}

    Safety policies from reward-based reinforcement learning and ISAACS with the avoid-only objective fail immediately before applying force.

  • *

    The policy can withstand this magnitude of force. Since the policy can make the quadruped move toward the tugging direction, we cannot add more force in 10 pull attempts.

V-B Physical Results

Safe walking on different terrains. We answer Questions (1) and (3) by evaluating physical robots walking on flat terrain with tugging force and bumpy terrain. We compare our proposed gameplay safety filter with the task policy and critic safety filter. We record the number of runs that the quadruped can safely reach the goal. Also, for those successful runs, we also report the frequency of filter intervention and the time to reach the goal. We additionally report the maximum and average (adversarial) force in the walk for the experiment of flat terrain with tugging forces. Table I shows the result of the experiment.777There is one test of critic safety filter on bumpy terrains failing to reach the goal but remaining safe. We do not include this run’s filter frequency and elapsed time for the average. Our proposed gameplay safety filter has the highest safe rate on both flat terrain with tugging force and unmodeled irregular terrain. Even for those failed trials, the gameplay filter withstands higher tugging force before it violates the safety constraints. Further, the gameplay filter does not unduly intervene with the task-oriented actions as it has a similar filter frequency as the critic filter.

Figure 1 shows the quadruped walking by applying the proposed gameplay filter versus the performance-oriented task policy. When there are imminent safety failures after executing candidate performance-oriented controls, e.g., with airborne legs or loss of balance when climbing, the gameplay filter intervenes and stretches the legs of the quadruped to fight against persistent forces and bumpy terrain.

Maximum withstandable force. To answer Question (3), we test the maximum tugging force withstandable for the safety policies trained by ISAACS, SAC with reward, \removenon-game-theoretic reach–avoid reinforcement learning \newwith domain randomization (RARL+DR), and adversarial SAC with reward. We pull the quadruped from different directions, where the tugging angle for “low” is always between [0.1, 0.4]rad0.10.4rad[-0.1,\,0.4]~{}\text{rad}[ - 0.1 , 0.4 ] rad, and the tugging angle for “high” is always between [0.5, 1.0]rad0.51.0rad[0.5,\,1.0]~{}\text{rad}[ 0.5 , 1.0 ] rad.

Table II shows that the employed πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can withstand more than about 150N150N150~{}\text{N}150 N from all directions, but the non-game-theoretic counterpart (\removereach–avoid RL\newRARL+DR) is vulnerable to the tugging from the left \newand can only withstand 43N43N43~{}\text{N}43 N. \newThis observation suggests that DR struggles to capture the worst-case realization of disturbances accurately. This limitation arises from the inherent nature of DR, where the control actor is optimized for average disturbance behavior. As the dimension of disturbance input increases, the likelihood of the random policy simulating the worst-case disturbance decreases exponentially. This underscores the importance of employing adversarial game-theoretic learning techniques over DR approaches.

Further, we notice the reward-based reinforcement learning baselines and ISAACS with the avoid-only objective fail almost immediately before applying the force since they overreact and thus flip the robot. We find that reach–avoid policies generalize better since they can bring the robot to a stable stance. We also include tests for task policy πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT and the fixed-pose policy π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT (used when the state is in the target set). We observe that ISAACS \removesafety policy \newcontrol actor is strictly better than πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT and is comparable to π𝒯superscript𝜋𝒯{{\pi}^{{\mathcal{T}}}}italic_π start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT.

Refer to caption
Figure 3: Sensitivity analysis of gameplay horizon and gameplay criteria. The proposed gameplay safety filter, utilizing the reach–avoid criteria, maintains a 100%percent\%% safe rate for all gameplay horizons. As the gameplay horizon shortens, the gameplay filter with reach–avoid criteria only experiences a decrease in filter efficiency without compromising safety. In contrast, the filter with avoid-only criteria shows more safety violations than the task policy under a very short horizon. Also, our proposed gameplay filter shows a higher safe rate than the critic safety filter and task policy.

V-C Simulated Results

Bespoke ultimate stress test (BUST). We further answer Questions (1) and (3) by running more exhaustive case studies comparing the following policies: task policy πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT, ISAACS \removesafety policy \newcontrol actor πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, critic safety filter ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT, and proposed gameplay safety filter ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT. In order to test their robustness when taken to the limit, we learn, for each of the above control schemes, a specialized adversarial disturbance policy explicitly trained (via L2) to exploit its safety vulnerabilities. \newWe also compare these policies against random perturbations sampled uniformly from the disturbance set πrndsuperscript𝜋rnd{{\pi}^{\text{rnd}}}italic_π start_POSTSUPERSCRIPT rnd end_POSTSUPERSCRIPT or from extreme points (e.g., Fx=50,Fy=Fz=0formulae-sequencesubscript𝐹𝑥50subscript𝐹𝑦subscript𝐹𝑧0{{F}_{x}}=50,{{F}_{y}}={{F}_{z}}=0italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 50 , italic_F start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = 0) πrnd,+superscript𝜋rnd,+{{\pi}^{\text{rnd,+}}}italic_π start_POSTSUPERSCRIPT rnd,+ end_POSTSUPERSCRIPT.

\new

In L2 training, the disturbance actor must face a time-independent optimal control problem, where the control policy (including the appropriate safety filter) is queried during environment simulation. We note that while the internally simulated gameplay rollout considers a time-varying policy, the executed safety-filtered policy ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT remains time-independent. Specifically, ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT selects either the task control or the safety fallback control based on the outcome of the gameplay rollout, with the rollout dependent solely on the initial state but not when this state is visited. Therefore, ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT can be considered part of the time-invariant environment, meeting the requirements of L2 training.

Table III shows the result of the \newBUST. \removeWe first look at the first two columns and find that πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT is easily exploitable, so we end up with a πψ(πtask)superscriptsubscript𝜋𝜓superscript𝜋task{{{\pi}_{\psi}}^{*}}\left({{\pi}^{\text{task}}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) that is very effective against πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT, but not against the others. On the other hand, the specialized adversary against the ISAACS controller remains effective in attacking πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT and ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT. Finally, we observe that safe filters are not as exploitable, so their πψsuperscriptsubscript𝜋𝜓{{{\pi}_{\psi}}^{*}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTs do not really learn to do more damage than the “universal worst-case” πψ(πθ)superscriptsubscript𝜋𝜓subscript𝜋𝜃{{{\pi}_{\psi}}^{*}}\left({{\pi}_{\theta}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ). \newWe first note that πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT is vulnerable to all πψsuperscriptsubscript𝜋𝜓{{{\pi}_{\psi}}^{*}}italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, while the proposed gameplay filters can only be exploited by its associated πψ(ϕgame)superscriptsubscript𝜋𝜓superscriptitalic-ϕgame{{{\pi}_{\psi}}^{*}}({{\phi}^{\text{game}}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT ). Further, because ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT is very robust, this helps πψ(ϕgame)superscriptsubscript𝜋𝜓superscriptitalic-ϕgame{{{\pi}_{\psi}}^{*}}({{\phi}^{\text{game}}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT ) attack effectively against other policies, where the third column has the lowest safe rates compared to other columns.

\new

The last 2 columns show the safe rate under random disturbance. Except for πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT, both the reach–avoid control actor and safety filters remain at high safe rates. This observation suggests that our L2 training method effectively establishes a superior safety benchmark for policies compared to DR, even when we improve the sampling from uniformly within the set πrndsuperscript𝜋rnd{{\pi}^{\text{rnd}}}italic_π start_POSTSUPERSCRIPT rnd end_POSTSUPERSCRIPT to extreme cases πrnd,+superscript𝜋rnd,+{{\pi}^{\text{rnd,+}}}italic_π start_POSTSUPERSCRIPT rnd,+ end_POSTSUPERSCRIPT.

TABLE III: We perform a bespoke ultimate stress test in simulated environments by learning a specialized adversarial disturbance policy explicitly trained (via L2) to exploit any existing vulnerabilities in each of these robot controllers πθ,πtask,ϕgame,ϕcriticsubscript𝜋𝜃superscript𝜋tasksuperscriptitalic-ϕgamesuperscriptitalic-ϕcritic{{\pi}_{\theta}},\,{{\pi}^{\text{task}}},\,{{\phi}^{\text{game}}},\,{{\phi}^{% \text{critic}}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT , italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT. Additionally, we consider random disturbance sampled uniformly in all directions πrndsuperscript𝜋rnd{{\pi}^{\text{rnd}}}italic_π start_POSTSUPERSCRIPT rnd end_POSTSUPERSCRIPT or at extreme directions πrnd,+superscript𝜋rnd,+{{\pi}^{\text{rnd,+}}}italic_π start_POSTSUPERSCRIPT rnd,+ end_POSTSUPERSCRIPT. The proposed gameplay filter ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT has a higher safe rate than the task policy πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT, critic filter ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT and the learned reach–avoid actor πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.
 
\Tstrut πψ(πθ)superscriptsubscript𝜋𝜓subscript𝜋𝜃{{{\pi}_{\psi}}^{*}}\left({{\pi}_{\theta}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) πψ(πtask)superscriptsubscript𝜋𝜓superscript𝜋task{{{\pi}_{\psi}}^{*}}\left({{\pi}^{\text{task}}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) πψ(ϕgame)superscriptsubscript𝜋𝜓superscriptitalic-ϕgame{{{\pi}_{\psi}}^{*}}\left({{\phi}^{\text{game}}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT ) πψ(ϕcritic)superscriptsubscript𝜋𝜓superscriptitalic-ϕcritic{{{\pi}_{\psi}}^{*}}\left({{\phi}^{\text{critic}}}\right)italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT ) πrndsuperscript𝜋rnd{{\pi}^{\text{rnd}}}italic_π start_POSTSUPERSCRIPT rnd end_POSTSUPERSCRIPT πrnd,+superscript𝜋rnd,+{{\pi}^{\text{rnd,+}}}italic_π start_POSTSUPERSCRIPT rnd,+ end_POSTSUPERSCRIPT \Bstrut
πθsubscript𝜋𝜃{{\pi}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 0.37 0.38 0.17 0.44 0.88 0.85
πtasksuperscript𝜋task{{\pi}^{\text{task}}}italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT 0.0 0.0 0.0 0.0 0.03 0.03
ϕgamesuperscriptitalic-ϕgame{{\phi}^{\text{game}}}italic_ϕ start_POSTSUPERSCRIPT game end_POSTSUPERSCRIPT 0.42 0.35 0.03 0.45 0.84 0.89
ϕcriticsuperscriptitalic-ϕcritic{{\phi}^{\text{critic}}}italic_ϕ start_POSTSUPERSCRIPT critic end_POSTSUPERSCRIPT 0.37 0.34 0.10 0.44 0.86 0.86\Bstrut
 
Refer to caption
Figure 4: We evaluate the impact of the three-level curriculum by retrospectively measuring the effectiveness of the trainee control actor against the fully trained disturbance actor. Contrary to our initial expectation, the two pertaining stages do not present noticeable advantages over direct learning.

Sensitivity analysis: reach–avoid criteria vs. avoid-only. We evaluate the significance of using reach–avoid criteria in the gameplay filter by performing a sensitivity analysis of the horizon in the imagined gameplay. Figure 3 shows that the gameplay filter with reach–avoid criteria still remains 100%percent\%% safe rate even when the gameplay horizon is short (H=10𝐻10{H}=10italic_H = 10. However, the gameplay filter with avoid-only criteria, which simplifies Eq. 20

ΔH,L,𝒜\faShield*,game(x,πtask):=𝟙{\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{L},{\mathcal{% A}}}({x},{{\pi}^{\text{task}}}):=\mathbbm{1}\{\,roman_Δ start_POSTSUPERSCRIPT * , game end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_L , caligraphic_A end_POSTSUBSCRIPT ( italic_x , italic_π start_POSTSUPERSCRIPT task end_POSTSUPERSCRIPT ) := blackboard_1 { τ{L,,H},x^τ}12,\displaystyle\forall{\tau}\in\{{L},\dots,{H}\},{\hat{x}}_{{\tau}}\not\in{% \mathcal{F}}\}-\frac{1}{2},∀ italic_τ ∈ { italic_L , … , italic_H } , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∉ caligraphic_F } - divide start_ARG 1 end_ARG start_ARG 2 end_ARG , (21)

has more safety violations than task policy when H=10𝐻10{H}=10italic_H = 10. The difference is due to shorter imagined gameplay resulting in more frequent filter intervention for reach–avoid criteria but overly optimistic monitoring for avoid-only criteria (ignore the upcoming failure). Further, as the gameplay horizon increases, the filter frequency of using reach–avoid criteria goes down, i.e., if HH𝐻superscript𝐻{H}\geq{H}^{\prime}italic_H ≥ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

ΔH,L,𝒜\faShield*,game<0ΔH,L,𝒜\faShield*game<0.subscriptsuperscriptΔ\faShield*game𝐻𝐿𝒜0subscriptsuperscriptΔ\faShield*gamesuperscript𝐻𝐿𝒜0\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{L},{\mathcal{% RA}}}<0\to\Delta^{\text{\tiny{\faShield*}}\,\text{game}}_{{H}^{\prime},{L},{% \mathcal{RA}}}<0.roman_Δ start_POSTSUPERSCRIPT * , game end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H , italic_L , caligraphic_R caligraphic_A end_POSTSUBSCRIPT < 0 → roman_Δ start_POSTSUPERSCRIPT * game end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_L , caligraphic_R caligraphic_A end_POSTSUBSCRIPT < 0 . (22)

This observation indicates that reach–avoid criteria are preferred in physical deployment as it is difficult to know the sufficient horizon a priori.

Sensitivity analysis: three-level training curriculum. We testify to the need to use a three-level curriculum by gameplay results against a specialized adversary for the \removesafety policy \newreach–avoid control actor trained with the curriculum, i.e., πψ(πθ)superscriptsubscript𝜋𝜓subscript𝜋𝜃{{{\pi}_{\psi}}^{*}}({{\pi}_{\theta}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ). Figure 4 shows the safe rate of gameplay results between the model checkpoints stored along the training and πψ(πθ)superscriptsubscript𝜋𝜓subscript𝜋𝜃{{{\pi}_{\psi}}^{*}}({{\pi}_{\theta}})italic_π start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ). We observe that two pre-training stages in the curriculum do not significantly improve training performance. In contrast, directly learning in the L3 stage requires similar steps in gameplay learning to reach a decent safety performance.

VI Conclusion

This work presents a game-theoretic learning approach to synthesize safety filters for high-order, nonlinear dynamics. The proposed gameplay safety filter monitors the risk of system safety through imagined games between its best-effort safety \removecontrol \newfallback policy and a learned virtual adversary, aiming to realize the worst-case uncertainty in the system. We validate our approach on a physical quadruped robot under strong tugging forces and unmodeled irregular terrain while maintaining zero-shot safety. An exhaustive simulation study is performed to compare with state-of-the-art safety \removecontrol synthesis methods \newfallback policies, safety filters, and \newunderstand the relative importance of design choices.

Appendix A Implementation Details

\new

The state and action space are defined as:

x𝑥\displaystyle{x}italic_x =[px,py,pz,vx,vy,vz,θx,θy,θz,ωx,ωy,ωz,{θJi},{ωJi}],absentsubscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧subscript𝑣𝑥subscript𝑣𝑦subscript𝑣𝑧subscript𝜃𝑥subscript𝜃𝑦subscript𝜃𝑧subscript𝜔𝑥subscript𝜔𝑦subscript𝜔𝑧subscriptsuperscript𝜃𝑖Jsubscriptsuperscript𝜔𝑖J\displaystyle=\left[{{p}_{x}},{{p}_{y}},{{p}_{z}},{{v}_{x}},{{v}_{y}},{{v}_{z}% },{{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}},{{\omega}_{x}},{{\omega}_{y}},{% {\omega}_{z}},\{{{\theta}^{i}_{\text{J}}}\},\{{{\omega}^{i}_{\text{J}}}\}% \right],= [ italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , { italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT } , { italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT } ] ,
u𝑢\displaystyle{u}italic_u =[{δθJi}],absentdelimited-[]𝛿subscriptsuperscript𝜃𝑖J\displaystyle=\left[\{{\delta{\theta}^{i}_{\text{J}}}\}\right],= [ { italic_δ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT } ] ,

with px,py,pzsubscript𝑝𝑥subscript𝑝𝑦subscript𝑝𝑧{{p}_{x}},{{p}_{y}},{{p}_{z}}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the position of the body center, vx,vy,vzsubscript𝑣𝑥subscript𝑣𝑦subscript𝑣𝑧{{v}_{x}},{{v}_{y}},{{v}_{z}}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the velocity of the robot in the body frame coordinate, θx,θy,θzsubscript𝜃𝑥subscript𝜃𝑦subscript𝜃𝑧{{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}}italic_θ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the roll, pitch, and yaw of the robot, ωx,ωy,ωzsubscript𝜔𝑥subscript𝜔𝑦subscript𝜔𝑧{{\omega}_{x}},{{\omega}_{y}},{{\omega}_{z}}italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT the body axial rotational rate, and θJi,ωJi,δθJisubscriptsuperscript𝜃𝑖Jsubscriptsuperscript𝜔𝑖J𝛿subscriptsuperscript𝜃𝑖J{{\theta}^{i}_{\text{J}}},{{\omega}^{i}_{\text{J}}},{\delta{\theta}^{i}_{\text% {J}}}italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT , italic_ω start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT , italic_δ italic_θ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT J end_POSTSUBSCRIPT the angle, angular velocity, and commanded angular increment of the robot’s ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT joint.888In this work, we specifically consider walking locomotion, so the policies ignore px,py,\newpz,θzsubscript𝑝𝑥subscript𝑝𝑦\newsubscript𝑝𝑧subscript𝜃𝑧{{p}_{x}},{{p}_{y}},\new{{{p}_{z}}},{{\theta}_{z}}italic_p start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT.

\new

We define the critical points 𝕡𝕔subscript𝕡𝕔\mathbb{p_{c}}blackboard_p start_POSTSUBSCRIPT blackboard_c end_POSTSUBSCRIPT as the body corners and \removeelbows\newknees of the robot. The safety margin is defined as:

g(x)=min{mini{zcorneri}z¯corner,g,mini{zkneei}z¯knee},𝑔𝑥subscript𝑖superscriptsubscript𝑧corner𝑖subscript¯𝑧corner𝑔subscript𝑖superscriptsubscript𝑧knee𝑖subscript¯𝑧knee\displaystyle{g}({x})=\min\left\{\min_{i}\{z_{\text{corner}}^{i}\}-\bar{z}_{% \text{corner},{g}},\,\min_{i}\{z_{\text{knee}}^{i}\}-\bar{z}_{\text{knee}}% \right\},italic_g ( italic_x ) = roman_min { roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT corner end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } - over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , italic_g end_POSTSUBSCRIPT , roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT knee end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } - over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT knee end_POSTSUBSCRIPT } ,

with zcornerisuperscriptsubscript𝑧corner𝑖z_{\text{corner}}^{i}italic_z start_POSTSUBSCRIPT corner end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT the distance to ground of robot body corner ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT and z\newknee\removeelbowisuperscriptsubscript𝑧\newknee\removeelbow𝑖z_{\text{\new{knee}\remove{elbow}}}^{i}italic_z start_POSTSUBSCRIPT knee elbow end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT the distance to ground of robot knee ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT.

\new

The target margin function is defined as

(x)=min{\displaystyle{\ell}({x})=\min\Big{\{}\,roman_ℓ ( italic_x ) = roman_min { ω¯|ωx|,ω¯|ωy|,ω¯|ωz|,¯𝜔subscript𝜔𝑥¯𝜔subscript𝜔𝑦¯𝜔subscript𝜔𝑧\displaystyle\bar{{\omega}}-|{{\omega}_{x}}|,\,\bar{{\omega}}-|{{\omega}_{y}}|% ,\,\bar{{\omega}}-|{{\omega}_{z}}|,over¯ start_ARG italic_ω end_ARG - | italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | , over¯ start_ARG italic_ω end_ARG - | italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | , over¯ start_ARG italic_ω end_ARG - | italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | ,
v¯|vx|,v¯|vy|,v¯|vz|,¯𝑣subscript𝑣𝑥¯𝑣subscript𝑣𝑦¯𝑣subscript𝑣𝑧\displaystyle\bar{{v}}-|{{v}_{x}}|,\,\bar{{v}}-|{{v}_{y}}|,\,\bar{{v}}-|{{v}_{% z}}|,over¯ start_ARG italic_v end_ARG - | italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | , over¯ start_ARG italic_v end_ARG - | italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | , over¯ start_ARG italic_v end_ARG - | italic_v start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | ,
z¯corner,maxi{zcorneri},z¯toemaxi{ztoei}},\displaystyle\bar{z}_{\text{corner},{\ell}}-\max_{i}\{z_{\text{corner}}^{i}\},% \,\bar{z}_{\text{toe}}-\max_{i}\{z_{\text{toe}}^{i}\}\Big{\}},over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , roman_ℓ end_POSTSUBSCRIPT - roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT corner end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } , over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT toe end_POSTSUBSCRIPT - roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT { italic_z start_POSTSUBSCRIPT toe end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } } ,

with ztoeisuperscriptsubscript𝑧toe𝑖z_{\text{toe}}^{i}italic_z start_POSTSUBSCRIPT toe end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT the distance to ground of robot toes ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT. ()¯¯\bar{(\cdot)}over¯ start_ARG ( ⋅ ) end_ARG denotes the desired magnitude. Table IV shows the threshold used to define the safety and target margin functions for quadruped walking.

TABLE IV: Implementation details of the safety specifications of quadruped walking.
 
\TstrutNotation Magnitude
z¯corner,gsubscript¯𝑧corner𝑔\bar{z}_{\text{corner},{g}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , italic_g end_POSTSUBSCRIPT 0.1 m
z¯\newknee\removeelbowsubscript¯𝑧\newknee\removeelbow\bar{z}_{\text{\new{knee}\remove{elbow}}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT knee elbow end_POSTSUBSCRIPT 0.05 m
z¯corner,subscript¯𝑧corner\bar{z}_{\text{corner},{\ell}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT corner , roman_ℓ end_POSTSUBSCRIPT 0.4 m
z¯toesubscript¯𝑧toe\bar{z}_{\text{toe}}over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT toe end_POSTSUBSCRIPT 0.05 m
ω¯¯𝜔\bar{{\omega}}over¯ start_ARG italic_ω end_ARG 10 deg/s
v¯¯𝑣\bar{v}over¯ start_ARG italic_v end_ARG 0.2 m/s \Bstrut
 
\new

LABEL:tab:term summarizes the terminology used in safety filter design, which also highlights the modularity of the proposed gameplay filter.

TABLE V: Terminology and symbols used in safety filter modules.