\newbibmacro

string+doiurl[1] \addbibresourcereferences.bib

Gameplay Filters: Safe Robot Walking through Adversarial Imagination

Duy P. Nguyen1^§, Kai-Chieh Hsu1^§, Wenhao Yu2, Jie Tan2, Jaime F. Fisac1 1Department of Electrical and Computer Engineering, Princeton University, United States
{duyn, kaichieh, jfisac}@princeton.edu 2Google Deepmind, United States
{magicmelon, jietan}@google.com

Abstract

Ensuring the safe operation of legged robots in uncertain, novel environments is crucial to their widespread adoption. Despite recent advances in safety filters that can keep arbitrary task-driven policies from incurring safety failures, existing solutions for legged robot locomotion still rely on simplified dynamics and may fail when the robot is perturbed away from predefined stable gaits. This paper presents a general approach that leverages offline game-theoretic reinforcement learning to synthesize a highly robust safety filter for high-order nonlinear dynamics. This gameplay filter then maintains runtime safety by continually simulating adversarial futures and precluding task-driven actions that would cause it to lose future games (and thereby violate safety). Validated on a 36-dimensional quadruped robot locomotion task, the gameplay safety filter exhibits inherent robustness to the sim-to-real gap without manual tuning or heuristic designs. Physical experiments demonstrate the effectiveness of the gameplay safety filter under perturbations, such as tugging and unmodeled irregular terrains, while simulation studies shed light on how to trade off computation and conservativeness without compromising safety.

^§^§footnotetext: Denotes equal contribution.

Refer to caption — Figure 1: We deploy our gameplay-based safety filter on a quadruped robot equipped with a safety-agnostic walking (task) policy, and evaluate its effectiveness under strong tugging forces and unmodeled irregular terrain. The gameplay filter continually monitors the robot’s safety by rapidly simulating adversarial futures, pitting its best-effort safety strategy against a learned virtual adversary that aims to exploit uncertainty and sim-to-real error to make it tip over. If hazardous conditions arise, the filter intervenes to preclude task-driven actions that would cause the robot to lose this imaginary safety game at a later time. Interventions result in highly robust, adaptive behaviors such as counterbalancing to fight persistent pulls and springing into a wide stance to break imminent falls.

I Introduction

Increasingly, autonomous robots are being deployed beyond controlled environments and required to operate reliably in uncertain, unforeseen conditions \citepkumar2021rma,zhuang2023robot,hsuzen2022sim2lab2real,margolis2022rapid,kostrikov2023demonstrating. This has resulted in a growing need for robot safety frameworks that can scale with system complexity and generalize gracefully to novel environments.

Model-based approaches developed by the robotics and control communities offer a principled treatment of safe decision-making under uncertainty. Unfortunately, computing global safety \removecontrol \newfallback strategies for high-dimensional, nonlinear robot dynamics remains an open problem. State-of-the-art numerical safety methods only scale to 5–6 state variables [bansal2017hamilton, bui2021realtime], woefully short of the 12 needed to accurately model the flight of a drone in free space, let alone the 30–50 required for most legged robots. Analytical approaches like Lyapunov controllers and CBFs rely on hand-design, structural assumptions, and reduced-order models [nguyen2022robust, molnar2022modelfree], restricting their use to a local operating envelope, such as a predefined stable walking gait. As a result, legged robots are notorious for falling easily, especially on irregular terrain or when externally perturbed (pushed, tugged, or tripped).

Data-driven approaches grounded in machine learning address the scalability challenge by automatically distilling efficient representations from the robot’s prior experience or, more recently, from web-scale data \citepdeepmind2023rtx. In practice, however, learned models for robot control, including deep reinforcement learning and imitation learning, are often trained in simulated environments due to hardware constraints and poor sample complexity (requiring millions of training episodes that can much more easily be procured by at-scale simulation). The discrepancy between training and deployment conditions, or sim-to-real gap, can result in deteriorated operational performance and, in extreme cases, catastrophic safety failures (e.g., damaging the robot or hurting nearby people) [hsuzen2022sim2lab2real]. Additionally, end-to-end approaches often require re-training for different task specifications, which presents technical challenges in balancing safety objectives with task-specific goals, especially avoiding situations where a robot may unexpectedly prioritize task performance over safety.

A recent line of work breaks down the safety–performance trade-off through variations of a supervisory control mechanism known as a safety filter, which monitors the autonomous system’s safety at runtime and intervenes when necessary by adjusting the original performance-oriented control to avert catastrophic failures \citephsu2023safety,fisac2019AGS,ames2017cbf,wabersich2018linear,bastani2021safe,kumar2023cbfddp. While some efforts have been made to synthesize safety filters for legged robot locomotion, these typically rely on simplified low-order dynamics to maintain tractability, and they lack a systematic treatment of uncertainty and reality gap \citephsuzen2022sim2lab2real,yang2022safe.

This paper introduces a novel type of safety filter that brings together the scalability of learning-based representations and the reliability of model-based safety analysis, enabling highly robust and minimally disruptive safety assurance for arbitrary robot task policies. Unlike most general safety filter techniques, the approach scales readily to robot dynamics with tens of state dimensions, which allows us to focus on its use in the dynamic legged locomotion domain. \newFurther, our safety filter can monitor a closed-loop policy and address the associated computational latency, while existing safety filters only handle a single control input.

A preliminary offline stage leverages game-theoretic reinforcement learning to \removesystematicallysynthesize \newcontrol and disturbance policies, which can be systematically used to construct safety filters for general nonlinear, high-dimensional dynamic systems \newat runtime. At every control cycle, the online gameplay safety filter assesses safety risks based on an imagined game between the control and adversarial disturbance policy trained in offline gameplay learning. This imagined gameplay aims to simulate the worst-case realization of the uncertainty in the system, either from a sim-to-real error or perturbations from the environment. If dangerous conditions emerge, the filter steps in to prevent task-driven actions that could lead the robot to lose in the subsequent safety-oriented gameplay.

The effectiveness of the proposed gameplay safety filter is validated in a legged robot locomotion task with a 36-dimensional state space and a 12-dimensional control space.¹¹1See https://saferobotics.princeton.edu/research/gameplay-filter for supplementary material. Our results demonstrate that the gameplay safety filter is inherently robust to the sim-to-real gap, operating in a “zero-shot” manner without requiring manual design or hyperparameter tuning during deployment. Moreover, the gameplay safety filter achieves a high safety rate without being overly conservative, avoiding frequent interventions in the performance-oriented control policy. Importantly, the gameplay safety filter synthesis remains independent of the performance-oriented policy, making it modular and adaptable to any performance-oriented policy at runtime. Our evaluation includes real-world experiments on different terrains with perturbations (see Figure 1) and a comprehensive simulation study on the relative importance of design choices.

II Related Work

Learning for Locomotion. Conventionally, legged locomotion has been addressed through model-based techniques, including model-predictive control \citepbledt2018mit and trajectory optimization \citepwinkler2018gait. However, recent advancements in deep learning offer the opportunity to learn directly from interactions with environments and feedback in the form of a reward signal, bypassing the need for intricate dynamics modeling and extensive domain knowledge. \citetkostrikov2023demonstrating demonstrated the direct training of locomotion policies across various terrains in the real world through reinforcement learning by carefully formulating the problem with consideration for state space, action space, and reward function. Despite the success of reinforcement learning, it relies on trial and error during training. In safety-critical environments, learning from scratch can lead to catastrophic safety failures. An alternative approach involves initially training control policies in simulation and then bridging the simulation-to-real gap through methods such as domain randomization \citeptobin2017domain, task-driven adaptation \citepkumar2021rma, ren2023adaptsim, and system identification \citepfabio2019bayessim.

Safety Filters. While learning-based policies discussed earlier exhibit practical utility, they primarily focus on task-oriented performance metrics. However, ensuring their safe operation in unforeseen, uncertain, and unforgiving environments is of paramount importance. A recent line of work aims at inducing safety awareness and even guarantees for learning-based policies through a safety filter. The runtime operation of every safety filter can be conceptualized as two interrelated functions: monitoring and intervention \citephsu2023safety. The safety filter continually monitors the robot’s planned actions to assess the level of safety risk. Subsequently, the filter may intervene by modulating or entirely overriding the robot’s intended control input to guarantee the preservation of safety. Many safety filters incorporate monitoring and intervention procedures guided by a safety-oriented control strategy, which the filter views as a viable fallback.

One important family of safety filters is built on Hamilton-Jacobi (HJ) reachability analysis, which computes a global safe value function through dynamic programming \citepmitchell2008flexible,fisac2015reachavoid. The resulting value function encodes the maximal safe set and optimal safety \removepolicy \newfallback, and thus, a least-restrictive safety filter can be synthesized by a switch-type intervention \citepfisac2019AGS. Although systematic and powerful, HJ methods have poor scalability and are limited to no more than 6 state dimensions \citepbui2022optimizeddp.

On the other hand, control barrier functions (CBFs) \citepames2017cbf,ames2019control no longer encode or approximate the maximal safe set. Instead, CBFs, if found, provide a sufficient condition to keep the system safe forever, akin to control Lyapunov functions \citepsontag1983lyapunov. Another critical feature of CBFs is their usage of optimization-type intervention, which finds minimal modulation to the task-oriented control that still keeps the system safe, and thus CBFs allow a smooth intervention mechanism. However, finding a CBF for general dynamics is usually not trivial, and CBF is only local and not robust to model mismatch.

For high-dimensional dynamics, computing global optimal value functions (HJ) is computationally prohibitive, and finding a valid CBF is often heuristic. Instead of relying on value functions, model predictive safety filters aim to certify the system safety in real time by forward simulating (“rolling out”) or trajectory optimization,²²2Some recent efforts have been made to synthesize CBFs based on model predictive methods \citepchen2021backup,kumar2023cbfddp, while the concerned dynamics has no more than 5 state dimensions. \citepwabersich2018linear,bastani2021safe,hsunguyen2023isaacs which closely link to this work. \citethsunguyen2023isaacs consider the forward-reachable set (FRS) of the system trajectories. However, the use of FRS brings two challenges for general high-dimensional dynamics: 1) FRS needs to be tight to make safety filters not overly conservative, and 2) the computation of FRS needs to be quick to satisfy real-time constraints. Instead, \citetbastani2021safe assume disturbance distribution is known, by which a sufficient number of trajectories are sampled and a statistical guarantee is derived; nonetheless, disturbance distribution may be difficult to obtain in practice.

Safety filters have been applied to ensure the safe operation of learning-based locomotion \citephsuzen2022sim2lab2real,yang2022safe. \citethsuzen2022sim2lab2real introduce a safety monitor based on a value function and fine-tune the corresponding safety filter using a two-stage reinforcement learning framework, providing statistical safety guarantees. However, they consider the uncertainty distribution as a whole, while our work focuses on robustly safeguarding against the worst-case realization of uncertainty. \citetyang2022safe propose a safety monitor criterion based on a heuristically defined safety-triggered set, checking if rollouts activate the criterion. In contrast, our work determines such a safety-triggered set through gameplay rollouts. Additionally, these methods employ a simplified dynamics model and only consider velocity control instead of torque or joint position control directly.

III Preliminaries

III-A Scalable Safety Analysis via Reinforcement Learning

We consider discrete-time, uncertain robot dynamics

\displaystyle{{x}_{{k}+1}}={f}({{x}_{k}},{{u}_{k}},{{d}_{k}}),

(1)

where, at each time step ${k}\in\mathbb{N}$ , ${{x}_{k}}\in{\mathcal{X}}\subseteq\mathbb{R}^{n_{x}}$ is the state of the system, ${{u}_{k}}\in{\mathcal{U}}\subseteq\mathbb{R}^{n_{u}}$ is the control input (typically from a control policy ${{\pi}^{u}}\in{\Pi^{u}}\colon{\mathcal{X}}\to{\mathcal{U}}$ ), and ${{d}_{k}}\in{\mathcal{D}}\subseteq\mathbb{R}^{n_{d}}$ is the disturbance input, unknown a priori. \newThe disturbance bound defines the operational design domain (ODD), under which we must ensure autonomous systems function safely and effectively. We further assume we are given a specification of the failure set ${\mathcal{F}}\subset{\mathcal{X}}$ of all conditions the system state should never reach. Safety analysis aims to determine the largest possible safe set ${\Omega}\subset{\mathcal{X}}$ , from which there exists a control policy that can maintain system safety against all admissible uncertainty realizations (encoded by a disturbance policy ${{\pi}^{d}}\in{\Pi^{d}}\colon{\mathcal{X}}\times{\mathcal{U}}\to{\mathcal{D}}$ )

\displaystyle{\Omega}:=\left\{{x}\in{\mathcal{X}}\mid\exists{{\pi}^{u}}\in{\Pi% ^{u}},\,\forall{{\pi}^{d}}\in{\Pi^{d}},\,\forall{k}>0,\,{{x}_{k}}\notin{% \mathcal{F}}\right\},

(2)

where ${{x}_{k}}={\mathbf{{x}}}_{{x}}^{{{\pi}^{u}}\!,{{\pi}^{d}}}({k})$ and ${\mathbf{{x}}}_{{x}}^{{{\pi}^{u}}\!,{{\pi}^{d}}}$ is the system trajectory starting from ${{x}_{0}}={x}$ and following dynamics Eq. 1 with control and disturbance inputs from control policy ${{\pi}^{u}}$ and disturbance policy ${{\pi}^{d}}$ , respectively.

Hamilton-Jacobi-Isaacs (HJI) reachability analysis leverages the level set representations to transform the binary outcome, or game-of-kind as formulated in Eq. 2, into a continuous outcome, or game-of-degree, by a (Lipschitz-continuous) margin function ${g}:{\mathcal{X}}\to\mathbb{R}$ such that ${g}({x})<0\Leftrightarrow{x}\in{\mathcal{F}}$ ³³3An example of a margin function is the signed distance function to the failure set. \citepmitchell2008flexible,bansal2017hamilton

\displaystyle{J}^{{{\pi}^{u}}\!,{{\pi}^{d}}}_{k}({x}):=\min_{{\tau}\in[{k},{H}% ]}{g}\left({\mathbf{{x}}}_{{x}}^{{{\pi}^{u}}\!,{{\pi}^{d}}}({\tau})\right),

(3)

where ${H}$ is the concerned control horizon. Consistent with the identifiers in Eq. 2, we compute the lower value of the game ${V}_{k}({x}):=\max_{\vphantom{{{\pi}^{d}}}{{\pi}^{u}}}\min_{{{\pi}^{d}}}{J}_{k% }^{{{\pi}^{u}}\!,{{\pi}^{d}}}({x})$ , which gives the disturbance policy information advantage \citepisaacs1954differential. Additionally, this value function is the fixed-point solution of the Isaacs equation, which can be solved by dynamic programming


$\displaystyle{V}_{k}({x})$	$\displaystyle=\max_{\vphantom{{d}}{u}}\min_{{d}}\min\left\{{g}({x}),\,{V}_{{k}% +1}\big{(}{f}({x},{u},{d})\big{)}\right\},$	(4a)
$\displaystyle{V}_{H}({x})$	$\displaystyle={g}({x}).$	(4b)

If a nonempty safe set is present in the context of the differential game, the value function converges and becomes time-independent within this set as ${H}\to\infty$ . Consequently, we can eliminate the dependence on ${k}$ , resulting in ${V}({x})=\lim_{{k}\to-\infty}{V}_{k}({x})$ . We can recover the (maximal) safe set by the superzero level set of the value function ${{\Omega}^{*}}=\{{x}\in{\mathcal{X}}\mid{V}({x})\geq 0\}$ and the optimal policies ${{\pi}^{{u}*}},\,{{{\pi}_{\psi}}^{*}}$ from the optimizers of Eq. 4.

However, in practice, it is difficult to know the sufficient control horizon a priori. \newAlso, finding the maximal safe set is usually difficult to find in complex, high-dimensional dynamics. Instead, reach–avoid analysis simplifies the safety analysis by checking whether a control policy exists to guide the system into a (robust controlled-invariant) target set ${\mathcal{T}}\subset{\mathcal{X}}$ ( ${\mathcal{T}}\cap{\mathcal{F}}={\text{\O}}$ ) in ${H}$ time steps without entering the failure set previously. \removeInstead, we can extend the safety analysis to reach–avoid analysis, requiring the system to reach specific conditions ${\mathcal{T}}\subset{\mathcal{X}}$ in ${H}$ time steps without entering the failure set previously. The reach–avoid set is defined by

	$\displaystyle{\mathcal{RA}}:=\Big{\{}\,$	$\displaystyle{x}\in{\mathcal{X}}\mid\ \exists{{\pi}^{u}}\in{\Pi^{u}},\,\forall% {{\pi}^{d}}\in{\Pi^{d}},\,$
		$\displaystyle\exists{k}\in[0,{H}],\,{{x}_{k}}\in{\mathcal{T}}\land\forall{\tau% }\in[0,{k}],\,{x}_{{\tau}}\not\in{\mathcal{F}}\,\Big{\}}.$		(5)

\remove

Importantly, we choose the target set ${\mathcal{T}}$ to be a known robust controlled-invariant set, where there exists a policy ${{\pi}^{{\mathcal{T}}}}$ that can maintain the system state to stay in ${\mathcal{T}}$ forever under all disturbance realizations. \newSince ${\mathcal{T}}$ is a robust controlled-invariant set, there exists a policy ${{\pi}^{{\mathcal{T}}}}$ that can maintain the system state to stay in ${\mathcal{T}}$ forever under all disturbance realizations. After the reach–avoid policy safely guides the system into ${\mathcal{T}}$ , we can switch to ${{\pi}^{{\mathcal{T}}}}$ to keep the system in ${\mathcal{T}}$ forever. Thus, reach–avoid analysis simplifies the safety control design by requiring only the assurance of control invariance for a small subset of states ${\mathcal{T}}$ , which\remove This property is sufficient for the reach–avoid set to be a safe set.

An auxiliary game of degree can be similarly formulated by introducing another margin function with respect to the target set ${\ell}\colon{\mathcal{X}}\to\mathbb{R}$ such that ${\ell}({x})\geq 0\Leftrightarrow{x}\in{\mathcal{T}}$ . We consider the reach–avoid outcome \citephsu2021safety

\displaystyle{J}^{{{\pi}^{u}}\!,{{\pi}^{d}}}_{k}({x}):=\max_{{\tau}\in[{k},{H}% ]}\min\left\{{\ell}\left({x}_{{\tau}}\right),\,\min_{{s}\in[{k},{\tau}]}{g}% \left({x}_{{s}}\right)\right\}.

(6)

The reach–avoid value function can be solved by the following Isaacs equation


$\displaystyle{V}_{k}({x})$	$\displaystyle=\max_{\vphantom{{d}}{u}}\min_{{d}}\min\left\{{g}({x}),\,\max% \left\{{\ell}({x}),\,{V}_{{k}+1}\big{(}{f}({x},{u},{d})\big{)}\right\}\right\},$	(7a)
$\displaystyle{V}_{H}({x})$	$\displaystyle=\min\left\{{\ell}({x}),\,{g}({x})\right\}.$	(7b)

Similarly, the reach–avoid set can be recovered by ${\mathcal{RA}}=\{{x}\in{\mathcal{X}}\mid{V}_{0}({x})\geq 0\}$ .

The computation complexity and memory requirement of solving Eq. 4 and Eq. 7 grows exponentially with respect to the dimension of continuous states, which limits its applicability to no more than six state dimensions for general dynamics \citepbui2022optimizeddp. In recent work, \citethsunguyen2023isaacs proposed an adversarial reinforcement learning framework ISAACS to find approximation solutions to the Isaacs equations, where the state-action value function, or Q-function, ${{Q}_{{\omega}}}$ , \removecontrol policy \newreach–avoid control policy ${{\pi}_{\theta}}$ , and disturbance policy ${{\pi}_{\psi}}$ are parameterized by neural networks ${\omega},{\theta},{\psi}$ , respectively.⁴⁴4In reinforcement learning literature, ${{Q}_{{\omega}}}$ is also called critic as it evaluates the quality of the action, while ${{\pi}_{\theta}},{{\pi}_{\psi}}$ are called actors since they determine which action to take. In each iteration of ISAACS, it simulates adversarial safety games to collect state-action sequences, performs gradient updates of neural networks, and determines the policies to sample from in the next simulated games.

III-B Value-Based and Rollout-Based Safety Filters

This section introduces safety filters ${\phi}\colon{\mathcal{X}}\times{\Pi^{u}}\to{\Pi^{u}}$ and \new ${\phi}=({{\pi}^{\text{\tiny{\faShield*}}}},{\Delta}^{\text{\tiny{\faShield*}}})$ ⁵⁵5With a slight abuse of notation, we highlight here the safety filter is composed of a safety fallback policy and a safety monitor. based on switch-type intervention, which can generally be formulated \citephsu2023safety

\displaystyle{\phi}({x},{{\pi}^{\text{task}}})=\left\{\begin{array}[]{ll}{{\pi% }^{\text{task}}},&{\Delta}^{\text{\tiny{\faShield*}}}({x},{{\pi}^{\text{task}}% })\geq 0,\\ {{\pi}^{\text{\tiny{\faShield*}}}},&\text{Otherwise,}\end{array}\right.

(10)

where ${{\pi}^{\text{task}}}\colon{\mathcal{X}}\to{\mathcal{U}}$ is an arbitrary performance-oriented task policy, ${\Delta}^{\text{\tiny{\faShield*}}}\colon{\mathcal{X}}\times{\Pi^{u}}\to% \mathbb{R}$ , and ${{\pi}^{\text{\tiny{\faShield*}}}}\colon{\mathcal{X}}\to{\mathcal{U}}$ is a safety-aware or even safety-guaranteed fallback policy. \new ${\Delta}^{\text{\tiny{\faShield*}}}$ is a safety monitor if it satisfies that its positive output indicates that the input policy can keep the system safe from the input state. Therefore, the safety filter in Eq. 10 maintains the system’s safety, following the Theorem 1 in \citephsu2023safety. Our approach introduces a more general and novel safety filter and monitor, considering the task policy as a function rather than just a singular proposed task control.

Previous work has used neural-network-parameterized Q-function for safety monitor with a threshold $\epsilon>0$ \citephsuzen2022sim2lab2real,thananjeyan2021recovery

\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{critic}}_{\epsilon}({x},{{% \pi}^{\text{task}}}):={{Q}_{{\omega}}}({x},{{\pi}^{\text{task}}}({x}))-\epsilon.

(11)

Also, the safety fallback policy can be constructed by $\operatorname*{{\mathop{\mathrm{argmax}}}}_{{u}\in{\mathcal{U}}}{{Q}_{{\omega}% }}({x},{u})$ or directly ${{\pi}_{\theta}}$ when the state is outside of the target set. \newOn the other hand, when the state is in the target set ${\mathcal{T}}$ , ${{\pi}^{{\mathcal{T}}}}$ serves as a fallback to keep the state inside the target set. \newIn other words, the safety fallback policy ${{\pi}^{\text{\tiny{\faShield*}}}}$ is defined by a switching rule

\displaystyle{{\pi}^{\text{\tiny{\faShield*}}}}({x})=\left\{\begin{array}[]{ll% }{{\pi}^{{\mathcal{T}}}}({x}),&{x}\in{\mathcal{T}},\\ {{\pi}_{\theta}}({x}),&{x}\not\in{\mathcal{T}}.\end{array}\right.

(14)

Although this critic filter has shown practical utility, it does not readily provide safety guarantees. In addition, the value threshold $\epsilon$ needs to be carefully tuned, which may be difficult for safety-critical applications.

In contrast, the safety monitor can be built on model predictive rollouts. \citetbastani2021safe-acc assumes the dynamical model is perfectly accurate (disturbance-free) and checks if, after executing performance-oriented control, rollout based on the fallback policy satisfies the reach–avoid criterion

	$\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{nom}}_{{H},{\mathcal{RA}}}% ({x},{{\pi}^{\text{task}}})$	$\displaystyle:=\mathbbm{1}\Big{\{}\,\exists{\tau}\in\{1,\dots,{H}\},{\hat{x}}_% {{\tau}}\in{\mathcal{T}}\,\land$
		$\displaystyle\forall{s}\in\{1,\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal{F% }}\Big{\}}-\frac{1}{2}$		(15)

with ${\hat{x}}_{0}={x}$ , ${\hat{x}}_{1}={f}({x},{{\pi}^{\text{task}}}({x}),0)$ , and ${\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{{\pi}^{\text{\tiny{\faShield*}}}}% ({\hat{x}}_{{\tau}}),0)$ , ${\tau}\geq 1$ . \citethsunguyen2023isaacs tackle the model mismatch by employing robust rollout with FRS

	$\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{FRS}}_{{H},{\mathcal{RA}}}% ({x},{{\pi}^{\text{task}}})$	$\displaystyle:=\mathbbm{1}\Big{\{}\exists{\tau}\in\{1,\dots,{H}\},{\mathcal{R}% }_{\tau}\subseteq{\mathcal{T}}\,\land$
		$\displaystyle\forall{s}\in\{1,\dots,{\tau}\},{\mathcal{R}}_{s}\cap{\mathcal{F}% }={\text{\O}}\Big{\}}-\frac{1}{2}$		(16)

with ${\mathcal{R}}_{0}=\{{x}\}$ , ${\mathcal{R}}_{1}=\{{f}({x},{{\pi}^{\text{task}}}({x}),{d}),{d}\in{\mathcal{D}}\}$ , and ${\mathcal{R}}_{{\tau}+1}=\{{f}(\hat{x},{{\pi}^{\text{\tiny{\faShield*}}}}(\hat% {x}),{d}),\hat{x}\in{\mathcal{R}}_{\tau},{d}\in{\mathcal{D}}\}$ , ${\tau}\geq 1$ . In this work, we instead rely on rollouts of the \removetrained safety policy and disturbance policy \newlearned reach–avoid control and disturbance actors to assess system safety risk.

IV Safe Walking by Adversarial Gameplay

This section introduces a systematic way to construct a safety filter for nonlinear, high-dimensional dynamics. In this paper, we specifically consider the task of quadruped walking, but we stress that the method presented is general for different robots and tasks. We start with careful problem formulation by defining state space, control space, uncertainty modeling, and safety specifications. Then, we elucidate offline gameplay learning, \removewhich exploits robot symmetry in simulated games to prevent safety control policy from overfitting to attacks from a specific direction \newwhere a disturbance actor is jointly trained with the reach–avoid control actor to generate the worst-case realization of uncertainty to attack the system adversarially. We close the section \removewith a constructive synthesis for \newby systematically constructing an online gameplay filter using the trained control and disturbance policies \newsynthesized from offline learning.

IV-A State and Action Spaces

The robot’s state and control input are defined by

	$\displaystyle{x}$	$\displaystyle=\left[\mathbf{{p}},\mathbf{\dot{p}},\bm{{\theta}},\bm{\dot{% \theta}},\bm{{\theta}_{\text{J}}},\bm{\dot{\theta}_{\text{J}}}\right],$	(17a)
	$\displaystyle{u}$	$\displaystyle=\left[\bm{{\delta{\theta}_{\text{J}}}}\right]$	(17b)
\newwith $\mathbf{{p}},\bm{{\theta}}$ the robot pose, $\bm{{\theta}_{\text{J}}}$ the angular joint position, $\bm{\dot{(\cdot)}}$ the rate of these variables and $\bm{{\delta{\theta}_{\text{J}}}}$ the commanded angular increment of the robot’s joint. Appendix A illustrates the state and action space details. Since the robot has three joints per leg, we end up with a 36-D state space and a 12-D control space.

We model the sim-to-real gap via a 6-D adversarial force pushing or pulling the robot with a magnitude of 5\new0 N

\displaystyle{d}

\displaystyle=\left[{{F}_{x}},{{F}_{y}},{{F}_{z}},\remove{p_{x},p_{y},p_{z}}% \new{{{p}^{F}_{x}},{{p}^{F}_{y}},{{p}^{F}_{z}}}\right]

(17c)

with ${F}=[{{F}_{x}},{{F}_{y}},{{F}_{z}}]$ represents the force vector applied at position defined by \remove $\{p_{x},p_{y},p_{z}\}$ \new ${{p}^{F}_{x}},{{p}^{F}_{y}},{{p}^{F}_{z}}$ in the body coordinates, \remove $p_{x},p_{y}$ $\new{{{p}^{F}_{x}},{{p}^{F}_{y}}}\in[-0.1,0.1]\new{~{}\text{m}}$ , \remove $p_{z}$ $\new{{{p}^{F}_{z}}}\in[0,\remove{0.5}\new{0.05}]\new{~{}\text{m}}$ . We further assume the optimal force is bang-bang with $||F||_{2}=5\new{0}~{}\text{N}$ . \removeThe adversarial disturbance policy takes in the robot’s state and control input , i.e., ${{d}_{k}}={{\pi}^{d}}({{x}_{k}},{{u}_{k}})$ . The red \removelines\newarrows in the imagined gameplay of Figure 2 show examples of learned adversarial disturbance.

IV-B Safety Specifications

We consider the failure set of states \newwhere the defined critical points (of the robot’s body) $\mathbf{p_{g}}$ are very close to the ground. \removeWe consider the failure set of states where the robot body is very close to the ground or the robot stands with its knees. The safety margin function is defined as

\new

\displaystyle{g}({x})=\min_{i}\left\{p_{g}^{i}-\bar{p_{g}}^{i}\right\},

with $\bar{(\cdot)}$ denotes the desired magnitude.

\new

Also, we consider the target set of states where the concerned variables $\mathbf{p_{\ell}}$ are within a small box around the target pose and velocity. The target set is designed so that the robot is known to be robustly stable with a simple stance controller. The target margin function is then defined as

\displaystyle{\ell}({x})=\min_{i}\left\{\bar{p_{\ell}}^{i}-|p_{\ell}^{i}|% \right\}.

Appendix A illustrates the safety specifications details.

IV-C Offline Gameplay Learning

We introduce an offline gameplay learning scheme, which builds upon ISAACS \citephsunguyen2023isaacs. At each iteration, the learning algorithm collects interactions with environments via simulated adversarial safety games, updates neural-network-parameterized Q-function and policies, and determines which control and disturbance policies are used for the next iteration’s simulated gameplay. \removeFurthermore, we utilize the symmetry of legged locomotion to prevent safety control policy from overfitting to attacks from one direction.

Simulated Adversarial Safety Games. At every time step of games, we store the transition $({x},{u},{d},{{x}^{\prime}},{{\ell}^{\prime}},{{g}^{\prime}})$ in the replay buffer ${\mathcal{B}}$ , with ${{{x}^{\prime}}={f}({x},{u},{d})}$ , ${{\ell}^{\prime}}={\ell}({{x}^{\prime}})$ and ${{g}^{\prime}}={g}({{x}^{\prime}})$ . The control and disturbance inputs are selected from the policies either trained concurrently or fixed after pre-training. \removeWe notice that in some training runs, the disturbance policy always attacks from a fixed direction (e.g., the positive direction of the y-axis), which results in the safety control policy being vulnerable to the attack in the other direction (e.g., the negative direction of the y-axis). To prevent this overfitting, we utilize the symmetry in robot locomotion by flip** the disturbance inputs in the collected interactions and re-simulate the new disturbance sequences, i.e., ${\tilde{{d}}}=[F_{x},-{{F}_{y}},F_{z},p_{x},p_{y},p_{z}]$ .

Policy and Critic Networks Update The core of the proposed offline gameplay learning is to find approximate solutions to the Isaacs equation Eq. 7. We employ the Soft Actor-Critic (SAC) \citephaarnoja2018sac framework to update the critic and actor networks with the following loss functions.

We update the critic to reduce the deviation from the Isaacs target⁶⁶6Deep reinforcement learning typically involves training an auxiliary target critic ${{Q}_{\omega^{\prime}}}$ , with parameters ${\omega^{\prime}}$ that undergo slow adjustments to align with the critic parameters ${\omega}$ . This process aims to stabilize the regression by maintaining a fixed target within a relatively short timeframe.

	$\displaystyle L({\omega})$	$\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u},{d},{{x}^{\prime}},{{% \ell}^{\prime}},{{g}^{\prime}})\sim{\mathcal{B}}}\left[\left({{Q}_{{\omega}}}(% {x},{u},{d})-y\right)^{2}\right]\,,$
	$\displaystyle y$	$\displaystyle=\gamma\min\left\{\ {{g}^{\prime}},\max\left\{{{\ell}^{\prime}},{% {Q}_{\omega^{\prime}}}({{x}^{\prime}},{{u}^{\prime}},{{d}^{\prime}})\right\}\right\}$
		$\displaystyle\ \ \ +(1-\gamma)\min\left\{{{\ell}^{\prime}},{{g}^{\prime}}\right\}$	(18a)
with ${{u}^{\prime}}\sim{{\pi}_{\theta}}(\cdot\mid{{x}^{\prime}})$ , ${{d}^{\prime}}\sim{{\pi}_{\psi}}(\cdot\mid{{x}^{\prime}})$ . We update control and disturbance policies following the policy gradient induced by the critic and entropy loss:

	$\displaystyle L({\theta})$	$\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{d})\sim{\mathcal{B}}}\Big{[% }-{{Q}_{{\omega}}}({x},{\tilde{{u}}},{d})+{{\alpha}^{u}}\log{{\pi}_{\theta}}({% \tilde{{u}}}\mid{x})\Big{]},$	(18b)
	$\displaystyle L({\psi})$	$\displaystyle:=\operatorname*{{\mathbb{E}}}_{({x},{u})\sim{\mathcal{B}}}\Big{[% }{{Q}_{{\omega}}}({x},{u},{\tilde{{d}}})+{{\alpha}^{d}}\log{{\pi}_{\psi}}({% \tilde{{d}}}\mid{x})\Big{]},$	(18c)
where ${\tilde{{u}}}\sim{{\pi}_{\theta}}(\cdot\mid{x})$ , ${\tilde{{d}}}\sim{{\pi}_{\psi}}(\cdot\mid{x})$ , and ${{\alpha}^{u}},{{\alpha}^{d}}$ are hyperparameters encouraging higher entropy in the stochastic policies for more exploration, which decay gradually in magnitude through the training.

We can directly train the critic and control and disturbance actors from scratch through Eq. 18. On the other hand, we can also utilize a three-level training curriculum with two additional pre-training stages (L1 and L2). In L1, we only train the \removesafety control policy \newreach–avoid control actor without considering adversarial disturbance inputs, which is a special case of Eq. 18 when ${d}=0$ . Then, in L2, we fix the control policy trained in L1 and train the disturbance policy instead. Since there is only one policy to optimize in L1 and L2, we can use standard SAC directly. At the beginning of the gameplay learning (L3), we can then initialize actor and critic networks with pre-trained weights, i.e., the control actor (L1), disturbance actor (L2), and the safety critic (L2).

Furthermore, L2 training can be viewed as finding the best adversary to attack the associated control policies, or simply best response ${{{\pi}_{\psi}}^{*}}({{\pi}^{u}})$ . \newAfter ISAACS training, we additionally use L2 training to fine-tune ${{\pi}_{\psi}}$ against frozen ${{\pi}_{\theta}}$ . We combine the resulting ${{{\pi}_{\psi}}^{*}}({{\pi}_{\theta}})$ into our gameplay filter. On the other hand, we utilize L2 training to perform a \newbespoke ultimate stress test (BUST) for safety policies and safety filters under various design choices in Table III.

Policy Selection. During the L3 training, we also maintain a finite leaderboard of control and disturbance \removepolicies \newactors from past iterations of training. Periodically, the leaderboard is updated by performing simulated gameplays between the current control and disturbance policies and the previous leaders. If the capacity of the leaderboard is reached, we remove the \removecontrol policy \newcontrol actor checkpoints with the lowest safe rate (and the disturbance \removepolicy \newactor checkpoints with the highest safe rate). At the next iteration’s simulated adversarial safety games, we randomly select control and disturbance \removepolicies \newactors from the leaderboard to generate action inputs, which prevents the control \removepolicy \newactor updates from excessively fitting into a single disturbance \removepolicy \newactor \citepvinitsky2020robust.

IV-D Online Gameplay Safety Filter

\remove

This section illustrates that the offline game-theoretic reinforcement learning scheme outputs can systematically synthesize \newThis section demonstrates how the reach–avoid control actor ${{\pi}_{\theta}}$ and disturbance actor ${{\pi}_{\psi}}$ synthesized offline through game-theoretic reinforcement learning can be systematically used at runtime to construct highly effective safety filters for general nonlinear, high-dimensional dynamic systems. \removeWe employ the rollout-based safety monitor as the value-based safety monitor requires tuning the value threshold, which is difficult to perform offline. \new A predictive (rollout-based) safety monitor is employed to prevent tuning the value threshold as the value-based safety monitor, which is difficult to perform before deployment. Also, a simple switching intervention scheme in the form of Eq. 10 is used, although optimization-based schemes like CBF–QP are also possible..

\remove

However, \newThe state-of-the-art predictive safety monitors face scalability and robustness challenges. For example, the nominal rollout in Section III-B can result in an overly optimistic filter, while the FRS-based robust rollout in Section III-B can be computationally intensive for high-dimensional dynamics. To tackle \removescalability and robustness \newthese challenges, we propose using \newa novel adversarial gameplay rollout between the fallback and disturbance policy from offline gameplay learning.

\remove

Since the safety policy from ISAACS only aims to reach the target set safety, we need to utilize ${{\pi}^{{\mathcal{T}}}}$ to keep the state inside the target set once the system reaches the target set. Therefore, the fallback policy is defined by the switching rule \removewhere we use ${\pi}^{\text{ISAACS}}$ to denote safety control policy ${{\pi}_{\theta}}$ for clarity.

\remove

The adversarial gameplay then checks if the fallback policy can safely reach the target set after executing the performance-oriented controls, with the whole rollout under attacks from the ISAACS disturbance policy. \newThe adversarial gameplay begins with applying a control from task policy ${{\pi}^{\text{task}}}$ and fallback policy ${{\pi}^{\text{\tiny{\faShield*}}}}$ afterward, with the whole rollout under attacks from the ISAACS disturbance policy ${{\pi}_{\psi}}$ . This gameplay monitor returns success if the state trajectory safely reaches the target set:


	$\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{\mathcal{RA}}% }({x},{{\pi}^{\text{task}}}):=$	$\displaystyle\mathbbm{1}\Big{\{}\,\exists{\tau}\in\{1,\dots,{H}\},{\hat{x}}_{{% \tau}}\in{\mathcal{T}}\,\land$
		$\displaystyle\forall{s}\in\{1,\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal{F% }}\Big{\}}-\frac{1}{2}$	(19a)
with ${\hat{x}}_{0}={x}$ , ${\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{{\pi}_{\psi}}(% {\hat{x}}_{{\tau}})),{\tau}\geq 0$ , and

	$\displaystyle{\hat{u}}_{{\tau}}=\left\{\begin{array}[]{ll}{{\pi}^{\text{task}}% }({\hat{x}}_{{\tau}}),&{\tau}=0,\\ {{\pi}^{\text{\tiny{\faShield*}}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{1,\dots,{H}% -1\}.\end{array}\right.$		(19d)

Finally, the (real-time) gameplay filter ${\phi}=({{\pi}^{\text{\tiny{\faShield*}}}},{\Delta}^{\text{\tiny{\faShield*}}})$ is constructed by the fallback policy ${{\pi}^{\text{\tiny{\faShield*}}}}$ in Eq. 14, the gameplay monitor $\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{\mathcal{RA}}}$ in Eq. 19, and a switch-type intervention scheme in Eq. 10. Algorithm 1 illustrates the proposed gameplay filter, and Appendix Appendix A summarizes the terminology (and the symbols) of the modules in safety filters.

\remove

Additionally, we consider a longer horizon instead of potentially switching between steps. \newHowever, the computation time in the gameplay rollout may require multiple time steps. To resolve this latency issue, we verify the task policy by a longer execution horizon of ${L}$ steps instead of within one step, as in Algorithm 1. The longer foresight \removeaddresses measurement latency in real deployment but also smoothens out undesired oscillations close to the boundary of the reach–avoid set. The gameplay-based safety monitor \newwith latency is formulated below


	$\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{L},{\mathcal{% RA}}}({x},{{\pi}^{\text{task}}}):=$	$\displaystyle\mathbbm{1}\Big{\{}\,\exists{\tau}\in\{2{L},\dots,{H}\},{\hat{x}}% _{{\tau}}\in{\mathcal{T}}\,\land$
		$\displaystyle\forall{s}\in\{{L},\dots,{\tau}\},{\hat{x}}_{{s}}\not\in{\mathcal% {F}}\Big{\}}-\frac{1}{2}$	(20a)
with ${\hat{x}}_{0}={x}$ , ${\hat{x}}_{{\tau}+1}={f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{{\pi}_{\psi}}(% {\hat{x}}_{{\tau}}))$ , ${\tau}\geq 0$ , and

	$\displaystyle{\hat{u}}_{{\tau}}=\left\{\begin{array}[]{ll}{{\phi}^{\text{prev}% }}({\hat{x}}_{{\tau}}),&{\tau}\in\{0,\dots,{L}-1\},\\ {{\pi}^{\text{task}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{{L},\dots,2{L}-1\},\\ {{\pi}^{\text{\tiny{\faShield*}}}}({\hat{x}}_{{\tau}}),&{\tau}\in\{2{L},\dots,% {H}-1\},\end{array}\right.$		(20e)
where \remove ${\pi}^{\text{exec}}$ \new ${{\phi}^{\text{prev}}}$ is the \removepolicy being \newverified gameplay filter and is executed\new, and thus immutable, during the wait for simulated adversarial gameplay. \newIn other words, if the (trained) disturbance policy captures the worst-case realization of the uncertainty and sim-to-real gap, executing ${{\phi}^{\text{prev}}}$ is guaranteed safe. However, we may not have the optimal disturbance policy due to the parameterization of neural networks. Instead, we still switch to the fallback ${{\pi}^{\text{\tiny{\faShield*}}}}$ if there is a safety failure during the first ${L}$ steps in the imagined gameplay.

Algorithm 2 summarizes the ${L}$ -step gameplay safety filter.

Figure 2 illustrates the operation of the gameplay safety filter \newwith the ${L}$ -step gameplay safety monitor. For example, at monitor cycle ${k}={L}$ , since the safety monitor check is successful at ${k}=0$ , ${\phi}^{\text{prev}}={{\pi}^{\text{task}}}$ . On the other hand, at monitor cycle ${k}=2{L}$ , since the safety monitor check is failed at ${k}={L}$ , ${\phi}^{\text{prev}}={{\pi}^{\text{\tiny{\faShield*}}}}$ .

Algorithm 1 Real-Time Gameplay Safety Filter

{{x}_{k}},{{\pi}^{\text{task}}},{{\pi}^{\text{\tiny{\faShield*}}}},{{\pi}_{% \psi}},{f},{L},{H}

{\hat{x}}_{0}\leftarrow{{x}_{k}}

3:for

{\tau}=0

{H}

4: if

{\tau}=0

then

{{\pi}^{u}}\leftarrow{{\pi}^{\text{task}}}

\triangleright

Evaluate the task policy

6: else

{{\pi}^{u}}\leftarrow{{\pi}^{\text{\tiny{\faShield*}}}}

\triangleright

Followed by the fallback

{\hat{u}}_{{\tau}}\leftarrow{{\pi}^{u}}({\hat{x}}_{{\tau}})

{\hat{d}}_{{\tau}}\leftarrow{{\pi}_{\psi}}({\hat{x}}_{{\tau}})

10:

{\hat{x}}_{{\tau}+1}\leftarrow{f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{\hat{% d}}_{{\tau}})

11: if

{g}({\hat{x}}_{{\tau}+1})<0

then

\triangleright

Gameplay violates safety

12:

{\text{res}}\leftarrow 0

13: return

{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}

14: if

{\ell}({\hat{x}}_{{\tau}+1})>0

then

\triangleright

Gameplay succeeds

15:

{\text{res}}\leftarrow 1

16: return

{\text{res}},{{\pi}^{\text{task}}}

{\text{res}}\leftarrow 0

\triangleright

Gameplay does not reach

17:return

{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}

Algorithm 2

{L}

-Step Gameplay Safety Filter

{{x}_{k}},{{\pi}^{\text{task}}},{{\phi}^{\text{prev}}},{{\pi}^{\text{\tiny{% \faShield*}}}},{{\pi}_{\psi}},{f},{L},{H}

{\hat{x}}_{0}\leftarrow{{x}_{k}}

3:for

{\tau}=0

{H}

4: if \new

{\tau}<{L}

then

5: \new

{{\pi}^{u}}\leftarrow{{\phi}^{\text{prev}}}

\triangleright

Apply the verified filter

6: else if

{\tau}<2{L}

then

{{\pi}^{u}}\leftarrow{{\pi}^{\text{task}}}

\triangleright

Evaluate the task policy

8: else

{{\pi}^{u}}\leftarrow{{\pi}^{\text{\tiny{\faShield*}}}}

\triangleright

Followed by the fallback

10:

{\hat{u}}_{{\tau}}\leftarrow{{\pi}^{u}}({\hat{x}}_{{\tau}})

11:

{\hat{d}}_{{\tau}}\leftarrow{{\pi}_{\psi}}({\hat{x}}_{{\tau}})

12:

{\hat{x}}_{{\tau}+1}\leftarrow{f}({\hat{x}}_{{\tau}},{\hat{u}}_{{\tau}},{\hat{% d}}_{{\tau}})

13: if

{g}({\hat{x}}_{{\tau}+1})<0

then

\triangleright

Gameplay violates safety

14:

{\text{res}}\leftarrow 0

15: return

{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}

16: if

{\ell}({\hat{x}}_{{\tau}+1})>0

and

{\tau}\geq 2{L}

then

\triangleright

Gameplay succeeds

17:

{\text{res}}\leftarrow 1

18: return

{\text{res}},{{\pi}^{\text{task}}}

{\text{res}}\leftarrow 0

\triangleright

Gameplay does not reach

19:return

{\text{res}},{{\pi}^{\text{\tiny{\faShield*}}}}

V Experiments

Through extensive simulation study and hardware experiments, we aim to answer the following questions: Can our offline game-theoretic learning and gameplay safety filter

(1)
\remove
provide a systematic synthesis method for general nonlinear, high-dimensional systems, which achieves a high safe rate without overly frequent filter intervention? \newachieve robust safety for general nonlinear, high-dimensional systems without obstructing task execution?
(2)

enable the robot to operate safely in various deployment conditions that are different from the training conditions in a “zero-shot” manner?
(3)

outperform reward-based learning, non-game-theoretic learning, and value-based (critic) safety filters?

Additionally, we analyze the relative importance of our design choices, including (a) gameplay filter with reach–avoid criteria versus avoid-only, (b) three-level training curriculum versus L3 directly, and (c) symmetric exploitation in offline learning versus without.

V-A Experiment Setup

Robot \newand sensors. We use Spirit 40 from Ghost Robotics for the robot platform as shown in Figure 1 and the PyBullet physics engine \citepcoumans2021pybullet to construct the simulated environment. We use the internal motor encoder of the robot to obtain joint absolute position \new ${{\theta}^{i}_{\text{J}}}$ and velocity \remove ${{\theta}^{i}_{\text{J}}}$ , ${{\omega}^{i}_{\text{J}}}$ , and built-in IMU for roll \new ${{\theta}_{x}}$ and pitch \removeangle \new ${{\theta}_{y}}$ , body axial rotational rate \new ${{\omega}_{x}},{{\omega}_{y}},{{\omega}_{z}}$ and velocity \new ${{v}_{x}},{{v}_{y}},{{v}_{z}}$ \remove, and a motion capture system for robot body height ${{p}_{z}}$ . There is no force sensor or contact sensing capability enabled, meaning that ground contact can only be implicitly inferred.

\remove

Physical setup. \newGameplay filter. To implement a gameplay safety filter on a physical robot, we create a gameplay rollout server, \removewhich is a ROS service that takes in the current physical robot state and proposed control action. The server then runs the gameplay rollout for a fixed horizon and returns the filtered safe action. Using the reach–avoid criterion for the gameplay rollout terminal condition, we observe a flat change in elapsed time (from request to response) when the rollout horizon increases (from 10 to 300 steps), yielding an average cycle of $35~{}\text{Hz}$ .

\new

Task and perturbations. We construct two different terrains for physical experiments: flat terrain with tugging forces and unmodeled irregular terrain. The robot’s task is to traverse safely across from the same state initialization to reach the goal on the other side of the terrain.

To \removesimulate \newemulate adversarial \newtugging forces on the robot, we \removemanually create a tugging force on the robot by mounting \newmount a rope to the robot on one end and a motion-tracked dynamometer on the other \newend to monitor the force magnitude and direction. The dynamometer has a rated capacity of $500~{}\text{N}$ and a resolution of $0.1~{}\text{N}$ , tethered to a computer via RS232C. The sampling rate is $1000~{}\text{Hz}$ to record both constant pulling and force pulses. \removeAs the rope is attached to the body of the robot, the range of $p_{z}$ is different from the simulated environment, with the arm length from the robot’s center of mass to the mounting point being $0.05~{}\text{m}$ , resulting in a net moment force of $50~{}\text{N}$ comparable to $5~{}\text{N}$ applying at $p_{z}$ .

\remove

We construct two different terrains for physical experiments: unmodeled irregular terrain and flat terrain with tugging forces. The irregular terrain is a $2~{}\text{m}\times 4~{}\text{m}$ area with a 15-degree incline along one edge, and two mounds emerged in the middle using memory foam, with \removesize length $\times$ width $\times$ height of $1.2~{}\text{m}\times 0.7~{}\text{m}\times 0.05~{}\text{m}$ and $1.2~{}\text{m}\times 0.8~{}\text{m}\times 0.15~{}\text{m}$ , positioned $1.8~{}\text{m}$ away from each other. \removeFor both types of terrain, the goal of the robot is to traverse safely across from the same state initialization to reach the goal on the other side of the terrain.

Baselines. To evaluate the effectiveness of margin-based feedback signal and uncertainty-aware offline learning, we consider three prior reinforcement learning algorithms: (1) standard SAC \citephaarnoja2018sac with reward defined as

\displaystyle{r}({x})=\left\{\begin{array}[]{ll}1,&{x}\in{\mathcal{T}},\\ -1,&{x}\in{\mathcal{F}},\\ 0,&\text{Otherwise,}\end{array}\right.

(2) non-game-theoretic reach–avoid reinforcement learning (RARL) \citephsu2021safety, \newRARL with the domain randomization (DR), and (4) adversarial SAC with the reward feedback signal. For the critic filter, we conduct a parameter sweep to find the best value threshold in the simulation and use the same threshold in physical experiments directly.

Policy. We handcraft a task policy by using an inverse kinematics gait planner for forward walking and sideways walking. We parameterize all policies by neural networks of 3 fully connected layers with 256 neurons, and critics have 3 layers with 128 neurons. The gameplay filter uses horizon ${H}=300$ and ${L}=1$ . We use a low-level PD position controller that outputs torques \remove $\tau=K_{p}(q^{*}-q)-K_{d}\cdot\dot{q}$ to the robot motor controller, with $q^{*}$ the target angular position, $q$ the current position, $\dot{q}$ the angular velocity and $K_{p},K_{d}$ the PD gains.\new $\tau^{i}=K_{p}({\delta{\theta}^{i}_{\text{J}}})-K_{d}\cdot{{\omega}^{i}_{\text% {J}}}$ to the robot motor controller with $K_{p},K_{d}$ the proportional and derivative gains.

TABLE I: We evaluate physical robots walking on flat terrain with tugging force and unmodeled irregular terrain.

F_{\text{avg}}

and

F_{\text{max}}

are the average and maximum force during the test and

T_{\text{goal}}

is the time to reach the goal. Our gameplay filter has the highest safe rate in both tugging force and irregular terrain tests without overly intervening in task-oriented actions.

\TstrutPolicy

Tugging Force

Bumpy Terrain

Successful Runs

Failed Runs

Successful Runs

Safe Rate

Filter Freq.

T_{\text{goal}}

F^{\text{peak}}_{\text{avg}}

F^{\text{peak}}_{\text{max}}

F^{\text{peak}}_{\text{avg}}

F^{\text{peak}}_{\text{min}}

Safe Rate

Filter Freq.

T_{\text{goal}}

{{\phi}^{\text{game}}}

7/10

0.17

26.3

67.5N

70.5N

59.8N

52.7N

10/10

0.19

41.2

{{\phi}^{\text{critic}}}

4/10

0.10

26.8

73.7N

80.9N

53.6N

40.0N

5/10

0.22

33.5

{{\pi}^{\text{task}}}

0/10

n/a

56.5N

41.4N

5/10

n/a

16.4\Bstrut

TABLE II: Maximum force magnitude withstood by the physical robot with various safety policies^†, task policy

{{\pi}^{\text{task}}}

, and fixed-pose policy

{{\pi}^{{\mathcal{T}}}}

in different tugging directions. Our employed fallback

{{\pi}^{\text{\tiny{\faShield*}}}}

outperforms the task policy and other safety fallback baselines and has comparable robustness to the policy used in the target set.


\TstrutAlgorithm	Maximum Force
	Left		Right
	Low	High	Low	High
${{\pi}^{\text{\tiny{\faShield*}}}}$	87.1N^∗	61.1N^∗	99.3N^∗	59.1N^∗
${{\pi}_{\theta}}$	100.5N^∗	150.3N^∗	121.6N^∗	121.9N^∗
\newRARL + DR	46.4N	43N	57.2N	72.1N^∗
${{\pi}^{\text{task}}}$	83.2N	96.9N	82.8N^∗	59N \Bstrut
${{\pi}^{{\mathcal{T}}}}$	151.9N^∗	173.7N^∗	140.3N^∗	142.6N^∗

${\dagger}$

Safety policies from reward-based reinforcement learning and ISAACS with the avoid-only objective fail immediately before applying force.
$*$

The policy can withstand this magnitude of force. Since the policy can make the quadruped move toward the tugging direction, we cannot add more force in 10 pull attempts.

V-B Physical Results

Safe walking on different terrains. We answer Questions (1) and (3) by evaluating physical robots walking on flat terrain with tugging force and bumpy terrain. We compare our proposed gameplay safety filter with the task policy and critic safety filter. We record the number of runs that the quadruped can safely reach the goal. Also, for those successful runs, we also report the frequency of filter intervention and the time to reach the goal. We additionally report the maximum and average (adversarial) force in the walk for the experiment of flat terrain with tugging forces. Table I shows the result of the experiment.⁷⁷7There is one test of critic safety filter on bumpy terrains failing to reach the goal but remaining safe. We do not include this run’s filter frequency and elapsed time for the average. Our proposed gameplay safety filter has the highest safe rate on both flat terrain with tugging force and unmodeled irregular terrain. Even for those failed trials, the gameplay filter withstands higher tugging force before it violates the safety constraints. Further, the gameplay filter does not unduly intervene with the task-oriented actions as it has a similar filter frequency as the critic filter.

Figure 1 shows the quadruped walking by applying the proposed gameplay filter versus the performance-oriented task policy. When there are imminent safety failures after executing candidate performance-oriented controls, e.g., with airborne legs or loss of balance when climbing, the gameplay filter intervenes and stretches the legs of the quadruped to fight against persistent forces and bumpy terrain.

Maximum withstandable force. To answer Question (3), we test the maximum tugging force withstandable for the safety policies trained by ISAACS, SAC with reward, \removenon-game-theoretic reach–avoid reinforcement learning \newwith domain randomization (RARL+DR), and adversarial SAC with reward. We pull the quadruped from different directions, where the tugging angle for “low” is always between $[-0.1,\,0.4]~{}\text{rad}$ , and the tugging angle for “high” is always between $[0.5,\,1.0]~{}\text{rad}$ .

Table II shows that the employed ${{\pi}_{\theta}}$ can withstand more than about $150~{}\text{N}$ from all directions, but the non-game-theoretic counterpart (\removereach–avoid RL\newRARL+DR) is vulnerable to the tugging from the left \newand can only withstand $43~{}\text{N}$ . \newThis observation suggests that DR struggles to capture the worst-case realization of disturbances accurately. This limitation arises from the inherent nature of DR, where the control actor is optimized for average disturbance behavior. As the dimension of disturbance input increases, the likelihood of the random policy simulating the worst-case disturbance decreases exponentially. This underscores the importance of employing adversarial game-theoretic learning techniques over DR approaches.

Further, we notice the reward-based reinforcement learning baselines and ISAACS with the avoid-only objective fail almost immediately before applying the force since they overreact and thus flip the robot. We find that reach–avoid policies generalize better since they can bring the robot to a stable stance. We also include tests for task policy ${{\pi}^{\text{task}}}$ and the fixed-pose policy ${{\pi}^{{\mathcal{T}}}}$ (used when the state is in the target set). We observe that ISAACS \removesafety policy \newcontrol actor is strictly better than ${{\pi}^{\text{task}}}$ and is comparable to ${{\pi}^{{\mathcal{T}}}}$ .

V-C Simulated Results

Bespoke ultimate stress test (BUST). We further answer Questions (1) and (3) by running more exhaustive case studies comparing the following policies: task policy ${{\pi}^{\text{task}}}$ , ISAACS \removesafety policy \newcontrol actor ${{\pi}_{\theta}}$ , critic safety filter ${{\phi}^{\text{critic}}}$ , and proposed gameplay safety filter ${{\phi}^{\text{game}}}$ . In order to test their robustness when taken to the limit, we learn, for each of the above control schemes, a specialized adversarial disturbance policy explicitly trained (via L2) to exploit its safety vulnerabilities. \newWe also compare these policies against random perturbations sampled uniformly from the disturbance set ${{\pi}^{\text{rnd}}}$ or from extreme points (e.g., ${{F}_{x}}=50,{{F}_{y}}={{F}_{z}}=0$ ) ${{\pi}^{\text{rnd,+}}}$ .

\new

In L2 training, the disturbance actor must face a time-independent optimal control problem, where the control policy (including the appropriate safety filter) is queried during environment simulation. We note that while the internally simulated gameplay rollout considers a time-varying policy, the executed safety-filtered policy ${{\phi}^{\text{game}}}$ remains time-independent. Specifically, ${{\phi}^{\text{game}}}$ selects either the task control or the safety fallback control based on the outcome of the gameplay rollout, with the rollout dependent solely on the initial state but not when this state is visited. Therefore, ${{\phi}^{\text{game}}}$ can be considered part of the time-invariant environment, meeting the requirements of L2 training.

Table III shows the result of the \newBUST. \removeWe first look at the first two columns and find that ${{\pi}^{\text{task}}}$ is easily exploitable, so we end up with a ${{{\pi}_{\psi}}^{*}}\left({{\pi}^{\text{task}}}\right)$ that is very effective against ${{\pi}^{\text{task}}}$ , but not against the others. On the other hand, the specialized adversary against the ISAACS controller remains effective in attacking ${{\pi}^{\text{task}}}$ and ${{\phi}^{\text{critic}}}$ . Finally, we observe that safe filters are not as exploitable, so their ${{{\pi}_{\psi}}^{*}}$ s do not really learn to do more damage than the “universal worst-case” ${{{\pi}_{\psi}}^{*}}\left({{\pi}_{\theta}}\right)$ . \newWe first note that ${{\pi}^{\text{task}}}$ is vulnerable to all ${{{\pi}_{\psi}}^{*}}$ , while the proposed gameplay filters can only be exploited by its associated ${{{\pi}_{\psi}}^{*}}({{\phi}^{\text{game}}})$ . Further, because ${{\phi}^{\text{game}}}$ is very robust, this helps ${{{\pi}_{\psi}}^{*}}({{\phi}^{\text{game}}})$ attack effectively against other policies, where the third column has the lowest safe rates compared to other columns.

\new

The last 2 columns show the safe rate under random disturbance. Except for ${{\pi}^{\text{task}}}$ , both the reach–avoid control actor and safety filters remain at high safe rates. This observation suggests that our L2 training method effectively establishes a superior safety benchmark for policies compared to DR, even when we improve the sampling from uniformly within the set ${{\pi}^{\text{rnd}}}$ to extreme cases ${{\pi}^{\text{rnd,+}}}$ .

TABLE III: We perform a bespoke ultimate stress test in simulated environments by learning a specialized adversarial disturbance policy explicitly trained (via L2) to exploit any existing vulnerabilities in each of these robot controllers

{{\pi}_{\theta}},\,{{\pi}^{\text{task}}},\,{{\phi}^{\text{game}}},\,{{\phi}^{% \text{critic}}}

. Additionally, we consider random disturbance sampled uniformly in all directions

{{\pi}^{\text{rnd}}}

or at extreme directions

{{\pi}^{\text{rnd,+}}}

. The proposed gameplay filter

{{\phi}^{\text{game}}}

has a higher safe rate than the task policy

{{\pi}^{\text{task}}}

, critic filter

{{\phi}^{\text{critic}}}

and the learned reach–avoid actor

{{\pi}_{\theta}}


\Tstrut	${{{\pi}_{\psi}}^{*}}\left({{\pi}_{\theta}}\right)$	${{{\pi}_{\psi}}^{*}}\left({{\pi}^{\text{task}}}\right)$	${{{\pi}_{\psi}}^{*}}\left({{\phi}^{\text{game}}}\right)$	${{{\pi}_{\psi}}^{*}}\left({{\phi}^{\text{critic}}}\right)$	${{\pi}^{\text{rnd}}}$	${{\pi}^{\text{rnd,+}}}$ \Bstrut
${{\pi}_{\theta}}$	0.37	0.38	0.17	0.44	0.88	0.85
${{\pi}^{\text{task}}}$	0.0	0.0	0.0	0.0	0.03	0.03
${{\phi}^{\text{game}}}$	0.42	0.35	0.03	0.45	0.84	0.89
${{\phi}^{\text{critic}}}$	0.37	0.34	0.10	0.44	0.86	0.86\Bstrut

Sensitivity analysis: reach–avoid criteria vs. avoid-only. We evaluate the significance of using reach–avoid criteria in the gameplay filter by performing a sensitivity analysis of the horizon in the imagined gameplay. Figure 3 shows that the gameplay filter with reach–avoid criteria still remains 100 $\%$ safe rate even when the gameplay horizon is short ( ${H}=10$ . However, the gameplay filter with avoid-only criteria, which simplifies Eq. 20

\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{L},{\mathcal{% A}}}({x},{{\pi}^{\text{task}}}):=\mathbbm{1}\{\,

\displaystyle\forall{\tau}\in\{{L},\dots,{H}\},{\hat{x}}_{{\tau}}\not\in{% \mathcal{F}}\}-\frac{1}{2},

(21)

has more safety violations than task policy when ${H}=10$ . The difference is due to shorter imagined gameplay resulting in more frequent filter intervention for reach–avoid criteria but overly optimistic monitoring for avoid-only criteria (ignore the upcoming failure). Further, as the gameplay horizon increases, the filter frequency of using reach–avoid criteria goes down, i.e., if ${H}\geq{H}^{\prime}$

\displaystyle\Delta^{\text{\tiny{\faShield*}},\text{game}}_{{H},{L},{\mathcal{% RA}}}<0\to\Delta^{\text{\tiny{\faShield*}}\,\text{game}}_{{H}^{\prime},{L},{% \mathcal{RA}}}<0.

(22)

This observation indicates that reach–avoid criteria are preferred in physical deployment as it is difficult to know the sufficient horizon a priori.

Sensitivity analysis: three-level training curriculum. We testify to the need to use a three-level curriculum by gameplay results against a specialized adversary for the \removesafety policy \newreach–avoid control actor trained with the curriculum, i.e., ${{{\pi}_{\psi}}^{*}}({{\pi}_{\theta}})$ . Figure 4 shows the safe rate of gameplay results between the model checkpoints stored along the training and ${{{\pi}_{\psi}}^{*}}({{\pi}_{\theta}})$ . We observe that two pre-training stages in the curriculum do not significantly improve training performance. In contrast, directly learning in the L3 stage requires similar steps in gameplay learning to reach a decent safety performance.

VI Conclusion

This work presents a game-theoretic learning approach to synthesize safety filters for high-order, nonlinear dynamics. The proposed gameplay safety filter monitors the risk of system safety through imagined games between its best-effort safety \removecontrol \newfallback policy and a learned virtual adversary, aiming to realize the worst-case uncertainty in the system. We validate our approach on a physical quadruped robot under strong tugging forces and unmodeled irregular terrain while maintaining zero-shot safety. An exhaustive simulation study is performed to compare with state-of-the-art safety \removecontrol synthesis methods \newfallback policies, safety filters, and \newunderstand the relative importance of design choices.

Appendix A Implementation Details

\new

The state and action space are defined as:

	$\displaystyle{x}$	$\displaystyle=\left[{{p}_{x}},{{p}_{y}},{{p}_{z}},{{v}_{x}},{{v}_{y}},{{v}_{z}% },{{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}},{{\omega}_{x}},{{\omega}_{y}},{% {\omega}_{z}},\{{{\theta}^{i}_{\text{J}}}\},\{{{\omega}^{i}_{\text{J}}}\}% \right],$
	$\displaystyle{u}$	$\displaystyle=\left[\{{\delta{\theta}^{i}_{\text{J}}}\}\right],$

with ${{p}_{x}},{{p}_{y}},{{p}_{z}}$ the position of the body center, ${{v}_{x}},{{v}_{y}},{{v}_{z}}$ the velocity of the robot in the body frame coordinate, ${{\theta}_{x}},{{\theta}_{y}},{{\theta}_{z}}$ the roll, pitch, and yaw of the robot, ${{\omega}_{x}},{{\omega}_{y}},{{\omega}_{z}}$ the body axial rotational rate, and ${{\theta}^{i}_{\text{J}}},{{\omega}^{i}_{\text{J}}},{\delta{\theta}^{i}_{\text% {J}}}$ the angle, angular velocity, and commanded angular increment of the robot’s $i^{\text{th}}$ joint.⁸⁸8In this work, we specifically consider walking locomotion, so the policies ignore ${{p}_{x}},{{p}_{y}},\new{{{p}_{z}}},{{\theta}_{z}}$ .

\new

We define the critical points $\mathbb{p_{c}}$ as the body corners and \removeelbows\newknees of the robot. The safety margin is defined as:

\displaystyle{g}({x})=\min\left\{\min_{i}\{z_{\text{corner}}^{i}\}-\bar{z}_{% \text{corner},{g}},\,\min_{i}\{z_{\text{knee}}^{i}\}-\bar{z}_{\text{knee}}% \right\},

with $z_{\text{corner}}^{i}$ the distance to ground of robot body corner $i^{\text{th}}$ and $z_{\text{\new{knee}\remove{elbow}}}^{i}$ the distance to ground of robot knee $i^{\text{th}}$ .

\new

The target margin function is defined as

	$\displaystyle{\ell}({x})=\min\Big{\{}\,$	$\displaystyle\bar{{\omega}}-\|{{\omega}_{x}}\|,\,\bar{{\omega}}-\|{{\omega}_{y}}\|% ,\,\bar{{\omega}}-\|{{\omega}_{z}}\|,$
		$\displaystyle\bar{{v}}-\|{{v}_{x}}\|,\,\bar{{v}}-\|{{v}_{y}}\|,\,\bar{{v}}-\|{{v}_{% z}}\|,$
		$\displaystyle\bar{z}_{\text{corner},{\ell}}-\max_{i}\{z_{\text{corner}}^{i}\},% \,\bar{z}_{\text{toe}}-\max_{i}\{z_{\text{toe}}^{i}\}\Big{\}},$

with $z_{\text{toe}}^{i}$ the distance to ground of robot toes $i^{\text{th}}$ . $\bar{(\cdot)}$ denotes the desired magnitude. Table IV shows the threshold used to define the safety and target margin functions for quadruped walking.

TABLE IV: Implementation details of the safety specifications of quadruped walking.


\TstrutNotation	Magnitude
$\bar{z}_{\text{corner},{g}}$	0.1 m
$\bar{z}_{\text{\new{knee}\remove{elbow}}}$	0.05 m
$\bar{z}_{\text{corner},{\ell}}$	0.4 m
$\bar{z}_{\text{toe}}$	0.05 m
$\bar{{\omega}}$	10 deg/s
$\bar{v}$	0.2 m/s \Bstrut

\new

LABEL:tab:term summarizes the terminology used in safety filter design, which also highlights the modularity of the proposed gameplay filter.

TABLE V: Terminology and symbols used in safety filter modules.