11institutetext: Carnegie Mellon University, Pittsburgh PA, USA 22institutetext: The Pennsylvania State University, State College PA, USA 33institutetext: Toyota InfoTech Labs Mountain View, CA USA

Tolerance of Reinforcement Learning Controllers against Deviations in Cyber Physical Systems

Changjian Zhang Both authors contributed equally to this research.11    Parv Kapoor footnotemark: 11    Eunsuk Kang 11    Rômulo Meira-Góes 22    David Garlan 11    Akila Ganlath 33    Shatadal Mishra 33    Nejib Ammar 33
Abstract

Cyber-physical systems (CPS) with reinforcement learning (RL)-based controllers are increasingly being deployed in complex physical environments such as autonomous vehicles, the Internet-of-Things (IoT), and smart cities. An important property of a CPS is tolerance; i.e., its ability to function safely under possible disturbances and uncertainties in the actual operation. In this paper, we introduce a new, expressive notion of tolerance that describes how well a controller is capable of satisfying a desired system requirement, specified using Signal Temporal Logic (STL), under possible deviations in the system. Based on this definition, we propose a novel analysis problem, called the tolerance falsification problem, which involves finding small deviations that result in a violation of the given requirement. We present a novel, two-layer simulation-based analysis framework and a novel search heuristic for finding small tolerance violations. To evaluate our approach, we construct a set of benchmark problems where system parameters can be configured to represent different types of uncertainties and disturbances in the system. Our evaluation shows that our falsification approach and heuristic can effectively find small tolerance violations.

1 Introduction

The tolerance of a CPS characterizes the ability of an engineered system to function correctly in the presence of uncertainties. Modern cyber-physical systems (CPS) operate in dynamic and uncertain environments, such as autonomous vehicles, medical devices, the Internet of Things (IoT), and smart cities. The mission-critical and safety-critical nature of CPS accentuate the need to provide a high level of tolerance against uncertainties, as a failure to do so could result in severe consequences, from safety hazards to economic losses.

As CPS grow in complexity and scale, reinforcement learning (RL) techniques are gaining popularity for learning CPS controllers. In general, these controllers perceive the state of the CPS and take an action that maximizes the long-term utility. The utility is captured through reward functions designed by engineers. An RL controller is trained via a trial-and-error process where an agent takes actions in a simulator of the CPS and uses the simulated results of the actions to discover an optimal control strategy. Hence, the fidelity of the simulator plays a big role in the effectiveness of a trained controller. Often, there are reality gaps between the actual deployed environment and the simulator due to approximation and under-modeling of physical phenomena, which makes controllers trained in simulations perform poorly in the real world[1]. This performance degradation can also manifest as unsafe system behaviors in the actual environment.

To make an RL controller tolerant of possible errors due to these reality gaps, existing works often focus on the training stage, such as robust RL [2, 3] and domain randomization [4, 5, 6]. They investigate the problem of training a controller that is capable of maintaining desired system behavior in the presence of possible system deviations—environmental uncertainties, observation or actuation errors, disturbances, and modeling errors. However, these methods are limited in how desired system behaviors are expressed. In RL, the desired behavior is often expressed using a reward function [2, 3]; it is well-known that encoding a high-level system requirement using a reward function is a challenging task that requires a significant amount of domain expertise and manual effort via reward sha** [7, 8]. Additionally, certain requirements cannot be directly encoded as rewards, especially those that capture time-varying behavior (e.g., “the vehicle must come to a stop in the next 3 seconds”).

Due to the limitation in reward functions and the data-driven nature of RL, these training-oriented methods in general do not provide formal guarantees about tolerance. Also, there is a lack of focus on post-training analysis for the tolerance of RL controllers, especially in the sense of maintaining a desired, complex system specification. Moreover, a formal definition of tolerance for RL controllers with respect to system behavior (beyond rewards) is also missing.

To fill the missing gap in post-training tolerance analysis of RL controllers, we propose a new notion of tolerance based on specifications in Signal Temporal Logic (STL) [9]. Our definition assumes a parametric representation of a system, where system parameters capture the dynamics of the system (e.g., acceleration of a nearby vehicle) that are affected by system deviations (e.g., sensor errors). A system is initially assigned a set of nominal parameters that describe its expected dynamics. Then, a change in parameters, denoted by δ𝛿\deltaitalic_δ, corresponds to a deviation that may occur. Finally, a controller is said to be tolerable against certain deviations with respect to a STL specification if and only if the controller is capable of satisfying the specification even under those deviations.

Based on this tolerance definition, we propose a new type of analysis problem called the tolerance falsification. The goal is to find deviations in system parameters that result in a violation of the desired system specification. Specifically, we argue that identifying a violation closer to the nominal system parameters would be more valuable, since such a violation is more likely to occur in practice. Intuitively, our system needs to tolerate these deviations before addressing the ones that are further away from the nominal set. These identified violations could be used to retrain the controller for improved tolerance, or to build a run-time monitor to detect when the system deviates into an unsafe region.

In addition, we propose a novel simulation-based framework where the tolerance falsification problem is formulated as a two-layer optimization problem. In the lower layer, for a given system deviation δ𝛿\deltaitalic_δ (representing a particular system dynamics), an optimization-based method is used to find a falsifying signal; i.e., a sequence of system states that results in a violation of the given STL specification. In the upper layer, the space of possible deviations is explored to find small deviations that result in a specification violation, repeatedly invoking the lower-layer falsification. The results generated from the lower layer guide the upper-layer search towards small violating deviations. Furthermore, we present a novel heuristic that leverages the differences between the trajectories from the normative and deviated environments, captured via cosine distances, to improve the effectiveness of the upper layer search algorithm.

To evaluate the effectiveness of our falsification approach, we have constructed a set of benchmark case studies. In particular, these benchmark systems are configurable with system parameters to generate a range of systems with different behaviors due to the parameters’ impact on how the system evolves. Our evaluation shows that our approach can be used to effectively find small deviations that cause a specification violation in these systems.

This paper makes the following contributions:

  • We present a novel, formal definition of tolerance for RL controllers (Sec. 4), and a new analysis problem named tolerance falsification problem (Sec. 5).

  • We propose a two-layer optimization-based method and a novel search heuristic for finding small violating deviations (Sec. 6).

  • We present an RL tolerance analysis benchmark and evaluate the effectiveness of our approach through experimental results on it (Sec. 7).

2 Preliminaries

2.0.1 Markov Decision Process

We model the systems under study as discrete-time stochastic systems in Markov Decision Processes (MDPs) [10]. An MDP is a tuple 𝐌=S,A,T,I,R𝐌𝑆𝐴𝑇𝐼𝑅\mathbf{M}=\langle S,A,T,I,R\ranglebold_M = ⟨ italic_S , italic_A , italic_T , italic_I , italic_R ⟩, where Sn𝑆superscript𝑛S\subseteq\mathbb{R}^{n}italic_S ⊆ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the set of states, Am𝐴superscript𝑚A\subseteq\mathbb{R}^{m}italic_A ⊆ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the set of actions (e.g., control inputs), T:S×A×S[0,1]:𝑇𝑆𝐴𝑆01T:S\times A\times S\to[0,1]italic_T : italic_S × italic_A × italic_S → [ 0 , 1 ] is the transition function where T(s,a,s)𝑇𝑠𝑎superscript𝑠T(s,a,s^{\prime})italic_T ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) represents the probability from state s𝑠sitalic_s to ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by action a𝑎aitalic_a and sS,aA:sST(s,a,s)=1:formulae-sequencefor-all𝑠𝑆𝑎𝐴subscriptsuperscript𝑠𝑆𝑇𝑠𝑎superscript𝑠1\forall s\in S,a\in A:\sum_{s^{\prime}\in S}T(s,a,s^{\prime})=1∀ italic_s ∈ italic_S , italic_a ∈ italic_A : ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_T ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1, I:S[0,1]:𝐼𝑆01I:S\to[0,1]italic_I : italic_S → [ 0 , 1 ] is the initial state distribution, and R:S:𝑅𝑆R:S\to\mathbb{R}italic_R : italic_S → blackboard_R is the reward function map** states to a real value. As is often the case for real-world systems, we assume that the transition function is unknown.

We consider black-box deterministic control policies for a system. Formally, a policy π:SA:𝜋𝑆𝐴\pi:S\to Aitalic_π : italic_S → italic_A for an MDP maps states to actions. Reinforcement learning (RL) [11] is the process of learning an optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that maximizes the cumulative discounted reward for this MDP. Additionally, a trajectory σ𝜎\sigmaitalic_σ of an MDP given an initial state s0Isimilar-tosubscript𝑠0𝐼s_{0}\sim Iitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_I and a policy π𝜋\piitalic_π is defined accordingly as σ=(s0a0s1siaisi+1)𝜎subscript𝑎0subscript𝑠0subscript𝑠1subscript𝑠𝑖subscript𝑎𝑖subscript𝑠𝑖1\sigma=(s_{0}\xrightarrow{a_{0}}s_{1}\ldots s_{i}\xrightarrow{a_{i}}s_{i+1}\ldots)italic_σ = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_ARROW start_OVERACCENT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT … ) where ai=π(si)subscript𝑎𝑖𝜋subscript𝑠𝑖a_{i}=\pi(s_{i})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and si+1T(si,ai)similar-tosubscript𝑠𝑖1𝑇subscript𝑠𝑖subscript𝑎𝑖s_{i+1}\sim T(s_{i},a_{i})italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ∼ italic_T ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Finally, we use (𝐌||π)\mathcal{L}(\mathbf{M}||\mathbf{\pi})caligraphic_L ( bold_M | | italic_π ) to represent the behavior of the controlled system, i.e., it is the set of all trajectories of a system 𝐌𝐌\mathbf{M}bold_M under the control of π𝜋\mathbf{\pi}italic_π.

2.0.2 Signal Temporal Logic

A signal 𝐬𝐬\mathbf{s}bold_s is a function 𝐬:TD:𝐬𝑇𝐷\mathbf{s}:T\to Dbold_s : italic_T → italic_D that maps a time domain T0𝑇subscriptabsent0T\subseteq\mathbb{R}_{\geq 0}italic_T ⊆ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT to a k𝑘kitalic_k real-value space Dk𝐷superscript𝑘D\subseteq\mathbb{R}^{k}italic_D ⊆ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where 𝐬(t)=(v1,,vk)𝐬𝑡subscript𝑣1subscript𝑣𝑘\mathbf{s}(t)=(v_{1},\ldots,v_{k})bold_s ( italic_t ) = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) represents the value of the signal at time t𝑡titalic_t. Then, an STL formula is defined as:

ϕ:=μ|¬ϕ|ϕψ|ϕψ|ϕ𝒰[a,b]ψassignitalic-ϕ𝜇italic-ϕitalic-ϕ𝜓italic-ϕ𝜓italic-ϕsubscript𝒰𝑎𝑏𝜓\phi:=\mu~{}|~{}\neg\phi~{}|~{}\phi\land\psi~{}|~{}\phi\lor\psi~{}|~{}\phi~{}% \mathcal{U}_{[a,b]}~{}\psiitalic_ϕ := italic_μ | ¬ italic_ϕ | italic_ϕ ∧ italic_ψ | italic_ϕ ∨ italic_ψ | italic_ϕ caligraphic_U start_POSTSUBSCRIPT [ italic_a , italic_b ] end_POSTSUBSCRIPT italic_ψ

where μ𝜇\muitalic_μ is a predicate of the signal 𝐬𝐬\mathbf{s}bold_s at time t𝑡titalic_t in the form of μμ(𝐬(t))>0𝜇𝜇𝐬𝑡0\mu\equiv\mu(\mathbf{s}(t))>0italic_μ ≡ italic_μ ( bold_s ( italic_t ) ) > 0 and [a,b]𝑎𝑏[a,b][ italic_a , italic_b ] is the time interval (or simply I𝐼Iitalic_I). The until operator 𝒰𝒰\mathcal{U}caligraphic_U defines that ϕitalic-ϕ\phiitalic_ϕ must be true until ψ𝜓\psiitalic_ψ becomes true within a time interval [a,b]𝑎𝑏[a,b][ italic_a , italic_b ]. Two other operators can be derived from until: eventually ([a,b]ϕ:=𝒰[a,b]ϕassignsubscript𝑎𝑏italic-ϕtopsubscript𝒰𝑎𝑏italic-ϕ\Diamond_{[a,b]}~{}\phi:=\top~{}\mathcal{U}_{[a,b]}~{}\phi◇ start_POSTSUBSCRIPT [ italic_a , italic_b ] end_POSTSUBSCRIPT italic_ϕ := ⊤ caligraphic_U start_POSTSUBSCRIPT [ italic_a , italic_b ] end_POSTSUBSCRIPT italic_ϕ) and always ([a,b]ϕ:=¬[a,b]¬ϕassignsubscript𝑎𝑏italic-ϕsubscript𝑎𝑏italic-ϕ\Box_{[a,b]}~{}\phi:=\neg\Diamond_{[a,b]}~{}\neg\phi□ start_POSTSUBSCRIPT [ italic_a , italic_b ] end_POSTSUBSCRIPT italic_ϕ := ¬ ◇ start_POSTSUBSCRIPT [ italic_a , italic_b ] end_POSTSUBSCRIPT ¬ italic_ϕ).

The satisfaction of an STL formula can be measured in a quantitative way as a real-valued function ρ(ϕ,𝐬,t)𝜌italic-ϕ𝐬𝑡\rho(\phi,\mathbf{s},t)italic_ρ ( italic_ϕ , bold_s , italic_t ) (also known as the STL robustness value), which represents the difference between the actual signal value and the expected one [9]. For example, given a formula ϕ𝐬(t)3>0italic-ϕ𝐬𝑡30\phi\equiv\mathbf{s}(t)-3>0italic_ϕ ≡ bold_s ( italic_t ) - 3 > 0, if 𝐬=5𝐬5\mathbf{s}=5bold_s = 5 at time t𝑡titalic_t, then the satisfaction of ϕitalic-ϕ\phiitalic_ϕ can be evaluated by ρ(ϕ,𝐬,t)=𝐬(t)3=2𝜌italic-ϕ𝐬𝑡𝐬𝑡32\rho(\phi,\mathbf{s},t)=\mathbf{s}(t)-3=2italic_ρ ( italic_ϕ , bold_s , italic_t ) = bold_s ( italic_t ) - 3 = 2. The definition of ρ𝜌\rhoitalic_ρ is as follows (ρ𝜌\rhoitalic_ρ for the other operators can be formulated from these):

ρ(μ,𝐬,t)=μ(𝐬(t))ρ(¬ϕ,𝐬,t)=ρ(ϕ,𝐬,t)formulae-sequence𝜌𝜇𝐬𝑡𝜇𝐬𝑡𝜌italic-ϕ𝐬𝑡𝜌italic-ϕ𝐬𝑡\displaystyle\rho(\mu,\mathbf{s},t)=\mu(\mathbf{s}(t))\qquad\qquad\rho(\neg% \phi,\mathbf{s},t)=-\rho(\phi,\mathbf{s},t)italic_ρ ( italic_μ , bold_s , italic_t ) = italic_μ ( bold_s ( italic_t ) ) italic_ρ ( ¬ italic_ϕ , bold_s , italic_t ) = - italic_ρ ( italic_ϕ , bold_s , italic_t )
ρ(ϕψ,𝐬,t)=min{ρ(ϕ,𝐬,t),ρ(ψ,𝐬,t)}𝜌italic-ϕ𝜓𝐬𝑡𝜌italic-ϕ𝐬𝑡𝜌𝜓𝐬𝑡\displaystyle\rho(\phi\land\psi,\mathbf{s},t)=\min\{\rho(\phi,\mathbf{s},t),% \rho(\psi,\mathbf{s},t)\}italic_ρ ( italic_ϕ ∧ italic_ψ , bold_s , italic_t ) = roman_min { italic_ρ ( italic_ϕ , bold_s , italic_t ) , italic_ρ ( italic_ψ , bold_s , italic_t ) }
ρ(ϕ𝒰Iψ,𝐬,t)=supt1I+tmin{ρ(ψ,𝐬,t1),inft2[t,t1]ρ(ϕ,𝐬,t2)}𝜌italic-ϕsubscript𝒰𝐼𝜓𝐬𝑡subscriptsupremumsubscript𝑡1𝐼𝑡𝜌𝜓𝐬subscript𝑡1subscriptinfimumsubscript𝑡2𝑡subscript𝑡1𝜌italic-ϕ𝐬subscript𝑡2\displaystyle\rho(\phi~{}\mathcal{U}_{I}~{}\psi,\mathbf{s},t)=\sup_{t_{1}\in I% +t}\min\{\rho(\psi,\mathbf{s},t_{1}),\inf_{t_{2}\in[t,t_{1}]}\rho(\phi,\mathbf% {s},t_{2})\}italic_ρ ( italic_ϕ caligraphic_U start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT italic_ψ , bold_s , italic_t ) = roman_sup start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_I + italic_t end_POSTSUBSCRIPT roman_min { italic_ρ ( italic_ψ , bold_s , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , roman_inf start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ italic_t , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT italic_ρ ( italic_ϕ , bold_s , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) }

3 Motivating Example

We use an RL system which is required to satisfy a safety specification to illustrate our tolerance definition and analysis. Consider the CarRun safe RL system implemented in bullet-safety-gym111https://github.com/SvenGronauer/Bullet-Safety-Gym, depicted in Figure 1. The CarRun system has a four-wheeled agent based on MIT Racecar222https://github.com/mit-racecar placed between two safety boundaries. The safety boundaries are non-physical bodies that can be breached without causing a collision. The objective is to go through the avenue between the boundaries without penetrating them. The agent velocity also needs to be maintained below a user-defined threshold. Formally, it can be specified by an STL invariant: (|ypos|<C1|v|<C2)subscript𝑦𝑝𝑜𝑠subscript𝐶1𝑣subscript𝐶2\Box(|y_{pos}|<C_{1}\land|v|<C_{2})□ ( | italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT | < italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ | italic_v | < italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the constant thresholds for the y coordinate and the velocity, respectively.

Given the CarRun system, we can train an RL controller such that the car agent satisfies the safety specification above using methods from safe RL [12] [13]. However, to transfer this “safe” controller to the real world, we need to account for the reality gap between the simulator and the deployed environment. This reality gap might arise due to inaccurate modeling of contact surfaces, actuator errors, and incorrect physical parameter configuration (e.g., friction and mass). These reality gaps can lead to the agent violating the safety specification in the real world, despite satisfying them in simulation. Additionally, since the RL controllers are black-box neural networks, it is extremely hard to capture their concrete behaviors. The difficulty in reasoning about the controller’s behaviors coupled with the stochasticity of the system leads to a challenging analysis problem of understanding their tolerance ability. This has long been one of the key drawbacks that limit the application of these controllers in the real world [6, 14].

Refer to caption
Figure 1: Behavior of the CarRun system under different system parameters. In the norminal condition (left), ypossubscript𝑦𝑝𝑜𝑠y_{pos}italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT in all trajectories is below the threshold (green line) and thus the system is safe. However, in the deviated condition (right), there exists a trajectory where ypossubscript𝑦𝑝𝑜𝑠y_{pos}italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT exceeds the threshold and hence the safety requirement is violated.

Since it can be challenging to quantitatively measure these reality gaps, we take a parametric approach. We approximate the reality gap between the simulator and the deployed environment quantitatively using deviations as parameters. For example, we model the CarRun system as being parametric with two controllable system parameters, tm𝑡𝑚tmitalic_t italic_m (turn multiplier, a factor for the steering control) and sm𝑠𝑚smitalic_s italic_m (speed multiplier, a factor for the speed control). These parameters govern the impact of the action provided by the controller, e.g., a larger sm𝑠𝑚smitalic_s italic_m will result in more aggressive accelerations. The intuition behind these deviations is to account for actuation issues that arise while deploying agents in the real world. Figure 1 shows the behavior of CarRun under different system parameters. In Figure 1(a), the agent is deployed in the nominal condition with default system parameters. In this scenario, the controller successfully manages to drive the Car agent through the avenue and also maintains a safe velocity, i.e., the safety specification is satisfied. In Figure 1(b), we show the same controller deployed under a deviated CarRun environment with different turn and speed multipliers. In this scenario, the controller makes the car behave erratically, which eventually makes the car cross the safety boundary, i.e., the safety specification is violated.

This example highlights the brittleness of these controllers concerning safety specifications and the need for stakeholders to address pre-deployment questions like: What are the possible deviations that these RL controllers can tolerate? More specifically, how much change in the system parameters can the controller tolerate before it begins to violate the given safety specification? We formulate this question as a type of analysis problem called tolerance falsification, where the goal is to find deviations in system parameters (e.g., the changes in the turn and speed multiplier of CarRun) where the deviated system violates the given specification. This analysis problem is challenging due to the stochastic, black-box nature of the system as well as the opacity of NN-based RL controllers.

Additionally, a notion of “quality of solution” while searching for system parameters is necessary to factor in the practical assumptions about the operating context of this system. For example, deviations that are closer to the nominal parameters are more likely to occur in practice and hence need to be prioritized when analyzing. This helps avoid impractically large deviation values that might cause a violation but offers little insight to system designers. Thus, our falsification process attempts to find violations with small deviations; i.e., minimal parameter changes that introduce a risk of specification violation into the system. The output of this analysis (i.e., violations) can help the engineer identify RL-based controller brittleness and can be used to redesign or retrain the controller to improve its tolerance.

4 Tolerance Definition

4.1 Definition of Specification-Based Tolerance

In this work, we use STL to specify the desired properties of a system, and system parameters to capture the deviations in system dynamics. Parameters can represent a variety of deviations such as environmental disturbances (e.g., wind or turbulence), internal deviations (e.g., mass variation of a vehicle), observation errors (e.g., sensor errors), or actuation errors (e.g., errors in steering control). Then, to capture systems with such diverse dynamics using parameters, we leverage the notion of parametric control systems[15, 16].

A parametric discrete-time stochastic system 𝐌Δsuperscript𝐌Δ\mathbf{M}^{\Delta}bold_M start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT defines a set of systems such that ΔkΔsuperscript𝑘\Delta\subseteq\mathbb{R}^{k}roman_Δ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents the parameter domain, and for any δΔ𝛿Δ\delta\in\Deltaitalic_δ ∈ roman_Δ, an instance of a parametric system 𝐌δsuperscript𝐌𝛿\mathbf{M}^{\delta}bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT is an MDP 𝐌δ=S,A,Tδ,Iδ,Rsuperscript𝐌𝛿𝑆𝐴superscript𝑇𝛿superscript𝐼𝛿𝑅\mathbf{M}^{\delta}=\langle S,A,T^{\delta},I^{\delta},R\ranglebold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT = ⟨ italic_S , italic_A , italic_T start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT , italic_I start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT , italic_R ⟩, where the initial state distribution Iδsuperscript𝐼𝛿I^{\delta}italic_I start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT and the state transition distributions Tδsuperscript𝑇𝛿T^{\delta}italic_T start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT are both defined by the parameter δ𝛿\deltaitalic_δ. Parameter δ𝛿\deltaitalic_δ represents a deviation to a system and ΔΔ\Deltaroman_Δ represents the domain of all deviations of interest. In addition, we use δ0Δsubscript𝛿0Δ\delta_{0}\in\Deltaitalic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ roman_Δ to represent the zero-deviation point, i.e., the parameter under which the system 𝐌δ0superscript𝐌subscript𝛿0\mathbf{M}^{\delta_{0}}bold_M start_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT exhibits the expected, normative behavior. Then, we define a system as being tolerable against a certain deviation as follows:

Definition 1

For a system 𝐌𝐌\mathbf{M}bold_M, a policy π𝜋\piitalic_π, a deviation parameter δ𝛿\deltaitalic_δ, and an STL property ϕitalic-ϕ\phiitalic_ϕ, we say the system can tolerate the deviation when the parametric form of 𝐌𝐌\mathbf{M}bold_M with parameter δ𝛿\deltaitalic_δ under the control of π𝜋\piitalic_π satisfies the property, i.e., 𝐌δ||πϕ\mathbf{M}^{\delta}||\pi\models\phibold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT | | italic_π ⊧ italic_ϕ.

Then, the tolerance of a controller can be defined as all the possible deviations that the system can tolerate. Formally:

Definition 2

For a system 𝐌𝐌\mathbf{M}bold_M, a policy π𝜋\piitalic_π, and an STL property ϕitalic-ϕ\phiitalic_ϕ, the tolerance of the controller is defined as the maximal 𝚫k𝚫superscript𝑘\mathbf{\Delta}\subseteq\mathbb{R}^{k}bold_Δ ⊆ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT s.t. δ𝚫:𝐌δ||πϕ\forall\delta\in\mathbf{\Delta}:\mathbf{M}^{\delta}||\pi\models\phi∀ italic_δ ∈ bold_Δ : bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT | | italic_π ⊧ italic_ϕ.

In other words, the tolerance of a control policy π𝜋\piitalic_π is measured by the maximal parameter domain 𝚫𝚫\mathbf{\Delta}bold_Δ of a system where each deviated system 𝐌δsuperscript𝐌𝛿\mathbf{M}^{\delta}bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT of it still satisfies the property under the control of π𝜋\piitalic_π.

4.2 Strict Evaluation of Tolerance

In this work, we focus on a specific evaluation of tolerance. Specifically, Def. 1 and 2 depend on the interpretation of 𝐌δ||πϕ\mathbf{M}^{\delta}||\pi\models\phibold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT | | italic_π ⊧ italic_ϕ, i.e., a system satisfying a STL property; however, STL satisfaction is computed over a single trajectory. From the literature [17], one common evaluation criteria is that a system must not contain a trajectory that violates the STL property. In other words, even in the worst-case scenario that is less likely to occur in a stochastic system, it should still guarantee the property. This interpretation enforces a strong guarantee of the system, and thus we call it the strict satisfaction of STL in this work. Formally:

Definition 3

A discrete-time stochastic system 𝐌𝐌\mathbf{M}bold_M strictly satisfies an STL property ϕitalic-ϕ\phiitalic_ϕ under the control of a policy π𝜋\piitalic_π iff every controlled trajectory produces a non-negative STL robustness value, i.e., 𝐌||πϕσ(𝐌||π):ρ(ϕ,𝐬σ,0)0\mathbf{M}||\pi\models\phi\Leftrightarrow\forall\sigma\in\mathcal{L}(\mathbf{M% }||\pi):\rho(\phi,\mathbf{s}_{\sigma},0)\geq 0bold_M | | italic_π ⊧ italic_ϕ ⇔ ∀ italic_σ ∈ caligraphic_L ( bold_M | | italic_π ) : italic_ρ ( italic_ϕ , bold_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , 0 ) ≥ 0, where 𝐬σsubscript𝐬𝜎\mathbf{s}_{\sigma}bold_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT is the signal of state values of trajectory σ𝜎\sigmaitalic_σ.

With this interpretation, we can then restate Def. 2 as:

Definition 4

The tolerance of a policy π𝜋\piitalic_π that strictly satisfies an STL property ϕitalic-ϕ\phiitalic_ϕ is the maximal 𝚫𝚫\mathbf{\Delta}bold_Δ s.t. δ𝚫,σ(𝐌δ||π):ρ(ϕ,𝐬σ,0)0\forall\delta\in\mathbf{\Delta},\sigma\in\mathcal{L}(\mathbf{M}^{\delta}||\pi)% :\rho(\phi,\mathbf{s}_{\sigma},0)\geq 0∀ italic_δ ∈ bold_Δ , italic_σ ∈ caligraphic_L ( bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT | | italic_π ) : italic_ρ ( italic_ϕ , bold_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , 0 ) ≥ 0

Although this definition delineates a strong tolerance guarantee, it can also be extended to more relaxed notions with probabilistic guarantees. In that case, other evaluation techniques for STL specification satisfaction such as [18, 19, 20] can be leveraged. We leave this as an extension of our work in the future.

5 Tolerance Analysis

5.1 Tolerance Falsification

According to Def. 4, to compute the tolerance of a controller, we need to: (1) (formally) show that a stochastic system does not contain a trajectory that violates the STL property, and (2) compute the maximal parameter set 𝚫𝚫\mathbf{\Delta}bold_Δ, which could be in any non-convex or even non-continuous shape, where all system instances 𝐌δsuperscript𝐌𝛿\mathbf{M}^{\delta}bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT should satisfy step (1). This exhaustive computation is intractable due to the black-box RL controllers coupled with the stochasticity in system.

Therefore, in this work, instead of computing or approximating the tolerance 𝚫𝚫\mathbf{\Delta}bold_Δ, we consider the problem of falsifying a given estimation of tolerance 𝚫^^𝚫\widehat{\mathbf{\Delta}}over^ start_ARG bold_Δ end_ARG, i.e., finding a deviation δ𝚫^𝛿^𝚫\delta\in\widehat{\mathbf{\Delta}}italic_δ ∈ over^ start_ARG bold_Δ end_ARG that the system cannot tolerate for a given controller. More formally, we define:

Problem 1 (Tolerance Falsification)

For a system 𝐌𝐌\mathbf{M}bold_M, a policy π𝜋\piitalic_π, and an STL property ϕitalic-ϕ\phiitalic_ϕ, given a tolerance estimation 𝚫^k^𝚫superscript𝑘\widehat{\mathbf{\Delta}}\subseteq\mathbb{R}^{k}over^ start_ARG bold_Δ end_ARG ⊆ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, the goal of a tolerance falsification problem (𝐌,π,ϕ,𝚫^)𝐌𝜋italic-ϕ^𝚫\mathcal{F}(\mathbf{M},\pi,\phi,\widehat{\mathbf{\Delta}})caligraphic_F ( bold_M , italic_π , italic_ϕ , over^ start_ARG bold_Δ end_ARG ) is to find a deviation δ𝚫^𝛿^𝚫\delta\in\widehat{\mathbf{\Delta}}italic_δ ∈ over^ start_ARG bold_Δ end_ARG s.t. σ(𝐌δ||π):ρ(ϕ,𝐬σ,0)<0\exists\sigma\in\mathcal{L}(\mathbf{M}^{\delta}||\pi):\rho(\phi,\mathbf{s}_{% \sigma},0)<0∃ italic_σ ∈ caligraphic_L ( bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT | | italic_π ) : italic_ρ ( italic_ϕ , bold_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , 0 ) < 0.

5.2 Minimum Tolerance Falsification

Intuitively, a larger deviation (i.e., a deviation that is far away from the expected system parameter) would likely cause a larger deviation in the system behavior leading to a specification violation. However, controllers are generally not designed to handle arbitrarily large deviations in the first place, and analyzing their performance in these situations offers limited insight to the designer. Moreover, if the designer decides to improve the tolerance of a controller (which is a costly endeavor), deviations closer to the nominal system are given high priority due to their higher likelihood of occurrence. In light of these practical design and deployment assumptions, we focus on the minimum deviation problem.

Problem 2

Given a minimum tolerance falsification problem min(𝐌,π,ϕ,𝚫^)subscript𝑚𝑖𝑛𝐌𝜋italic-ϕ^𝚫\mathcal{F}_{min}(\mathbf{M},\pi,\phi,\widehat{\mathbf{\Delta}})caligraphic_F start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ( bold_M , italic_π , italic_ϕ , over^ start_ARG bold_Δ end_ARG ), let δ0𝚫^subscript𝛿0^𝚫\delta_{0}\in\widehat{\mathbf{\Delta}}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ over^ start_ARG bold_Δ end_ARG be the zero-deviation point, the goal is to find a deviation δ𝚫^𝛿^𝚫\delta\in\widehat{\mathbf{\Delta}}italic_δ ∈ over^ start_ARG bold_Δ end_ARG s.t. 𝐌δ||π|=ϕ\mathbf{M}^{\delta}||\pi\not\mathrel{|}\joinrel=\phibold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT | | italic_π not |= italic_ϕ and δ𝛿\deltaitalic_δ minimizes a distance measure δδ0psubscriptdelimited-∥∥𝛿subscript𝛿0𝑝\lVert\delta-\delta_{0}\rVert_{p}∥ italic_δ - italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

5.3 Falsification by Optimization

Since the satisfaction of STL can be measured quantitatively, the tolerance falsification problem can be formulated as an optimization problem. Consider a real-valued system evaluation function Γ(𝐌,π,ϕ)Γ𝐌𝜋italic-ϕ\Gamma(\mathbf{M},\pi,\phi)\in\mathbb{R}roman_Γ ( bold_M , italic_π , italic_ϕ ) ∈ blackboard_R. We assume that if this function’s value is negative, the controlled system violates the property, i.e.,

Γ(𝐌,π,ϕ)<0𝐒||π|=ϕconditionalΓ𝐌𝜋italic-ϕbra0𝐒𝜋|=italic-ϕ\Gamma(\mathbf{M},\pi,\phi)<0\Leftrightarrow\mathbf{S}||\pi\not\mathrel{|}% \joinrel=\phiroman_Γ ( bold_M , italic_π , italic_ϕ ) < 0 ⇔ bold_S | | italic_π not |= italic_ϕ

and the smaller the value, the larger the degree of property violation. Then, a tolerance falsification problem (𝐌,π,ϕ,𝚫^)𝐌𝜋italic-ϕ^𝚫\mathcal{F}(\mathbf{M},\pi,\phi,\widehat{\mathbf{\Delta}})caligraphic_F ( bold_M , italic_π , italic_ϕ , over^ start_ARG bold_Δ end_ARG ) can be formulated as the following optimization problem:

argminδ𝚫^Γ(𝐌δ,π,ϕ)subscript𝛿^𝚫Γsuperscript𝐌𝛿𝜋italic-ϕ\displaystyle\mathop{\arg\min}\limits_{\delta\in\widehat{\mathbf{\Delta}}}~{}% \Gamma(\mathbf{M}^{\delta},\pi,\phi)start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_δ ∈ over^ start_ARG bold_Δ end_ARG end_POSTSUBSCRIPT roman_Γ ( bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT , italic_π , italic_ϕ ) (1)

i.e., by finding a parameter δ𝚫^𝛿^𝚫\delta\in\widehat{\mathbf{\Delta}}italic_δ ∈ over^ start_ARG bold_Δ end_ARG that minimizes the evaluation function ΓΓ\Gammaroman_Γ and observing this value can give information about system’s property satisfaction. Concretely, if the minimum function value is negative, then the associated parameter δ𝛿\deltaitalic_δ indicates a deviation where the system violates the property ϕitalic-ϕ\phiitalic_ϕ. Specifically, in the case of strict evaluation of tolerance, the system evaluation function ΓΓ\Gammaroman_Γ is defined as:

Γ(𝐌,π,ϕ)=min{ρ(ϕ,𝐬σ,0)|σ(𝐌||π)}\displaystyle\Gamma(\mathbf{M},\pi,\phi)=\min\{\rho(\phi,\mathbf{s}_{\sigma},0% )~{}|~{}\sigma\in\mathcal{L}(\mathbf{M}||\pi)\}roman_Γ ( bold_M , italic_π , italic_ϕ ) = roman_min { italic_ρ ( italic_ϕ , bold_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , 0 ) | italic_σ ∈ caligraphic_L ( bold_M | | italic_π ) } (2)

Finally, we can formulate a minimum tolerance falsification problem min(𝐌,\mathcal{F}_{min}(\mathbf{M},caligraphic_F start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ( bold_M , π,ϕ,𝚫^)\pi,\phi,\widehat{\mathbf{\Delta}})italic_π , italic_ϕ , over^ start_ARG bold_Δ end_ARG ) as a constrained optimization problem:

argminδ𝚫^δδ0ps.t.Γ(𝐌δ,π,ϕ)<0formulae-sequencesubscript𝛿^𝚫subscriptdelimited-∥∥𝛿subscript𝛿0𝑝𝑠𝑡Γsuperscript𝐌𝛿𝜋italic-ϕ0\displaystyle\mathop{\arg\min}\limits_{\delta\in\widehat{\mathbf{\Delta}}}~{}% \lVert\delta-\delta_{0}\rVert_{p}~{}s.t.~{}\Gamma(\mathbf{M}^{\delta},\pi,\phi% )<0start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_δ ∈ over^ start_ARG bold_Δ end_ARG end_POSTSUBSCRIPT ∥ italic_δ - italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_s . italic_t . roman_Γ ( bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT , italic_π , italic_ϕ ) < 0 (3)

Note that, Eq. 2 is the typical formulation for solving a CPS falsification problem that intends to find a trajectory that violates an STL specification [17]. Thus, the problem of finding any tolerance violation (Eq. 1) can be formulated as a min-min optimization problem which can be solved by existing CPS falsifiers such as Breach [21] and PsyTaLiRo [22, 23].

However, the minimum falsification problem (Eq. 3) features multi-objective optimization or min-max optimization characteristics — minimizing the deviation distance (δδ0psubscriptdelimited-∥∥𝛿subscript𝛿0𝑝\lVert\delta-\delta_{0}\rVert_{p}∥ italic_δ - italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT) would likely cause a larger system evaluation value (ΓΓ\Gammaroman_Γ). Since these objectives are inherently conflicting, nuanced techniques are required to find solutions. Although, existing CPS falsifiers can be configured to represent this additional cost/objective function (either via specification modification or through explicit cost function definition), the underlying optimization techniques do not have a multi-layer setup to handle this off the shelf. Therefore, we present a novel two-layer search for solving the tolerance falsification problems, particularly effective in finding minimum violating deviations.

6 Simulation-Based Tolerance Analysis Framework

In this section, we outline our analysis framework to solve the tolerance falsification problems for black-box CPS and RL controllers (as shown in Figure 2). We first explain our novel two layer falsification algorithm and then present a heuristic for more effective solving of the minimum falsification problem.

Refer to caption
Figure 2: Overview of the two-layer falsification algorithm.

6.1 A Two-Layer Falsification Algorithm

Algorithm 1 presents our two-layer framework in details. Lines 3-13 indicate the upper-layer search. In each iteration, the upper-layer searches a set of deviation samples. For a deviation δ𝛿\deltaitalic_δ, it instantiates a deviated system 𝐌δsuperscript𝐌𝛿\mathbf{M}^{\delta}bold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT (line 6), computes the system evaluation value γ𝛾\gammaitalic_γ (line 7), and then computes the objective function value v𝑣vitalic_v (line 8). The objective value indicates the quality of a deviation sample, e.g., whether it causes a violation of tolerance and has a small distance to the zero-deviation point. Finally, the objective values for this iteration is used to update the best result so far (line 11) and generates the next candidate solutions (line 12). In particular, line 7 indicates the lower-layer task. It corresponds to the system evaluation function ΓΓ\Gammaroman_Γ (which is the minimal STL robustness value according to Eq. 2).

Given the characteristics of our falsification problem, we propose this two-layer structure for multiple reasons: First, the separation of deviations and the lower-layer CPS falsification allows us to define richer evaluation metrics and heuristics that are solely relevant for deviation searching. These heuristics, if used in a single layer objective, would lead to an ill-posed optimization problem exacerbated by the highly non-convex landscapes of traditional CPS falsification. Second, this separation of concerns allow us to find deviations closer to nominal points even for systems with high-dimensional state spaces, complex dynamics, and rugged robustness landscapes with multiple local minimas. In these settings, an one-layer search would converge to local solutions without exploring the search space extensively. Finally, this two-layer structure provides us enough extensibility to:

  • Integrate many off-the-shelf optimization techniques for the upper-layer like we have for Uniform Random Sampling, CMA-ES [24], NSGA-II [25], and Extended Ant Colony [26].

  • Integrate state-of-the-art CPS falsifiers (we integrated CMA-ES, Breach [21], and PsyTaLiRo [23]) and simulation platforms (we used OpenAI-Gym [27], PyBullet [28], and Matlab Simulink).

  • Extend to other STL evaluation methods (function ΓΓ\Gammaroman_Γ), e.g., evaluation with probabilistic guarantees [18, 19, 20], cumulative STL [29], or mean STL [30].

Input : 𝐌,π,ϕ,𝚫^𝐌𝜋italic-ϕ^𝚫\mathbf{M},\pi,\phi,\widehat{\mathbf{\Delta}}bold_M , italic_π , italic_ϕ , over^ start_ARG bold_Δ end_ARG, and objective function f𝑓fitalic_f
Output : violation δbest𝚫^subscript𝛿𝑏𝑒𝑠𝑡^𝚫\delta_{best}\in\widehat{\mathbf{\Delta}}italic_δ start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ∈ over^ start_ARG bold_Δ end_ARG
1
2δbestnilsubscript𝛿𝑏𝑒𝑠𝑡𝑛𝑖𝑙\delta_{best}\leftarrow nilitalic_δ start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_n italic_i italic_l;
3 Xinitial candidates from𝚫^𝑋initial candidates from^𝚫X\leftarrow~{}\text{initial candidates from}~{}\widehat{\mathbf{\Delta}}italic_X ← initial candidates from over^ start_ARG bold_Δ end_ARG ;
4 while termination criteria = false do
5       V𝑉V\leftarrow\langle\rangleitalic_V ← ⟨ ⟩ ;
6       for δX𝛿𝑋\delta\in Xitalic_δ ∈ italic_X do
7             Mδsuperscript𝑀𝛿absentM^{\delta}\leftarrowitalic_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ← Instantiate(𝐌𝚫^,δsuperscript𝐌^𝚫𝛿\mathbf{M}^{\widehat{\mathbf{\Delta}}},\deltabold_M start_POSTSUPERSCRIPT over^ start_ARG bold_Δ end_ARG end_POSTSUPERSCRIPT , italic_δ) ;
8             γ𝛾absent\gamma\leftarrowitalic_γ ← CPSFalsification(𝐌δ,π,ϕsuperscript𝐌𝛿𝜋italic-ϕ\mathbf{M}^{\delta},\pi,\phibold_M start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT , italic_π , italic_ϕ) ;
             vf(δ,γ)𝑣𝑓𝛿𝛾v\leftarrow f(\delta,\gamma)italic_v ← italic_f ( italic_δ , italic_γ ) ;
              // heuristic computation.
9             VVv𝑉𝑉delimited-⟨⟩𝑣V\leftarrow V\frown\langle v\rangleitalic_V ← italic_V ⌢ ⟨ italic_v ⟩ ;
10            
11       end for
12      δbestsubscript𝛿𝑏𝑒𝑠𝑡absent\delta_{best}\leftarrowitalic_δ start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← UpdateBest(X, V) ;
13       X𝑋absentX\leftarrowitalic_X ← NextCandidates(f,X,V,𝚫^𝑓𝑋𝑉^𝚫f,X,V,\widehat{\mathbf{\Delta}}italic_f , italic_X , italic_V , over^ start_ARG bold_Δ end_ARG) ;
14      
15 end while
16
Algorithm 1 A Two-Layer Tolerance Falsification Algorithm

6.2 Heuristic for Efficient Minimum Tolerance Falsification

We present a novel heuristic for more effective discovery of minimum violating deviations. Our heuristic is based on the known issues of RL policy overfitting. It has been highlighted in related literature that RL policies can overfit to the specific paramterized system used for training the policies and this dependence can reduce their applicability to real-world scenarios [4, 5, 6]. We exploit this over-fitting tendency to guide the search for δ𝛿\deltaitalic_δ that leads to a violation. Our heuristic is the cosine similarity between a deviated system’s worst-case trajectory and a nominal system’s worst-case trajectory. Formally:

dist(δ)=𝐓𝐫δ𝐓𝐫δ𝟎𝐓𝐫δ𝐓𝐫δ𝟎𝑑𝑖𝑠𝑡𝛿subscript𝐓𝐫𝛿subscript𝐓𝐫subscript𝛿0normsubscript𝐓𝐫𝛿normsubscript𝐓𝐫subscript𝛿0dist(\delta)=\frac{\mathbf{Tr_{\delta}}\cdot\mathbf{Tr_{\delta_{0}}}}{\|% \mathbf{Tr_{\delta}}\|\cdot\|\mathbf{Tr_{\delta_{0}}}\|}italic_d italic_i italic_s italic_t ( italic_δ ) = divide start_ARG bold_Tr start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ⋅ bold_Tr start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_Tr start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ∥ ⋅ ∥ bold_Tr start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ end_ARG

Our intuition is that once a controller has been trained in a system parameterized by δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, it overfits to that specific system. Then, when the controller is deployed in a deviated system, its worst-case trajectory will be similar to the nominal worst-case trajectory if the distance between the two MDPs, measured by the Euclidean distance between the parameters, is small. We measure the similarity between trajectories using cosine similarity. Thus, as the distance from the nominal MDP increases, the similarity score between the worst-case trajectories decreases. This heuristic provides more information about the search space: i.e. in the case there are two deviations where the robustness values are similar (which is possible due to the worst case semantics of STL robustness), cosine similarity can help in directing the search toward more violating directions.

Example. Concretely, we illustrate our heuristic’s benefits through the CarRun system discussed in Section 3. First, the δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT value (normative parameters) of the system is δ0=[20.0,0.5]subscript𝛿020.00.5\delta_{0}=[20.0,0.5]italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ 20.0 , 0.5 ], where the first one is the turn multiplier and the second one is the steering multiplier. Then, consider two concrete deviations δ1=[16.566,0.409]subscript𝛿116.5660.409\delta_{1}=[16.566,0.409]italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ 16.566 , 0.409 ] and  δ2=[15.136\delta_{2}=[15.136italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ 15.136, 0.447]0.447]0.447 ]. The normalized l-2𝑙-2l\text{-2}italic_l -2 distances of them to δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (i.e., δδ02subscriptdelimited-∥∥𝛿subscript𝛿02\lVert\delta-\delta_{0}\rVert_{2}∥ italic_δ - italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) are 0.190 and 0.184, respectively. By solving the CPS falsification problems at these two deviations, their corresponding minimum STL robustness values are 0.130 and 0.125, respectively. That is, given a similar deviation distance, their worst-case robustness values are also close. On the other hand, their corresponding worst-case trajectory similarity values are 0.900 and 0.995. Compared to the small difference in robustness values, this relatively big difference in similarity scores can better guide the upper-layer search to a violating deviation, i.e., the direction of δ1subscript𝛿1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT might more likely lead to a violation and should be prioritized in the search.

7 Evaluation

We implemented our proposed framework in a Python package333https://github.com/SteveZhangBit/STL-Robustness and evaluate our technique through comprehensive experimentation. Our evaluation focuses on the minimum tolerance falsification problem. Specifically, we measure our technique’s effectiveness through three key metrics: (1) the number of violations found, (2) the minimum distance of violations, and (3) the average distance of violations. Based on these metrics, we formulate the following research questions:

  • RQ1: Is our two-layer falsification framework more effective than leveraging an existing CPS falsifier?

  • RQ2: Does our heuristic improve the effectiveness for finding minimum violating deviations, compared to off-the-self optimization algorithms?

Although existing CPS falsifiers [21, 22, 23] cannot directly solve our minimum tolerance falsification problem (Problem 2), they allow customizing the objective function to optimize for both the deviation distance and STL robustness value to find minimum deviations. We call this technique one-layer search. For RQ1, we benchmark against the one-layer search baseline for the minimum tolerance falsification problem. For RQ2, we evaluate whether our proposed heuristic described in Section 6.2 further improves the effectiveness of our two-layer search, specifically the minimum distance.

7.1 Experimental Setup and Implementational Details

To answer these research questions, we first present a benchmark with systems and controllers trained to satisfy complex safety specifications. The benchmark contains six systems with non linear dynamics adopted from OpenAI-Gym, PyBullet, and Matlab Simulink. We extend the interfaces of these systems so that users can configure their behavior for tolerance analysis by changing the system parameters.

Then, we solve the corresponding minimum tolerance falsification problems for these problems. For each problem, we conduct the following experiments:

  • One-layer search leveraging an existing CPS falsifier by modifying the objective function to factor in the deviation distance and STL robustness value,

  • Two-layer search with CMA-ES for both the upper and lower layers,

  • Two-layer search with CMA-ES+Heuristic for the upper layer and CMA-ES for the lower layer.

Specifically, for the one-layer search, we employ the state-of-the-art CPS falsifiers, Breach [21] for Matlab systems and PsyTaLiRo [23] for Python systems by extending their default objective functions. For the two-layer search, due to the complexity of the CPS and the non-convex nature of STL robustness, the upper-layer optimization is also non-convex and has multiple local minima. Additionally, we assume black-box systems and controllers. Thus, due to these two considerations, we made the decision to adopt derivative-free evolutionary algorithms. Specifically, we primarily utilized CMA-ES as the upper-layer algorithm because it is widely used for black-box optimization and in our preliminary experiments outperformed other evolutionary methods. However, other algorithms can also be integrated. Furthermore, we also use CMA-ES for the lower-layer search as it is a widely used in CPS falsification tools [17, 21] and works competitively for both Python and Matlab environments. Finally, we implement our heuristic and use it alongside the evaluation function for the upper-layer search.

Each problem was run three times on a Linux machine with a 3.6GHz CPU and 24GB memory. For fair evaluation, we set the budget in terms of the number of interactions with the simulator for all our techniques. Specifically, for one run, the budget for the one-layer search is 10,000 simulations; and the budget for the two-layer search is 100 for the upper-layer and 100 for the lower-layer falsification.

7.2 Results

Table 1: Minimum tolerance falsification results.
One-layer search CMA-ES CMA-ES w/ Heuristic
Viol. Min Dst. Avg. Dst. Viol. Min Dst. Avg. Dst. Viol. Min Dst. Avg. Dst.
Cartpole 90 0.300 0.399 69 0.285 0.449 79 0.256 0.417
LunarLander - - - 74 0.026 0.222 84 0.020 0.293
CarCircle 11 0.143 0.255 22 0.102 0.219 57 0.068 0.454
CarRun 25 0.191 0.249 68 0.161 0.449 109 0.156 0.399
ACC N/A N/A N/A 43 0.110 0.323 110 0.138 0.415
WTK 300 0.299 0.443 54 0.296 0.454 45 0.319 0.533

Table 1 summarizes the results for solving the minimum tolerance falsification problems. The Viol. column shows the number of violations found in total from the three runs. The Min Dst. and Avg. Dst. columns show the minimum and average normalized l-2𝑙-2l\text{-2}italic_l -2 distance to the zero-deviation point (i.e., δδ02subscriptdelimited-∥∥𝛿subscript𝛿02\lVert\delta-\delta_{0}\rVert_{2}∥ italic_δ - italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) of the found violations, respectively. The performance of our approach heavily depends on the underlying simulation time of a system that vastly outweighs the overhead added by our evolutionary search algorithms. Thus, we share comparable performance, measured by total run time, as tools like Breach and PsyTaLiRo given the same budget of simulation calls.

In addition, to qualitatively exhibit our approach’s effectiveness in finding deviations, we visualize the search space landscape for different problems in heat maps. Each heat map is generated by slicing the space (i.e., the estimated domain of system parameters) into a 20×20202020\times 2020 × 20 grid and using a CPS falsifier to find the minimum STL robustness value for each grid cell. However, this processing is only done for visualization purposes and is not used in any of the algorithms. This brute force sampling requires far more resources than our falsification approach. Finally, we draw the deviation samples and violations from our analysis on the heat maps. The final results are illustrated visually in Figure 3.

Refer to caption
Figure 3: Search spaces, deviation samples and violations processed by each algorithm. In each graph, the axes indicate the parameter domains. A red cell indicates a positive STL robustness value and a blue cell a negative value. A grey cross indicates a deviation sample that is not falsified in the given budget; a yellow cross indicates a violation.

Answer to RQ1. As illustrated in the table, the one-layer search fails to find violations in the LunarLander problem, and it cannot represent the type of system parameters we need in the ACC problem (due to falsification tool implementation). On the other hand, our two-layer search with CMA-ES solves all the problems and finds smaller deviations than the one-layer search in all problems. Moreover, as can be observed in the heat maps, since the distance value is directly appended to the STL robustness value in the one-layer search, it fails to find small deviations that barely violate the property because it would result in a larger objective value. Thus, it is hard for it to converge to the minimum violating deviations. On the other hand, our two-layer search can better converge to the boundary of safe and unsafe regions. However, it also causes it to find fewer violations because it searches for more samples in the safe region close to the boundary where violations can be rare.

Answer to RQ2. From the table, our two-layer search with CMA-ES+Heuristic finds smaller violating deviations than the original CMA-ES in 4/6 problems. It also finds more violations in 5/6 problems. However, the average distances also increase in 4/6 problems due to more exploration of violations encouraged by our heuristic. Despite that, from the heat maps, our CMA-ES+Heuristic approach can still converge to small violating deviations on the safe and unsafe boundary while also finding more violations. Our heuristic helps in guiding the search and provides additional information to the algorithm when STL robustness is not enough to provide directionality. Concretely, a small similarity value would likely lead to a violation (even when the robustness value is similar) and thus results in more violations found and faster convergence to a small violation.

8 Related Work

There exists similar CPS tolerance notions from a control theory perspective such as [31, 32]. For example, Saoud et al. [31] present a resilience notion of CPS based on LTL w.r.t. a real-valued disturbance space. Then, they present an optimization-based method to approximate the maximum set of disturbances that maintain a desired LTL property for linear control systems. These notions target traditional controllers with a white-box assumption of systems and controllers, whereas we employ a black-box assumption which is more practical regarding complex CPS and NN-based RL controllers.

Falsification of CPS [17] is a well-studied problem in the literature. The goal is to find counterexample trajectories that violate a STL property by mutating the initial states or system inputs. A related application is parameter synthesis [33] that finds a set of system parameters where the system satisfies the property. It can be seen as a dual problem to the falsification problem. Tools like Breach [21] and PSY-TaLiRo [23, 22] support both types of analysis. However, our tolerance falsification problem can be seen as solving these two problems at the same time. Our upper-layer search aims to find system parameters that would lead to a violation of the system specification, and the lower-layer search aims to find initial states or system inputs that lead to a violating trajectory. Although our problem can be reduced to a CPS falsification problem with system parameters, it is not effective in solving our minimum tolerance falsification problem compared to our two-layer structure as illustrated by our experimental results.

VerifAI [34, 35] applies a similar idea to us where they consider abstract features for a ML model that can lead to a property violation of a CPS. Different from us, they assume a CPS with a ML perception model (such as object detection) connecting to a traditional controller, and the abstract features are environmental parameters that would affect the performance of the ML model (e.g., brightness). In other words, they focus on deviations that affect the ML model whereas our deviation notion is more general that includes any external or internal deviation or sensor error which changes the system dynamics.

Robust RL studies the problem to improve the performance of RL controllers in the presence of uncertainties [2, 3]. A similar research topic is domain randomization [4, 5, 6] that create various systems with randomized parameters leading to changed system dynamics and then train a controller that works across these systems. However, our work is different in that: (1) we focus on tolerance evaluation whereas they focus more on training; and (2) we focus on system specifications and specify them in STL properties, while they rely on rewards where maximizing the reward does not necessarily guarantee certain system specification.

9 Conclusion

In this paper, we have introduced a specification-based tolerance definition for CPS. This definition yields a new type of analysis problem, called tolerance falsification, where the goal is to find small changes to the system dynamics that result in a violation of a given STL specification. We have also presented a novel optimization-based approach to solve the problem and evaluated the effectiveness of it over our proposed CPS tolerance analysis benchmark.

Since our analysis framework is extensible, as part of future work, we plan to explore and integrate other types of evaluation functions ΓΓ\Gammaroman_Γ (e.g., evaluation with probabilistic guarantees [19, 20, 18]), different semantics of STL robustness (e.g., cumulative robustness[29]), or leveraging decomposition of STL for more effective falsification of complex specifications [36]. Moreover, we currently use l-2𝑙-2l\text{-2}italic_l -2 norm to compute the deviation distances. In the future, we also plan to explore other distance notions such as Wasserstein Distance [37, 38, 39], which computes distribution distance between system dynamics.

References

  • [1] J. J. Collins, D. Howard, and J. Leitner, “Quantifying the reality gap in robotic manipulation tasks,” 2019 International Conference on Robotics and Automation (ICRA), pp. 6706–6712, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53208962
  • [2] J. Moos, K. Hansel, H. Abdulsamad, S. Stark, D. Clever, and J. Peters, “Robust reinforcement learning: A review of foundations and recent advances,” Machine Learning and Knowledge Extraction, vol. 4, no. 1, pp. 276–315, 2022. [Online]. Available: https://www.mdpi.com/2504-4990/4/1/13
  • [3] M. Xu, Z. Liu, P. Huang, W. Ding, Z. Cen, B. Li, and D. Zhao, “Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,” 2022.
  • [4] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 3803–3810.
  • [5] F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without a single real image,” 2017.
  • [6] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” 2017.
  • [7] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward sha**,” in Proceedings of the Sixteenth International Conference on Machine Learning, ser. ICML ’99.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, p. 278–287.
  • [8] S. Booth, W. B. Knox, J. Shah, S. Niekum, P. Stone, and A. Allievi, “The perils of trial-and-error reward design: Misdesign through overfitting and invalid task specifications,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 5, pp. 5920–5929, Jun. 2023. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/25733
  • [9] A. Donzé and O. Maler, “Robust satisfaction of temporal logic over real-valued signals,” in Formal Modeling and Analysis of Timed Systems, K. Chatterjee and T. A. Henzinger, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 92–106.
  • [10] C. Baier, L. de Alfaro, V. Forejt, and M. Kwiatkowska, Model Checking Probabilistic Systems.   Cham: Springer International Publishing, 2018, pp. 963–999.
  • [11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [12] J. García and F. Fernández, “A comprehensive survey on safe reinforcement learning,” J. Mach. Learn. Res., vol. 16, no. 1, p. 1437–1480, jan 2015.
  • [13] S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, Y. Yang, and A. Knoll, “A review of safe reinforcement learning: Methods, theory and applications,” ArXiv, vol. abs/2205.10330, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:248965265
  • [14] W. Yu, C. K. Liu, and G. Turk, “Policy transfer with strategy optimization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=H1g6osRcFQ
  • [15] S. P. Bhattacharyya, H. Chapellat, and L. H. Keel, Robust Control: The Parametric Approach, 1st ed.   USA: Prentice Hall PTR, 1995.
  • [16] A. Weinmann, Uncertain models and robust control.   Springer Science & Business Media, 2012.
  • [17] A. Corso, R. Moss, M. Koren, R. Lee, and M. Kochenderfer, “A survey of algorithms for black-box safety validation of cyber-physical systems,” Journal of Artificial Intelligence Research, vol. 72, pp. 377–428, 2021.
  • [18] C. Fan, X. Qin, Y. Xia, A. Zutshi, and J. Deshmukh, “Statistical verification of autonomous systems using surrogate models and conformal inference,” 2021.
  • [19] G. Pedrielli, T. Khandait, Y. Cao, Q. Thibeault, H. Huang, M. Castillo-Effen, and G. Fainekos, “Part-x: A family of stochastic algorithms for search-based test generation with probabilistic guarantees,” IEEE Transactions on Automation Science and Engineering, pp. 1–22, 2023.
  • [20] L. Lindemann, N. Matni, and G. J. Pappas, “Stl robustness risk over discrete-time stochastic processes,” in 2021 60th IEEE Conference on Decision and Control (CDC), 2021, pp. 1329–1335.
  • [21] A. Donzé, “Breach, a toolbox for verification and parameter synthesis of hybrid systems,” in Computer Aided Verification, T. Touili, B. Cook, and P. Jackson, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 167–170.
  • [22] Y. Annpureddy, C. Liu, G. Fainekos, and S. Sankaranarayanan, “S-taliro: A tool for temporal logic falsification for hybrid systems,” in Tools and Algorithms for the Construction and Analysis of Systems, P. A. Abdulla and K. R. M. Leino, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 254–257.
  • [23] Q. Thibeault, J. Anderson, A. Chandratre, G. Pedrielli, and G. Fainekos, “Psy-taliro: A python toolbox for search-based test generation for cyber-physical systems,” 2021.
  • [24] N. Hansen and A. Ostermeier, “Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation,” in Proceedings of IEEE International Conference on Evolutionary Computation, 1996, pp. 312–317.
  • [25] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
  • [26] M. Schlüter, J. A. Egea, and J. R. Banga, “Extended ant colony optimization for non-convex mixed integer nonlinear programming,” Computers & Operations Research, vol. 36, no. 7, pp. 2217–2229, 2009. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0305054808001524
  • [27] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.
  • [28] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016.
  • [29] I. Haghighi, N. Mehdipour, E. Bartocci, and C. Belta, “Control from signal temporal logic specifications with smooth cumulative quantitative semantics,” in 2019 IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 4361–4366.
  • [30] N. Mehdipour, C.-I. Vasile, and C. Belta, “Arithmetic-geometric mean robustness for control from signal temporal logic specifications,” in 2019 American Control Conference (ACC), 2019, pp. 1690–1695.
  • [31] A. Saoud, P. Jagtap, and S. Soudjani, “Temporal logic resilience for cyber-physical systems,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 2066–2071.
  • [32] G. E. Fainekos and G. J. Pappas, “Mtl robust testing and verification for lpv systems,” in 2009 American Control Conference, 2009, pp. 3748–3753.
  • [33] E. Bartocci, J. Deshmukh, A. Donzé, G. Fainekos, O. Maler, D. Ničković, and S. Sankaranarayanan, “Specification-based monitoring of cyber-physical systems: a survey on theory, tools and applications,” Lectures on Runtime Verification: Introductory and Advanced Topics, pp. 135–175, 2018.
  • [34] T. Dreossi, D. J. Fremont, S. Ghosh, E. Kim, H. Ravanbakhsh, M. Vazquez-Chanlatte, and S. A. Seshia, “Verifai: A toolkit for the formal design and analysis of artificial intelligence-based systems,” in Computer Aided Verification, I. Dillig and S. Tasiran, Eds.   Cham: Springer International Publishing, 2019, pp. 432–442.
  • [35] T. Dreossi, A. Donzé, and S. A. Seshia, “Compositional falsification of cyber-physical systems with machine learning components,” Journal of Automated Reasoning, vol. 63, pp. 1031–1053, 2019.
  • [36] P. Kapoor, E. Kang, and R. Meira-Goes, “Safe planning through incremental decomposition of signal temporal logic specifications,” arXiv preprint arXiv:2403.10554, 2024.
  • [37] E. Lecarpentier and E. Rachelson, “Non-stationary markov decision processes, a worst-case approach using model-based reinforcement learning,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019.
  • [38] M. A. Abdullah, H. Ren, H. B. Ammar, V. Milenkovic, R. Luo, M. Zhang, and J. Wang, “Wasserstein robust reinforcement learning,” 2019.
  • [39] I. Yang, “A convex optimization approach to distributionally robust markov decision processes with wasserstein distance,” IEEE Control Systems Letters, vol. 1, no. 1, pp. 164–169, 2017.
  • [40] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” 2013.
  • [41] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017.
  • [42] S. Gronauer, “Bullet-safety-gym: A framework for constrained reinforcement learning,” mediaTUM, Tech. Rep., 2022.
  • [43] Z. Liu, Z. Guo, Z. Cen, H. Zhang, J. Tan, B. Li, and D. Zhao, “On the robustness of safe reinforcement learning under observational perturbations,” 2023.
  • [44] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” 2018.
  • [45] J. Song, D. Lyu, Z. Zhang, Z. Wang, T. Zhang, and L. Ma, “When cyber-physical systems meet ai: A benchmark, an evaluation, and a way forward,” in Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, ser. ICSE-SEIP ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 343–352. [Online]. Available: https://doi.org/10.1145/3510457.3513049
  • [46] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   PMLR, 10–15 Jul 2018, pp. 1587–1596. [Online]. Available: https://proceedings.mlr.press/v80/fujimoto18a.html

Appendix

Benchmark Problem Descriptions

The following sections describe the details about the systems of our CPS tolerance evaluation benchmark.

9.0.1 Cart-Pole

The Cart-Pole problem is described in Section 3. In our experiments, we synthesize a PID and a DQN [40] controller for it; and we define four deviation dimensions, the Mass of the cart, the Mass of the pole, the Length of the pole, and the Force when pushing the cart.

9.0.2 Lunar-Lander

The Lunar-Lander system444https://www.gymlibrary.dev/environments/box2d/lunar_lander/ where the goal is to control an aircraft to safely land on the surface of a planet (within the flagged area). It can fire the main engine (on the bottom) and the left/right engines to control the pose of the aircraft. The safety property defines 1) the rotation of the aircraft should be within a value θ𝜃\thetaitalic_θ (e.g., not parallel to the ground or upside down), and 2) it should be close to the landing target as the height decreasing. In our experiments, we develop a LQR and a PPO [41] controller for it; and we define three deviation dimensions, the Wind that can change the x-y position of the aircraft, the Turbulence (rotational wind) that can change the rotation of it, and the Gravity.

9.0.3 Car-Circle

The Car-Circle system where the task is to control a car to move along the circumference of the blue circle [42]. There are “walls” on the two sides and the safety property defines the car should not move across the walls. In our experiments, we leverage a PPO variation for it from [43] (which is more robust than standard PPO in the context of robust RL); and we define three types of deviations, the Force that moves the car, the Speed Multiplier and the Steering Multiplier that affect the sensitivity of the forward velocity and the angular velocity response to the force, respectively.

9.0.4 Car-Run

The Car-Run system where the task is to control a car move along the track without hitting the walls on the two sides [42]. That is, the safety property defines the car should not move across the walls. In our experiments, similar to the Car-Circle system, we also leverage a PPO variation from [43] and consider the Speed Multiplier and the Steering Multiplier deviation types.

9.0.5 Adaptive Cruise Control

A vehicle equipped with adaptive cruise control (ACC)555https://www.mathworks.com/help/mpc/ug/adaptive-cruise-control-using-model-predictive-controller.html has a sensor that measures the distance to the preceding vehicle in the same line. The control goal is to: 1) control the speed of the vehicle to reach the driver-set velocity, and 2) maintain a safe distance to the leading vehicle. Therefore, we have the safety property that the relative distance between the ego vehicle and the leading vehicle should always be greater than a safe distance. In our experiments, we adopt an MPC controller from Matlab and a SAC [44] controller from Jiayang Song et. al [45]. We define three types of deviations, the Mass of the vehicle, and the min and max acceleration of the leading vehicle, changing which can mimic a more progressive or conservative leading vehicle that changes its speed more abruptly or slowly.

9.0.6 Water Tank

A water tank (WTK) system is a container with a controller controlling the inflow and outflow of water, widely used in industry domains like the chemical industry.666https://www.mathworks.com/help/slcontrol/gs/watertank-simulink-model.html The safety property is defined such that the error between the actual water level and the desired water level should always be below a threshold. We adopt a PID controller from Matlab and a TD3 [46] controller from [45]. We define two types of deviations, the water flow rate into the tank and the water flow rate out of the tank, which affect how fast the water volume would change.