\corrauth

Rachel Luo, Stanford University, Stanford, CA 94305, U.S.A.

Sample-Efficient Safety Assurances using Conformal Prediction

Rachel Luo¹¹affiliationmark: Shengjia Zhao¹¹affiliationmark: Jonathan Kuck²²affiliationmark: Boris Ivanovic¹¹affiliationmark: Silvio Savarese¹¹affiliationmark:
Edward Schmerling¹¹affiliationmark: and Marco Pavone¹¹affiliationmark: ¹¹affiliationmark: Stanford University, Stanford, CA 94305, USA
²²affiliationmark: Dexterity, Inc., Redwood City, CA 94063, USA [email protected]

Abstract

When deploying machine learning models in high-stakes robotics applications, the ability to detect unsafe situations is crucial. Early warning systems can provide alerts when an unsafe situation is imminent (in the absence of corrective action). To reliably improve safety, these warning systems should have a provable false negative rate; i.e. of the situations that are unsafe, fewer than $\epsilon$ will occur without an alert. In this work, we present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics, in order to tune warning systems to provably achieve an $\epsilon$ false negative rate using as few as $1/\epsilon$ data points. We apply our framework to a driver warning system and a robotic gras** application, and empirically demonstrate the guaranteed false negative rate while also observing a low false detection (positive) rate.

keywords:

Safety assurance, Conformal prediction, Statistical inference

1 Introduction

Monitoring a system for faults, or detecting if unsafe situations will occur is a key problem for high-stakes robotics applications, and indeed the field of fault detection has long been the state of practice for building reliable systems (Visinsky et al., 1994a, b, 1995; Vemuri et al., 1998; Khalastchi and Kalech, 2018; Muradore and Fiorini, 2011; Crestani et al., 2015; Ding, 2013; Patton and Chen, 1997; Harirchi and Ozay, 2015, 2018). With the advent of learning-enabled components in robotic systems, robots are performing increasingly complex safety-critical tasks, so reliability has become increasingly important. For instance, in an autonomous driving setting, errors in perception or planning could lead to collision. In a warehouse robotics setting, robots on the factory floor work alongside humans, and not recognizing faults in learned systems could impact safety or even lead to injuries. At the same time, it is less clear how to ensure reliability for these learned systems. These systems are complex, so guaranteeing safety is not something that can be done from first principles — empirical, data-driven methods are needed.

In this work, we present a sample efficient and principled method for detecting unsafe situations based on the statistical inference technique of conformal prediction (Vovk et al., 2005). Our method provides provable false negative rates for warning systems (i.e. among the situations in which an alert should be issued, fewer than $\epsilon$ occur without an alert), while achieving low false positive rates (few unnecessary alerts are issued).

For example, in a driver assistance system, when an unsafe situation (e.g. another car getting too close) is imminent, our method will issue a warning the vast majority of the time (i.e. at least $1-\epsilon$ of the time). In a warehouse setting with a robotic pick-and-place system, when the system will fail to grasp and transport an object, our method will issue an alert at least $1-\epsilon$ of the time. As a running example in this paper, we use our method to design an alert system to warn a human operator of impending danger in a driving application (illustrated in Figure 1).

Refer to caption — Figure 1: We design a warning system that achieves a provable false negative rate sample efficiently. Among the situations that are dangerous (i.e. lead to an unsafe future situation in the absence of corrective action), fewer than $\epsilon$ occur without an alert.

1.1 Related Work

Traditional fault detection techniques include hardware redundancy, signal processing, and plausibility tests (Ding, 2013; Visinsky et al., 1994a, b, 1995; Vemuri et al., 1998). However, hardware redundancy requires extra components, signal processing works well only for processes in steady state, and plausibility tests do not catch faults that lead to a physically plausible system. Additionally, these methods typically lack performance guarantees. Model-based fault detection techniques (Ding, 2013; Patton and Chen, 1997; Harirchi and Ozay, 2015, 2018) involve using a model of the system to determine whether a fault has occurred; they assume that users have a very accurate model of the system dynamics, which is difficult to obtain in practice.

Another common approach for detecting unsafe states employs supervised learning to train a classifier model for labeling states as unsafe, and then the classifier hyperparameters are adjusted until empirically the false negative rate is low. In practice this is typically accomplished by plotting a receiver operating characteristic (ROC) curve and tuning the classification threshold to achieve low false negative rate. However, this approach requires training a new classification model, and provides no performance guarantees.

To guarantee the false negative rate of a learned warning system, the standard statistical learning framework could be used under standard i.i.d. assumptions (von Luxburg and Schölkopf, 2011). A practitioner could collect additional data and use a validation dataset to provably certify the false negative rate. However, the key problem is data efficiency, because collecting data for unsafe situations can be very expensive (Foody, 2009; von Luxburg and Schölkopf, 2011; Calafiore and Campi, 2006).

1.2 Contributions

Main Question

Can we tune a warning system and guarantee a low false negative rate with only a handful of data points? For example, with only 30 data samples of dangerous situations, can we tune a warning system to have a provable 5% false negative rate? This problem is easy if we allow trivial systems that always issue a warning, but such systems are not practically useful. If we restrict our attention to non-trivial systems, this problem is seemingly impossible because even if a fixed warning system successfully identifies all 30 dangerous situations, due to statistical fluctuations, we cannot prove that its false negative rate is less than 5% (with high confidence). If proving that a fixed predictor achieves safety is difficult, tuning a predictor to provably achieve safety seems only more challenging.

Our Contribution

We answer our main question affirmatively. We adapt a statistical inference framework known as conformal prediction to a robotics setting in order to tune systems to achieve provable safety guarantees (e.g. 5% false negative rate) with extremely limited data (e.g. 30 samples). We only require a single assumption: the training samples are exchangeable with each test sample, i.e. for each test sample, if we permute the concatenated sequence of the training samples and the test sample, there is no reason to believe that any permutation is more or less likely to occur. This is a weaker assumption than the i.i.d. assumption typically used in the statistical learning framework.

The assumption is reasonable in practice, even in situations in which the statistical inference assumptions fail (e.g. situations with temporal correlations between different test samples). In a driving scenario, for instance, the training dataset could include scene snippets sampled from different scenes, and these will be i.i.d. At test time, there may be one scene with several snippets. These snippets are obviously not independent; however, they are individually exchangeable with the training dataset.

While this answer seems too good to be true, the key insight here is that we provide a type of guarantee that is different from standard statistical learning guarantees. Consider a sequence of test samples $Z_{1},\cdots,Z_{N}$ , and event indicators $F_{1},\cdots,F_{N}$ for whether our warning system fails on each test sample.

•

Statistical learning guarantee. In the statistical learning framework, we assume that the test samples $Z_{1},\cdots,Z_{N}$ are i.i.d., so the failure events $F_{1},\cdots,F_{N}$ are also i.i.d. — we guarantee the failure probability for a sequence of i.i.d. failure events.
•

Conformal prediction guarantee. In the conformal prediction framework, the test samples are not necessarily i.i.d., so the failure events $F_{1},\cdots,\allowbreak F_{N}$ can be correlated — we guarantee the marginal failure probability for each failure event. In other words, we know that each test sample has a low probability of failure (i.e. $F_{n}=1$ with low probability), but the failures could be correlated. For example, conditioning on $F_{n}=1$ might increase or decrease the probability that $F_{n+1}=1$ (while in the i.i.d. case, $F_{n}$ and $F_{n+1}$ are independent events).

The usefulness of the conformal guarantee depends on the intended application. Consider the driver alert system example: for individual drivers, collisions are rare and most drivers will not encounter more than one. Hence, there is little reason to worry about whether the warning failures are correlated between collisions. In other words, the conformal guarantee can convey confidence to individual users who rarely encounter multiple failures.

On the other hand, the conformal guarantee may convey less confidence to a company with a large fleet of vehicles. For example, if $F_{n}=1$ increases the probability that $F_{n+1}=1$ , then it is possible to have multiple simultaneous failures. However, this is not a limitation of our method, but rather an unavoidable consequence of the weaker (not i.i.d.) assumptions: if the test data is correlated (which we have no control over), then failure events of a warning system are inherently correlated. The weaker assumption is usually necessary because most robotics applications are deployed in time series or sequential decision making setups, so data from nearby time steps are correlated and not i.i.d. Since standard statistical learning guarantees are not applicable due to violation of the i.i.d. assumption, having some (conformal) guarantee is better than none.

Furthermore, we will show empirically in Section 4 that failures are not highly correlated on two real-world driving datasets. Therefore, despite the lack of formal guarantees, there is strong empirical evidence suggesting that simultaneous failures do not occur in practice.

Thus, our contribution is four-fold:

1.

We introduce a new notion of safety guarantee that is satisfactory for many use cases and has extremely good sample efficiency.
2.

We show how to leverage the statistical inference tool of conformal prediction for robotics applications.
3.

We instantiate a framework for applying conformal prediction to robotic safety.
4.

We validate our framework experimentally on both a driver alert safety system and a robotic gras** system, showing that the conformal guarantees hold in practice, without issuing too many false positive alerts (e.g. less than 1% for many setups).

A preliminary version of this article was presented at the 2022 Workshop on the Algorithmic Foundations of Robotics (Luo et al., 2022). In this revised and extended version, we additionally contribute: (1) additional exposition on the surrogate safety score in our proposed framework, (2) proofs for the propositions in Section 3, (3) additional exposition on the details of our experimental setups, (4) experimental results demonstrating the tradeoff between $\epsilon$ and the false positive rate (FPR) when there are few samples, (5) experimental results demonstrating empirically that failures are not highly correlated in a real-world driving setting, and (6) experimental results demonstrating that worse surrogate safety scores lead to a performance drop in terms of the false positive rate (but no performance drop in terms of the false negative rate).

1.3 Organization

The rest of this paper is organized as follows. In Section 2, we review conformal prediction. In Section 3, we describe our problem setup, introduce our framework, and demonstrate that specific choices for elements of our framework lead to instantiations such as tuning an ROC curve threshold to limit false negatives (though we enrich this classic method with new guarantees). We then explain the differences between the conformal prediction guarantees and the statistical learning guarantees, and discuss when our guarantees should be applied. Finally, in Sections 4 and 5, we evaluate our framework on a driver alert safety system and on a robotic gras** system.

2 Overview of Conformal Prediction

This section provides an overview of conformal prediction, the general framework that we adapt for robotics safety. It may be skipped without breaking the flow of the paper.

Consider a prediction problem where the input feature is denoted by $X$ and the label is denoted by $Z$ . Conformal prediction (Shafer and Vovk, 2008) is a class of methods that can produce prediction sets (i.e. a set of labels), such that the true label belongs to the predicted set with high probability. In its standard form, conformal prediction requires two components: a sequence of validation data $(X_{1},Z_{1}),\cdots,(X_{T},Z_{T})$ and a non-conformity score $\psi$ , which is any function from the input feature $X$ and the label $Z$ to a real number. Intuitively, the non-conformity score should measure the “unusualness” of the label $Z$ when the input feature is $X$ . An example non-conformity score is $\psi(X,Z)=|h(X)-Z|$ where $h$ is some fixed prediction function — intuitively, $Z$ is “unusual” if the prediction function has large error.

The conformal prediction algorithm computes the non-conformity score for all samples in the validation set. Given a new test example with input feature ${\hat{X}}$ , the conformal prediction algorithm then “tries” all possible labels $z$ , and measures the non-conformity score $\psi({\hat{X}},z)$ . A label is rejected if the computed non-conformity score is greater than $1-\epsilon$ of the non-conformity scores in the validation set. Any label that is not rejected is included in the prediction set. Intuitively, the true label is unlikely to have a non-conformity score higher than $1-\epsilon$ of validation samples; hence the true label is unlikely to get rejected.

If the training data and the new test data point $({\hat{X}},{\hat{Z}})$ are exchangeable, i.e. the probability of observing any permutation of $(X_{1},Z_{1}),\cdots,(X_{T},Z_{T}),({\hat{X}},{\hat{Z}})$ is equally likely, then conformal prediction has very strong validity guarantees: the true label will be within the prediction set with $1-\epsilon\pm 1/(T+1)$ probability. We note that this guarantee holds regardless of the nonconformity function $\psi$ .

There are many extensions of conformal prediction, and the most relevant extension to our safety application is Mondrian conformal prediction (Vovk et al., 2003, 2005), which partitions the input data into several categories such that each data point belongs to exactly one category, and guarantees validity separately for each category. Our work is based on Mondrian conformal prediction; because we wish to limit the false negative rate in warning systems, we need class-conditional validity for samples in the “unsafe” class.

Works that apply conformal prediction to classification models in learned systems include Angelopoulos et al. (2020), Ghosh et al. (2023), and Angelopoulos et al. (2022). Angelopoulos et al. (2020) and Ghosh et al. (2023) apply conformal prediction to image classifiers to obtain a predictive set containing the true label with a user-specified probability. Angelopoulos et al. (2022) studies conformal prediction in the context of recommender systems, and uses it to produce a set of user recommendations.

Works that apply conformal prediction to robotics settings include Chen et al. (2020), Cai and Koutsoukos (2020), Nouretdinov et al. (2011), and Gammerman et al. (2008). Chen et al. (2020) uses conformal prediction to predict a set of possible future motion trajectories from out of a set of 17 basis trajectories; Cai and Koutsoukos (2020) uses some ideas from conformal prediction for detecting out of distribution samples in cyber-physical systems; and Nouretdinov et al. (2011) and Gammerman et al. (2008) use conformal prediction for medical diagnosis. However, these works consider very different targeted problems, while we consider the problem of warning systems and provide a general framework for using conformal prediction on a variety of robotics applications.

3 Conformal Prediction Framework for Robotics Applications

3.1 Problem Setup

We consider a model-based planning application where we have some existing simulator or model, and given the current observations (denoted by random variable $X$ ), the simulator or model predicts the future states of the system (denoted by $Y$ ) in the absence of a warning. For instance, many applications have off-the-shelf simulators: an autonomous driving software might simulate the future trajectories of all traffic participants (up to some time horizon), or an aircraft control software might forward simulate the dynamics of the aircraft. We will use the random variable $Z$ to denote the true unknown future states of the system in the absence of a warning, e.g. the true future trajectories of traffic participants, or the true future dynamics of an aircraft.

In our setup, depending on the model or simulator available, $Y$ could have the same type as $Z$ (e.g. both $Y$ and $Z$ are random variables that represent the future trajectories of traffic participants), or $Y$ could have a different type from $Z$ (e.g. $Y$ might represent some but not all aspects about the future, such as the direction of movement for traffic participants, or the distance from collision).

Figure 2 shows a simplified illustration of our running driver alert system example. In this scene, there is an ego-agent (shown in blue), and an external agent (shown in red), whose position we would like to predict. Here, $X$ represents the current locations of the agents in the scene (i.e. the location of the red car at the bottom of the figure), $Y$ represents the predicted future locations (in this illustration our model predicts that the red car will move up and to the left), and $Z$ represents the true unknown future locations (perhaps the red car actually moves up and to the right).

3.1.1 Assessing Safety

We assume that if we know the true future state of the system $Z$ , we can assess whether it is safe or not. Specifically, there exists some safety score denoted by $f(Z)$ ; we specify some threshold (denoted by $f_{0}$ ), and wish to be alerted if the safety score drops below this threshold (i.e. if $f(Z)<f_{0}$ ). In other words, a situation is defined to be unsafe if the safety score $f(Z)$ is too low. Most applications have natural safety scores. For instance, an autonomous driving safety score $f$ could be the distance to or time from collision; an aircraft control safety score could be the (negative) absolute difference between the orientation of the aircraft and its ideal orientation.

In addition, we assume that the user provides a surrogate safety score $g:Y\mapsto\mathbb{R}$ that maps from the simulator prediction to a “safety score,” where a higher score indicates “safe” and a lower score indicates “unsafe”. Ideally the surrogate safety score $g(Y)$ should be highly correlated with the true safety score $f(Z)$ , but technically $g$ can be any function. None of our technical results depend on any assumptions about $g$ ; however, the choice of $g$ affects the empirical performance in terms of false positive rate (i.e. how often our warning system issues unnecessary alerts). When $Y$ and $Z$ have the same type, we can simply choose $g:=f$ ; when $Y$ and $Z$ have different types we need to choose $g$ on a case-by-case basis.

For example, in the driver assistance system in Figure 2, the safety score $f(Z)$ could be the distance to the nearest car (shown by the blue line in the figure). Since the simulator $Y$ can output predicted distances to other cars, the surrogate safety score $g(Y)$ could also be the distance to the nearest car (shown by the orange line in the figure). The difference between $f$ and $g$ is that the input to $g$ is the predicted future state of the system $Y$ (output by the simulator), rather than the true unknown future state of the system $Z$ . In this example, $f_{0}$ is the safety threshold shown by the red boundary; so if the red and blue cars are too close together, then the situation is considered unsafe.

3.1.2 Warning Function

We wish to design a warning function (denoted as a function $w(Y)$ ) that given the simulation or model output $Y$ , decides to issue a warning ( $w(Y)=1$ ) or not ( $w(Y)=0$ ) (see Figure 3). Note that $Y$ depends on previous states, so the warning function implicitly depends on previous observations through $Y$ . Formally we define “safety” as the following requirement:

Definition 1.

For some $0<\epsilon<1$ , we say that the warning system $w$ is $\epsilon$ -safe (with respect to $Y,Z$ , $f$ , and $f_{0}$ ) if

\displaystyle\Pr[w(Y)=1\mid f(Z)<f_{0}]\geq 1-\epsilon.

In words, whenever the true future safety score $f(Z)$ is below $f_{0}$ , the warning system should issue a warning ( $w(Y)=1$ ) with at least $1-\epsilon$ probability (see Figure 4). Another way to think of this is that the false negative rate is at most $\epsilon$ . The main difficulty here is that the warning function $w$ can depend on only the simulated future $Y$ rather than the true future $Z$ (which is not yet observed when the warning is issued), and the simulation might not come with any performance guarantees.

A trivial warning system that always issues a warning (i.e. $w_{\mathrm{trivial}}(Y)\equiv 1$ ) is always $\epsilon$ -safe for any $\epsilon>0$ . However, such a warning system is not useful. A useful warning system should issue as few warnings as possible when safe. Therefore, we should also consider its false positive rate

\displaystyle\mathrm{FPR}(w)=\Pr[w(Y)=1\mid f(Z)\geq f_{0}].

The false positive rate is of lower priority for safety because issuing an unnecessary warning might only be an inconvenience, while failing to issue a warning when the situation is unsafe can lead to catastrophic outcomes. In summary, our goal is to design a warning function $w(\cdot)$ such that:

Goal: Provably achieve $\epsilon$ -safety for small $\epsilon$ (e.g. $0.02$ ), while achieving low false positive rate (FPR).

3.1.3 Examples

A few examples that illustrate this problem setup are as follows:

1.

In a driver alert system, users may want an assurance that among the instances in which the driver is in a dangerous situation, the system will issue a warning the vast majority of the time. The safety score in this case could be the time to collision (TTC), or the nearest distance from another car.
2.

In a multi-arm robot collaboration system, users may want an assurance that among the instances in which the robot arms may collide, the system will issue a warning the majority of the time. The safety score could be the nearest distance to another robot arm.
3.

In a warehouse robotic box-stacking system, users may want an assurance that among the instances in which the boxes will topple, the system will issue a warning the majority of the time. The safety score could be the probability of a stable stack.
4.

In a coffee shop with a robotic barista, users may want an assurance that among the instances in which the robot may spill hot coffee, the system will issue a warning the majority of the time. The safety score could be the probability of a successful pour.

Examples 3 and 4 can be thought of as ROC curve threshold tuning. If the model used is a binary classifier that predicts whether there is a stable stack, we can use the predicted probability of a “safe” outcome as $g$ . Note that in this special case, our method also tunes the threshold, but adds guarantees on the false negative rate and practical guidelines for sample complexity.

3.2 Analysis of the Trade-off Between the FNR and FPR

In this section, we analyze the fundamental trade-off between the false negative rate (FNR) and the false positive rate (FPR) of a warning system. For example, a trivial system that always issues a warning will have a 0% FNR but 100% FPR. Conversely, a system that never issues a warning will have a 0% FPR but 100% FNR. This suggests a trade-off between the achievable FNR and FPR.

3.2.1 Infinite Validation Data Regime

Even with infinite validation data, we may not be able to achieve both perfect FNR and perfect FPR because of inherent limitations of the safety score. For instance, at one extreme, if the safe and unsafe examples have identical safety score distributions, then there is no way to distinguish them according to Definition 1. At the other extreme, if the safe and unsafe examples have disjoint safety score distributions, then we can distinguish them perfectly (i.e. achieve 0% FPR and 0% FNR). A typical real world scenario will likely fall somewhere in between the two extremes, as illustrated in Figure 5. The key quantity is the amount of overlap between the safety score distribution for safe vs. unsafe examples, which will dictate the optimal achievable trade-off between the FPR and the FNR.

3.2.2 Finite Validation Data Regime

The lack of sufficient validation data is another source of error that degrades the best achievable FNR/FPR trade-off. Intuitively, because we need to provably guarantee the FNR for $\epsilon$ -safety, in the absence of sufficient validation data, we must be conservative and issue more warnings than necessary. For example, with zero validation data we have no choice but to issue a warning for nearly every example, leading to a high FPR. In fact, we show in Proposition 2 in Section 3.3 that if we have fewer than $T$ data samples, we cannot guarantee better than $O(1/T)$ FNR without incurring an FPR of close to $1$ for any distribution-free warning function that depends only on the ordering of the $g(Y)$ values.

Our conformal algorithm can guarantee an $O(1/T)$ FNR while the FPR is not much higher than in the infinite data regime, demonstrating the (asymptotic) optimality of the conformal algorithm presented in Algorithm 1.

3.3 Algorithm to Achieve Guaranteed Safety Assurances

In this section, we will describe an algorithm that achieves $\epsilon^{*}$ -safety with a low (nontrivial) false positive rate. The setup is as described in Section 3.1, with current observations $X$ , simulator $Y$ , and true unknown future states $Z$ .

If the simulation $Y$ is perfect and has the same type as the ground truth future state $Z$ , i.e. $Z=Y$ almost surely, then we can simply set $g=f$ and choose $w(Y)=\mathbb{I}(g(Y)<f_{0})$ , and this $w$ will automatically satisfy our definition of safety. However, in most applications, it is difficult to provide any guarantees on the accuracy of the simulation. For example, in autonomous driving situations, traffic participants can behave in unexpected and hard to predict ways.

When we are uncertain about the simulation accuracy, we will require an additional training dataset. With a dataset of (simulated future state, true future state) pairs $(Y_{1},Z_{1})$ , $(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})$ , where $T$ is the number of samples, we can guarantee $\epsilon$ -safety. Let $({\hat{Y}},{\hat{Z}})$ denote a new test sample. We require only a single assumption on the dataset:

Assumption 1.

The sequence $(Y_{1},Z_{1})$ , $(Y_{2},Z_{2})$ , $\cdots$ , $(Y_{T},Z_{T})$ , $({\hat{Y}},{\hat{Z}})$ is exchangeable, i.e. the probability of observing any permutation of the sequence is equally likely.

Exchangeability is a strong assumption. However, it is weaker than typical i.i.d. assumptions that underlie most machine learning methods with performance guarantees: if a sequence of data is i.i.d., then it is also exchangeable. In addition, if the distribution shifts, it is not prohibitively costly to collect a new training dataset from the shifted distribution. This is because we require only a very small dataset (e.g. in most of our experiments, the training dataset contains only about 50 examples of unsafe situations).

Based only on Assumption 1, we design an algorithm to guarantee safety on test data. The algorithm can be thought of as an instantiation of the conformal prediction framework.

Algorithm 1 Approximate

\epsilon

-safety

1:Training dataset

(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})

, Surrogate safety score

g

, True safety score

f

, Threshold

f_{0}

; A new simulation

{\hat{Y}}

\{0,1\}

3:Compute

{\mathcal{A}}=\big{\{}g(Y_{t})\colon f(Z_{t})<f_{0},t=1,\cdots,T\big{\}}.

4:Sample

U

uniformly from

U\sim\big{\{}0,1,\cdots,|\{a\in{\mathcal{A}}\colon a=g({\hat{Y}})\}|\big{\}}.

5:Compute

q=\frac{\ |\{a\in{\mathcal{A}}\colon a<g({\hat{Y}})\}|+U+1}{|{\mathcal{A}}|+1}.

6:If

q\leq 1-\epsilon

then output 1, otherwise output 0 .

Intuitively, the procedure is as follows. We first compute a predicted safety score (based on the simulator outputs) for each unsafe sample in the training dataset (Line 1). We then sample a number from a uniform distribution between 0 and $|\{a\in{\mathcal{A}}\colon a=g({\hat{Y}})\}|$ , where $|\{a\in{\mathcal{A}}\colon a=g({\hat{Y}})\}|$ is the number of unsafe training data points with surrogate safety scores that are equal to the surrogate safety score of the new simulation $g({\hat{Y}})$ (Line 2). ²²2This randomization factor $U$ is simply a small correction factor that guarantees exact coverage in case of ties (see Vovk et al. (2005)). We next compute the quantile value for the new test simulation, i.e. the proportion of validation samples with a lower safety score than ${\hat{Y}}$ (Line 3), with a small randomization factor from the previous step. If this quantile value is smaller than $1-\epsilon$ (i.e. fewer than $1-\epsilon$ of the unsafe samples from the training set have a lower safety score), we say that this may be an unsafe situation and issue an alert; otherwise, we say that it is safe (Line 4).

The following proposition shows that Algorithm 1 can guarantee safety.

Proposition 1.

Algorithm 1 is $\epsilon+1/(1+|{\mathcal{A}}|)$ -safe (with respect to ${\hat{Y}}$ and ${\hat{Z}}$ ), under Assumption 1.

Proof.

Given a dataset

(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T}),

where each data point contains the (simulated future state, true future state), denote the subset of “unsafe” data as

(Y_{c_{1}},Z_{c_{1}}),(Y_{c_{2}},Z_{c_{2}}),\cdots,(Y_{c_{M}},Z_{c_{M}}),

where $Z_{c_{t}}$ is the $t$ -th unsafe example (i.e. $f(Z_{c_{t}})<f_{0}$ ). For typographical clarity throughout this proof, we use $M$ as shorthand to represent the number of unsafe data points,

M=|{\mathcal{A}}|=\big{|}\big{\{}t\colon f(Z_{t})<f_{0}\big{\}}\big{|},

and we use $N$ to represent the number of unsafe data points with surrogate safety score less than $g(\hat{Y})$ ,

\displaystyle N

\displaystyle=\big{|}\big{\{}t\colon g(Y_{c_{t}})<g(\hat{Y})\big{\}}\big{|}=% \big{|}\big{\{}a\in{\mathcal{A}}\colon a<g({\hat{Y}})\big{\}}\big{|}.

Suppose that $\hat{Z}$ is also unsafe, i.e. $f(\hat{Z})<f_{0}$ . Let $\lbag\cdot\rbag$ denote an unordered bag (i.e. it is a set that can have repeated elements). We use $B$ to represent the unordered bag of unsafe data,

B=\lbag(Y_{c_{1}},Z_{c_{1}}),\cdots,(Y_{c_{M}},Z_{c_{M}}),(\hat{Y},\hat{Z})\rbag.

To bound the safety, we first note that the probability of a false negative is given by

	$\displaystyle\Pr[w(\hat{Y})=0]$	$\displaystyle=\Pr[q>1-\epsilon]$
		$\displaystyle={\mathbb{E}}\Bigl{[}\Pr\bigl{[}q>1-\epsilon\big{\|}B\bigr{]}\Bigr% {]}\hskip 51.00014pt\text{(Tower)}$
		$\displaystyle={\mathbb{E}}\Bigl{[}\Pr\bigl{[}N+U>(1-\epsilon)(M+1)-1\big{\|}B% \bigr{]}\Bigl{]}$

By the assumption of exchangeability, we are equally likely to observe any permutation of $B$ . Intuitively, $g(\hat{Y})$ is equally likely to be the largest, 2nd largest, etc., among $g(Y_{c_{1}}),\cdots,g(Y_{c_{M}}),g(\hat{Y})$ . Formally, the random variable $N+U$ takes on all values $\{0,1,\cdots,M\}$ with equal probability.³³3Note that the maximum value of $U$ is the number of unsafe training data points with surrogate safety scores that are equal to the surrogate safety score of the new simulation, and so $|\{a\in{\mathcal{A}}\colon a<g({\hat{Y}})\}|+|\{a\in{\mathcal{A}}\colon a=g({% \hat{Y}})\}|=M$ . Therefore,

	$\displaystyle\Pr\biggl{[}N+U>(1-\epsilon)(M+1)-1\Big{\|}B\biggr{]}$
	$\displaystyle=1-\Pr\biggl{[}N+U\leq(1-\epsilon)(M+1)-1\Big{\|}B\biggr{]}$
	$\displaystyle\leq 1-\frac{\lceil(1-\epsilon)(M+1)-1\rceil}{M+1}$
	$\displaystyle=\frac{\lfloor M+1-(1-\epsilon)(M+1)+1\rfloor}{M+1}$
	$\displaystyle\leq\frac{1+\epsilon M+\epsilon}{M+1}$
	$\displaystyle=\epsilon+\frac{1}{M+1}.$

We can combine this result with the previous statement of the probability of a false negative to get

	$\displaystyle\Pr[w(\hat{Y})=0]$
	$\displaystyle={\mathbb{E}}\Biggl{[}\Pr\biggl{[}N+U>(1-\epsilon)(M+1)-1\Big{\|}B% \biggr{]}\Biggr{]}$
	$\displaystyle\leq{\mathbb{E}}\left[\epsilon+\frac{1}{M+1}\right]$
	$\displaystyle=\epsilon+\frac{1}{M+1}$

as required. ∎

∎

To use Proposition 1 to provide safety guarantees, we choose $\epsilon$ based on the number of samples available $|{\mathcal{A}}|$ . Specifically, if the desired safety level is $\epsilon^{*}$ , then we can choose any $\epsilon<\epsilon^{*}$ in Algorithm 1 such that

\displaystyle\epsilon+1/(1+|{\mathcal{A}}|)\leq\epsilon^{*}

(1)

In other words, if our choice of $\epsilon$ satisfies Eq. (1), then Algorithm 1 will be $\epsilon^{*}$ -safe. Intuitively, choosing a large $\epsilon$ decreases the false positive rate (FPR). This is because according to Algorithm 1 Line 6, choosing a larger $\epsilon$ decreases the number of times that a warning is output. Therefore, based on the number of samples in the training dataset $|{\mathcal{A}}|$ , we choose the largest $\epsilon$ that satisfies Eq. (1); i.e. we choose

\epsilon=\epsilon^{*}-1/(1+|{\mathcal{A}}|).

We will call $1/(1+|{\mathcal{A}}|)$ the discretization error.

Proposition 1 also reveals the sample complexity of the conformal prediction algorithm. If the number of unsafe examples is too small ( $|{\mathcal{A}}|\leq 1/\epsilon^{*}-1$ ), then we must choose $\epsilon<0$ to ensure $\epsilon^{*}$ -safety according to Proposition 1. Algorithm 1 with $\epsilon<0$ will trivially always output $1$ (i.e. always issue a warning). On the other hand, if the number of unsafe examples exceeds the threshold ( $|{\mathcal{A}}|>1/\epsilon^{*}-1$ ), then there will be an $\epsilon>0$ that ensures $\epsilon^{*}$ -safety according to Proposition 1. Consequently, Algorithm 1 will not be trivial. In practice, we find that to get good results and a low false positive rate, it is sufficient to have sample count $|{\mathcal{A}}|$ that exceeds the threshold by a small margin, such as $|{\mathcal{A}}|=1.5/\epsilon^{*}-1$ . For example, to achieve a 5% false positive rate, it is sufficient to have only about 30 unsafe examples.

Proposition 2 demonstrates that Algorithm 1 is asymptotically optimal, since it can guarantee an $O(1/T)$ FNR while the FPR is not much higher than in the infinite data regime.

Proposition 2.

There is no $\epsilon$ -safe warning system based on the ordering of $g(Y)$ values that can achieve a false positive rate lower than $1-(1+T)\epsilon$ .

We conjecture that Proposition 2 holds for all functions, but we prove it for the fairly general class of functions specified by Equation 5 below, comparing the surrogate safety score for our new test sample $g({\hat{Y}})$ relative to the $g(Y_{1}),\cdots,g(Y_{T})$ values.

Proof.

Consider a function $w$ that maps a dataset ${\mathcal{D}}=(g(Y_{1}),Z_{1}),\cdots,(g(Y_{T}),Z_{T})$ of unsafe examples, and a new data point $g({\hat{Y}})$ , to $\{0,1\}$ . Let $w$ be a warning function that gives a distribution-free false negative rate guarantee that depends only on the ordering between $g(Y_{1}),\cdots,g(Y_{T}),g({\hat{Y}})$ (rather than on their specific values). In other words, $w$ takes the form defined by

\displaystyle w({\mathcal{D}},{\hat{Y}})=\left\{\begin{array}[]{ll}\phi\left(% \big{|}\big{\{}t\colon g({\hat{Y}})<g(Y_{t})\big{\}}\big{|}\right),\\ &\mathllap{\text{ with probability }\gamma}\\ 1,&\mathllap{\text{ with probability }1-\gamma}\end{array}\right.

(5)

for some deterministic function $\phi$ and real number $\gamma$ . We know that when the data is exchangeable, the random variable $|\{t\colon g({\hat{Y}})<g(Y_{t})\}|$ is uniformly distributed on $\{0,1,\cdots,T\}$ .

Case 1: Suppose $\phi$ outputs the value $0$ (no warning) for at least one possible input. Then the false negative rate is given by

\displaystyle\text{FNR}\geq\gamma/(1+T),

(6)

and the false positive rate is given by

\displaystyle\text{FPR}\geq 1-\gamma,

(7)

so combined we have

\displaystyle\text{FPR}\geq 1-\gamma\geq 1-(1+T)\text{FNR}\geq 1-(1+T)\epsilon.

(8)

Case 2: Suppose $\phi$ outputs the value $0$ for none of the inputs (i.e. it always issues a warning). Then the false negative rate and false positive rate are given by

\displaystyle\text{FNR}=0,\text{FPR}=1,

(9)

so we would still (trivially) have $\text{FPR}\geq 1-(1+T)\epsilon$ .

Thus, if $w$ takes the form of Equation 5, then the false positive rate must be lower bounded by $1-(1+T)\epsilon$ . In other words, when $\epsilon=o(1/T)$ , the false positive rate tends to $1$ when $T$ is large. ∎

∎

3.3.1 Algorithm Design Choices

There are two major components of our algorithm that can be tuned to achieve a tighter FPR: the surrogate safety score, and the trained simulator or prediction model. A surrogate safety score that is better correlated with the true safety score will lead to a better FPR, as will a more accurate simulator model. Note that patterns of inaccuracies in the simulator model will also lead to patterns of errors in the false alarms. For example, an autonomous driving simulator that is particularly inaccurate around yield signs will lead to surrogate safety scores that are not well correlated with the true safety score around yield signs; more alarms will be issued in order to maintain the specified FNR and thus the FPR will be higher.

The guarantees and analysis of Algorithm 1 will hold regardless of the surrogate safety score and simulator model used. However, if these components are chosen poorly, not enough data is available, or the required $\epsilon^{*}$ is too stringent, then this procedure could become trivial (e.g. always issuing a warning). In practice however, we find that we are able to obtain good results for an autonomous driving application with an off-the-shelf prediction model for reasonably low $\epsilon^{*}$ -values with not too much data (see Section 4).

3.4 Comparing Conformal Prediction with PAC Learning

We further compare the statistical learning and conformal prediction guarantees. We first clarify the notation and formally define the different assumptions. Consider a sequence of training data $(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})$ and a sequence of test data $({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})$ . Let $c_{1},\cdots,c_{M}$ denote the unsafe subsequence of test data, i.e. $({\hat{Y}}_{c_{1}},{\hat{Z}}_{c_{1}}),({\hat{Y}}_{c_{2}},{\hat{Z}}_{c_{2}}),% \cdots,({\hat{Y}}_{c_{M}},{\hat{Z}}_{c_{M}})$ is the subsequence of $({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})$ such that, for all $m$ , $f({\hat{Z}}_{c_{m}})<f_{0}$ .

Two possible assumptions that we could make on the training and test data sequences are shown in Assumptions 2 and 3. In particular, marginal exchangeability (Assumption 2) is the same as Assumption 1 from the previous section. The only difference here is that we explicitly state that we only require exchangeability with each test data point.

Assumption 2 (Marginal exchangeability).

For each $n$ , the sequence $(Y_{1},Z_{1})$ , $(Y_{2},Z_{2})$ , $\cdots$ , $(Y_{T},Z_{T})$ , $({\hat{Y}}_{n},{\hat{Z}}_{n})$ is exchangeable.

Assumption 3 (Independent and identically distributed).

The training / test data sequences $(Y_{1},Z_{1})$ , $(Y_{2},Z_{2})$ , $\cdots$ , $(Y_{T},Z_{T})$ , $({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})$ are drawn from an i.i.d. distribution.

Given a warning function, we use the random variables ${\hat{F}}_{c_{1}},\cdots,{\hat{F}}_{c_{M}}$ to denote failure of the warning function, i.e. ${\hat{F}}_{c_{m}}={\mathbb{I}}(w({\hat{Y}}_{c_{m}})=0)$ . Note that ${\hat{F}}_{c_{m}}$ depends on $w$ , but we drop this dependence from our notation.

A learning algorithm is a function that takes as input the training data $(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})$ and outputs a warning function $w:X\to\{0,1\}$ . There are two main paradigms for designing learning algorithms with guarantees.

PAC Learning: Under Assumption 3, a learning algorithm is $(\epsilon,\delta)$ -safe if with $1-\delta$ probability (with respect to randomness of the training data) the learned warning function $w$ satisfies for some $\epsilon^{\prime}<\epsilon$

\displaystyle{\hat{F}}_{c_{1}},\cdots,{\hat{F}}_{c_{M}}\sim\mathrm{Bernoulli}(% \epsilon^{\prime}).

(10)

Conformal Learning: For completeness we restate the conformal learning guarantee. A learning algorithm is $\epsilon$ -safe if the learned function $w$ satisfies for some $\epsilon^{\prime}<\epsilon$

\displaystyle{\hat{F}}_{c_{m}}\sim\mathrm{Bernoulli}(\epsilon^{\prime}),\text{% for all }m=1,\cdots,M.

(11)

3.4.1 Comparing Assumptions

Conformal learning requires weaker assumptions. Assumption 2 is much weaker than Assumption 3, and hence is applicable to a much larger class of problems. For example, consider an autonomous driving application where the training data are snippets from randomly sampled driving scenes (no two training data points come from the same driving scene), and the test data $({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})$ is a sequence of driving snippets from a random driving scene. The test data points are not independent because they are from the same scene, and hence Assumption 3 is violated. However, Assumption 2 holds because the training data and any single test sample are snippets from randomly sampled driving scenes.

3.4.2 Comparing Sample Complexity

Conformal learning requires $\Theta(1/\epsilon)$ training examples of unsafe situations (Proposition 1), while standard analysis in PAC learning requires $\Theta(1/\epsilon^{2})$ examples. For example, consider the following $(\epsilon,\delta)$ -safe algorithm: based on the simulation $Y$ and the surrogate safety function $g$ , we consider the family of warning functions $w_{\theta}(Y)={\mathbb{I}}(g(Y)<\theta)$ . Our goal is to estimate the false negative rate of $w_{\theta}$ (denoted by $\epsilon^{*}(\theta)$ ) for each $\theta$ and select the smallest $\theta$ such that $\epsilon^{*}(\theta)\leq\epsilon$ .

To estimate $\epsilon^{*}$ , we compute the (empirical) false negative rate (denoted by $\hat{\epsilon}(\theta)$ ) on the training data, i.e.

\displaystyle\hat{\epsilon}(\theta)=1/M\sum_{m}{\mathbb{I}}(g(Y_{c_{m}})\geq\theta)

(12)

and use a standard concentration inequality (such as Hoeffding) to bound the difference between $\epsilon^{*}(\theta)$ and $\hat{\epsilon}(\theta)$ . Specifically, with probability $1-\delta$

\displaystyle\epsilon^{*}(\theta)\in\hat{\epsilon}(\theta)\pm\sqrt{\frac{\log(% 1/\delta)}{2M}}.

(13)

Note that Eq. (13) is already the tightest bound possible up to constants (Foody, 2009). To verify that $w_{\theta}$ has a false negative rate that is less than or equal to $\epsilon$ , we have to check that

\displaystyle\hat{\epsilon}(\theta)+\sqrt{\frac{\log(1/\delta)}{2M}}\leq\epsilon,

which requires

\displaystyle\sqrt{\frac{\log(1/\delta)}{2M}}\leq\epsilon\iff M\geq\frac{\log(% 1/\delta)}{2\epsilon^{2}}.

This means that we must have at least $\Theta(1/\epsilon^{2})$ samples.

In words, even a fixed $w_{\theta}$ requires $\Theta(1/\epsilon^{2})$ samples to verify its false negative rate according to Eq. (13). Thus, finding $w_{\theta}$ to provably achieve low false negative rate should require at least as many, if not more, training examples.

3.4.3 Comparing Usefulness of Guarantees

PAC learning and conformal learning both have advantages. PAC learning has the advantage that its i.i.d. error rate guarantee in Eq. (10) is stronger than the marginal error rate guarantee in Eq. (11). For example, if the downstream user is very sensitive to high variance (i.e. it is unacceptable for all test examples to fail simultaneously even if the probability is vanishingly small) then the i.i.d. error rate guarantee in Eq. (10) might be necessary. Nevertheless, the risk can be reduced by alternative methods such as financial tools (insurance). On the other hand, the conformal learning guarantee in Eq. (11) has the advantage that it always holds, while the PAC learning guarantee in Eq. (10) only holds with $1-\delta$ probability.

To summarize, conformal learning requires much weaker assumptions and fewer samples, and its guarantees always hold (rather than with $1-\delta$ probability). PAC learning offers stronger guarantees when its assumptions and sample complexity requirements are met.

4 Experiments: Driver Alert System

We empirically validate the guarantees of our framework on a driver alert safety system using real driving data. The system should warn the driver if the driver may get into an unsafe situation, without issuing too many false alarms. We show that the false negative rate (the percentage of unsafe situations that the system fails to identify) is indeed bounded according to Proposition 1, while the FPR remains low.

4.1 Experimental Setup

4.1.1 Methods

We evaluate our framework on the setup described in Section 3.1. We use Trajectron++ (Salzmann et al., 2020) as our future dynamics model (i.e. in the notation of Section 3.1, $Y$ is the output of Trajectron++ and $g=f$ ). We choose the safety score $f$ as a weighted distance metric, where agents in the direction of the ego-vehicle velocity vector are considered “closer” than agents in the orthogonal direction.

More specifically, we define the safety score by the Mahalanobis distance between the ego-vehicle and the agent, where the first eigenvector is aligned with the ego-vehicle’s velocity vector, and the second eigenvector is orthogonal to the ego-vehicle (see Figure 6); the magnitude of the first eigenvector is the magnitude of the velocity, and the magnitude of the second eigenvector is approximately half of a car width (we use 1m). Intuitively, this means that agents that are along the ego-vehicle’s velocity vector appear closer than agents in the perpendicular direction.

4.1.2 Datasets

We use the nuScenes (Caesar et al., 2020) and the Kaggle Lyft Motion Prediction (Jang et al., 2020) autonomous driving datasets. Each dataset contains multiple scenes, and each scene contains multiple trajectories. The trajectories in a scene are correlated with each other, but the different scenes are sufficiently distinct from each other to be considered exchangeable. To generate a dataset of exchangeable trajectories, we sample a single trajectory uniformly at random from each scene.

The nuScenes dataset includes 952 scenes collected across Boston and Singapore, divided into a 697/105/150 train/val/test split (the same split used for the original Trajectron++). Each scene is 20 seconds long. The Kaggle Lyft Motion Prediction dataset includes approximately 16k scenes, divided into an 70%/15%/15% train/val/test split. Each scene is 25 seconds long. Both datasets include labeled ego-vehicle trajectories as well as labeled detections and trajectories for other agents in the scene. Note that for both of these datasets, because the training split was used to train the Trajectron++ model, we use the validation split as the input training data for Algorithm 1. A visualization of trajectory predictions output by Trajectron++ is shown in Figure 8 (Salzmann et al., 2020).

4.1.3 Data Splitting

To compute average performance, we use random train and test splits. For both datasets, we first pool together all available data points, randomly shuffle them, and separate them back into training and test splits (with the same size as the original splits). We ran 100 trials for each experiment, and averaged over the results.

4.2 Results and Discussion

In Fig. 6(a), 6(b), and 6(c) we vary several parameters (safety threshold $f_{0}$ , safety guarantee $\epsilon$ , and proportion of unsafe situations) for nuScenes. We show qualitatively similar results for the Lyft dataset in Figure 6(d). Our main observations:

1.

The false negative rate (i.e. safety) is always within the theoretical bound in Proposition 1. We achieve these false negative rates with very little data. nuScenes has 50-70 unsafe examples in the training dataset, and Lyft has about 50. Yet, even with these few examples, we can ensure a false negative rate to within 1 or 2% of the desired $\epsilon$ .
2.

The false positive rate (FPR) is generally very good — well below 1% on the nuScenes dataset. We use an off-the-shelf trajectory predictor trained on a small academic dataset; a more accurate trajectory predictor trained on industry-sized datasets might be expected to provide a more discriminative safety score (as in Figure 5), and thus a further improved FPR. Note that as shown in Figure 9, there is a tradeoff between $\epsilon$ and the FPR when there are few (e.g. $<1/T$ ) samples, which is consistent with what our theory from Section 3.2 would predict. Figure 9 plots the epsilon bound as well as the false negative and false positive rates vs. the number of unsafe samples in the validation dataset; we see that when $\epsilon$ decreases as $1/T$ , the false positive rate is relatively flat and low.
3.

One previously unmentioned benefit of our approach is that our method is robust to label frequency shift — the frequency of unsafe situations can differ between the training data and test data. Observe that the output of Algorithm 1 depends only on the unsafe examples; consequently, the safety guarantee in Proposition 1 still holds if we increase or decrease the number of safe examples. For example, the training data collection process could intentionally focus on unsafe situations, so that unsafe examples are over-represented in the training data. We empirically simulate this in Figure 6(c) where we increase the proportion of unsafe examples in the training set (by deleting safe examples). The performance of our algorithm does not change qualitatively.

In Figure 10, we also plot the false negative and false positive rates at various $\epsilon$ -values for the Lyft dataset using PAC estimates. In this experiment, we find a 99.9% confidence bound for the $(1-\epsilon)$ quantile surrogate safety score among the unsafe samples, and issue an alert if the surrogate safety score of the new test sample is below this upper bound. Thus, with high (99.9%) probability, the FNR will be less than or equal to $\epsilon$ . As the figure demonstrates, the FPR is very high until $\epsilon$ is large, since there are too few samples to obtain a tight confidence interval.

In the comparison of PAC learning and conformal learning, we argued that the main advantage of PAC learning is that the failures are i.i.d., so the total number of failures should have low variance (due to the Central Limit Theorem). However, we show empirically that users need not be overly concerned about highly correlated failures, as long as the test samples are not inherently highly correlated. We find that the variance on the false negative rate from different train/test splits is very low. With $\epsilon=0.06$ , for instance, it was only $0.0014$ . Table 1 displays the variance on the false negative rate calculated over the 100 trials for the Lyft dataset at each $\epsilon$ value. All of the variances are well below 0.003, suggesting that the test sequence false negative rates are clustered around $\epsilon$ (rather than having some sequences that fail on zero examples and others with catastrophic failures). As further evidence, in Figure 11, we provide a representative box plot of the false negative rates over the 100 trials with $\epsilon=0.04$ . The false negative rate values are indeed clustered around $0.04$ .

$\epsilon$	$0.02$	$0.04$	$0.06$	$0.08$	$0.10$
Variance	0.00096	0.0019	0.0014	0.0023	0.0024

Table 1: Variance on the test sequence false negative rates at different

\epsilon

5 Experiments: Robotic Gras**

Finally, we validate the guarantees of our framework on a robotic gras** system that should warn the user when the robot will fail to pick and transport an object. Picking is a core problem in warehouse robotics (Correll et al., 2016; Eppner et al., 2016; Hernandez et al., 2017; Yu et al., 2016; Zeng et al., 2017; Mahler et al., 2017, 2018, 2019), and failures hurt throughput (potentially even stop** the assembly line). Failures can also lead to dropped or damaged goods.

5.1 Experimental Setup

We again evaluate our framework on the setup described in Section 3.1, using an open source dataset and model. We use the Grasp Quality Convolutional Neural Network (GQ-CNN) from Mahler et al. (2017, 2018, 2019) as our predictor model and the DexNet 4.0 dataset of synthetic objects grasped with a parallel-jaw gripper (Mahler et al., 2019). (An example of objects from this dataset is shown in Figure 12.) The inputs to the GQ-CNN are a point cloud representation of an object, $\mathbf{y}$ , and a candidate grasp, $\mathbf{u}$ . The GQ-CNN outputs the predicted probability, $Q_{\theta}(\mathbf{y},\mathbf{u})$ , that the candidate grasp will be able to successfully pick and transport the object. We use this predicted probability as the surrogate safety score, $g=Q_{\theta}(\mathbf{y},\mathbf{u})$ . We consider a candidate grasp “unsafe” if it will not be able to successfully pick the object (i.e. the true label is $Z=0$ ). Note that this is exactly the ROC curve threshold tuning setup (with an additional guarantee on the false negative rate).

The DexNet dataset of synthetic objects (Mahler et al., 2017) includes a variety of pick attempts that were not used in training the GQ-CNN model. These samples are divided into a 50%/50% train/test split. Each example is labeled as a success if the robot successfully picks and places the object, and a failure otherwise. We ran 100 trials of Algorithm 1 with randomized train/test splits, and averaged over the results.

5.2 Results and Discussion

With $\epsilon=0.05$ , we achieved a false negative rate of 0.05, and a false positive rate of 0.11. With $\epsilon=0.1$ , we achieved a false negative rate of 0.10 and a false positive rate of 0.04. The conformal guarantees of our framework hold. A plot of the false negative and false positive rates achieved for different $\epsilon$ values is shown in Figure 12(a). As before, the false negative rate is within the theoretical bound, and the false positive rate is reasonably low.

In Figures 12(b), 12(c), and 12(d), we progressively degrade the quality of the surrogate safety score used in Figure 12(a) ( $g=Q_{\theta}(\mathbf{y},\mathbf{u})$ ) to demonstrate the effects of a worse surrogate safety score or a more inaccurate simulator. In fact, in Figure 12(d), we replace the surrogate safety score $g$ entirely with random noise that is uniformly sampled between 0 and 1, i.e. $g(Y)=\mathcal{U}([0,1])$ . As the plot shows, the false negative rate still remains within the theoretical bound. However, the false positive rate is extremely high (with values ranging from 0.95 to 0.90), indicating that the warning system essentially always issues an alert. This makes sense intuitively, because in this case the surrogate safety score is completely non-informative. In Figures 12(b) and 12(c), we progressively degrade the quality of the original surrogate safety score (the predicted probability that a pick will be successful) by adding increasing amounts of random noise. (In other words, we progressively decrease the accuracy of the simulator by adding noise.) We take a linear combination of the original surrogate safety score ( $g=Q_{\theta}(\mathbf{y},\mathbf{u})$ ) and the uniformly randomly sampled noise, using weights of 0.75 and 0.25 for the original surrogate safety score and the random noise, respectively, in Figure 12(b), and weights of 0.5 and 0.5 in Figure 12(c). Note that the false negative rate always remains within the theoretical bound from Proposition 1, but the false positive rate increases with increasing noise, indicating that more alerts are being issued and it is more difficult to distinguish between safe and unsafe situations.

Taken together, these results demonstrate that the guarantees of Algorithm 1 hold regardless of the surrogate safety score and simulator model used, and thus, our method can be used to bound the false negative rate of a warning system even if the simulator or prediction model does not have any performance guarantees. However, a surrogate safety score $g$ that is better correlated with the true safety score $f$ and a more accurate simulator model will lead to better empirical performance in terms of the false positive rate, and issue fewer unnecessary alerts.

6 Conclusions and Future Work

In this work, we introduce a broadly applicable framework that uses conformal prediction to tune warning systems for robotics applications. This framework allows us to achieve provable safety assurances with very little data. We demonstrate empirically that the guarantees on the false negative rate hold for a driver alert system and for a robotic gras** system (even with only tens of examples of failure cases), while the false positive rate remains low.

There are several exciting future directions for this work. One area of particular interest is the application of conformal prediction in non-exchangeable scenarios (Tibshirani et al., 2019; Barber et al., 2023; Gibbs and Candès, 2021; Cauchois et al., 2020), as many robotics settings involve highly correlated time-series data, and robots deployed in the world may encounter distribution shift. There have been many recent advances in the conformal prediction literature on relaxing the exchangeability assumption, and leveraging this work could lead to useful developments in robotics.

Another intriguing extension of this work is exploring conditional safety (Feldman et al., 2021; Gupta et al., 2022), with the goal of providing safety assurances conditioned on specific factors (rather than a marginal guarantee). For example, a driver assistance system that provides a 95% guarantee on the false negative rate regardless of whether it is raining could be very useful. This system could leverage existing work on conditional coverage guarantees in conformal prediction; however, it would need to be sample-efficient to be useful for robotics settings.

Two additional interesting and important future directions are studying deployment in industry-scale applications and studying the impact of the predictor on the data that it is trying to predict (Perdomo et al., 2020) (e.g. examining whether and to what extent the warning system changes behavior or outcomes).

{acks}

The NASA University Leadership Initiative (grant #80NSSC20M0163) provided funds to assist the authors with their research. This article solely reflects the opinions and conclusions of its authors and not any NASA entity. The authors would like to thank Matteo Zallio for his expertise in crafting Figure 1.

References

Angelopoulos et al. (2020) Angelopoulos AN, Bates S, Malik J and Jordan MI (2020) Uncertainty sets for image classifiers using conformal prediction. In: International Conference on Learning Representations.
Angelopoulos et al. (2022) Angelopoulos AN, Krauth K, Bates S, Wang Y and Jordan MI (2022) Recommendation systems with distribution-free reliability guarantees. arXiv preprint arXiv:2207.01609 .
Barber et al. (2023) Barber RF, Candès EJ, Ramdas A and Tibshirani RJ (2023) Conformal prediction beyond exchangeability. Annals of Statistics 51(2): 816–845. 10.1214/23-AOS2276.
Caesar et al. (2020) Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G and Beijbom O (2020) nuscenes: A multimodal dataset for autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition.
Cai and Koutsoukos (2020) Cai F and Koutsoukos X (2020) Real-time out-of-distribution detection in learning-enabled cyber-physical systems. In: International Conference on Cyber-Physical Systems.
Calafiore and Campi (2006) Calafiore G and Campi M (2006) The scenario approach to robust control design. IEEE Transactions on Automatic Control 51(5): 742–753. 10.1109/TAC.2006.875041.
Cauchois et al. (2020) Cauchois M, Gupta S, Ali A and Duchi JC (2020) Robust validation: Confident predictions even when distributions shift. arXiv preprint arXiv:2008.04267 .
Chen et al. (2020) Chen Y, Rosolia U, Fan C, Ames A and Murray R (2020) Reactive motion planning with probabilistic safety guarantees. In: Conference on Robot Learning.
Correll et al. (2016) Correll N, Bekris KE, Berenson D, Brock O, Causo A, Hauser K, Okada K, Rodriguez A, Romano JM and Wurman PR (2016) Analysis and observations from the first amazon picking challenge. IEEE Transactions on Automation Science and Engineering 15(1): 172–188.
Crestani et al. (2015) Crestani D, Godary-Dejean K and Lapierre L (2015) Enhancing fault tolerance of autonomous mobile robots. Robotics and Autonomous Systems 68: 140–155.
Ding (2013) Ding SX (2013) Introduction. In: Model-Based Fault Diagnosis Techniques: Design Schemes, Algorithms and Tools. Springer London, pp. 3–11.
Eppner et al. (2016) Eppner C, Höfer S, Jonschkowski R, Martín-Martín R, Sieverling A, Wall V and Brock O (2016) Lessons from the amazon picking challenge: Four aspects of building robotic systems. In: Robotics: Science and Systems.
Feldman et al. (2021) Feldman S, Bates S and Romano Y (2021) Improving conditional coverage via orthogonal quantile regression. In: Advances in Neural Information Processing Systems.
Foody (2009) Foody GM (2009) Sample size determination for image classification accuracy assessment and comparison. International Journal of Remote Sensing 30(20): 5273–5291. 10.1080/01431160903130937.
Gammerman et al. (2008) Gammerman A, Nouretdinov I, Burford B, Chervonenkis A, Vovk V and Luo Z (2008) Clinical mass spectrometry proteomic diagnosis by conformal predictors. Statistical Applications in Genetics and Molecular Biology 7(2). 10.2202/1544-6115.1385.
Ghosh et al. (2023) Ghosh S, Belkhouja T, Yan Y and Doppa JR (2023) Improving uncertainty quantification of deep classifiers via neighborhood conformal prediction: Novel algorithm and theoretical analysis. In: AAAI Conference on Artificial Intelligence.
Gibbs and Candès (2021) Gibbs I and Candès EJ (2021) Conformal inference for online prediction with arbitrary distribution shifts. In: Advances in Neural Information Processing Systems.
Gupta et al. (2022) Gupta V, Jung C, Noarov G, Pai MM and Roth A (2022) Online multivalid learning: Means, moments, and prediction intervals. In: Innovations in Theoretical Computer Science Conference.
Harirchi and Ozay (2015) Harirchi F and Ozay N (2015) Model invalidation for switched affine systems with applications to fault and anomaly detection. Analysis and Design of Hybrid Systems 48(27): 260–266.
Harirchi and Ozay (2018) Harirchi F and Ozay N (2018) Guaranteed model-based fault detection in cyber-physical systems: A model invalidation approach. Automatica 93: 476–488. https://doi.org/10.1016/j.automatica.2018.03.040.
Hernandez et al. (2017) Hernandez C, Bharatheesha M, Ko W, Gaiser H, Tan J, van Deurzen K, de Vries M, Van Mil B, van Egmond J, Burger R et al. (2017) Team delft’s robot winner of the amazon picking challenge 2016. In: RoboCup 2016: Robot World Cup XX. Springer International Publishing, pp. 613–624.
Jang et al. (2020) Jang A, Christy, Bergamini L, Maggie, Scheel O, Ondruska P, Culliton P and Iglovikov V (2020) Lyft motion prediction for autonomous vehicles. URL https://kaggle.com/competitions/lyft-motion-prediction-autonomous-vehicles.
Khalastchi and Kalech (2018) Khalastchi E and Kalech M (2018) On fault detection and diagnosis in robotic systems. ACM Computing Surveys 51(1): 1–24.
Luo et al. (2022) Luo R, Zhao S, Kuck J, Ivanovic B, Savarese S, Schmerling E and Pavone M (2022) Sample-efficient safety assurances using conformal prediction. In: Workshop on the Algorithmic Foundations of Robotics.
Mahler et al. (2017) Mahler J, Liang J, Niyaz S, Laskey M, Doan R, Liu X, Ojea JA and Goldberg K (2017) Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In: Robotics: Science and Systems (RSS).
Mahler et al. (2018) Mahler J, Matl M, Liu X, Li A, Gealy D and Goldberg K (2018) Dex-net 3.0: Computing robust robot suction grasp targets in point clouds using a new analytic model and deep learning. In: IEEE International Conference on Robotics and Automation.
Mahler et al. (2019) Mahler J, Matl M, Satish V, Danielczuk M, DeRose B, McKinley S and Goldberg K (2019) Learning ambidextrous robot gras** policies. Science Robotics 4(26).
Muradore and Fiorini (2011) Muradore R and Fiorini P (2011) A pls-based statistical approach for fault detection and isolation of robotic manipulators. IEEE Transactions on Industrial Electronics 59(8): 3167–3175.
Nouretdinov et al. (2011) Nouretdinov I, Costafreda SG, Gammerman A, Chervonenkis A, Vovk V, Vapnik V and Fu CH (2011) Machine learning classification with confidence: Application of transductive conformal predictors to mri-based diagnostic and prognostic markers in depression. NeuroImage 56(2): 809–813. 10.1016/j.neuroimage.2010.05.023.
Patton and Chen (1997) Patton R and Chen J (1997) Observer-based fault detection and isolation: Robustness and applications. Control Engineering Practice 5(5): 671–682.
Perdomo et al. (2020) Perdomo J, Zrnic T, Mendler-Dünner C and Hardt M (2020) Performative prediction. In: International Conference on Machine Learning.
Salzmann et al. (2020) Salzmann T, Ivanovic B, Chakravarty P and Pavone M (2020) Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In: European Conference on Computer Vision.
Shafer and Vovk (2008) Shafer G and Vovk V (2008) A tutorial on conformal prediction. Journal of Machine Learning Research 9: 371–421.
Tibshirani et al. (2019) Tibshirani RJ, Barber RF, Candès EJ and Ramdas A (2019) Conformal prediction under covariate shift. In: Advances in Neural Information Processing Systems.
Vemuri et al. (1998) Vemuri AT, Polycarpou MM and Diakourtis SA (1998) Neural network based fault detection in robotic manipulators. IEEE Transactions on Robotics and Automation 14(2): 342–348.
Visinsky et al. (1994a) Visinsky ML, Cavallaro JR and Walker ID (1994a) Expert system framework for fault detection and fault tolerance in robotics. Computers & Electrical Engineering 20(5): 421–435.
Visinsky et al. (1994b) Visinsky ML, Cavallaro JR and Walker ID (1994b) Robotic fault detection and fault tolerance: A survey. Reliability Engineering & System Safety 46(2): 139–158.
Visinsky et al. (1995) Visinsky ML, Cavallaro JR and Walker ID (1995) A dynamic fault tolerance framework for remote robots. IEEE Transactions on Robotics and Automation 11(4): 477–490.
von Luxburg and Schölkopf (2011) von Luxburg U and Schölkopf B (2011) Statistical learning theory: Models, concepts, and results. In: Gabbay DM, Hartmann S and Woods J (eds.) Inductive Logic, Handbook of the History of Logic, volume 10. North-Holland, pp. 651–706.
Vovk et al. (2005) Vovk V, Gammerman A and Shafer G (2005) Algorithmic Learning in a Random World. Springer New York. 10.1007/b106715.
Vovk et al. (2003) Vovk V, Lindsay D, Nouretdinov I and Gammerman A (2003) Mondrian confidence machine. URL http://alrw.net/old/04.pdf.
Yu et al. (2016) Yu KT, Fazeli N, Chavan-Dafle N, Taylor O, Donlon E, Lankenau GD and Rodriguez A (2016) A summary of team mit’s approach to the amazon picking challenge 2015. arXiv preprint arXiv:1604.03639 .
Zeng et al. (2017) Zeng A, Yu KT, Song S, Suo D, Walker E, Rodriguez A and Xiao J (2017) Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In: IEEE International Conference on Robotics and Automation.