HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: moreverb

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2109.14082v5 [cs.RO] 02 Jan 2024
\corrauth

Rachel Luo, Stanford University, Stanford, CA 94305, U.S.A.

Sample-Efficient Safety Assurances using Conformal Prediction

Rachel Luo11affiliationmark:    Shengjia Zhao11affiliationmark:    Jonathan Kuck22affiliationmark:    Boris Ivanovic11affiliationmark:    Silvio Savarese11affiliationmark:   
Edward Schmerling11affiliationmark:
   and Marco Pavone11affiliationmark: 11affiliationmark: Stanford University, Stanford, CA 94305, USA
22affiliationmark: Dexterity, Inc., Redwood City, CA 94063, USA
[email protected]
Abstract

When deploying machine learning models in high-stakes robotics applications, the ability to detect unsafe situations is crucial. Early warning systems can provide alerts when an unsafe situation is imminent (in the absence of corrective action). To reliably improve safety, these warning systems should have a provable false negative rate; i.e. of the situations that are unsafe, fewer than ϵitalic-ϵ\epsilonitalic_ϵ will occur without an alert. In this work, we present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics, in order to tune warning systems to provably achieve an ϵitalic-ϵ\epsilonitalic_ϵ false negative rate using as few as 1/ϵ1italic-ϵ1/\epsilon1 / italic_ϵ data points. We apply our framework to a driver warning system and a robotic gras** application, and empirically demonstrate the guaranteed false negative rate while also observing a low false detection (positive) rate.

keywords:
Safety assurance, Conformal prediction, Statistical inference

1 Introduction

Monitoring a system for faults, or detecting if unsafe situations will occur is a key problem for high-stakes robotics applications, and indeed the field of fault detection has long been the state of practice for building reliable systems (Visinsky et al., 1994a, b, 1995; Vemuri et al., 1998; Khalastchi and Kalech, 2018; Muradore and Fiorini, 2011; Crestani et al., 2015; Ding, 2013; Patton and Chen, 1997; Harirchi and Ozay, 2015, 2018). With the advent of learning-enabled components in robotic systems, robots are performing increasingly complex safety-critical tasks, so reliability has become increasingly important. For instance, in an autonomous driving setting, errors in perception or planning could lead to collision. In a warehouse robotics setting, robots on the factory floor work alongside humans, and not recognizing faults in learned systems could impact safety or even lead to injuries. At the same time, it is less clear how to ensure reliability for these learned systems. These systems are complex, so guaranteeing safety is not something that can be done from first principles — empirical, data-driven methods are needed.

In this work, we present a sample efficient and principled method for detecting unsafe situations based on the statistical inference technique of conformal prediction (Vovk et al., 2005). Our method provides provable false negative rates for warning systems (i.e. among the situations in which an alert should be issued, fewer than ϵitalic-ϵ\epsilonitalic_ϵ occur without an alert), while achieving low false positive rates (few unnecessary alerts are issued).

For example, in a driver assistance system, when an unsafe situation (e.g. another car getting too close) is imminent, our method will issue a warning the vast majority of the time (i.e. at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ of the time). In a warehouse setting with a robotic pick-and-place system, when the system will fail to grasp and transport an object, our method will issue an alert at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ of the time. As a running example in this paper, we use our method to design an alert system to warn a human operator of impending danger in a driving application (illustrated in Figure 1).

Refer to caption
Figure 1: We design a warning system that achieves a provable false negative rate sample efficiently. Among the situations that are dangerous (i.e. lead to an unsafe future situation in the absence of corrective action), fewer than ϵitalic-ϵ\epsilonitalic_ϵ occur without an alert.

1.1 Related Work

Traditional fault detection techniques include hardware redundancy, signal processing, and plausibility tests (Ding, 2013; Visinsky et al., 1994a, b, 1995; Vemuri et al., 1998). However, hardware redundancy requires extra components, signal processing works well only for processes in steady state, and plausibility tests do not catch faults that lead to a physically plausible system. Additionally, these methods typically lack performance guarantees. Model-based fault detection techniques (Ding, 2013; Patton and Chen, 1997; Harirchi and Ozay, 2015, 2018) involve using a model of the system to determine whether a fault has occurred; they assume that users have a very accurate model of the system dynamics, which is difficult to obtain in practice.

Another common approach for detecting unsafe states employs supervised learning to train a classifier model for labeling states as unsafe, and then the classifier hyperparameters are adjusted until empirically the false negative rate is low. In practice this is typically accomplished by plotting a receiver operating characteristic (ROC) curve and tuning the classification threshold to achieve low false negative rate. However, this approach requires training a new classification model, and provides no performance guarantees.

To guarantee the false negative rate of a learned warning system, the standard statistical learning framework could be used under standard i.i.d. assumptions (von Luxburg and Schölkopf, 2011). A practitioner could collect additional data and use a validation dataset to provably certify the false negative rate. However, the key problem is data efficiency, because collecting data for unsafe situations can be very expensive (Foody, 2009; von Luxburg and Schölkopf, 2011; Calafiore and Campi, 2006).

1.2 Contributions

Main Question

Can we tune a warning system and guarantee a low false negative rate with only a handful of data points? For example, with only 30 data samples of dangerous situations, can we tune a warning system to have a provable 5% false negative rate? This problem is easy if we allow trivial systems that always issue a warning, but such systems are not practically useful. If we restrict our attention to non-trivial systems, this problem is seemingly impossible because even if a fixed warning system successfully identifies all 30 dangerous situations, due to statistical fluctuations, we cannot prove that its false negative rate is less than 5% (with high confidence). If proving that a fixed predictor achieves safety is difficult, tuning a predictor to provably achieve safety seems only more challenging.

Our Contribution

We answer our main question affirmatively. We adapt a statistical inference framework known as conformal prediction to a robotics setting in order to tune systems to achieve provable safety guarantees (e.g. 5% false negative rate) with extremely limited data (e.g. 30 samples). We only require a single assumption: the training samples are exchangeable with each test sample, i.e. for each test sample, if we permute the concatenated sequence of the training samples and the test sample, there is no reason to believe that any permutation is more or less likely to occur. This is a weaker assumption than the i.i.d. assumption typically used in the statistical learning framework.

The assumption is reasonable in practice, even in situations in which the statistical inference assumptions fail (e.g. situations with temporal correlations between different test samples). In a driving scenario, for instance, the training dataset could include scene snippets sampled from different scenes, and these will be i.i.d. At test time, there may be one scene with several snippets. These snippets are obviously not independent; however, they are individually exchangeable with the training dataset.

While this answer seems too good to be true, the key insight here is that we provide a type of guarantee that is different from standard statistical learning guarantees. Consider a sequence of test samples Z1,,ZNsubscript𝑍1subscript𝑍𝑁Z_{1},\cdots,Z_{N}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_Z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and event indicators F1,,FNsubscript𝐹1subscript𝐹𝑁F_{1},\cdots,F_{N}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT for whether our warning system fails on each test sample.

  • Statistical learning guarantee. In the statistical learning framework, we assume that the test samples Z1,,ZNsubscript𝑍1subscript𝑍𝑁Z_{1},\cdots,Z_{N}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_Z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT are i.i.d., so the failure events F1,,FNsubscript𝐹1subscript𝐹𝑁F_{1},\cdots,F_{N}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT are also i.i.d. — we guarantee the failure probability for a sequence of i.i.d. failure events.

  • Conformal prediction guarantee. In the conformal prediction framework, the test samples are not necessarily i.i.d., so the failure events F1,,FNsubscript𝐹1subscript𝐹𝑁F_{1},\cdots,\allowbreak F_{N}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_F start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT can be correlated — we guarantee the marginal failure probability for each failure event. In other words, we know that each test sample has a low probability of failure (i.e. Fn=1subscript𝐹𝑛1F_{n}=1italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 with low probability), but the failures could be correlated. For example, conditioning on Fn=1subscript𝐹𝑛1F_{n}=1italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 might increase or decrease the probability that Fn+1=1subscript𝐹𝑛11F_{n+1}=1italic_F start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = 1 (while in the i.i.d. case, Fnsubscript𝐹𝑛F_{n}italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Fn+1subscript𝐹𝑛1F_{n+1}italic_F start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT are independent events).

The usefulness of the conformal guarantee depends on the intended application. Consider the driver alert system example: for individual drivers, collisions are rare and most drivers will not encounter more than one. Hence, there is little reason to worry about whether the warning failures are correlated between collisions. In other words, the conformal guarantee can convey confidence to individual users who rarely encounter multiple failures.

On the other hand, the conformal guarantee may convey less confidence to a company with a large fleet of vehicles. For example, if Fn=1subscript𝐹𝑛1F_{n}=1italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1 increases the probability that Fn+1=1subscript𝐹𝑛11F_{n+1}=1italic_F start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = 1, then it is possible to have multiple simultaneous failures. However, this is not a limitation of our method, but rather an unavoidable consequence of the weaker (not i.i.d.) assumptions: if the test data is correlated (which we have no control over), then failure events of a warning system are inherently correlated. The weaker assumption is usually necessary because most robotics applications are deployed in time series or sequential decision making setups, so data from nearby time steps are correlated and not i.i.d. Since standard statistical learning guarantees are not applicable due to violation of the i.i.d. assumption, having some (conformal) guarantee is better than none.

Furthermore, we will show empirically in Section 4 that failures are not highly correlated on two real-world driving datasets. Therefore, despite the lack of formal guarantees, there is strong empirical evidence suggesting that simultaneous failures do not occur in practice.

Thus, our contribution is four-fold:

  1. 1.

    We introduce a new notion of safety guarantee that is satisfactory for many use cases and has extremely good sample efficiency.

  2. 2.

    We show how to leverage the statistical inference tool of conformal prediction for robotics applications.

  3. 3.

    We instantiate a framework for applying conformal prediction to robotic safety.

  4. 4.

    We validate our framework experimentally on both a driver alert safety system and a robotic gras** system, showing that the conformal guarantees hold in practice, without issuing too many false positive alerts (e.g. less than 1% for many setups).

A preliminary version of this article was presented at the 2022 Workshop on the Algorithmic Foundations of Robotics (Luo et al., 2022). In this revised and extended version, we additionally contribute: (1) additional exposition on the surrogate safety score in our proposed framework, (2) proofs for the propositions in Section 3, (3) additional exposition on the details of our experimental setups, (4) experimental results demonstrating the tradeoff between ϵitalic-ϵ\epsilonitalic_ϵ and the false positive rate (FPR) when there are few samples, (5) experimental results demonstrating empirically that failures are not highly correlated in a real-world driving setting, and (6) experimental results demonstrating that worse surrogate safety scores lead to a performance drop in terms of the false positive rate (but no performance drop in terms of the false negative rate).

1.3 Organization

The rest of this paper is organized as follows. In Section 2, we review conformal prediction. In Section 3, we describe our problem setup, introduce our framework, and demonstrate that specific choices for elements of our framework lead to instantiations such as tuning an ROC curve threshold to limit false negatives (though we enrich this classic method with new guarantees). We then explain the differences between the conformal prediction guarantees and the statistical learning guarantees, and discuss when our guarantees should be applied. Finally, in Sections 4 and 5, we evaluate our framework on a driver alert safety system and on a robotic gras** system.

2 Overview of Conformal Prediction

This section provides an overview of conformal prediction, the general framework that we adapt for robotics safety. It may be skipped without breaking the flow of the paper.

Consider a prediction problem where the input feature is denoted by X𝑋Xitalic_X and the label is denoted by Z𝑍Zitalic_Z. Conformal prediction (Shafer and Vovk, 2008) is a class of methods that can produce prediction sets (i.e. a set of labels), such that the true label belongs to the predicted set with high probability. In its standard form, conformal prediction requires two components: a sequence of validation data (X1,Z1),,(XT,ZT)subscript𝑋1subscript𝑍1subscript𝑋𝑇subscript𝑍𝑇(X_{1},Z_{1}),\cdots,(X_{T},Z_{T})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) and a non-conformity score ψ𝜓\psiitalic_ψ, which is any function from the input feature X𝑋Xitalic_X and the label Z𝑍Zitalic_Z to a real number. Intuitively, the non-conformity score should measure the “unusualness” of the label Z𝑍Zitalic_Z when the input feature is X𝑋Xitalic_X. An example non-conformity score is ψ(X,Z)=|h(X)Z|𝜓𝑋𝑍𝑋𝑍\psi(X,Z)=|h(X)-Z|italic_ψ ( italic_X , italic_Z ) = | italic_h ( italic_X ) - italic_Z | where hhitalic_h is some fixed prediction function — intuitively, Z𝑍Zitalic_Z is “unusual” if the prediction function has large error.

The conformal prediction algorithm computes the non-conformity score for all samples in the validation set. Given a new test example with input feature X^^𝑋{\hat{X}}over^ start_ARG italic_X end_ARG, the conformal prediction algorithm then “tries” all possible labels z𝑧zitalic_z, and measures the non-conformity score ψ(X^,z)𝜓^𝑋𝑧\psi({\hat{X}},z)italic_ψ ( over^ start_ARG italic_X end_ARG , italic_z ). A label is rejected if the computed non-conformity score is greater than 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ of the non-conformity scores in the validation set. Any label that is not rejected is included in the prediction set. Intuitively, the true label is unlikely to have a non-conformity score higher than 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ of validation samples; hence the true label is unlikely to get rejected.

If the training data and the new test data point (X^,Z^)^𝑋^𝑍({\hat{X}},{\hat{Z}})( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Z end_ARG ) are exchangeable, i.e. the probability of observing any permutation of (X1,Z1),,(XT,ZT),(X^,Z^)subscript𝑋1subscript𝑍1subscript𝑋𝑇subscript𝑍𝑇^𝑋^𝑍(X_{1},Z_{1}),\cdots,(X_{T},Z_{T}),({\hat{X}},{\hat{Z}})( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_X start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , ( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_Z end_ARG ) is equally likely, then conformal prediction has very strong validity guarantees: the true label will be within the prediction set with 1ϵ±1/(T+1)plus-or-minus1italic-ϵ1𝑇11-\epsilon\pm 1/(T+1)1 - italic_ϵ ± 1 / ( italic_T + 1 ) probability. We note that this guarantee holds regardless of the nonconformity function ψ𝜓\psiitalic_ψ.

There are many extensions of conformal prediction, and the most relevant extension to our safety application is Mondrian conformal prediction (Vovk et al., 2003, 2005), which partitions the input data into several categories such that each data point belongs to exactly one category, and guarantees validity separately for each category. Our work is based on Mondrian conformal prediction; because we wish to limit the false negative rate in warning systems, we need class-conditional validity for samples in the “unsafe” class.

Works that apply conformal prediction to classification models in learned systems include Angelopoulos et al. (2020)Ghosh et al. (2023), and Angelopoulos et al. (2022).  Angelopoulos et al. (2020) and Ghosh et al. (2023) apply conformal prediction to image classifiers to obtain a predictive set containing the true label with a user-specified probability.  Angelopoulos et al. (2022) studies conformal prediction in the context of recommender systems, and uses it to produce a set of user recommendations.

Works that apply conformal prediction to robotics settings include Chen et al. (2020)Cai and Koutsoukos (2020)Nouretdinov et al. (2011), and Gammerman et al. (2008).  Chen et al. (2020) uses conformal prediction to predict a set of possible future motion trajectories from out of a set of 17 basis trajectories;  Cai and Koutsoukos (2020) uses some ideas from conformal prediction for detecting out of distribution samples in cyber-physical systems; and Nouretdinov et al. (2011) and Gammerman et al. (2008) use conformal prediction for medical diagnosis. However, these works consider very different targeted problems, while we consider the problem of warning systems and provide a general framework for using conformal prediction on a variety of robotics applications.

3 Conformal Prediction Framework for Robotics Applications

3.1 Problem Setup

We consider a model-based planning application where we have some existing simulator or model, and given the current observations (denoted by random variable X𝑋Xitalic_X), the simulator or model predicts the future states of the system (denoted by Y𝑌Yitalic_Y) in the absence of a warning. For instance, many applications have off-the-shelf simulators: an autonomous driving software might simulate the future trajectories of all traffic participants (up to some time horizon), or an aircraft control software might forward simulate the dynamics of the aircraft. We will use the random variable Z𝑍Zitalic_Z to denote the true unknown future states of the system in the absence of a warning, e.g. the true future trajectories of traffic participants, or the true future dynamics of an aircraft.

In our setup, depending on the model or simulator available, Y𝑌Yitalic_Y could have the same type as Z𝑍Zitalic_Z (e.g. both Y𝑌Yitalic_Y and Z𝑍Zitalic_Z are random variables that represent the future trajectories of traffic participants), or Y𝑌Yitalic_Y could have a different type from Z𝑍Zitalic_Z (e.g. Y𝑌Yitalic_Y might represent some but not all aspects about the future, such as the direction of movement for traffic participants, or the distance from collision).

Figure 2 shows a simplified illustration of our running driver alert system example. In this scene, there is an ego-agent (shown in blue), and an external agent (shown in red), whose position we would like to predict. Here, X𝑋Xitalic_X represents the current locations of the agents in the scene (i.e. the location of the red car at the bottom of the figure), Y𝑌Yitalic_Y represents the predicted future locations (in this illustration our model predicts that the red car will move up and to the left), and Z𝑍Zitalic_Z represents the true unknown future locations (perhaps the red car actually moves up and to the right).

Refer to caption
Figure 2: Overview of our problem setup: in this simplified example, there is an ego-agent (shown in blue), and an external agent (shown in red), whose position we would like to predict. X𝑋Xitalic_X represents the current location of the external agent, Y𝑌Yitalic_Y represents the predicted future location of the external agent, and Z𝑍Zitalic_Z represents the true future location of the external agent. f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the distance threshold at which two cars are considered too close together (i.e. the situation is considered unsafe), and f𝑓fitalic_f and g𝑔gitalic_g are both the distance to the nearest car.

3.1.1 Assessing Safety

We assume that if we know the true future state of the system Z𝑍Zitalic_Z, we can assess whether it is safe or not. Specifically, there exists some safety score denoted by f(Z)𝑓𝑍f(Z)italic_f ( italic_Z ); we specify some threshold (denoted by f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), and wish to be alerted if the safety score drops below this threshold (i.e. if f(Z)<f0𝑓𝑍subscript𝑓0f(Z)<f_{0}italic_f ( italic_Z ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). In other words, a situation is defined to be unsafe if the safety score f(Z)𝑓𝑍f(Z)italic_f ( italic_Z ) is too low. Most applications have natural safety scores. For instance, an autonomous driving safety score f𝑓fitalic_f could be the distance to or time from collision; an aircraft control safety score could be the (negative) absolute difference between the orientation of the aircraft and its ideal orientation.

In addition, we assume that the user provides a surrogate safety score g:Y:𝑔maps-to𝑌g:Y\mapsto\mathbb{R}italic_g : italic_Y ↦ blackboard_R that maps from the simulator prediction to a “safety score,” where a higher score indicates “safe” and a lower score indicates “unsafe”. Ideally the surrogate safety score g(Y)𝑔𝑌g(Y)italic_g ( italic_Y ) should be highly correlated with the true safety score f(Z)𝑓𝑍f(Z)italic_f ( italic_Z ), but technically g𝑔gitalic_g can be any function. None of our technical results depend on any assumptions about g𝑔gitalic_g; however, the choice of g𝑔gitalic_g affects the empirical performance in terms of false positive rate (i.e. how often our warning system issues unnecessary alerts). When Y𝑌Yitalic_Y and Z𝑍Zitalic_Z have the same type, we can simply choose g:=fassign𝑔𝑓g:=fitalic_g := italic_f; when Y𝑌Yitalic_Y and Z𝑍Zitalic_Z have different types we need to choose g𝑔gitalic_g on a case-by-case basis.

For example, in the driver assistance system in Figure 2, the safety score f(Z)𝑓𝑍f(Z)italic_f ( italic_Z ) could be the distance to the nearest car (shown by the blue line in the figure). Since the simulator Y𝑌Yitalic_Y can output predicted distances to other cars, the surrogate safety score g(Y)𝑔𝑌g(Y)italic_g ( italic_Y ) could also be the distance to the nearest car (shown by the orange line in the figure). The difference between f𝑓fitalic_f and g𝑔gitalic_g is that the input to g𝑔gitalic_g is the predicted future state of the system Y𝑌Yitalic_Y (output by the simulator), rather than the true unknown future state of the system Z𝑍Zitalic_Z. In this example, f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the safety threshold shown by the red boundary; so if the red and blue cars are too close together, then the situation is considered unsafe.

Refer to caption
Figure 3: Given the simulation or model output Y𝑌Yitalic_Y, the warning function w(Y)𝑤𝑌w(Y)italic_w ( italic_Y ) either decides to issue a warning (w(Y)=1𝑤𝑌1w(Y)=1italic_w ( italic_Y ) = 1), or not (w(Y)=0𝑤𝑌0w(Y)=0italic_w ( italic_Y ) = 0).
Refer to caption
Figure 4: Whenever the true safety score f(Z)𝑓𝑍f(Z)italic_f ( italic_Z ) is below f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (i.e. the red car and the blue car will indeed be too close together), the warning system should issue a warning (w(Y)=1𝑤𝑌1w(Y)=1italic_w ( italic_Y ) = 1) with at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ probability.

3.1.2 Warning Function

We wish to design a warning function (denoted as a function w(Y)𝑤𝑌w(Y)italic_w ( italic_Y )) that given the simulation or model output Y𝑌Yitalic_Y, decides to issue a warning (w(Y)=1𝑤𝑌1w(Y)=1italic_w ( italic_Y ) = 1) or not (w(Y)=0𝑤𝑌0w(Y)=0italic_w ( italic_Y ) = 0) (see Figure 3). Note that Y𝑌Yitalic_Y depends on previous states, so the warning function implicitly depends on previous observations through Y𝑌Yitalic_Y. Formally we define “safety” as the following requirement:

Definition 1.

For some 0<ϵ<10italic-ϵ10<\epsilon<10 < italic_ϵ < 1, we say that the warning system w𝑤witalic_w is ϵitalic-ϵ\epsilonitalic_ϵ-safe (with respect to Y,Z𝑌𝑍Y,Zitalic_Y , italic_Z, f𝑓fitalic_f, and f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) if

Pr[w(Y)=1f(Z)<f0]1ϵ.Pr𝑤𝑌conditional1𝑓𝑍subscript𝑓01italic-ϵ\displaystyle\Pr[w(Y)=1\mid f(Z)<f_{0}]\geq 1-\epsilon.roman_Pr [ italic_w ( italic_Y ) = 1 ∣ italic_f ( italic_Z ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ≥ 1 - italic_ϵ .

In words, whenever the true future safety score f(Z)𝑓𝑍f(Z)italic_f ( italic_Z ) is below f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the warning system should issue a warning (w(Y)=1𝑤𝑌1w(Y)=1italic_w ( italic_Y ) = 1) with at least 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ probability (see Figure 4). Another way to think of this is that the false negative rate is at most ϵitalic-ϵ\epsilonitalic_ϵ. The main difficulty here is that the warning function w𝑤witalic_w can depend on only the simulated future Y𝑌Yitalic_Y rather than the true future Z𝑍Zitalic_Z (which is not yet observed when the warning is issued), and the simulation might not come with any performance guarantees.

A trivial warning system that always issues a warning (i.e. wtrivial(Y)1subscript𝑤trivial𝑌1w_{\mathrm{trivial}}(Y)\equiv 1italic_w start_POSTSUBSCRIPT roman_trivial end_POSTSUBSCRIPT ( italic_Y ) ≡ 1) is always ϵitalic-ϵ\epsilonitalic_ϵ-safe for any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. However, such a warning system is not useful. A useful warning system should issue as few warnings as possible when safe. Therefore, we should also consider its false positive rate

FPR(w)=Pr[w(Y)=1f(Z)f0].FPR𝑤Pr𝑤𝑌conditional1𝑓𝑍subscript𝑓0\displaystyle\mathrm{FPR}(w)=\Pr[w(Y)=1\mid f(Z)\geq f_{0}].roman_FPR ( italic_w ) = roman_Pr [ italic_w ( italic_Y ) = 1 ∣ italic_f ( italic_Z ) ≥ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] .

The false positive rate is of lower priority for safety because issuing an unnecessary warning might only be an inconvenience, while failing to issue a warning when the situation is unsafe can lead to catastrophic outcomes. In summary, our goal is to design a warning function w()𝑤w(\cdot)italic_w ( ⋅ ) such that:

Goal: Provably achieve ϵitalic-ϵ\epsilonitalic_ϵ-safety for small ϵitalic-ϵ\epsilonitalic_ϵ (e.g. 0.020.020.020.02), while achieving low false positive rate (FPR).

3.1.3 Examples

A few examples that illustrate this problem setup are as follows:

  1. 1.

    In a driver alert system, users may want an assurance that among the instances in which the driver is in a dangerous situation, the system will issue a warning the vast majority of the time. The safety score in this case could be the time to collision (TTC), or the nearest distance from another car.

  2. 2.

    In a multi-arm robot collaboration system, users may want an assurance that among the instances in which the robot arms may collide, the system will issue a warning the majority of the time. The safety score could be the nearest distance to another robot arm.

  3. 3.

    In a warehouse robotic box-stacking system, users may want an assurance that among the instances in which the boxes will topple, the system will issue a warning the majority of the time. The safety score could be the probability of a stable stack.

  4. 4.

    In a coffee shop with a robotic barista, users may want an assurance that among the instances in which the robot may spill hot coffee, the system will issue a warning the majority of the time. The safety score could be the probability of a successful pour.

Examples 3 and 4 can be thought of as ROC curve threshold tuning. If the model used is a binary classifier that predicts whether there is a stable stack, we can use the predicted probability of a “safe” outcome as g𝑔gitalic_g. Note that in this special case, our method also tunes the threshold, but adds guarantees on the false negative rate and practical guidelines for sample complexity.

3.2 Analysis of the Trade-off Between the FNR and FPR

In this section, we analyze the fundamental trade-off between the false negative rate (FNR) and the false positive rate (FPR) of a warning system. For example, a trivial system that always issues a warning will have a 0% FNR but 100% FPR. Conversely, a system that never issues a warning will have a 0% FPR but 100% FNR. This suggests a trade-off between the achievable FNR and FPR.

3.2.1 Infinite Validation Data Regime

Even with infinite validation data, we may not be able to achieve both perfect FNR and perfect FPR because of inherent limitations of the safety score. For instance, at one extreme, if the safe and unsafe examples have identical safety score distributions, then there is no way to distinguish them according to Definition 1. At the other extreme, if the safe and unsafe examples have disjoint safety score distributions, then we can distinguish them perfectly (i.e. achieve 0% FPR and 0% FNR). A typical real world scenario will likely fall somewhere in between the two extremes, as illustrated in Figure 5. The key quantity is the amount of overlap between the safety score distribution for safe vs. unsafe examples, which will dictate the optimal achievable trade-off between the FPR and the FNR.

Refer to caption
Figure 5: Even in the limit of infinite validation data, the best false positive rate achievable (for a given ϵitalic-ϵ\epsilonitalic_ϵ-safety level) is determined by the distribution of the safe samples and the unsafe samples under the surrogate safety score function g𝑔gitalic_g.

3.2.2 Finite Validation Data Regime

The lack of sufficient validation data is another source of error that degrades the best achievable FNR/FPR trade-off. Intuitively, because we need to provably guarantee the FNR for ϵitalic-ϵ\epsilonitalic_ϵ-safety, in the absence of sufficient validation data, we must be conservative and issue more warnings than necessary. For example, with zero validation data we have no choice but to issue a warning for nearly every example, leading to a high FPR. In fact, we show in Proposition 2 in Section 3.3 that if we have fewer than T𝑇Titalic_T data samples, we cannot guarantee better than O(1/T)𝑂1𝑇O(1/T)italic_O ( 1 / italic_T ) FNR without incurring an FPR of close to 1111 for any distribution-free warning function that depends only on the ordering of the g(Y)𝑔𝑌g(Y)italic_g ( italic_Y ) values.

Our conformal algorithm can guarantee an O(1/T)𝑂1𝑇O(1/T)italic_O ( 1 / italic_T ) FNR while the FPR is not much higher than in the infinite data regime, demonstrating the (asymptotic) optimality of the conformal algorithm presented in Algorithm 1.

3.3 Algorithm to Achieve Guaranteed Safety Assurances

In this section, we will describe an algorithm that achieves ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT-safety with a low (nontrivial) false positive rate. The setup is as described in Section 3.1, with current observations X𝑋Xitalic_X, simulator Y𝑌Yitalic_Y, and true unknown future states Z𝑍Zitalic_Z.

If the simulation Y𝑌Yitalic_Y is perfect and has the same type as the ground truth future state Z𝑍Zitalic_Z, i.e. Z=Y𝑍𝑌Z=Yitalic_Z = italic_Y almost surely, then we can simply set g=f𝑔𝑓g=fitalic_g = italic_f and choose w(Y)=𝕀(g(Y)<f0)𝑤𝑌𝕀𝑔𝑌subscript𝑓0w(Y)=\mathbb{I}(g(Y)<f_{0})italic_w ( italic_Y ) = blackboard_I ( italic_g ( italic_Y ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and this w𝑤witalic_w will automatically satisfy our definition of safety. However, in most applications, it is difficult to provide any guarantees on the accuracy of the simulation. For example, in autonomous driving situations, traffic participants can behave in unexpected and hard to predict ways.

When we are uncertain about the simulation accuracy, we will require an additional training dataset. With a dataset of (simulated future state, true future state) pairs (Y1,Z1)subscript𝑌1subscript𝑍1(Y_{1},Z_{1})( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), (Y2,Z2),,(YT,ZT)subscript𝑌2subscript𝑍2subscript𝑌𝑇subscript𝑍𝑇(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), where T𝑇Titalic_T is the number of samples, we can guarantee ϵitalic-ϵ\epsilonitalic_ϵ-safety. Let (Y^,Z^)^𝑌^𝑍({\hat{Y}},{\hat{Z}})( over^ start_ARG italic_Y end_ARG , over^ start_ARG italic_Z end_ARG ) denote a new test sample. We require only a single assumption on the dataset:

Assumption 1.

The sequence (Y1,Z1)subscript𝑌1subscript𝑍1(Y_{1},Z_{1})( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), (Y2,Z2)subscript𝑌2subscript𝑍2(Y_{2},Z_{2})( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), normal-⋯\cdots, (YT,ZT)subscript𝑌𝑇subscript𝑍𝑇(Y_{T},Z_{T})( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), (Y^,Z^)normal-^𝑌normal-^𝑍({\hat{Y}},{\hat{Z}})( over^ start_ARG italic_Y end_ARG , over^ start_ARG italic_Z end_ARG ) is exchangeable, i.e. the probability of observing any permutation of the sequence is equally likely.

Exchangeability is a strong assumption. However, it is weaker than typical i.i.d. assumptions that underlie most machine learning methods with performance guarantees: if a sequence of data is i.i.d., then it is also exchangeable. In addition, if the distribution shifts, it is not prohibitively costly to collect a new training dataset from the shifted distribution. This is because we require only a very small dataset (e.g. in most of our experiments, the training dataset contains only about 50 examples of unsafe situations).

Based only on Assumption 1, we design an algorithm to guarantee safety on test data. The algorithm can be thought of as an instantiation of the conformal prediction framework.

Algorithm 1 Approximate ϵitalic-ϵ\epsilonitalic_ϵ-safety
1:Training dataset (Y1,Z1),(Y2,Z2),,(YT,ZT)subscript𝑌1subscript𝑍1subscript𝑌2subscript𝑍2subscript𝑌𝑇subscript𝑍𝑇(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), Surrogate safety score g𝑔gitalic_g, True safety score f𝑓fitalic_f, Threshold f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; A new simulation Y^^𝑌{\hat{Y}}over^ start_ARG italic_Y end_ARG
2:{0,1}01\{0,1\}{ 0 , 1 }  
3:Compute
𝒜={g(Yt):f(Zt)<f0,t=1,,T}.𝒜conditional-set𝑔subscript𝑌𝑡formulae-sequence𝑓subscript𝑍𝑡subscript𝑓0𝑡1𝑇{\mathcal{A}}=\big{\{}g(Y_{t})\colon f(Z_{t})<f_{0},t=1,\cdots,T\big{\}}.caligraphic_A = { italic_g ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) : italic_f ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t = 1 , ⋯ , italic_T } .
4:Sample U𝑈Uitalic_U uniformly from
U{0,1,,|{a𝒜:a=g(Y^)}|}.similar-to𝑈01conditional-set𝑎𝒜𝑎𝑔^𝑌U\sim\big{\{}0,1,\cdots,|\{a\in{\mathcal{A}}\colon a=g({\hat{Y}})\}|\big{\}}.italic_U ∼ { 0 , 1 , ⋯ , | { italic_a ∈ caligraphic_A : italic_a = italic_g ( over^ start_ARG italic_Y end_ARG ) } | } .
5:Compute
q=|{a𝒜:a<g(Y^)}|+U+1|𝒜|+1.𝑞conditional-set𝑎𝒜𝑎𝑔^𝑌𝑈1𝒜1q=\frac{\ |\{a\in{\mathcal{A}}\colon a<g({\hat{Y}})\}|+U+1}{|{\mathcal{A}}|+1}.italic_q = divide start_ARG | { italic_a ∈ caligraphic_A : italic_a < italic_g ( over^ start_ARG italic_Y end_ARG ) } | + italic_U + 1 end_ARG start_ARG | caligraphic_A | + 1 end_ARG .
6:If q1ϵ𝑞1italic-ϵq\leq 1-\epsilonitalic_q ≤ 1 - italic_ϵ then output 1, otherwise output 0 .

Intuitively, the procedure is as follows. We first compute a predicted safety score (based on the simulator outputs) for each unsafe sample in the training dataset (Line 1). We then sample a number from a uniform distribution between 0 and |{a𝒜:a=g(Y^)}|conditional-set𝑎𝒜𝑎𝑔^𝑌|\{a\in{\mathcal{A}}\colon a=g({\hat{Y}})\}|| { italic_a ∈ caligraphic_A : italic_a = italic_g ( over^ start_ARG italic_Y end_ARG ) } |, where |{a𝒜:a=g(Y^)}|conditional-set𝑎𝒜𝑎𝑔^𝑌|\{a\in{\mathcal{A}}\colon a=g({\hat{Y}})\}|| { italic_a ∈ caligraphic_A : italic_a = italic_g ( over^ start_ARG italic_Y end_ARG ) } | is the number of unsafe training data points with surrogate safety scores that are equal to the surrogate safety score of the new simulation g(Y^)𝑔^𝑌g({\hat{Y}})italic_g ( over^ start_ARG italic_Y end_ARG ) (Line 2). 222This randomization factor U𝑈Uitalic_U is simply a small correction factor that guarantees exact coverage in case of ties (see Vovk et al. (2005)). We next compute the quantile value for the new test simulation, i.e. the proportion of validation samples with a lower safety score than Y^^𝑌{\hat{Y}}over^ start_ARG italic_Y end_ARG (Line 3), with a small randomization factor from the previous step. If this quantile value is smaller than 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ (i.e. fewer than 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ of the unsafe samples from the training set have a lower safety score), we say that this may be an unsafe situation and issue an alert; otherwise, we say that it is safe (Line 4).

The following proposition shows that Algorithm 1 can guarantee safety.

Proposition 1.

Algorithm 1 is ϵ+1/(1+|𝒜|)italic-ϵ11𝒜\epsilon+1/(1+|{\mathcal{A}}|)italic_ϵ + 1 / ( 1 + | caligraphic_A | )-safe (with respect to Y^normal-^𝑌{\hat{Y}}over^ start_ARG italic_Y end_ARG and Z^normal-^𝑍{\hat{Z}}over^ start_ARG italic_Z end_ARG), under Assumption 1.

Proof.

Given a dataset

(Y1,Z1),(Y2,Z2),,(YT,ZT),subscript𝑌1subscript𝑍1subscript𝑌2subscript𝑍2subscript𝑌𝑇subscript𝑍𝑇(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T}),( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ,

where each data point contains the (simulated future state, true future state), denote the subset of “unsafe” data as

(Yc1,Zc1),(Yc2,Zc2),,(YcM,ZcM),subscript𝑌subscript𝑐1subscript𝑍subscript𝑐1subscript𝑌subscript𝑐2subscript𝑍subscript𝑐2subscript𝑌subscript𝑐𝑀subscript𝑍subscript𝑐𝑀(Y_{c_{1}},Z_{c_{1}}),(Y_{c_{2}},Z_{c_{2}}),\cdots,(Y_{c_{M}},Z_{c_{M}}),( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ⋯ , ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where Zctsubscript𝑍subscript𝑐𝑡Z_{c_{t}}italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the t𝑡titalic_t-th unsafe example (i.e. f(Zct)<f0𝑓subscript𝑍subscript𝑐𝑡subscript𝑓0f(Z_{c_{t}})<f_{0}italic_f ( italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). For typographical clarity throughout this proof, we use M𝑀Mitalic_M as shorthand to represent the number of unsafe data points,

M=|𝒜|=|{t:f(Zt)<f0}|,𝑀𝒜conditional-set𝑡𝑓subscript𝑍𝑡subscript𝑓0M=|{\mathcal{A}}|=\big{|}\big{\{}t\colon f(Z_{t})<f_{0}\big{\}}\big{|},italic_M = | caligraphic_A | = | { italic_t : italic_f ( italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT } | ,

and we use N𝑁Nitalic_N to represent the number of unsafe data points with surrogate safety score less than g(Y^)𝑔^𝑌g(\hat{Y})italic_g ( over^ start_ARG italic_Y end_ARG ),

N𝑁\displaystyle Nitalic_N =|{t:g(Yct)<g(Y^)}|=|{a𝒜:a<g(Y^)}|.absentconditional-set𝑡𝑔subscript𝑌subscript𝑐𝑡𝑔^𝑌conditional-set𝑎𝒜𝑎𝑔^𝑌\displaystyle=\big{|}\big{\{}t\colon g(Y_{c_{t}})<g(\hat{Y})\big{\}}\big{|}=% \big{|}\big{\{}a\in{\mathcal{A}}\colon a<g({\hat{Y}})\big{\}}\big{|}.= | { italic_t : italic_g ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) < italic_g ( over^ start_ARG italic_Y end_ARG ) } | = | { italic_a ∈ caligraphic_A : italic_a < italic_g ( over^ start_ARG italic_Y end_ARG ) } | .

Suppose that Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG is also unsafe, i.e. f(Z^)<f0𝑓^𝑍subscript𝑓0f(\hat{Z})<f_{0}italic_f ( over^ start_ARG italic_Z end_ARG ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Let delimited-⟅⟆\lbag\cdot\rbag⟅ ⋅ ⟆ denote an unordered bag (i.e. it is a set that can have repeated elements). We use B𝐵Bitalic_B to represent the unordered bag of unsafe data,

B=(Yc1,Zc1),,(YcM,ZcM),(Y^,Z^).𝐵subscript𝑌subscript𝑐1subscript𝑍subscript𝑐1subscript𝑌subscript𝑐𝑀subscript𝑍subscript𝑐𝑀^𝑌^𝑍B=\lbag(Y_{c_{1}},Z_{c_{1}}),\cdots,(Y_{c_{M}},Z_{c_{M}}),(\hat{Y},\hat{Z})\rbag.italic_B = ⟅ ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ⋯ , ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( over^ start_ARG italic_Y end_ARG , over^ start_ARG italic_Z end_ARG ) ⟆ .

To bound the safety, we first note that the probability of a false negative is given by

Pr[w(Y^)=0]Pr𝑤^𝑌0\displaystyle\Pr[w(\hat{Y})=0]roman_Pr [ italic_w ( over^ start_ARG italic_Y end_ARG ) = 0 ] =Pr[q>1ϵ]absentPr𝑞1italic-ϵ\displaystyle=\Pr[q>1-\epsilon]= roman_Pr [ italic_q > 1 - italic_ϵ ]
=𝔼[Pr[q>1ϵ|B]](Tower)absent𝔼delimited-[]Pr𝑞1conditionalitalic-ϵ𝐵(Tower)\displaystyle={\mathbb{E}}\Bigl{[}\Pr\bigl{[}q>1-\epsilon\big{|}B\bigr{]}\Bigr% {]}\hskip 51.00014pt\text{(Tower)}= blackboard_E [ roman_Pr [ italic_q > 1 - italic_ϵ | italic_B ] ] (Tower)
=𝔼[Pr[N+U>(1ϵ)(M+1)1|B]]\displaystyle={\mathbb{E}}\Bigl{[}\Pr\bigl{[}N+U>(1-\epsilon)(M+1)-1\big{|}B% \bigr{]}\Bigl{]}= blackboard_E [ roman_Pr [ italic_N + italic_U > ( 1 - italic_ϵ ) ( italic_M + 1 ) - 1 | italic_B ] ]

By the assumption of exchangeability, we are equally likely to observe any permutation of B𝐵Bitalic_B. Intuitively, g(Y^)𝑔^𝑌g(\hat{Y})italic_g ( over^ start_ARG italic_Y end_ARG ) is equally likely to be the largest, 2nd largest, etc., among g(Yc1),,g(YcM),g(Y^)𝑔subscript𝑌subscript𝑐1𝑔subscript𝑌subscript𝑐𝑀𝑔^𝑌g(Y_{c_{1}}),\cdots,g(Y_{c_{M}}),g(\hat{Y})italic_g ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ⋯ , italic_g ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_g ( over^ start_ARG italic_Y end_ARG ). Formally, the random variable N+U𝑁𝑈N+Uitalic_N + italic_U takes on all values {0,1,,M}01𝑀\{0,1,\cdots,M\}{ 0 , 1 , ⋯ , italic_M } with equal probability.333Note that the maximum value of U𝑈Uitalic_U is the number of unsafe training data points with surrogate safety scores that are equal to the surrogate safety score of the new simulation, and so |{a𝒜:a<g(Y^)}|+|{a𝒜:a=g(Y^)}|=Mconditional-set𝑎𝒜𝑎𝑔^𝑌conditional-set𝑎𝒜𝑎𝑔^𝑌𝑀|\{a\in{\mathcal{A}}\colon a<g({\hat{Y}})\}|+|\{a\in{\mathcal{A}}\colon a=g({% \hat{Y}})\}|=M| { italic_a ∈ caligraphic_A : italic_a < italic_g ( over^ start_ARG italic_Y end_ARG ) } | + | { italic_a ∈ caligraphic_A : italic_a = italic_g ( over^ start_ARG italic_Y end_ARG ) } | = italic_M. Therefore,

Pr[N+U>(1ϵ)(M+1)1|B]Pr𝑁𝑈1italic-ϵ𝑀1conditional1𝐵\displaystyle\Pr\biggl{[}N+U>(1-\epsilon)(M+1)-1\Big{|}B\biggr{]}roman_Pr [ italic_N + italic_U > ( 1 - italic_ϵ ) ( italic_M + 1 ) - 1 | italic_B ]
=1Pr[N+U(1ϵ)(M+1)1|B]absent1Pr𝑁𝑈1italic-ϵ𝑀1conditional1𝐵\displaystyle=1-\Pr\biggl{[}N+U\leq(1-\epsilon)(M+1)-1\Big{|}B\biggr{]}= 1 - roman_Pr [ italic_N + italic_U ≤ ( 1 - italic_ϵ ) ( italic_M + 1 ) - 1 | italic_B ]
1(1ϵ)(M+1)1M+1absent11italic-ϵ𝑀11𝑀1\displaystyle\leq 1-\frac{\lceil(1-\epsilon)(M+1)-1\rceil}{M+1}≤ 1 - divide start_ARG ⌈ ( 1 - italic_ϵ ) ( italic_M + 1 ) - 1 ⌉ end_ARG start_ARG italic_M + 1 end_ARG
=M+1(1ϵ)(M+1)+1M+1absent𝑀11italic-ϵ𝑀11𝑀1\displaystyle=\frac{\lfloor M+1-(1-\epsilon)(M+1)+1\rfloor}{M+1}= divide start_ARG ⌊ italic_M + 1 - ( 1 - italic_ϵ ) ( italic_M + 1 ) + 1 ⌋ end_ARG start_ARG italic_M + 1 end_ARG
1+ϵM+ϵM+1absent1italic-ϵ𝑀italic-ϵ𝑀1\displaystyle\leq\frac{1+\epsilon M+\epsilon}{M+1}≤ divide start_ARG 1 + italic_ϵ italic_M + italic_ϵ end_ARG start_ARG italic_M + 1 end_ARG
=ϵ+1M+1.absentitalic-ϵ1𝑀1\displaystyle=\epsilon+\frac{1}{M+1}.= italic_ϵ + divide start_ARG 1 end_ARG start_ARG italic_M + 1 end_ARG .

We can combine this result with the previous statement of the probability of a false negative to get

Pr[w(Y^)=0]Pr𝑤^𝑌0\displaystyle\Pr[w(\hat{Y})=0]roman_Pr [ italic_w ( over^ start_ARG italic_Y end_ARG ) = 0 ]
=𝔼[Pr[N+U>(1ϵ)(M+1)1|B]]absent𝔼delimited-[]Pr𝑁𝑈1italic-ϵ𝑀1conditional1𝐵\displaystyle={\mathbb{E}}\Biggl{[}\Pr\biggl{[}N+U>(1-\epsilon)(M+1)-1\Big{|}B% \biggr{]}\Biggr{]}= blackboard_E [ roman_Pr [ italic_N + italic_U > ( 1 - italic_ϵ ) ( italic_M + 1 ) - 1 | italic_B ] ]
𝔼[ϵ+1M+1]absent𝔼delimited-[]italic-ϵ1𝑀1\displaystyle\leq{\mathbb{E}}\left[\epsilon+\frac{1}{M+1}\right]≤ blackboard_E [ italic_ϵ + divide start_ARG 1 end_ARG start_ARG italic_M + 1 end_ARG ]
=ϵ+1M+1absentitalic-ϵ1𝑀1\displaystyle=\epsilon+\frac{1}{M+1}= italic_ϵ + divide start_ARG 1 end_ARG start_ARG italic_M + 1 end_ARG

as required. ∎

To use Proposition 1 to provide safety guarantees, we choose ϵitalic-ϵ\epsilonitalic_ϵ based on the number of samples available |𝒜|𝒜|{\mathcal{A}}|| caligraphic_A |. Specifically, if the desired safety level is ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, then we can choose any ϵ<ϵ*italic-ϵsuperscriptitalic-ϵ\epsilon<\epsilon^{*}italic_ϵ < italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT in Algorithm 1 such that

ϵ+1/(1+|𝒜|)ϵ*italic-ϵ11𝒜superscriptitalic-ϵ\displaystyle\epsilon+1/(1+|{\mathcal{A}}|)\leq\epsilon^{*}italic_ϵ + 1 / ( 1 + | caligraphic_A | ) ≤ italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT (1)

In other words, if our choice of ϵitalic-ϵ\epsilonitalic_ϵ satisfies Eq. (1), then Algorithm 1 will be ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT-safe. Intuitively, choosing a large ϵitalic-ϵ\epsilonitalic_ϵ decreases the false positive rate (FPR). This is because according to Algorithm 1 Line 6, choosing a larger ϵitalic-ϵ\epsilonitalic_ϵ decreases the number of times that a warning is output. Therefore, based on the number of samples in the training dataset |𝒜|𝒜|{\mathcal{A}}|| caligraphic_A |, we choose the largest ϵitalic-ϵ\epsilonitalic_ϵ that satisfies Eq. (1); i.e. we choose

ϵ=ϵ*1/(1+|𝒜|).italic-ϵsuperscriptitalic-ϵ11𝒜\epsilon=\epsilon^{*}-1/(1+|{\mathcal{A}}|).italic_ϵ = italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - 1 / ( 1 + | caligraphic_A | ) .

We will call 1/(1+|𝒜|)11𝒜1/(1+|{\mathcal{A}}|)1 / ( 1 + | caligraphic_A | ) the discretization error.

Proposition 1 also reveals the sample complexity of the conformal prediction algorithm. If the number of unsafe examples is too small (|𝒜|1/ϵ*1𝒜1superscriptitalic-ϵ1|{\mathcal{A}}|\leq 1/\epsilon^{*}-1| caligraphic_A | ≤ 1 / italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - 1), then we must choose ϵ<0italic-ϵ0\epsilon<0italic_ϵ < 0 to ensure ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT-safety according to Proposition 1. Algorithm 1 with ϵ<0italic-ϵ0\epsilon<0italic_ϵ < 0 will trivially always output 1111 (i.e. always issue a warning). On the other hand, if the number of unsafe examples exceeds the threshold (|𝒜|>1/ϵ*1𝒜1superscriptitalic-ϵ1|{\mathcal{A}}|>1/\epsilon^{*}-1| caligraphic_A | > 1 / italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - 1), then there will be an ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 that ensures ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT-safety according to Proposition 1. Consequently, Algorithm 1 will not be trivial. In practice, we find that to get good results and a low false positive rate, it is sufficient to have sample count |𝒜|𝒜|{\mathcal{A}}|| caligraphic_A | that exceeds the threshold by a small margin, such as |𝒜|=1.5/ϵ*1𝒜1.5superscriptitalic-ϵ1|{\mathcal{A}}|=1.5/\epsilon^{*}-1| caligraphic_A | = 1.5 / italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT - 1. For example, to achieve a 5% false positive rate, it is sufficient to have only about 30 unsafe examples.

Proposition 2 demonstrates that Algorithm 1 is asymptotically optimal, since it can guarantee an O(1/T)𝑂1𝑇O(1/T)italic_O ( 1 / italic_T ) FNR while the FPR is not much higher than in the infinite data regime.

Proposition 2.

There is no ϵitalic-ϵ\epsilonitalic_ϵ-safe warning system based on the ordering of g(Y)𝑔𝑌g(Y)italic_g ( italic_Y ) values that can achieve a false positive rate lower than 1(1+T)ϵ11𝑇italic-ϵ1-(1+T)\epsilon1 - ( 1 + italic_T ) italic_ϵ.

We conjecture that Proposition 2 holds for all functions, but we prove it for the fairly general class of functions specified by Equation 5 below, comparing the surrogate safety score for our new test sample g(Y^)𝑔^𝑌g({\hat{Y}})italic_g ( over^ start_ARG italic_Y end_ARG ) relative to the g(Y1),,g(YT)𝑔subscript𝑌1𝑔subscript𝑌𝑇g(Y_{1}),\cdots,g(Y_{T})italic_g ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , italic_g ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) values.

Proof.

Consider a function w𝑤witalic_w that maps a dataset 𝒟=(g(Y1),Z1),,(g(YT),ZT)𝒟𝑔subscript𝑌1subscript𝑍1𝑔subscript𝑌𝑇subscript𝑍𝑇{\mathcal{D}}=(g(Y_{1}),Z_{1}),\cdots,(g(Y_{T}),Z_{T})caligraphic_D = ( italic_g ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , ( italic_g ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) of unsafe examples, and a new data point g(Y^)𝑔^𝑌g({\hat{Y}})italic_g ( over^ start_ARG italic_Y end_ARG ), to {0,1}01\{0,1\}{ 0 , 1 }. Let w𝑤witalic_w be a warning function that gives a distribution-free false negative rate guarantee that depends only on the ordering between g(Y1),,g(YT),g(Y^)𝑔subscript𝑌1𝑔subscript𝑌𝑇𝑔^𝑌g(Y_{1}),\cdots,g(Y_{T}),g({\hat{Y}})italic_g ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ⋯ , italic_g ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , italic_g ( over^ start_ARG italic_Y end_ARG ) (rather than on their specific values). In other words, w𝑤witalic_w takes the form defined by

w(𝒟,Y^)={ϕ(|{t:g(Y^)<g(Yt)}|), with probability γ1, with probability 1γ𝑤𝒟^𝑌casesitalic-ϕconditional-set𝑡𝑔^𝑌𝑔subscript𝑌𝑡missing-subexpressionmissing-subexpression with probability 𝛾1 with probability 1𝛾\displaystyle w({\mathcal{D}},{\hat{Y}})=\left\{\begin{array}[]{ll}\phi\left(% \big{|}\big{\{}t\colon g({\hat{Y}})<g(Y_{t})\big{\}}\big{|}\right),\\ &\mathllap{\text{ with probability }\gamma}\\ 1,&\mathllap{\text{ with probability }1-\gamma}\end{array}\right.italic_w ( caligraphic_D , over^ start_ARG italic_Y end_ARG ) = { start_ARRAY start_ROW start_CELL italic_ϕ ( | { italic_t : italic_g ( over^ start_ARG italic_Y end_ARG ) < italic_g ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } | ) , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL with probability italic_γ end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL with probability 1 - italic_γ end_CELL end_ROW end_ARRAY (5)

for some deterministic function ϕitalic-ϕ\phiitalic_ϕ and real number γ𝛾\gammaitalic_γ. We know that when the data is exchangeable, the random variable |{t:g(Y^)<g(Yt)}|conditional-set𝑡𝑔^𝑌𝑔subscript𝑌𝑡|\{t\colon g({\hat{Y}})<g(Y_{t})\}|| { italic_t : italic_g ( over^ start_ARG italic_Y end_ARG ) < italic_g ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } | is uniformly distributed on {0,1,,T}01𝑇\{0,1,\cdots,T\}{ 0 , 1 , ⋯ , italic_T }.

Case 1: Suppose ϕitalic-ϕ\phiitalic_ϕ outputs the value 00 (no warning) for at least one possible input. Then the false negative rate is given by

FNRγ/(1+T),FNR𝛾1𝑇\displaystyle\text{FNR}\geq\gamma/(1+T),FNR ≥ italic_γ / ( 1 + italic_T ) , (6)

and the false positive rate is given by

FPR1γ,FPR1𝛾\displaystyle\text{FPR}\geq 1-\gamma,FPR ≥ 1 - italic_γ , (7)

so combined we have

FPR1γ1(1+T)FNR1(1+T)ϵ.FPR1𝛾11𝑇FNR11𝑇italic-ϵ\displaystyle\text{FPR}\geq 1-\gamma\geq 1-(1+T)\text{FNR}\geq 1-(1+T)\epsilon.FPR ≥ 1 - italic_γ ≥ 1 - ( 1 + italic_T ) FNR ≥ 1 - ( 1 + italic_T ) italic_ϵ . (8)

Case 2: Suppose ϕitalic-ϕ\phiitalic_ϕ outputs the value 00 for none of the inputs (i.e. it always issues a warning). Then the false negative rate and false positive rate are given by

FNR=0,FPR=1,formulae-sequenceFNR0FPR1\displaystyle\text{FNR}=0,\text{FPR}=1,FNR = 0 , FPR = 1 , (9)

so we would still (trivially) have FPR1(1+T)ϵFPR11𝑇italic-ϵ\text{FPR}\geq 1-(1+T)\epsilonFPR ≥ 1 - ( 1 + italic_T ) italic_ϵ.

Thus, if w𝑤witalic_w takes the form of Equation 5, then the false positive rate must be lower bounded by 1(1+T)ϵ11𝑇italic-ϵ1-(1+T)\epsilon1 - ( 1 + italic_T ) italic_ϵ. In other words, when ϵ=o(1/T)italic-ϵ𝑜1𝑇\epsilon=o(1/T)italic_ϵ = italic_o ( 1 / italic_T ), the false positive rate tends to 1111 when T𝑇Titalic_T is large. ∎

3.3.1 Algorithm Design Choices

There are two major components of our algorithm that can be tuned to achieve a tighter FPR: the surrogate safety score, and the trained simulator or prediction model. A surrogate safety score that is better correlated with the true safety score will lead to a better FPR, as will a more accurate simulator model. Note that patterns of inaccuracies in the simulator model will also lead to patterns of errors in the false alarms. For example, an autonomous driving simulator that is particularly inaccurate around yield signs will lead to surrogate safety scores that are not well correlated with the true safety score around yield signs; more alarms will be issued in order to maintain the specified FNR and thus the FPR will be higher.

The guarantees and analysis of Algorithm 1 will hold regardless of the surrogate safety score and simulator model used. However, if these components are chosen poorly, not enough data is available, or the required ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is too stringent, then this procedure could become trivial (e.g. always issuing a warning). In practice however, we find that we are able to obtain good results for an autonomous driving application with an off-the-shelf prediction model for reasonably low ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT-values with not too much data (see Section 4).

3.4 Comparing Conformal Prediction with PAC Learning

We further compare the statistical learning and conformal prediction guarantees. We first clarify the notation and formally define the different assumptions. Consider a sequence of training data (Y1,Z1),(Y2,Z2),,(YT,ZT)subscript𝑌1subscript𝑍1subscript𝑌2subscript𝑍2subscript𝑌𝑇subscript𝑍𝑇(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) and a sequence of test data (Y^1,Z^1),(Y^2,Z^2),,(Y^N,Z^N)subscript^𝑌1subscript^𝑍1subscript^𝑌2subscript^𝑍2subscript^𝑌𝑁subscript^𝑍𝑁({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ). Let c1,,cMsubscript𝑐1subscript𝑐𝑀c_{1},\cdots,c_{M}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT denote the unsafe subsequence of test data, i.e. (Y^c1,Z^c1),(Y^c2,Z^c2),,(Y^cM,Z^cM)subscript^𝑌subscript𝑐1subscript^𝑍subscript𝑐1subscript^𝑌subscript𝑐2subscript^𝑍subscript𝑐2subscript^𝑌subscript𝑐𝑀subscript^𝑍subscript𝑐𝑀({\hat{Y}}_{c_{1}},{\hat{Z}}_{c_{1}}),({\hat{Y}}_{c_{2}},{\hat{Z}}_{c_{2}}),% \cdots,({\hat{Y}}_{c_{M}},{\hat{Z}}_{c_{M}})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , ⋯ , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is the subsequence of (Y^1,Z^1),(Y^2,Z^2),,(Y^N,Z^N)subscript^𝑌1subscript^𝑍1subscript^𝑌2subscript^𝑍2subscript^𝑌𝑁subscript^𝑍𝑁({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) such that, for all m𝑚mitalic_m, f(Z^cm)<f0𝑓subscript^𝑍subscript𝑐𝑚subscript𝑓0f({\hat{Z}}_{c_{m}})<f_{0}italic_f ( over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) < italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Two possible assumptions that we could make on the training and test data sequences are shown in Assumptions 2 and 3. In particular, marginal exchangeability (Assumption 2) is the same as Assumption 1 from the previous section. The only difference here is that we explicitly state that we only require exchangeability with each test data point.

Assumption 2 (Marginal exchangeability).

For each n𝑛nitalic_n, the sequence (Y1,Z1)subscript𝑌1subscript𝑍1(Y_{1},Z_{1})( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), (Y2,Z2)subscript𝑌2subscript𝑍2(Y_{2},Z_{2})( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), normal-⋯\cdots, (YT,ZT)subscript𝑌𝑇subscript𝑍𝑇(Y_{T},Z_{T})( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), (Y^n,Z^n)subscriptnormal-^𝑌𝑛subscriptnormal-^𝑍𝑛({\hat{Y}}_{n},{\hat{Z}}_{n})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is exchangeable.

Assumption 3 (Independent and identically distributed).

The training / test data sequences (Y1,Z1)subscript𝑌1subscript𝑍1(Y_{1},Z_{1})( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), (Y2,Z2)subscript𝑌2subscript𝑍2(Y_{2},Z_{2})( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), normal-⋯\cdots, (YT,ZT)subscript𝑌𝑇subscript𝑍𝑇(Y_{T},Z_{T})( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ), (Y^1,Z^1),(Y^2,Z^2),,(Y^N,Z^N)subscriptnormal-^𝑌1subscriptnormal-^𝑍1subscriptnormal-^𝑌2subscriptnormal-^𝑍2normal-⋯subscriptnormal-^𝑌𝑁subscriptnormal-^𝑍𝑁({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) are drawn from an i.i.d. distribution.

Given a warning function, we use the random variables F^c1,,F^cMsubscript^𝐹subscript𝑐1subscript^𝐹subscript𝑐𝑀{\hat{F}}_{c_{1}},\cdots,{\hat{F}}_{c_{M}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT to denote failure of the warning function, i.e. F^cm=𝕀(w(Y^cm)=0)subscript^𝐹subscript𝑐𝑚𝕀𝑤subscript^𝑌subscript𝑐𝑚0{\hat{F}}_{c_{m}}={\mathbb{I}}(w({\hat{Y}}_{c_{m}})=0)over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = blackboard_I ( italic_w ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = 0 ). Note that F^cmsubscript^𝐹subscript𝑐𝑚{\hat{F}}_{c_{m}}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT depends on w𝑤witalic_w, but we drop this dependence from our notation.

A learning algorithm is a function that takes as input the training data (Y1,Z1),(Y2,Z2),,(YT,ZT)subscript𝑌1subscript𝑍1subscript𝑌2subscript𝑍2subscript𝑌𝑇subscript𝑍𝑇(Y_{1},Z_{1}),(Y_{2},Z_{2}),\cdots,(Y_{T},Z_{T})( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) and outputs a warning function w:X{0,1}:𝑤𝑋01w:X\to\{0,1\}italic_w : italic_X → { 0 , 1 }. There are two main paradigms for designing learning algorithms with guarantees.

PAC Learning: Under Assumption 3, a learning algorithm is (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-safe if with 1δ1𝛿1-\delta1 - italic_δ probability (with respect to randomness of the training data) the learned warning function w𝑤witalic_w satisfies for some ϵ<ϵsuperscriptitalic-ϵitalic-ϵ\epsilon^{\prime}<\epsilonitalic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_ϵ

F^c1,,F^cMBernoulli(ϵ).similar-tosubscript^𝐹subscript𝑐1subscript^𝐹subscript𝑐𝑀Bernoullisuperscriptitalic-ϵ\displaystyle{\hat{F}}_{c_{1}},\cdots,{\hat{F}}_{c_{M}}\sim\mathrm{Bernoulli}(% \epsilon^{\prime}).over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ roman_Bernoulli ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (10)

Conformal Learning: For completeness we restate the conformal learning guarantee. A learning algorithm is ϵitalic-ϵ\epsilonitalic_ϵ-safe if the learned function w𝑤witalic_w satisfies for some ϵ<ϵsuperscriptitalic-ϵitalic-ϵ\epsilon^{\prime}<\epsilonitalic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_ϵ

F^cmBernoulli(ϵ), for all m=1,,M.formulae-sequencesimilar-tosubscript^𝐹subscript𝑐𝑚Bernoullisuperscriptitalic-ϵ for all 𝑚1𝑀\displaystyle{\hat{F}}_{c_{m}}\sim\mathrm{Bernoulli}(\epsilon^{\prime}),\text{% for all }m=1,\cdots,M.over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ roman_Bernoulli ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , for all italic_m = 1 , ⋯ , italic_M . (11)

3.4.1 Comparing Assumptions

Conformal learning requires weaker assumptions. Assumption 2 is much weaker than Assumption 3, and hence is applicable to a much larger class of problems. For example, consider an autonomous driving application where the training data are snippets from randomly sampled driving scenes (no two training data points come from the same driving scene), and the test data (Y^1,Z^1),(Y^2,Z^2),,(Y^N,Z^N)subscript^𝑌1subscript^𝑍1subscript^𝑌2subscript^𝑍2subscript^𝑌𝑁subscript^𝑍𝑁({\hat{Y}}_{1},{\hat{Z}}_{1}),({\hat{Y}}_{2},{\hat{Z}}_{2}),\cdots,({\hat{Y}}_% {N},{\hat{Z}}_{N})( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ⋯ , ( over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , over^ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) is a sequence of driving snippets from a random driving scene. The test data points are not independent because they are from the same scene, and hence Assumption 3 is violated. However, Assumption 2 holds because the training data and any single test sample are snippets from randomly sampled driving scenes.

3.4.2 Comparing Sample Complexity

Conformal learning requires Θ(1/ϵ)Θ1italic-ϵ\Theta(1/\epsilon)roman_Θ ( 1 / italic_ϵ ) training examples of unsafe situations (Proposition 1), while standard analysis in PAC learning requires Θ(1/ϵ2)Θ1superscriptitalic-ϵ2\Theta(1/\epsilon^{2})roman_Θ ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) examples. For example, consider the following (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-safe algorithm: based on the simulation Y𝑌Yitalic_Y and the surrogate safety function g𝑔gitalic_g, we consider the family of warning functions wθ(Y)=𝕀(g(Y)<θ)subscript𝑤𝜃𝑌𝕀𝑔𝑌𝜃w_{\theta}(Y)={\mathbb{I}}(g(Y)<\theta)italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y ) = blackboard_I ( italic_g ( italic_Y ) < italic_θ ). Our goal is to estimate the false negative rate of wθsubscript𝑤𝜃w_{\theta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (denoted by ϵ*(θ)superscriptitalic-ϵ𝜃\epsilon^{*}(\theta)italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ )) for each θ𝜃\thetaitalic_θ and select the smallest θ𝜃\thetaitalic_θ such that ϵ*(θ)ϵsuperscriptitalic-ϵ𝜃italic-ϵ\epsilon^{*}(\theta)\leq\epsilonitalic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ ) ≤ italic_ϵ.

To estimate ϵ*superscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, we compute the (empirical) false negative rate (denoted by ϵ^(θ)^italic-ϵ𝜃\hat{\epsilon}(\theta)over^ start_ARG italic_ϵ end_ARG ( italic_θ )) on the training data, i.e.

ϵ^(θ)=1/Mm𝕀(g(Ycm)θ)^italic-ϵ𝜃1𝑀subscript𝑚𝕀𝑔subscript𝑌subscript𝑐𝑚𝜃\displaystyle\hat{\epsilon}(\theta)=1/M\sum_{m}{\mathbb{I}}(g(Y_{c_{m}})\geq\theta)over^ start_ARG italic_ϵ end_ARG ( italic_θ ) = 1 / italic_M ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT blackboard_I ( italic_g ( italic_Y start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ italic_θ ) (12)

and use a standard concentration inequality (such as Hoeffding) to bound the difference between ϵ*(θ)superscriptitalic-ϵ𝜃\epsilon^{*}(\theta)italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ ) and ϵ^(θ)^italic-ϵ𝜃\hat{\epsilon}(\theta)over^ start_ARG italic_ϵ end_ARG ( italic_θ ). Specifically, with probability 1δ1𝛿1-\delta1 - italic_δ

ϵ*(θ)ϵ^(θ)±log(1/δ)2M.superscriptitalic-ϵ𝜃plus-or-minus^italic-ϵ𝜃1𝛿2𝑀\displaystyle\epsilon^{*}(\theta)\in\hat{\epsilon}(\theta)\pm\sqrt{\frac{\log(% 1/\delta)}{2M}}.italic_ϵ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_θ ) ∈ over^ start_ARG italic_ϵ end_ARG ( italic_θ ) ± square-root start_ARG divide start_ARG roman_log ( 1 / italic_δ ) end_ARG start_ARG 2 italic_M end_ARG end_ARG . (13)

Note that Eq. (13) is already the tightest bound possible up to constants (Foody, 2009). To verify that wθsubscript𝑤𝜃w_{\theta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT has a false negative rate that is less than or equal to ϵitalic-ϵ\epsilonitalic_ϵ, we have to check that

ϵ^(θ)+log(1/δ)2Mϵ,^italic-ϵ𝜃1𝛿2𝑀italic-ϵ\displaystyle\hat{\epsilon}(\theta)+\sqrt{\frac{\log(1/\delta)}{2M}}\leq\epsilon,over^ start_ARG italic_ϵ end_ARG ( italic_θ ) + square-root start_ARG divide start_ARG roman_log ( 1 / italic_δ ) end_ARG start_ARG 2 italic_M end_ARG end_ARG ≤ italic_ϵ ,

which requires

log(1/δ)2MϵMlog(1/δ)2ϵ2.iff1𝛿2𝑀italic-ϵ𝑀1𝛿2superscriptitalic-ϵ2\displaystyle\sqrt{\frac{\log(1/\delta)}{2M}}\leq\epsilon\iff M\geq\frac{\log(% 1/\delta)}{2\epsilon^{2}}.square-root start_ARG divide start_ARG roman_log ( 1 / italic_δ ) end_ARG start_ARG 2 italic_M end_ARG end_ARG ≤ italic_ϵ ⇔ italic_M ≥ divide start_ARG roman_log ( 1 / italic_δ ) end_ARG start_ARG 2 italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

This means that we must have at least Θ(1/ϵ2)Θ1superscriptitalic-ϵ2\Theta(1/\epsilon^{2})roman_Θ ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) samples.

In words, even a fixed wθsubscript𝑤𝜃w_{\theta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT requires Θ(1/ϵ2)Θ1superscriptitalic-ϵ2\Theta(1/\epsilon^{2})roman_Θ ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) samples to verify its false negative rate according to Eq. (13). Thus, finding wθsubscript𝑤𝜃w_{\theta}italic_w start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to provably achieve low false negative rate should require at least as many, if not more, training examples.

3.4.3 Comparing Usefulness of Guarantees

PAC learning and conformal learning both have advantages. PAC learning has the advantage that its i.i.d. error rate guarantee in Eq. (10) is stronger than the marginal error rate guarantee in Eq. (11). For example, if the downstream user is very sensitive to high variance (i.e. it is unacceptable for all test examples to fail simultaneously even if the probability is vanishingly small) then the i.i.d. error rate guarantee in Eq. (10) might be necessary. Nevertheless, the risk can be reduced by alternative methods such as financial tools (insurance). On the other hand, the conformal learning guarantee in Eq. (11) has the advantage that it always holds, while the PAC learning guarantee in Eq. (10) only holds with 1δ1𝛿1-\delta1 - italic_δ probability.

To summarize, conformal learning requires much weaker assumptions and fewer samples, and its guarantees always hold (rather than with 1δ1𝛿1-\delta1 - italic_δ probability). PAC learning offers stronger guarantees when its assumptions and sample complexity requirements are met.

4 Experiments: Driver Alert System

We empirically validate the guarantees of our framework on a driver alert safety system using real driving data. The system should warn the driver if the driver may get into an unsafe situation, without issuing too many false alarms. We show that the false negative rate (the percentage of unsafe situations that the system fails to identify) is indeed bounded according to Proposition 1, while the FPR remains low.

4.1 Experimental Setup

4.1.1 Methods

We evaluate our framework on the setup described in Section 3.1. We use Trajectron++ (Salzmann et al., 2020) as our future dynamics model (i.e. in the notation of Section 3.1,  Y𝑌Yitalic_Y is the output of Trajectron++ and g=f𝑔𝑓g=fitalic_g = italic_f). We choose the safety score f𝑓fitalic_f as a weighted distance metric, where agents in the direction of the ego-vehicle velocity vector are considered “closer” than agents in the orthogonal direction.

More specifically, we define the safety score by the Mahalanobis distance between the ego-vehicle and the agent, where the first eigenvector is aligned with the ego-vehicle’s velocity vector, and the second eigenvector is orthogonal to the ego-vehicle (see Figure 6); the magnitude of the first eigenvector is the magnitude of the velocity, and the magnitude of the second eigenvector is approximately half of a car width (we use 1m). Intuitively, this means that agents that are along the ego-vehicle’s velocity vector appear closer than agents in the perpendicular direction.

Refer to caption
Figure 6: The Mahalanobis distance is transformed along these axes, such that agents in the direction of the velocity vector appear closer than agents in the perpendicular direction.
Refer to caption
(a) False negative and false positive rates on the nuScenes dataset with varying f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05. The theoretical upper bound on epsilon is shown in green. The false negative rate ( red) is below the upper bound and the false positive rate ( blue) is very low.
Refer to caption
(b) False negative and false positive rates on the nuScenes dataset with varying ϵitalic-ϵ\epsilonitalic_ϵ and f0=25subscript𝑓025f_{0}=25italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 25. The theoretical upper bound on epsilon is shown in green. The false positive rate ( blue) improves with higher ϵitalic-ϵ\epsilonitalic_ϵ.
Refer to caption
(c) False negative and false positive rates on the nuScenes dataset with varying proportions of unsafe samples in the training set. The theoretical upper bound on epsilon is shown in green. Here, ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05 and f0=25subscript𝑓025f_{0}=25italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 25.
Refer to caption
(d) False negative and false positive rates on the Kaggle Lyft dataset with a varying ϵitalic-ϵ\epsilonitalic_ϵ value. Here, f0=3.5subscript𝑓03.5f_{0}=3.5italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 3.5, and there are approx. 50 unsafe examples in the training dataset.
Figure 7: False negative rate, false positive rate, and theoretical upper bound on ϵitalic-ϵ\epsilonitalic_ϵ for the nuScenes and Lyft datasets while varying several parameters.

4.1.2 Datasets

We use the nuScenes (Caesar et al., 2020) and the Kaggle Lyft Motion Prediction (Jang et al., 2020) autonomous driving datasets. Each dataset contains multiple scenes, and each scene contains multiple trajectories. The trajectories in a scene are correlated with each other, but the different scenes are sufficiently distinct from each other to be considered exchangeable. To generate a dataset of exchangeable trajectories, we sample a single trajectory uniformly at random from each scene.

The nuScenes dataset includes 952 scenes collected across Boston and Singapore, divided into a 697/105/150 train/val/test split (the same split used for the original Trajectron++). Each scene is 20 seconds long. The Kaggle Lyft Motion Prediction dataset includes approximately 16k scenes, divided into an 70%/15%/15% train/val/test split. Each scene is 25 seconds long. Both datasets include labeled ego-vehicle trajectories as well as labeled detections and trajectories for other agents in the scene. Note that for both of these datasets, because the training split was used to train the Trajectron++ model, we use the validation split as the input training data for Algorithm 1. A visualization of trajectory predictions output by Trajectron++ is shown in Figure 8 (Salzmann et al., 2020).

Refer to caption
Figure 8: Visualization of trajectory predictions output by Trajectron++ (Salzmann et al., 2020).

4.1.3 Data Splitting

To compute average performance, we use random train and test splits. For both datasets, we first pool together all available data points, randomly shuffle them, and separate them back into training and test splits (with the same size as the original splits). We ran 100 trials for each experiment, and averaged over the results.

4.2 Results and Discussion

In Fig. 6(a), 6(b), and 6(c) we vary several parameters (safety threshold f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, safety guarantee ϵitalic-ϵ\epsilonitalic_ϵ, and proportion of unsafe situations) for nuScenes. We show qualitatively similar results for the Lyft dataset in Figure 6(d). Our main observations:

  1. 1.

    The false negative rate (i.e. safety) is always within the theoretical bound in Proposition 1. We achieve these false negative rates with very little data. nuScenes has 50-70 unsafe examples in the training dataset, and Lyft has about 50. Yet, even with these few examples, we can ensure a false negative rate to within 1 or 2% of the desired ϵitalic-ϵ\epsilonitalic_ϵ.

  2. 2.

    The false positive rate (FPR) is generally very good — well below 1% on the nuScenes dataset. We use an off-the-shelf trajectory predictor trained on a small academic dataset; a more accurate trajectory predictor trained on industry-sized datasets might be expected to provide a more discriminative safety score (as in Figure 5), and thus a further improved FPR. Note that as shown in Figure 9, there is a tradeoff between ϵitalic-ϵ\epsilonitalic_ϵ and the FPR when there are few (e.g. <1/Tabsent1𝑇<1/T< 1 / italic_T) samples, which is consistent with what our theory from Section 3.2 would predict. Figure 9 plots the epsilon bound as well as the false negative and false positive rates vs. the number of unsafe samples in the validation dataset; we see that when ϵitalic-ϵ\epsilonitalic_ϵ decreases as 1/T1𝑇1/T1 / italic_T, the false positive rate is relatively flat and low.

  3. 3.

    One previously unmentioned benefit of our approach is that our method is robust to label frequency shift — the frequency of unsafe situations can differ between the training data and test data. Observe that the output of Algorithm 1 depends only on the unsafe examples; consequently, the safety guarantee in Proposition 1 still holds if we increase or decrease the number of safe examples. For example, the training data collection process could intentionally focus on unsafe situations, so that unsafe examples are over-represented in the training data. We empirically simulate this in Figure 6(c) where we increase the proportion of unsafe examples in the training set (by deleting safe examples). The performance of our algorithm does not change qualitatively.

Refer to caption
Figure 9: Epsilon bound, false negative rate, and false positive rate on the nuScenes dataset while varying the number of unsafe samples. Consistent with our theory from Section 3.2, the sum of ϵitalic-ϵ\epsilonitalic_ϵ and the false positive rate is high when there are few samples.

In Figure 10, we also plot the false negative and false positive rates at various ϵitalic-ϵ\epsilonitalic_ϵ-values for the Lyft dataset using PAC estimates. In this experiment, we find a 99.9% confidence bound for the (1ϵ)1italic-ϵ(1-\epsilon)( 1 - italic_ϵ ) quantile surrogate safety score among the unsafe samples, and issue an alert if the surrogate safety score of the new test sample is below this upper bound. Thus, with high (99.9%) probability, the FNR will be less than or equal to ϵitalic-ϵ\epsilonitalic_ϵ. As the figure demonstrates, the FPR is very high until ϵitalic-ϵ\epsilonitalic_ϵ is large, since there are too few samples to obtain a tight confidence interval.

Refer to caption
Figure 10: Epsilon, false negative rate, and false positive rate for the Lyft dataset using PAC confidence bounds. The false positive rate is very high until ϵitalic-ϵ\epsilonitalic_ϵ is large, because there are very few unsafe examples in the training dataset.

In the comparison of PAC learning and conformal learning, we argued that the main advantage of PAC learning is that the failures are i.i.d., so the total number of failures should have low variance (due to the Central Limit Theorem). However, we show empirically that users need not be overly concerned about highly correlated failures, as long as the test samples are not inherently highly correlated. We find that the variance on the false negative rate from different train/test splits is very low. With ϵ=0.06italic-ϵ0.06\epsilon=0.06italic_ϵ = 0.06, for instance, it was only 0.00140.00140.00140.0014. Table 1 displays the variance on the false negative rate calculated over the 100 trials for the Lyft dataset at each ϵitalic-ϵ\epsilonitalic_ϵ value. All of the variances are well below 0.003, suggesting that the test sequence false negative rates are clustered around ϵitalic-ϵ\epsilonitalic_ϵ (rather than having some sequences that fail on zero examples and others with catastrophic failures). As further evidence, in Figure 11, we provide a representative box plot of the false negative rates over the 100 trials with ϵ=0.04italic-ϵ0.04\epsilon=0.04italic_ϵ = 0.04. The false negative rate values are indeed clustered around 0.040.040.040.04.

ϵitalic-ϵ\epsilonitalic_ϵ 0.020.020.020.02 0.040.040.040.04 0.060.060.060.06 0.080.080.080.08 0.100.100.100.10
Variance 0.00096 0.0019 0.0014 0.0023 0.0024
Table 1: Variance on the test sequence false negative rates at different ϵitalic-ϵ\epsilonitalic_ϵ.
Refer to caption
Figure 11: Box plot of the 100 false negative rates calculated over randomized train/test splits with ϵ=0.04italic-ϵ0.04\epsilon=0.04italic_ϵ = 0.04.

5 Experiments: Robotic Gras**

Finally, we validate the guarantees of our framework on a robotic gras** system that should warn the user when the robot will fail to pick and transport an object. Picking is a core problem in warehouse robotics (Correll et al., 2016; Eppner et al., 2016; Hernandez et al., 2017; Yu et al., 2016; Zeng et al., 2017; Mahler et al., 2017, 2018, 2019), and failures hurt throughput (potentially even stop** the assembly line). Failures can also lead to dropped or damaged goods.

5.1 Experimental Setup

We again evaluate our framework on the setup described in Section 3.1, using an open source dataset and model. We use the Grasp Quality Convolutional Neural Network (GQ-CNN) from Mahler et al. (2017, 2018, 2019) as our predictor model and the DexNet 4.0 dataset of synthetic objects grasped with a parallel-jaw gripper (Mahler et al., 2019). (An example of objects from this dataset is shown in Figure 12.) The inputs to the GQ-CNN are a point cloud representation of an object, 𝐲𝐲\mathbf{y}bold_y, and a candidate grasp, 𝐮𝐮\mathbf{u}bold_u. The GQ-CNN outputs the predicted probability, Qθ(𝐲,𝐮)subscript𝑄𝜃𝐲𝐮Q_{\theta}(\mathbf{y},\mathbf{u})italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_u ), that the candidate grasp will be able to successfully pick and transport the object. We use this predicted probability as the surrogate safety score, g=Qθ(𝐲,𝐮)𝑔subscript𝑄𝜃𝐲𝐮g=Q_{\theta}(\mathbf{y},\mathbf{u})italic_g = italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_u ). We consider a candidate grasp “unsafe” if it will not be able to successfully pick the object (i.e. the true label is Z=0𝑍0Z=0italic_Z = 0). Note that this is exactly the ROC curve threshold tuning setup (with an additional guarantee on the false negative rate).

Refer to caption
Figure 12: An example of objects from the DexNet 4.0 dataset (Mahler et al., 2019).

The DexNet dataset of synthetic objects (Mahler et al., 2017) includes a variety of pick attempts that were not used in training the GQ-CNN model. These samples are divided into a 50%/50% train/test split. Each example is labeled as a success if the robot successfully picks and places the object, and a failure otherwise. We ran 100 trials of Algorithm 1 with randomized train/test splits, and averaged over the results.

Refer to caption
(a) Original surrogate safety score. False negative and false positive rates on the DexNet dataset with varying ϵitalic-ϵ\epsilonitalic_ϵ, using the predicted probability of a successful pick as the surrogate safety score, g=Qθ(𝐲,𝐮)𝑔subscript𝑄𝜃𝐲𝐮g=Q_{\theta}(\mathbf{y},\mathbf{u})italic_g = italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_u ). The theoretical upper bound on epsilon is shown in green. The false negative rate ( red) is within the theoretical bound and the false positive rate ( blue) is reasonably low.
Refer to caption
(b) Original surrogate safety score ×\times× 0.75 + noise ×\times× 0.25. False negative and false positive rates on the DexNet dataset with varying ϵitalic-ϵ\epsilonitalic_ϵ. Here, the surrogate safety score is a weighted sum of the original surrogate safety score (as in Figure 12(a)) and uniformly randomly sampled noise. The theoretical upper bound on epsilon is shown in green. The false negative rate ( red) is within the theoretical bound, but the false positive rate ( blue) is higher than in Figure 12(a).
Refer to caption
(c) Original surrogate safety score ×\times× 0.5 + noise ×\times× 0.5. False negative and false positive rates on the DexNet dataset with varying ϵitalic-ϵ\epsilonitalic_ϵ. Here, the surrogate safety score is a weighted sum of the original surrogate safety score (as in Figure 12(a)) and uniformly randomly sampled noise; in this case, there is more noise than in Figure 12(b). The theoretical upper bound on epsilon is shown in green. The false negative rate ( red) is within the theoretical bound, but the false positive rate ( blue) is higher than in Figures 12(a) and 12(b).
Refer to caption
(d) Only noise as the surrogate safety score. False negative and false positive rates on the DexNet dataset with varying ϵitalic-ϵ\epsilonitalic_ϵ, using uniformly randomly sampled noise as the surrogate safety score, g=random.random()𝑔random.random()g=\text{random.random()}italic_g = random.random(). The theoretical upper bound on epsilon is shown in green. Even in this case, the false negative rate ( red) is within the theoretical bound. However, the false positive rate ( blue) is very high (with values ranging from 0.95 to 0.90), indicating that the warning system trivially always issues an alert.
Figure 13: False negative rate, false positive rate, and theoretical upper bound on ϵitalic-ϵ\epsilonitalic_ϵ for the DexNet dataset of synthetic objects. The false negative rate always remains within the bound, but the false positive rate increases with worse surrogate safety scores.

5.2 Results and Discussion

With ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05, we achieved a false negative rate of 0.05, and a false positive rate of 0.11. With ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1, we achieved a false negative rate of 0.10 and a false positive rate of 0.04. The conformal guarantees of our framework hold. A plot of the false negative and false positive rates achieved for different ϵitalic-ϵ\epsilonitalic_ϵ values is shown in Figure 12(a). As before, the false negative rate is within the theoretical bound, and the false positive rate is reasonably low.

In Figures 12(b)12(c), and 12(d), we progressively degrade the quality of the surrogate safety score used in Figure 12(a) (g=Qθ(𝐲,𝐮)𝑔subscript𝑄𝜃𝐲𝐮g=Q_{\theta}(\mathbf{y},\mathbf{u})italic_g = italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_u )) to demonstrate the effects of a worse surrogate safety score or a more inaccurate simulator. In fact, in Figure 12(d), we replace the surrogate safety score g𝑔gitalic_g entirely with random noise that is uniformly sampled between 0 and 1, i.e. g(Y)=𝒰([0,1])𝑔𝑌𝒰01g(Y)=\mathcal{U}([0,1])italic_g ( italic_Y ) = caligraphic_U ( [ 0 , 1 ] ). As the plot shows, the false negative rate still remains within the theoretical bound. However, the false positive rate is extremely high (with values ranging from 0.95 to 0.90), indicating that the warning system essentially always issues an alert. This makes sense intuitively, because in this case the surrogate safety score is completely non-informative. In Figures 12(b) and  12(c), we progressively degrade the quality of the original surrogate safety score (the predicted probability that a pick will be successful) by adding increasing amounts of random noise. (In other words, we progressively decrease the accuracy of the simulator by adding noise.) We take a linear combination of the original surrogate safety score (g=Qθ(𝐲,𝐮)𝑔subscript𝑄𝜃𝐲𝐮g=Q_{\theta}(\mathbf{y},\mathbf{u})italic_g = italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y , bold_u )) and the uniformly randomly sampled noise, using weights of 0.75 and 0.25 for the original surrogate safety score and the random noise, respectively, in Figure 12(b), and weights of 0.5 and 0.5 in Figure 12(c). Note that the false negative rate always remains within the theoretical bound from Proposition 1, but the false positive rate increases with increasing noise, indicating that more alerts are being issued and it is more difficult to distinguish between safe and unsafe situations.

Taken together, these results demonstrate that the guarantees of Algorithm 1 hold regardless of the surrogate safety score and simulator model used, and thus, our method can be used to bound the false negative rate of a warning system even if the simulator or prediction model does not have any performance guarantees. However, a surrogate safety score g𝑔gitalic_g that is better correlated with the true safety score f𝑓fitalic_f and a more accurate simulator model will lead to better empirical performance in terms of the false positive rate, and issue fewer unnecessary alerts.

6 Conclusions and Future Work

In this work, we introduce a broadly applicable framework that uses conformal prediction to tune warning systems for robotics applications. This framework allows us to achieve provable safety assurances with very little data. We demonstrate empirically that the guarantees on the false negative rate hold for a driver alert system and for a robotic gras** system (even with only tens of examples of failure cases), while the false positive rate remains low.

There are several exciting future directions for this work. One area of particular interest is the application of conformal prediction in non-exchangeable scenarios (Tibshirani et al., 2019; Barber et al., 2023; Gibbs and Candès, 2021; Cauchois et al., 2020), as many robotics settings involve highly correlated time-series data, and robots deployed in the world may encounter distribution shift. There have been many recent advances in the conformal prediction literature on relaxing the exchangeability assumption, and leveraging this work could lead to useful developments in robotics.

Another intriguing extension of this work is exploring conditional safety (Feldman et al., 2021; Gupta et al., 2022), with the goal of providing safety assurances conditioned on specific factors (rather than a marginal guarantee). For example, a driver assistance system that provides a 95% guarantee on the false negative rate regardless of whether it is raining could be very useful. This system could leverage existing work on conditional coverage guarantees in conformal prediction; however, it would need to be sample-efficient to be useful for robotics settings.

Two additional interesting and important future directions are studying deployment in industry-scale applications and studying the impact of the predictor on the data that it is trying to predict (Perdomo et al., 2020) (e.g. examining whether and to what extent the warning system changes behavior or outcomes).

{acks}

The NASA University Leadership Initiative (grant #80NSSC20M0163) provided funds to assist the authors with their research. This article solely reflects the opinions and conclusions of its authors and not any NASA entity. The authors would like to thank Matteo Zallio for his expertise in crafting Figure 1.

References

  • Angelopoulos et al. (2020) Angelopoulos AN, Bates S, Malik J and Jordan MI (2020) Uncertainty sets for image classifiers using conformal prediction. In: International Conference on Learning Representations.
  • Angelopoulos et al. (2022) Angelopoulos AN, Krauth K, Bates S, Wang Y and Jordan MI (2022) Recommendation systems with distribution-free reliability guarantees. arXiv preprint arXiv:2207.01609 .
  • Barber et al. (2023) Barber RF, Candès EJ, Ramdas A and Tibshirani RJ (2023) Conformal prediction beyond exchangeability. Annals of Statistics 51(2): 816–845. 10.1214/23-AOS2276.
  • Caesar et al. (2020) Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G and Beijbom O (2020) nuscenes: A multimodal dataset for autonomous driving. In: IEEE Conference on Computer Vision and Pattern Recognition.
  • Cai and Koutsoukos (2020) Cai F and Koutsoukos X (2020) Real-time out-of-distribution detection in learning-enabled cyber-physical systems. In: International Conference on Cyber-Physical Systems.
  • Calafiore and Campi (2006) Calafiore G and Campi M (2006) The scenario approach to robust control design. IEEE Transactions on Automatic Control 51(5): 742–753. 10.1109/TAC.2006.875041.
  • Cauchois et al. (2020) Cauchois M, Gupta S, Ali A and Duchi JC (2020) Robust validation: Confident predictions even when distributions shift. arXiv preprint arXiv:2008.04267 .
  • Chen et al. (2020) Chen Y, Rosolia U, Fan C, Ames A and Murray R (2020) Reactive motion planning with probabilistic safety guarantees. In: Conference on Robot Learning.
  • Correll et al. (2016) Correll N, Bekris KE, Berenson D, Brock O, Causo A, Hauser K, Okada K, Rodriguez A, Romano JM and Wurman PR (2016) Analysis and observations from the first amazon picking challenge. IEEE Transactions on Automation Science and Engineering 15(1): 172–188.
  • Crestani et al. (2015) Crestani D, Godary-Dejean K and Lapierre L (2015) Enhancing fault tolerance of autonomous mobile robots. Robotics and Autonomous Systems 68: 140–155.
  • Ding (2013) Ding SX (2013) Introduction. In: Model-Based Fault Diagnosis Techniques: Design Schemes, Algorithms and Tools. Springer London, pp. 3–11.
  • Eppner et al. (2016) Eppner C, Höfer S, Jonschkowski R, Martín-Martín R, Sieverling A, Wall V and Brock O (2016) Lessons from the amazon picking challenge: Four aspects of building robotic systems. In: Robotics: Science and Systems.
  • Feldman et al. (2021) Feldman S, Bates S and Romano Y (2021) Improving conditional coverage via orthogonal quantile regression. In: Advances in Neural Information Processing Systems.
  • Foody (2009) Foody GM (2009) Sample size determination for image classification accuracy assessment and comparison. International Journal of Remote Sensing 30(20): 5273–5291. 10.1080/01431160903130937.
  • Gammerman et al. (2008) Gammerman A, Nouretdinov I, Burford B, Chervonenkis A, Vovk V and Luo Z (2008) Clinical mass spectrometry proteomic diagnosis by conformal predictors. Statistical Applications in Genetics and Molecular Biology 7(2). 10.2202/1544-6115.1385.
  • Ghosh et al. (2023) Ghosh S, Belkhouja T, Yan Y and Doppa JR (2023) Improving uncertainty quantification of deep classifiers via neighborhood conformal prediction: Novel algorithm and theoretical analysis. In: AAAI Conference on Artificial Intelligence.
  • Gibbs and Candès (2021) Gibbs I and Candès EJ (2021) Conformal inference for online prediction with arbitrary distribution shifts. In: Advances in Neural Information Processing Systems.
  • Gupta et al. (2022) Gupta V, Jung C, Noarov G, Pai MM and Roth A (2022) Online multivalid learning: Means, moments, and prediction intervals. In: Innovations in Theoretical Computer Science Conference.
  • Harirchi and Ozay (2015) Harirchi F and Ozay N (2015) Model invalidation for switched affine systems with applications to fault and anomaly detection. Analysis and Design of Hybrid Systems 48(27): 260–266.
  • Harirchi and Ozay (2018) Harirchi F and Ozay N (2018) Guaranteed model-based fault detection in cyber-physical systems: A model invalidation approach. Automatica 93: 476–488. https://doi.org/10.1016/j.automatica.2018.03.040.
  • Hernandez et al. (2017) Hernandez C, Bharatheesha M, Ko W, Gaiser H, Tan J, van Deurzen K, de Vries M, Van Mil B, van Egmond J, Burger R et al. (2017) Team delft’s robot winner of the amazon picking challenge 2016. In: RoboCup 2016: Robot World Cup XX. Springer International Publishing, pp. 613–624.
  • Jang et al. (2020) Jang A, Christy, Bergamini L, Maggie, Scheel O, Ondruska P, Culliton P and Iglovikov V (2020) Lyft motion prediction for autonomous vehicles. URL https://kaggle.com/competitions/lyft-motion-prediction-autonomous-vehicles.
  • Khalastchi and Kalech (2018) Khalastchi E and Kalech M (2018) On fault detection and diagnosis in robotic systems. ACM Computing Surveys 51(1): 1–24.
  • Luo et al. (2022) Luo R, Zhao S, Kuck J, Ivanovic B, Savarese S, Schmerling E and Pavone M (2022) Sample-efficient safety assurances using conformal prediction. In: Workshop on the Algorithmic Foundations of Robotics.
  • Mahler et al. (2017) Mahler J, Liang J, Niyaz S, Laskey M, Doan R, Liu X, Ojea JA and Goldberg K (2017) Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. In: Robotics: Science and Systems (RSS).
  • Mahler et al. (2018) Mahler J, Matl M, Liu X, Li A, Gealy D and Goldberg K (2018) Dex-net 3.0: Computing robust robot suction grasp targets in point clouds using a new analytic model and deep learning. In: IEEE International Conference on Robotics and Automation.
  • Mahler et al. (2019) Mahler J, Matl M, Satish V, Danielczuk M, DeRose B, McKinley S and Goldberg K (2019) Learning ambidextrous robot gras** policies. Science Robotics 4(26).
  • Muradore and Fiorini (2011) Muradore R and Fiorini P (2011) A pls-based statistical approach for fault detection and isolation of robotic manipulators. IEEE Transactions on Industrial Electronics 59(8): 3167–3175.
  • Nouretdinov et al. (2011) Nouretdinov I, Costafreda SG, Gammerman A, Chervonenkis A, Vovk V, Vapnik V and Fu CH (2011) Machine learning classification with confidence: Application of transductive conformal predictors to mri-based diagnostic and prognostic markers in depression. NeuroImage 56(2): 809–813. 10.1016/j.neuroimage.2010.05.023.
  • Patton and Chen (1997) Patton R and Chen J (1997) Observer-based fault detection and isolation: Robustness and applications. Control Engineering Practice 5(5): 671–682.
  • Perdomo et al. (2020) Perdomo J, Zrnic T, Mendler-Dünner C and Hardt M (2020) Performative prediction. In: International Conference on Machine Learning.
  • Salzmann et al. (2020) Salzmann T, Ivanovic B, Chakravarty P and Pavone M (2020) Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In: European Conference on Computer Vision.
  • Shafer and Vovk (2008) Shafer G and Vovk V (2008) A tutorial on conformal prediction. Journal of Machine Learning Research 9: 371–421.
  • Tibshirani et al. (2019) Tibshirani RJ, Barber RF, Candès EJ and Ramdas A (2019) Conformal prediction under covariate shift. In: Advances in Neural Information Processing Systems.
  • Vemuri et al. (1998) Vemuri AT, Polycarpou MM and Diakourtis SA (1998) Neural network based fault detection in robotic manipulators. IEEE Transactions on Robotics and Automation 14(2): 342–348.
  • Visinsky et al. (1994a) Visinsky ML, Cavallaro JR and Walker ID (1994a) Expert system framework for fault detection and fault tolerance in robotics. Computers & Electrical Engineering 20(5): 421–435.
  • Visinsky et al. (1994b) Visinsky ML, Cavallaro JR and Walker ID (1994b) Robotic fault detection and fault tolerance: A survey. Reliability Engineering & System Safety 46(2): 139–158.
  • Visinsky et al. (1995) Visinsky ML, Cavallaro JR and Walker ID (1995) A dynamic fault tolerance framework for remote robots. IEEE Transactions on Robotics and Automation 11(4): 477–490.
  • von Luxburg and Schölkopf (2011) von Luxburg U and Schölkopf B (2011) Statistical learning theory: Models, concepts, and results. In: Gabbay DM, Hartmann S and Woods J (eds.) Inductive Logic, Handbook of the History of Logic, volume 10. North-Holland, pp. 651–706.
  • Vovk et al. (2005) Vovk V, Gammerman A and Shafer G (2005) Algorithmic Learning in a Random World. Springer New York. 10.1007/b106715.
  • Vovk et al. (2003) Vovk V, Lindsay D, Nouretdinov I and Gammerman A (2003) Mondrian confidence machine. URL http://alrw.net/old/04.pdf.
  • Yu et al. (2016) Yu KT, Fazeli N, Chavan-Dafle N, Taylor O, Donlon E, Lankenau GD and Rodriguez A (2016) A summary of team mit’s approach to the amazon picking challenge 2015. arXiv preprint arXiv:1604.03639 .
  • Zeng et al. (2017) Zeng A, Yu KT, Song S, Suo D, Walker E, Rodriguez A and Xiao J (2017) Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In: IEEE International Conference on Robotics and Automation.