Evaluation of autonomous systems under data distribution shifts

Daniel Sikar
City, University of London
Northampton Square, London EC1V 0HB, UK
[email protected]
   Artur Garcez
City, University of London
Northampton Square, London EC1V 0HB, UK
[email protected]
Abstract

We posit that data can only be safe to use up to a certain threshold of the data distribution shift, after which control must be relinquished by the autonomous system and operation halted or handed to a human operator. With the use of a computer vision toy example we demonstrate that network predictive accuracy is impacted by data distribution shifts and propose distance metrics between training and testing data to define safe operation limits within said shifts. We conclude that beyond an empirically obtained threshold of the data distribution shift, it is unreasonable to expect network predictive accuracy not to degrade.

1 Introduction

The development of autonomous systems such as self-driving cars is motivated by a number of goals, of a practical, safety, public interest and economic nature. From a practical perspective, the goal is ”to transport people from one place to another without any help from a driver” [23]. From a public health perspective, to transform the current approach to automotive safety from reducing injuries after collisions to complete collision prevention [11]. From a public interest and economic perspective, AV fleets allow for new shared autonomous mobility business models [31] though shared autonomous electric vehicle (SAEVs) fleets [24]. Shared Autonomous Vehicles (SAVs) have gained significant public interest as a possible less expensive, safer and more efficient version of today’s transportation networking companies (TNCs) and taxis.

This perceived superiority to human drivers is attributed to high-performance computing that allows AVs to process, learn from and adjust their guidance systems according to changes in external conditions at much faster rates than the typical human driver [38].

The presence in public spaces of autonomous systems driven by AI is a concern. Although models such as convolutional neural networks have been successfully used to solve problems applied to Computer Vision, the ability to generalize and robustness of such architectures has been increasingly scrutinised.

Zhang et al. [39] debated the need to rethink generalization, by demonstrating how traditional benchmarking approaches fail to explain why large neural networks generalize well in practice. By randomizing target labels, the experiments show that state-of-the-art convolutional neural networks for image classification trained with SGD (stochastic gradient descent) are large enough to fit a random labelling of the training data. This is achieved with a simple two-layer neural network, which presents a ”perfect finite sample expressivity” once the number of parameters is greater than the number of data points as often is the case with CNNs. This poses a challenge, in the context of overfitting, for autonomous systems such as self-driving cars, specially if relying only on images for perception of the surrounding environment. Since testing real life scenarios is not practical due to the associated cost and risk, we examine the use of game engines [8], which are capable of generating labeled datasets and realistic environments where the autonomous vehicle may be tested.

In this study, data distribution shifts are examined in the context of testing data presented to a neural network trained on a known data distribution, that is, with the weights and biases adjusted while minimizing an error function.

When the shift is of a sufficient quantity, relative to the training data, a number of terms exist in the literature to express and deal with the shift, such as anomaly detection [7, 6, 26, 32], out-of-distribution data [12, 33, 17, 16, 14], novelty detection [36, 27, 18, 9, 28], outlier detection [3, 15, 1] and covariate shift [35, 34, 30, 4]. In section 3 we propose metrics to quantify the distance between training and testing data. We use a game engine to generate our training data and train a model based on SOTA self-driving architectures. Our aim is to determine safe i.e. acceptable shift quantities, where predictions made by the model, given the shifted data, are safe to use, as exemplified in Figure 1.

Refer to caption
Figure 1: Simulated accident in the CARLA Simulator Town 10, where excessive brightness i.e. high RGB values cause predictive accuracy of self-driving models to degrade

2 Related Work

We examined literature related to out-of-distribution (OOD) data, and methods for identifying such data and/or making neural networks robust to its adverse and unwanted effect, such that network predictive accuracy is not affected by data in the OOD regime.

Fort et al. [12] are motivated by out-of-distribution detection (OOD) and point out that is high-stake applications such as healthcare and self-driving. Techniques for detecting OOD inputs using neural networks include MSP (maximum over softmax probabilities) proposed by Hendrycks et al. [14] i.e. scoremsp(x)=maxc=1,,Kp(y=c|x)𝑠𝑐𝑜𝑟subscript𝑒𝑚𝑠𝑝𝑥𝑚𝑎subscript𝑥𝑐1𝐾𝑝𝑦conditional𝑐𝑥score_{msp}(x)=max_{c=1,...,K}p(y=c|x)italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_m italic_s italic_p end_POSTSUBSCRIPT ( italic_x ) = italic_m italic_a italic_x start_POSTSUBSCRIPT italic_c = 1 , … , italic_K end_POSTSUBSCRIPT italic_p ( italic_y = italic_c | italic_x ), showing that the probability prediction for correct examples tends to be higher than of incorrect and out-of-distribution examples. Capturing prediction probability statistics about in-sample examples is in most cases sufficient for detecting abnormal examples.

Lee et al. [21] argue that Mahalanobis distance is more effective than Euclidean distance in various OOD and adversarial data detection tasks, given a defined distance-based confidence score M(x)𝑀𝑥M(x)italic_M ( italic_x ) that is the Mahalanobis distance between a test sample x and the closest class-conditional Gaussian distribution:

M(x)=maxc(f(x)μ^c)T^1(f(x)μ^c).𝑀𝑥subscript𝑐superscript𝑓𝑥subscript^𝜇𝑐𝑇superscript^1𝑓𝑥subscript^𝜇𝑐M(x)=\max\limits_{c}-(f(x)-\hat{\mu}_{c})^{T}\hat{\sum}^{-1}(f(x)-\hat{\mu}_{c% }).italic_M ( italic_x ) = roman_max start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - ( italic_f ( italic_x ) - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG ∑ end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) .
Refer to caption
Figure 2: LeNet predictive accuracy change under rotation (left) and translation shift (right)

Noting that the metric corresponds to measuring the log of the probability densities of the test sample. The value μ^csubscript^𝜇𝑐\hat{\mu}_{c}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represents the empirical class mean while ^^\hat{\sum}over^ start_ARG ∑ end_ARG the class covariance of the training samples. The algorithm uses the weights and biases of the softmax classifier, that is the penultimate (fully connected, dense) layer and assumes that the class-conditional distribution follows a multivariate Gaussian distribution.

Shalev et al. [33] propose the use of a number of distinct word embeddings as supervisors in ensemble models, thus generating shared representations. A semantic structure is utilised for word embeddings to produce semantic predictions, while the L2-norm of the output vectors is employed to detect OOD inputs, testing the approach to detect adversarial and incorrectly classified examples. Cosine distance is used in the space Z𝑍Zitalic_Z to measure the distance between two embeddings u,v Zabsent𝑍\in Z∈ italic_Z:

dcos(𝐮,𝐯)=𝟏𝟐(𝟏𝐮𝐯𝐮𝐯)d_{cos}\mathbf{(u,v)=\frac{1}{2}\bigg{(}1-\frac{u\cdot v}{\lVert u\lVert\lVert v% \lVert}\bigg{)}}italic_d start_POSTSUBSCRIPT italic_c italic_o italic_s end_POSTSUBSCRIPT ( bold_u , bold_v ) = divide start_ARG bold_1 end_ARG start_ARG bold_2 end_ARG ( bold_1 - divide start_ARG bold_u ⋅ bold_v end_ARG start_ARG ∥ bold_u ∥ ∥ bold_v ∥ end_ARG )

When the cosine distance is close to 0, the labels are considered similar, when it is close to 1, the labels are considered to be semantically far apart.

Huang et al. [17] propose a scoring function based on group softmax for different categories with the Minimum Others Score (MOS) function, allowing differentiation between in and out-of-distribution data. The key observation being that a pre-defined category others carries useful information for the likelihood of a given imaged being OOD w.r.t. each group. An OOD input is mapped to others in all groups and the lowest others score in all groups determines OOD images. The MOS OOD scoring function being:

SMOS(𝐱)=min1kKpothersk(𝐱)subscript𝑆𝑀𝑂𝑆𝐱subscript1𝑘𝐾superscriptsubscript𝑝𝑜𝑡𝑒𝑟𝑠𝑘𝐱S_{MOS}(\mathbf{x})=-\min\limits_{1\leq k\leq K}p_{others}^{k}(\mathbf{x})italic_S start_POSTSUBSCRIPT italic_M italic_O italic_S end_POSTSUBSCRIPT ( bold_x ) = - roman_min start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_K end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_o italic_t italic_h italic_e italic_r italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_x )

The sign is negated such that SMOS(𝐱)subscript𝑆𝑀𝑂𝑆𝐱S_{MOS}(\mathbf{x})italic_S start_POSTSUBSCRIPT italic_M italic_O italic_S end_POSTSUBSCRIPT ( bold_x ) is higher for in-distribution and lower of out-of-distribution data.

Liang et al. [22] propose the Out-of-distribution detector (ODIN) to distinguish in and out of distribution images on a pre-trained neural network using a temperature scaling parameter T𝑇Titalic_T. The problem statement is defined by P𝐱subscript𝑃𝐱P_{\mathbf{x}}italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT and Q𝐱subscript𝑄𝐱Q_{\mathbf{x}}italic_Q start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT denoting two distinct data distributions defined on the image space χ𝜒\chiitalic_χ, where P𝐱subscript𝑃𝐱P_{\mathbf{x}}italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT are the in-distribution and Q𝐱subscript𝑄𝐱Q_{\mathbf{x}}italic_Q start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT are the out-of-distribution images. During training images are drawn from P𝐱subscript𝑃𝐱P_{\mathbf{x}}italic_P start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT. If testing images are drawn from a mixture distribution, can a distinction be made between in an out-of-distribution images? For each input the network generates a label prediction y^(x)=argmaxiSi(x;T)^𝑦𝑥subscript𝑖subscript𝑆𝑖𝑥𝑇\hat{y}(x)=\arg\max_{i}S_{i}(x;T)over^ start_ARG italic_y end_ARG ( italic_x ) = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ; italic_T ) by computing the softmax output for each class, using the temperature scaling term T𝑇Titalic_T set to 1 during training:

Si(x;T)=exp(fi(x)/T)j=1Nexp(fj(x)/T)subscript𝑆𝑖𝑥𝑇subscript𝑓𝑖𝑥𝑇superscriptsubscript𝑗1𝑁subscript𝑓𝑗𝑥𝑇S_{i}(x;T)=\frac{\exp(f_{i}(x)/T)}{\sum_{j=1}^{N}\exp(f_{j}(x)/T)}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ; italic_T ) = divide start_ARG roman_exp ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) / italic_T ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) / italic_T ) end_ARG

Temperature scaling perhaps resonates with von Neumann [37] whose conviction was that error should be treated and subject to themodynamical methods and theory. In addition to temperature scaling, a small perturbation ϵitalic-ϵ\epsilonitalic_ϵ, inspired by the idea of adversarial examples [13], is added to the inputs. The inputs are classified as in-distribution if the softmax score is greater than threshold δ𝛿\deltaitalic_δ, out-of-distribution otherwise. The out-of-distribution can therefore be defined as a function of the input x𝑥xitalic_x, the perturbation ϵitalic-ϵ\epsilonitalic_ϵ, the temperature scaling term T𝑇Titalic_T and the threshold δ𝛿\deltaitalic_δ:

g(x;δ,T,ϵ)={1if maxip(x¯;T)δ ,0if maxip(x¯;T)>δ .𝑔𝑥𝛿𝑇italic-ϵcases1if maxip(x¯;T)δ 0if maxip(x¯;T)>δ g(x;\delta,T,\epsilon)=\begin{cases}1&\text{if $\max_{i}p(\bar{x};T)\leq\delta% $ },\\ 0&\text{if $\max_{i}p(\bar{x};T)>\delta$ }.\end{cases}italic_g ( italic_x ; italic_δ , italic_T , italic_ϵ ) = { start_ROW start_CELL 1 end_CELL start_CELL if roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( over¯ start_ARG italic_x end_ARG ; italic_T ) ≤ italic_δ , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( over¯ start_ARG italic_x end_ARG ; italic_T ) > italic_δ . end_CELL end_ROW

Hsu et al. [16] use a combination of ODIN and Mahalanobis distance for OOD data detection. Pointing out that ODIN requires OOD data to tune hyperparameters for temperature scaling and input pre-processing, their Generalized ODIN method addresses the problem of learning without OOD data, adding a variable to declaring if the data is in or out of distribution, closely related to the temperature scaling term in ODIN, except that the variable depends on input instead of tuned hyperparameters.

Ovadia et al. [25] argue that high stake applications such as medical diagnoses and self-driving require in addition to predictive accuracy, quantification of predictive uncertainty, that is, class predictions and confidence values. The study concluded that accuracy and confidence degrade with dataset shifts (rotated or horizontally translated images), and that ensembles help with robustness to dataset shifts. Figure 2 shows accuracy shift as a function of rotated and translated MNIST datasets, as predicted with the LeNet architecture [19] with standard training, validation, testing and hyperparameter tuning protocols, and seven variants including vanilla, ensemble and dropout. The work presented by Ovadia et al. [25] with respect to accuracy and dataset shifts is the one we find our study most aligned with.

3 Methods

Refer to caption
Figure 3: Two sets of images and pixel intensity value histograms, where the set on the left is a negative shift (pixel intensity values decrease) and the set on the right is a positive shift (pixel intensity values increase)

3.1 Applying data distribution shifts to RGB images

We interpret an RGB image as a data distribution of individual pixel intensity values. This can be represented as a histogram where the sum of bucket counts (distinct pixel values aggregated counts) is given in Equation 6, where the number of buckets is equal to the number of values the pixel may take, i.e. the pixel datatype range of values. Distribution shifts therefore will cause counts to increase and decrease in each bucket, according to the direction (left, negative, right, positive) of distribution shift. The distribution is shifted by uniformly incrementing or decrementing each pixel intensity value by the same quantity. We define the RGB pixel intensity shift function as:

RGBis(I,S)={Ijk+S,if IiminIjk+SIimax .Iimin,if Iimin>Ijk+S.Iimax,if Imax<Ijk+S.𝑅𝐺subscript𝐵𝑖𝑠𝐼𝑆casessubscript𝐼𝑗𝑘𝑆if IiminIjk+SIimax subscript𝐼𝑖𝑚𝑖𝑛if Iimin>Ijk+Ssubscript𝐼𝑖𝑚𝑎𝑥if Imax<Ijk+SRGB_{is}(I,S)=\begin{cases}I_{jk}+S,&\text{if $I_{imin}\leq I_{jk}+S\leq I_{% imax}$ }.\\ I_{imin},&\text{if $I_{imin}>I_{jk}+S$}.\\ I_{imax},&\text{if $I_{max}<I_{jk}+S$}.\end{cases}italic_R italic_G italic_B start_POSTSUBSCRIPT italic_i italic_s end_POSTSUBSCRIPT ( italic_I , italic_S ) = { start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT + italic_S , end_CELL start_CELL if italic_I start_POSTSUBSCRIPT italic_i italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT + italic_S ≤ italic_I start_POSTSUBSCRIPT italic_i italic_m italic_a italic_x end_POSTSUBSCRIPT . end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_i italic_m italic_i italic_n end_POSTSUBSCRIPT , end_CELL start_CELL if italic_I start_POSTSUBSCRIPT italic_i italic_m italic_i italic_n end_POSTSUBSCRIPT > italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT + italic_S . end_CELL end_ROW start_ROW start_CELL italic_I start_POSTSUBSCRIPT italic_i italic_m italic_a italic_x end_POSTSUBSCRIPT , end_CELL start_CELL if italic_I start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT < italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT + italic_S . end_CELL end_ROW (1)

where I𝐼Iitalic_I is the RGB image pixel matrix, S𝑆Sitalic_S is the shift quantity to be added to each pixel, j𝑗jitalic_j is the number of dimensions in I𝐼Iitalic_I and k𝑘kitalic_k is the number of elements in j𝑗jitalic_j. S𝑆Sitalic_S takes both negative and positive values. Negative values cause a left shift to the distribution and decrease the pixel intensity value. Output values can be in the range of, and including, Iimin,Iimaxsubscript𝐼𝑖𝑚𝑖𝑛subscript𝐼𝑖𝑚𝑎𝑥I_{imin},I_{imax}italic_I start_POSTSUBSCRIPT italic_i italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i italic_m italic_a italic_x end_POSTSUBSCRIPT defined respectively as the minimum and maximum values the pixel data type can represent. A 24-bit RGB image, represented by 3 bytes, one for every red, green and blue channel, can store values between 0 and 255, in this case, if the value of Ijksubscript𝐼𝑗𝑘I_{jk}italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT is 100 and a shift of -120 is applied, Ijksubscript𝐼𝑗𝑘I_{jk}italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT will take value 0. If Ijksubscript𝐼𝑗𝑘I_{jk}italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT is 170 and a shift of 100 is applied Ijksubscript𝐼𝑗𝑘I_{jk}italic_I start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT will take value 255. We say in these cases that the pixel is saturated resulting and darker or brighter images. The result and effect can be observed in Figure 3.

Refer to caption
Figure 4: Left to right, the SDSandbox self-driving neural network training application, the Generated Track circuit, a steering angle histogram showing the distribution of steering angles when going around the track clockwise

3.2 Image pre-processing

One of the key pre-processing steps used in SOTA models such as [20] is moving images from RGB to YUV space. The authors do not discuss the motivation behind this procedure. [YUV article] points out that YUV is more related to human vision. It can be said that YUV reduces the information content of the image. We observed that when no datashift exists, the YUV mean is lower than the RGB mean, and we decrease the RGB mean. we observed that after that when with a mean shift of minus 5 the RGB and YUV means are equal. Increased negative shift leads to an increase in the distance between RGB and YUV means.

We start with an image. We take the mean value of all combined RGB channel values, we then move the image to YUV space and back to RGB space, then again compute the mean. When there is no shift in pixel intensity value, the YUV image has a lower mean than the corresponding RGB mean. Depending on the shift, and the image becoming darker, or brighter, the YUV mean may be higher (for darker images, or lower (for brighter images). The main takeaway being the mean delta is lower in YUV space, creating lesser variability in the data and validating the use of YUV images to train self-driving CNNs and potentially other applications and architectures. This finding is a side-effect, and not the main focus of this study.

One important aspect of training data for AVs is data pre-processing e.g. crop** such that most relevant section of the image to the self-driving task is kept. In the example shown, 25 rows of pixels were removed from the bottom and 60 rows of pixels were removed from the top, resulting in a 75h x 320w pixel image. The image is then resized to 66h x 200w, the geometry to be presented to network in the example shown. Finally the image is moved from RGB to YUV space. In the RGB scheme, each pixel is represented by three channel intensities of red, green and blue. In YUV, also referred to as YCbCr space, each pixel is represented by Y (luma), U (Cb - luminance value subtracted from blue channel) and V (Cr - luminance value subtracted from red channel). It is a ”lossy” process which degrades the data, and originally developed for colour to black-and-white television backward-compatibility. Moving the image from RGB to YUV space has been demonstrated to give ”better subjective image quality than the RGB color space”, being better for computer vision ”implementations than RGB due to the perceptual similarities to the human vision” [29]. This scheme was used in previous AV data pre-processing pipelines e.g. Dave [20] and PilotNet [5].

3.3 Game engines

Game engines are ubiquitous and the concept of using virtual environment in lieu of real ones, to train autonomous systems or humans, with the benefits of cost savings and security, is long established. We used a Unity game engine based self driving training and evaluation system called SDSandbox, short for self-driving sandbox. Figure 4 shows left to right the SDSandbox user interface, with options including to drive, self-drive with a PID algorithm and self-drive with a neural network. In the middle is the Generated Track and on the left a histogram of steering angles for a dataset generated by the SDSandbox driving around the generated track, where the autonomous vehicle drives clockwise around the track, and the start is on the top left of the circuit.

The Unity SDSandbox can generate a number of circuits. We obtained a labelled dataset of 20,061 frames for the Generated Road circuit and a labelled dataset of 1,394 frames for the Generated Track circuit, each frame being a jpg image of size 160 wide by 120 high, with a corresponding json file containing the steering angle and throttle applied at the time the image was saved. We trained networks to self-drive in the SDSandbox environment, a video of two Donkey Cars [10] being driven by distinct trained networks, on a similar random Generated Track, can be seen at [2]. Note the video is different in that the frames are generated by the video game in real time, and the simulated world adjusted accordingly, depending on the steering and position of simulated vehicle on the road. With the labelled datasets, the world is not adjusted, as the frames have been generated and saved previously, and whatever steering prediction the model generates, it will not actually change the next frame, as it would in realtime.

Refer to caption
Figure 5: Plots of ground truth (SDSandbox PID steering output) and nvidia2 network predictions for images with pixel value intensity shifts of 40, 80 and 120, where the steering error (st. err.) for RGB shift is the MAE

3.4 Metrics

We examine 4 metrics for defining network prediction error, and 3 metrics for defining the distance between training and testing data distributions.

3.4.1 Error functions

MAE=1Ni=1N|yiy^i|𝑀𝐴𝐸1𝑁superscriptsubscript𝑖1𝑁subscript𝑦𝑖subscript^𝑦𝑖MAE=\frac{1}{N}\sum\limits_{i=1}^{N}|y_{i}-\hat{y}_{i}|italic_M italic_A italic_E = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | (2)
MAPE=1Ni=1N|yiy^iyi|𝑀𝐴𝑃𝐸1𝑁superscriptsubscript𝑖1𝑁subscript𝑦𝑖subscript^𝑦𝑖subscript𝑦𝑖MAPE=\frac{1}{N}\sum\limits_{i=1}^{N}\big{|}\frac{y_{i}-\hat{y}_{i}}{y_{i}}% \big{|}italic_M italic_A italic_P italic_E = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | (3)
RMSE=1Ni=1N(yiy^i)2𝑅𝑀𝑆𝐸1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑦𝑖subscript^𝑦𝑖2RMSE=\sqrt{\frac{1}{N}\sum\limits_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}}italic_R italic_M italic_S italic_E = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (4)
MSE=1Ni=1N(yiy^i)2𝑀𝑆𝐸1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑦𝑖subscript^𝑦𝑖2MSE=\frac{1}{N}\sum\limits_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}italic_M italic_S italic_E = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

Where MAE is the Mean Absolute Error, MAPE is Mean Absolute Percentage Error, MSE is Mean Squared Error and RMSE is Root Mean Square Error. Where MSE is quadratic, i.e., penalising larger errors while the other three are linear. y𝑦yitalic_y is the ground truth while y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is the network prediction value, and N𝑁Nitalic_N is the total number of images presented to the network, over which the error is computed

3.4.2 Distance functions

We adapt three known distance functions to quantify distance between two images, where the first is assumed to have been used during training, that is of a known accuracy and not have expected to generate a steering value that would have caused the simulated vehicle to drive off the road.

We consider an RGB image as distributions, as represented in a histogram, corresponding to each red, green and blue channels where the ground truth and shifted RGB histogram bin counts are P,Q𝑃𝑄P,Qitalic_P , italic_Q respectively.

We use the product tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT obtained in equation 6 to normalise our bin counts P,Q𝑃𝑄P,Qitalic_P , italic_Q where N𝑁Nitalic_N is the number of dimensions in the image array I𝐼Iitalic_I and Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the number of elements in each image dimension n𝑛nitalic_n.

tp=n=1NInsubscript𝑡𝑝superscriptsubscriptproduct𝑛1𝑁subscript𝐼𝑛t_{p}=\prod\limits_{n=1}^{N}I_{n}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (6)

Bhattacharyya RGB distance we define as the negative natural logarithm of the Batthacharyya RGB coefficient:

DBrgb(P,Q)=ln(BCrgb(P,Q)D_{B}rgb(P,Q)=-ln(BCrgb(P,Q)italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_r italic_g italic_b ( italic_P , italic_Q ) = - italic_l italic_n ( italic_B italic_C italic_r italic_g italic_b ( italic_P , italic_Q ) (7)

where the Batthacharyya coefficient is defined as:

BC(P,Q)=x𝒳P(x)Q(x)𝐵𝐶𝑃𝑄subscript𝑥𝒳𝑃𝑥𝑄𝑥BC(P,Q)=\sum\limits_{x\in\mathcal{X}}\sqrt{P(x)Q(x)}italic_B italic_C ( italic_P , italic_Q ) = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT square-root start_ARG italic_P ( italic_x ) italic_Q ( italic_x ) end_ARG (8)

for every x𝑥xitalic_x on the same domain 𝒳𝒳\mathcal{X}caligraphic_X. Since we are dealing with multiple dimensions (channels), in our Batthacharyya RGB coefficient we explicitly divide our histogram bin counts by tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and the expression becomes:

BCrgb(P,Q)=x𝒳P(x)tpQ(x)tp=x𝒳P(x)Q(x)tp2=x𝒳P(x)Q(x)tp=x𝒳P(x)Q(x)tp1𝐵𝐶𝑟𝑔𝑏𝑃𝑄subscript𝑥𝒳𝑃𝑥subscript𝑡𝑝𝑄𝑥subscript𝑡𝑝subscript𝑥𝒳𝑃𝑥𝑄𝑥superscriptsubscript𝑡𝑝2subscript𝑥𝒳𝑃𝑥𝑄𝑥subscript𝑡𝑝subscript𝑥𝒳𝑃𝑥𝑄𝑥superscriptsubscript𝑡𝑝1\begin{split}BCrgb(P,Q)&=\sum\limits_{x\in\mathcal{X}}\sqrt{\frac{P(x)}{t_{p}}% \frac{Q(x)}{t_{p}}}\\ &=\sum\limits_{x\in\mathcal{X}}\sqrt{\frac{P(x)Q(x)}{t_{p}^{2}}}\\ &=\sum\limits_{x\in\mathcal{X}}\frac{\sqrt{P(x)Q(x)}}{t_{p}}\\ &=\sum\limits_{x\in\mathcal{X}}\sqrt{P(x)Q(x)}\;t_{p}^{-1}\end{split}start_ROW start_CELL italic_B italic_C italic_r italic_g italic_b ( italic_P , italic_Q ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT square-root start_ARG divide start_ARG italic_P ( italic_x ) end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG divide start_ARG italic_Q ( italic_x ) end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT square-root start_ARG divide start_ARG italic_P ( italic_x ) italic_Q ( italic_x ) end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT divide start_ARG square-root start_ARG italic_P ( italic_x ) italic_Q ( italic_x ) end_ARG end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT square-root start_ARG italic_P ( italic_x ) italic_Q ( italic_x ) end_ARG italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW (9)

Relative entropy, also known as Kullback-Leibler divergence, for our specific use with RGB images we express as:

DKLrgb(PQ)=x𝒳P(x)tp1log((P(x)+ϵ)tp1(Q(x)+ϵ)tp1)=x𝒳P(x)tp1log(P(x)+ϵQ(x)+ϵ)subscript𝐷KLrgbconditional𝑃𝑄subscript𝑥𝒳𝑃𝑥superscriptsubscript𝑡𝑝1𝑃𝑥italic-ϵsuperscriptsubscript𝑡𝑝1𝑄𝑥italic-ϵsuperscriptsubscript𝑡𝑝1subscript𝑥𝒳𝑃𝑥superscriptsubscript𝑡𝑝1𝑃𝑥italic-ϵ𝑄𝑥italic-ϵ\begin{split}D_{\text{KLrgb}}(P\parallel Q)&=\sum\limits_{x\in\mathcal{X}}P(x)% t_{p}^{-1}\log\left(\frac{(P(x)+\epsilon)t_{p}^{-1}}{(Q(x)+\epsilon)t_{p}^{-1}% }\right)\\ &=\sum\limits_{x\in\mathcal{X}}P(x)t_{p}^{-1}\log\left(\frac{P(x)+\epsilon}{Q(% x)+\epsilon}\right)\end{split}start_ROW start_CELL italic_D start_POSTSUBSCRIPT KLrgb end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_P ( italic_x ) italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( divide start_ARG ( italic_P ( italic_x ) + italic_ϵ ) italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_Q ( italic_x ) + italic_ϵ ) italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_P ( italic_x ) italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_P ( italic_x ) + italic_ϵ end_ARG start_ARG italic_Q ( italic_x ) + italic_ϵ end_ARG ) end_CELL end_ROW (10)

We add a small positive term ϵitalic-ϵ\epsilonitalic_ϵ to P(x),Q(x)𝑃𝑥𝑄𝑥P(x),Q(x)italic_P ( italic_x ) , italic_Q ( italic_x ) to avoid division by zero, and also to avoid taking the logarithm base 10 of zero which is undefined.

We note that the KL Divergence in addition to, for our purposes, being fragile, is also asymmetric. that is the divergence from P to Q is not the same as the divergence from Q to P.

Note we use discrete rather than the continuous values. That is, we use the original byte value for each channel and the discrete relative entropy equation. If we convert the distribution to a PDF, that is, with all aggregated counts summing to one, we would use the continuous equation. The PDF seems like the better way to go as it is a normalised value, though if we are dealing with a known network, the input is always of the same the same dimensions and can be considered normative, that is the RGB mean is constrained by, and bound to, the input size. Histogram RGB intersection between two distributions we define as:

Hirgb(P,Q)=tp1x𝒳argmin(P(x),Q(x))subscript𝐻𝑖𝑟𝑔𝑏𝑃𝑄superscriptsubscript𝑡𝑝1subscript𝑥𝒳𝑎𝑟𝑔𝑚𝑖𝑛𝑃𝑥𝑄𝑥H_{irgb}(P,Q)=t_{p}^{-1}\sum\limits_{x\in\mathcal{X}}argmin(P(x),Q(x))italic_H start_POSTSUBSCRIPT italic_i italic_r italic_g italic_b end_POSTSUBSCRIPT ( italic_P , italic_Q ) = italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_a italic_r italic_g italic_m italic_i italic_n ( italic_P ( italic_x ) , italic_Q ( italic_x ) ) (11)

If P,Q𝑃𝑄P,Qitalic_P , italic_Q are the same for every x𝑥xitalic_x, then the summation will be equal to the product tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, resulting in 1, meaning there is a complete overlap between the two distributions. If argmin(P(x),Q(x))𝑎𝑟𝑔𝑚𝑖𝑛𝑃𝑥𝑄𝑥argmin(P(x),Q(x))italic_a italic_r italic_g italic_m italic_i italic_n ( italic_P ( italic_x ) , italic_Q ( italic_x ) ) is equal to zero for every x𝑥xitalic_x, then the summation will be equal to zero meaning there is no overlap between the histograms.

Since the histogram RGB intersection does not take square roots or logs, it is the most computationally efficient of the three distance metrics we investigate.

4 Experiments

A total of 10,360,826 predictions were made for our datashift and accuracy analysis. Using both datasets, performing a total of 241 laps on each of the Generated Track and Generated Road circuits, to obtain MAE, MAPE and MSE for all steering predictions, similar to shown in Figure 5, where a shift of positive 120 (red plot) clearly is not steering, and positive 80 is on the limit of following the ground truth PID steering angle. The plot shows that the limit of safety in this case is between positive 40 and positive 80, with approximately the same for negative shifts, which are not shown here due to space constraints.

4.1 Distribution shift metrics

We started our analysis by choosing one random image (1105_cam-image_array_.jpg) from the Generated Track dataset. Using the equations described in section 3.4.2, we computed the distance from the image to itself, which is one for the Histogram Intersection (complete overlap), and zero (distance) for Relative Entropy and Bhatthacharrya Distance. We then apply positive and negative shifts, for a total of 241 shifts, from negative 120 to positive 120.

Refer to caption
Figure 6: Histogram Intersection, Relative Entropy and Bhattacharyya Distance safe limits for RGB image pixel intensity value shifts
Table 1: Histogram Intersection in RGB space between random Generated Road frames
ID1 ID2 -120 -80 -40 0 40 80 120
1 592 503 0.10 0.20 0.45 0.83 0.63 0.28 0.11
2 863 825 0.08 0.21 0.55 0.80 0.37 0.16 0.07
3 912 879 0.09 0.18 0.43 0.88 0.35 0.16 0.08
4 1096 410 0.14 0.33 0.69 0.62 0.29 0.16 0.08
5 519 772 0.11 0.29 0.62 0.81 0.44 0.18 0.09
6 365 1091 0.10 0.19 0.32 0.65 0.68 0.35 0.14
7 1082 1094 0.11 0.20 0.41 0.92 0.42 0.22 0.12
8 811 1300 0.07 0.15 0.26 0.63 0.72 0.32 0.17
9 146 1004 0.10 0.19 0.40 0.88 0.44 0.21 0.12
10 1199 593 0.15 0.34 0.70 0.65 0.33 0.19 0.10
11 797 1157 0.07 0.15 0.27 0.61 0.71 0.33 0.14
12 350 173 0.08 0.16 0.29 0.65 0.72 0.36 0.15
13 407 1260 0.07 0.15 0.27 0.61 0.72 0.38 0.17
14 157 1036 0.10 0.20 0.42 0.89 0.36 0.16 0.07
15 331 1139 0.09 0.18 0.31 0.65 0.69 0.36 0.16
16 623 1259 0.09 0.17 0.31 0.63 0.72 0.38 0.17
17 235 566 0.10 0.25 0.58 0.86 0.51 0.24 0.10
18 950 458 0.11 0.26 0.66 0.65 0.29 0.15 0.07
19 1237 158 0.11 0.20 0.41 0.90 0.41 0.20 0.11
20 850 1011 0.07 0.15 0.31 0.75 0.57 0.25 0.11

We observed that in RGB space the change for all three metrics is approximately linear, which provides some advantage over using the same analysis in YUV space where the change for all three metrics is approximately exponential as shown in Figure 8, Appendix A. Two cyan vertical lines where plotted at negative and positive 40, to represent a ”safe” shift range where the autonomous system is expected to produce reliable predictions, as shown in Figures 9 through 10, Appendix A.

We then performed an experiment with partial results shown in Tables 1 through 4. We chose fifty random image pairs from the Generated Track dataset (20 are displayed) where column ID1 is the frame number of the first random image, and ID2 is the frame number of the second random image.

To the second random image ID2 with apply six shifts, ranging from negative to positive 120, and include the ID2 image with no shift, measuring the Histogram Intersection between both ID1 and ID2 for all shifts and making a record of the quantity, where values closer to 1, or most intersection, are the nearest, and closest to zero, or least intersection, represent the furthest distance between the pair of images.

Table 2: Relative Entropy in RGB space between random Generated Road frames
ID1 ID2 -120 -80 -40 0 40 80 120
1 592 503 1.80 1.22 0.51 0.04 0.32 1.01 1.53
2 863 825 1.82 0.98 0.34 0.07 0.83 1.58 1.97
3 912 879 1.69 0.94 0.45 0.03 0.83 1.58 1.97
4 1096 410 1.49 0.70 0.22 0.24 0.98 1.54 1.90
5 519 772 1.71 0.97 0.32 0.07 0.67 1.38 1.81
6 365 1091 1.82 1.23 0.64 0.16 0.25 0.78 1.43
7 1082 1094 1.57 0.90 0.48 0.01 0.62 1.11 1.83
8 811 1300 1.99 1.30 0.75 0.20 0.17 0.71 1.18
9 146 1004 1.63 0.92 0.50 0.03 0.54 1.29 1.80
10 1199 593 1.47 0.68 0.22 0.18 0.82 1.43 1.86
11 797 1157 1.95 1.33 0.74 0.20 0.19 0.84 1.46
12 350 173 1.88 1.31 0.71 0.17 0.17 0.72 1.37
13 407 1260 1.93 1.33 0.74 0.20 0.16 0.68 1.32
14 157 1036 1.64 0.90 0.46 0.04 0.81 1.63 2.00
15 331 1139 1.83 1.25 0.68 0.16 0.25 0.80 1.41
16 623 1259 1.85 1.28 0.69 0.17 0.17 0.65 1.26
17 235 566 1.70 1.09 0.35 0.03 0.46 1.12 1.65
18 950 458 1.63 0.82 0.22 0.18 1.03 1.60 1.97
19 1237 158 1.63 0.91 0.48 0.02 0.67 1.15 1.81
20 850 1011 1.90 1.16 0.64 0.08 0.32 1.13 1.72
Refer to caption
Figure 7: Histogram Intersection, Relative Entropy and Bhattacharyya Distance in RGB space between frame 100 and others up to frame 1300, with a safe limit determined empirically, that is the safe distance threshold the Histogram Intersection should not fall below, or Relative Entropy and Bhattacharya distance should not go above

Rows 11 and 13 are highlighted in Table 1 because the they represent the largest distance between ID1 and ID2 in the range of negative to positive 40, which is considered a ”safe” shift. We highlight values the largest values in the negative and positive 80 columns, which represent the largest Histogram Intersection we would expect between any two pair of random images taken from e.g. a training dataset and an image acquired by the autonomous system in a real life situation. We could then claim that for any such pairing, we would accept the prediction of the network controlling the autonomous system as long as the Histogram overlap is greater than 0.40, ensuring that we are close to our safe limits, otherwise the autonomous system defers control to a human operator. The value 0.40 is approximately the y axis value for the x value of negative and positive 40, where the Histogram Intersection plot and the cyan ”safe” vertical line intersect.

Table 3: Bhattacharyya Distance in RGB space between random Generated Road frames
ID1 ID2 -120 -80 -40 0 40 80 120
1 592 503 1.59 0.85 0.32 0.03 0.15 0.54 1.06
2 863 825 1.74 0.81 0.23 0.04 0.39 0.88 1.51
3 912 879 1.66 0.85 0.32 0.01 0.40 0.83 1.43
4 1096 410 1.19 0.50 0.13 0.12 0.49 0.91 1.48
5 519 772 1.37 0.63 0.18 0.04 0.32 0.83 1.41
6 365 1091 1.62 0.87 0.45 0.10 0.11 0.43 0.96
7 1082 1094 1.51 0.75 0.35 0.01 0.30 0.64 1.23
8 811 1300 2.09 1.12 0.59 0.14 0.09 0.41 0.80
9 146 1004 1.55 0.75 0.35 0.02 0.29 0.69 1.21
10 1199 593 1.13 0.48 0.13 0.10 0.42 0.82 1.36
11 797 1157 2.17 1.20 0.59 0.14 0.09 0.45 0.95
12 350 173 1.78 0.98 0.49 0.10 0.08 0.39 0.91
13 407 1260 1.91 1.08 0.56 0.13 0.08 0.37 0.85
14 157 1036 1.52 0.73 0.33 0.02 0.39 0.89 1.52
15 331 1139 1.74 0.95 0.47 0.10 0.11 0.42 0.92
16 623 1259 1.76 0.96 0.48 0.11 0.08 0.36 0.81
17 235 566 1.43 0.72 0.22 0.02 0.24 0.65 1.22
18 950 458 1.43 0.65 0.15 0.10 0.49 0.90 1.49
19 1237 158 1.48 0.72 0.33 0.01 0.33 0.67 1.24
20 850 1011 2.00 1.04 0.50 0.05 0.15 0.58 1.13

We repeated the experiment for Relative Entropy, highlighting rows 4, 11 and 13, which represent the largest Relative Entropy values between ID1 and ID2, where larger is further (greater distribution shift) and smaller is nearer (smaller distribution shift). We highlight the smallest values in columns negative and positive 80 shift, and claim that as long as the Relative Entropy between a random image from the training dataset and one acquired from the autonomous system camera, is less than 0.60, we will accept the prediction of the network in control of the autonomous system, else the prediction is considered to be unsafe and control is delegated to a human operator.

We repeated the experiment in RGB space and computed the Bhattacharyya Distance, with results shown in table 3. We use the same images for all distance metrics, i.e. row 1 will always be a comparison between frames 592 and 503 for all six, three RGB and three YUV tables, for the Histogram Intersection, Relative Entropy and Bhattacharyya Distance metrics, that is the 50 image pairs were chosen randomly only once, and the same list of pairs is used thereafter. Again we highlight the rows with the largest distances which are coincident for all three tables described so far, rows 11 and 13. Again we look at the smallest values in columns negative and positive 80 that a Bhattacharyya Distance of 0.30 or less, between a randomly chosen image from the training dataset, and an image acquired by the autonomous system’s camera, is safe to use, and the network prediction controlling the autonomous system will be accepted, otherwise the prediction will be rejected, and control will be returned to a human operator.

We concluded that in the RGB space, any of the three distance metrics would be adequate, and the least computationally intensive would be preferred, which is the Histogram Intersection.

5 Conclusion

A formulation could be given by defining a range of maximum and minimum distribution shift using pixels intensity value averages as a proxy, over the entire training dataset, and the allowed data distribution shift. The output being a boolean value where true means the image and resulting prediction are safe to use, false otherwise.

Psafe={true,if |DBrgb(P,Q)|<|DBrgb(P,Qsr)|.true,if |DKLrgb(PQ)|<|DKLrgb(PQsr)|.true,if |Hirgb(P,Q)|>|Hirgb(P,Qsr)|.false,otherwise.subscript𝑃𝑠𝑎𝑓𝑒cases𝑡𝑟𝑢𝑒if |DBrgb(P,Q)|<|DBrgb(P,Qsr)|𝑡𝑟𝑢𝑒if |DKLrgb(PQ)|<|DKLrgb(PQsr)|𝑡𝑟𝑢𝑒if |Hirgb(P,Q)|>|Hirgb(P,Qsr)|𝑓𝑎𝑙𝑠𝑒otherwiseP_{safe}=\begin{cases}true,&\text{if $|D_{B}rgb(P,Q)|<|D_{B}rgb(P,Q_{sr})|$}.% \\ true,&\text{if $|D_{\text{KLrgb}}(P\parallel Q)|<|D_{\text{KLrgb}}(P\parallel Q% _{sr})|$}.\\ true,&\text{if $|H_{irgb}(P,Q)|>|H_{irgb}(P,Q_{sr})|$}.\\ false,&\text{otherwise}.\end{cases}italic_P start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e end_POSTSUBSCRIPT = { start_ROW start_CELL italic_t italic_r italic_u italic_e , end_CELL start_CELL if | italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_r italic_g italic_b ( italic_P , italic_Q ) | < | italic_D start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT italic_r italic_g italic_b ( italic_P , italic_Q start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT ) | . end_CELL end_ROW start_ROW start_CELL italic_t italic_r italic_u italic_e , end_CELL start_CELL if | italic_D start_POSTSUBSCRIPT KLrgb end_POSTSUBSCRIPT ( italic_P ∥ italic_Q ) | < | italic_D start_POSTSUBSCRIPT KLrgb end_POSTSUBSCRIPT ( italic_P ∥ italic_Q start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT ) | . end_CELL end_ROW start_ROW start_CELL italic_t italic_r italic_u italic_e , end_CELL start_CELL if | italic_H start_POSTSUBSCRIPT italic_i italic_r italic_g italic_b end_POSTSUBSCRIPT ( italic_P , italic_Q ) | > | italic_H start_POSTSUBSCRIPT italic_i italic_r italic_g italic_b end_POSTSUBSCRIPT ( italic_P , italic_Q start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT ) | . end_CELL end_ROW start_ROW start_CELL italic_f italic_a italic_l italic_s italic_e , end_CELL start_CELL otherwise . end_CELL end_ROW (12)

Where Psafesubscript𝑃𝑠𝑎𝑓𝑒P_{safe}italic_P start_POSTSUBSCRIPT italic_s italic_a italic_f italic_e end_POSTSUBSCRIPT is a boolean value determining if it is safe to make a prediction from input Q𝑄Qitalic_Q relative to Qsrsubscript𝑄𝑠𝑟Q_{sr}italic_Q start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT, and Qsrsubscript𝑄𝑠𝑟Q_{sr}italic_Q start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT is the safe range between negative and positive sr𝑠𝑟sritalic_s italic_r, and the comparisons are performed in RGB space. Given the computational advantages, and strong linearity Hirgbsubscript𝐻𝑖𝑟𝑔𝑏H_{irgb}italic_H start_POSTSUBSCRIPT italic_i italic_r italic_g italic_b end_POSTSUBSCRIPT is the preferred distance metric.

References

  • [1] Charu C Aggarwal and Philip S Yu. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 37–46, 2001.
  • [2] Anonymous. https://youtu.be/v0Otdmtdhnk, 2020.
  • [3] Sabyasachi Basu and Martin Meckesheimer. Automatic outlier detection for time series: an application to sensor data. Knowledge and Information Systems, 11(2):137–154, 2007.
  • [4] Steffen Bickel, Michael Brückner, and Tobias Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(9), 2009.
  • [5] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars, 2016.
  • [6] Saikiran Bulusu, Bhavya Kailkhura, Bo Li, Pramod K Varshney, and Dawn Song. Anomalous example detection in deep learning: A survey. IEEE Access, 8:132330–132347, 2020.
  • [7] Raghavendra Chalapathy and Sanjay Chawla. Deep learning for anomaly detection: A survey. arXiv preprint arXiv:1901.03407, 2019.
  • [8] Brent Cowan and Bill Kapralos. A survey of frameworks and game engines for serious game development. In 2014 IEEE 14th International Conference on Advanced Learning Technologies, pages 662–664. IEEE, 2014.
  • [9] Christopher P Diehl and John B Hampshire. Real-time object classification and novelty detection for collaborative video surveillance. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No. 02CH37290), volume 3, pages 2620–2625. IEEE, 2002.
  • [10] Donkey Car. An opensource diy self driving platform for small scale cars, 2022.
  • [11] Janet Fleetwood. Public health, ethics, and autonomous vehicles. Am J Public Health, 107(4)(9):532–537, 4 2017.
  • [12] Stanislav Fort, Jie Ren, and Balaji Lakshminarayanan. Exploring the limits of out-of-distribution detection. Advances in Neural Information Processing Systems, 34:7068–7081, 2021.
  • [13] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [14] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  • [15] Victoria Hodge and Jim Austin. A survey of outlier detection methodologies. Artificial intelligence review, 22(2):85–126, 2004.
  • [16] Yen-Chang Hsu, Yilin Shen, Hongxia **, and Zsolt Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10951–10960, 2020.
  • [17] Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8710–8719, 2021.
  • [18] Hannah R Kerner, Danika F Wellington, Kiri L Wagstaff, James F Bell, Chiman Kwan, and Heni Ben Amor. Novelty detection for multispectral images with application to planetary exploration. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 9484–9491, 2019.
  • [19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [20] Y LeCun, E Cosatto, J Ben, U Muller, and B Flepp. Dave: Autonomous off-road vehicle control using end-to-end learning. Courant Institute/CBLL, http://www. cs. nyu. edu/yann/research/dave/index. html, Tech. Rep. DARPA-IPTO Final Report, 2004.
  • [21] Kimin Lee, Kibok Lee, Honglak Lee, and **woo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018.
  • [22] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
  • [23] Ming Lin, Jaewoo Yoon, and Byeongwoo Kim. Self-driving car location estimation based on a particle-aided unscented kalman filter. Sensors, 20(9), 2020.
  • [24] Benjamin Loeb and Kara M Kockelman. Fleet performance and cost evaluation of a shared autonomous electric vehicle (saev) fleet: A case study for austin, texas. Transportation Research Part A: Policy and Practice, 121:374–385, 2019.
  • [25] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
  • [26] Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review. ACM Computing Surveys (CSUR), 54(2):1–38, 2021.
  • [27] Stanislav Pidhorskyi, Ranya Almohsen, and Gianfranco Doretto. Generative probabilistic novelty detection with adversarial autoencoders. Advances in neural information processing systems, 31, 2018.
  • [28] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. A review of novelty detection. Signal processing, 99:215–249, 2014.
  • [29] Michal Podpora, Grzegorz Pawel Korbas, and Aleksandra Kawala-Janik. Yuv vs rgb-choosing a color space for human-machine interaction. In FedCSIS (Position Papers), pages 29–34, 2014.
  • [30] Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. Mit Press, 2008.
  • [31] William Riggs and Sven A Beiker. Business models for shared and autonomous mobility. In Automated Vehicles Symposium, pages 33–48. Springer, 2019.
  • [32] Lukas Ruff, Jacob R Kauffmann, Robert A Vandermeulen, Grégoire Montavon, Wojciech Samek, Marius Kloft, Thomas G Dietterich, and Klaus-Robert Müller. A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5):756–795, 2021.
  • [33] Gabi Shalev, Yossi Adi, and Joseph Keshet. Out-of-distribution detection using multiple semantic label representations. Advances in Neural Information Processing Systems, 31, 2018.
  • [34] Masashi Sugiyama and Motoaki Kawanabe. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
  • [35] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. Advances in neural information processing systems, 20, 2007.
  • [36] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and **woo Shin. Csi: Novelty detection via contrastive learning on distributionally shifted instances. Advances in neural information processing systems, 33:11839–11852, 2020.
  • [37] John von Neumann. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata studies, 34:43–98, 1956.
  • [38] Darrell M West. Moving forward: self-driving vehicles in china, europe, japan, korea, and the united states. Center for Technology Innovation at Brookings: Washington, DC, USA, 2016.
  • [39] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, 2017.

Appendix A Supporting Materials

Refer to caption
Figure 8: Histogram Intersection, Relative Entropy and Bhattacharyya Distance safe limits for YUV image pixel intensity value shifts
Table 4: Bhattacharyya Distance in YUV space between random Generated Road frames
ID1 ID2 -120 -80 -40 0 40 80 120
1 592 503 3.51 2.14 1.62 0.07 1.41 1.97 3.33
2 863 825 3.66 2.74 1.37 0.10 2.00 2.76 4.39
3 912 879 3.02 2.35 1.45 0.08 2.39 2.72 3.83
4 1096 410 3.20 2.41 1.31 0.20 2.23 2.95 4.46
5 519 772 2.90 1.84 1.35 0.13 1.63 2.44 3.59
6 365 1091 3.56 2.74 1.73 0.16 1.37 2.34 3.29
7 1082 1094 3.44 2.85 2.32 0.03 2.33 2.80 3.45
8 811 1300 4.95 2.91 2.35 0.24 1.23 2.11 3.27
9 146 1004 3.24 2.68 1.85 0.05 1.82 2.57 3.07
10 1199 593 2.87 2.07 1.25 0.18 1.67 2.50 3.43
11 797 1157 4.64 2.90 2.22 0.24 1.20 2.35 3.67
12 350 173 4.12 3.16 1.85 0.20 1.38 2.29 3.32
13 407 1260 4.71 3.01 2.37 0.18 1.20 2.01 2.87
14 157 1036 3.44 2.70 2.38 0.06 2.29 2.94 3.89
15 331 1139 3.64 2.66 1.81 0.15 1.32 2.25 3.21
16 623 1259 3.73 2.78 1.72 0.18 1.28 2.16 2.98
17 235 566 3.54 2.88 1.55 0.11 1.44 2.36 3.87
18 950 458 2.88 2.06 1.20 0.17 2.24 2.81 4.35
19 1237 158 4.72 2.95 1.93 0.06 2.28 2.75 3.47
20 850 1011 4.45 2.78 2.30 0.13 1.50 2.02 2.71
Refer to caption
Figure 9: Plots of ground truth (SDSandbox PID steering output) and nvidia2 network predictions for images with pixel value intensity shifts of 0, -40, -80 and -120
Refer to caption
Figure 10: Plots of ground truth (SDSandbox PID steering output) and nvidia1 network predictions for images with pixel value intensity shifts of 40, 80 and 120