11institutetext: Bar-Ilan University, 11email: {ravivav1, alufmem, hillelk}@biu.ac.il 22institutetext: Hebrew University of Jerusalem, 22email: {yizhak.elboher, g.katz}@mail.huji.ac.il

Formal Verification of Object Detection

Avraham Raviv 11    Yizhak Y. Elboher 22    Michelle Aluf-Medina 11    Yael Leibovich Weiss 11    Omer Cohen 11    Roy Assa 11    Guy Katz 22    Hillel Kugler 11
Abstract

Deep Neural Networks (DNNs) are ubiquitous in real-world applications, yet they remain vulnerable to errors and adversarial attacks. This work tackles the challenge of applying formal verification to ensure the safety of computer vision models, extending verification beyond image classification to object detection. We propose a general formulation for certifying the robustness of object detection models using formal verification and outline implementation strategies compatible with state-of-the-art verification tools. Our approach enables the application of these tools, originally designed for verifying classification models, to object detection. We define various attacks for object detection, illustrating the diverse ways adversarial inputs can compromise neural network outputs. Our experiments, conducted on several common datasets and networks, reveal potential errors in object detection models, highlighting system vulnerabilities and emphasizing the need for expanding formal verification to these new domains. This work paves the way for further research in integrating formal verification across a broader range of computer vision applications.

Keywords:
Adversarial Attacks, Formal Verification, Object Detection.

1 Introduction

Deep Neural Networks (DNNs) have demonstrated remarkable capabilities [1], consistently achieving state-of-the-art performance across a wide range of domains, such as Computer Vision (CV)[2, 3], Natural Language Processing (NLP)[4, 5, 6], Audio[7, 8] and Video [9, 10]. Despite their significant success in many tasks, integrating DNNs in safety-critical industries like autonomous vehicles, aviation, financial services, and healthcare is very limited: their statistical nature makes them vulnerable to both innocent and malicious alterations, such that even small, often imperceptible, perturbations to the input can lead to critical errors.

This vulnerability has spawned an entire branch of research dedicated to the attack and defense of DNNs. Numerous methods have been proposed to exploit these weaknesses, while in contrast, others have been developed to defend against such attacks and enhance robustness. A key family of these methods is based on formal verification[11, 12], which incorporates formal methods into the realm of DNNs to verify neural networks. Initially, the focus was primarily on classification networks and, to some extent, tabular data. More recently, interest has expanded to include applications in Natural Language Processing[13], Reinforcement Learning[14, 15], and robotics[16, 17]. However, the primary focus remains on computer vision, particularly image classification. Other computer vision tasks, such as object detection and segmentation, have yet to be explored using formal verification tools.

In this work, we make a first attempt to close this gap by extending current verifiers, which are designed for image classification, to be suitable for object detection tasks. We propose a general formulation that covers these tasks and outline the method for their verification. Beyond the formulation, we not only explain how to leverage verifiers for new tasks, but also define different attacks for each task. This involves understanding the task and identifying the various ways to attack the output, leading to potential errors. Once these attacks are clearly defined, the final step is to adapt the verifiers, which often require minimal additional adaptation. Alongside the theoretical formulation, we provide examples of several attacks on object detectors, which demonstrate the need and importance of extending formal methods for DNNs to broader computer vision tasks.

Refer to caption
Figure 1: Example of Attacked Images. Illustrations of adversarial attacks on OD models, including misclassification, misdetections (extra objects in the example), and mislocalization. Note: Noise in these examples is amplified 100×100\times100 × for visibility.

First, we provide a comprehensive framework to define object detection as a formal verification problem. By establishing clear definitions, we set the stage for systematically assessing the robustness of models against various attacks. Second, we show how existing verifiers, primarily designed for classification tasks, can be adjusted to handle these new formulations. This adaptation offers practical insights into utilizing current tools to verify a broader range of computer vision tasks with minimal changes. Lastly, we implement this framework and demonstrate its effectiveness through a series of examples. These examples highlight the practical application of our approach in evaluating and improving model robustness across different computer vision tasks. Figure 1 illustrates an example that encapsulates the core idea, where our method causes a detector to fail in various ways, highlighting the critical challenges this approach aims to address.

The rest of this paper is organized as follows: Sec. 2 provides an overview of related work on neural network verification and adversarial attacks in computer vision. Sec. 3 outlines our methodology, presenting the formulation of several computer vision tasks as formal verification problems and detailing how verifiers can be adapted to handle these tasks. Sec. 4 presents the results, demonstrating the effectiveness of our approach through examples. Finally, Sec. 5 concludes with insights and suggests future research directions.

2 Related Work

This work bridges the gap between object detection and formal verification of neural networks. Among various existing object detection methods, including R-CNN [18, 19, 20], SSD [21], and YOLO[22], the latter is the most efficient in terms of inference speed. YOLO has undergone several improvements in precision and performance [23, 24, 25, 26, 27, 28, 29, 30, 31, 32], as well as in scaling and multi-tasking capabilities [33, 34].

Our research builds upon the formal verification of deep neural networks. There are multiple verification approaches and tools, ranging from SMT-Based methods [35, 36], symbolic interval propagation [37, 38], MILP and LP-solvers, to abstract interpretation methods [39, 40, 41, 42] with branch-and-bound mechanisms [43, 44, 45]. We utilize Alpha-Beta-CROWN, which supports networks with multiple outputs and a variety of activation functions. This tool has won the last two international competitions in DNN verification [46, 47], which compare verification tools by their precision and efficiency.

A recent study by Cohen et al. [48] on proving properties of object detection models presents several key differences from our work. Firstly, we are pioneering the application of a state-of-the-art verification tool. Consequently, our method achieves completeness and successfully incorporates counterexample extraction and adversarial attacks. In addition, Cohen et al.’s method has currently been applied to small networks, while our approach has been integrated with the YOLO model, showcasing its practical applicability.

Range analysis methods [49] can compute a guaranteed range for the outputs of DNNs using a combination of local search and linear programming. Future work can investigate applying such methods to certify object detection models or identify adversarial attacks. This requires defining the constraints on allowed inputs and analyzing whether the guaranteed ranges enable adversarial attacks, and demonstrating this can be done in a scalable way.

In addition to YOLO, formal verification of DNNs has been applied in NLP [13], reinforcement learning [50, 51, 14, 16, 52, 53], explainability [54, 15], robustness assessment [55], binarized neural networks [56], generalization [57, 58], proof production [59] and delta-debugging[60], and was enhanced by classical optimization method in formal verification, e.g. abstraction [61, 52, 62, 63] and residual reasoning [64].

We believe these works are complementary to our own, and it will be of interest to explore potential connections between them in the future.

3 Methodology

This section navigates through the verification process step by step. First, it covers the basics with necessary preliminaries. Then, it delves into the specific targets focused on in computer vision and their motivation. Finally, it formulates these targets into the layers of neural networks, integrating them as an essential part of the verification process.

3.1 Preliminaries

In computer vision, various tasks aim to understand and interpret visual data, each with its own unique challenges. Image classification models categorize entire images into predefined classes. Formally, such a model takes an input image \mathcal{I}caligraphic_I and maps it to a class label c𝑐citalic_c, chosen from a discrete set of possible labels 𝒞𝒞\mathcal{C}caligraphic_C. Each label is assigned a confidence score, and the highest score indicates the predicted class.

Object Detection (OD) extends this by not only identifying the classes present in an image but also localizing each object by predicting bounding boxes. An OD model takes an input image \mathcal{I}caligraphic_I and produces a set of bounding boxes \mathcal{B}caligraphic_B. Each bounding box in \mathcal{B}caligraphic_B is a tuple (x,y,w,h,c)𝑥𝑦𝑤𝑐{(x,y,w,h,c)}( italic_x , italic_y , italic_w , italic_h , italic_c ), where (x,y)𝑥𝑦{(x,y)}( italic_x , italic_y ) represents the coordinates of the top-left corner, (w,h)𝑤{(w,h)}( italic_w , italic_h ) denotes its dimensions, and c𝑐citalic_c signifies the detected object’s class label.

An adversarial attack involves intentionally altering an original data point 0dsubscript0superscript𝑑\mathcal{I}_{0}\in\mathbb{R}^{d}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to a different one, dsuperscript𝑑\mathcal{I}\in\mathbb{R}^{d}caligraphic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, with 0subscript0\mathcal{I}\neq\mathcal{I}_{0}caligraphic_I ≠ caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, so that the altered data is interpreted differently by the model, potentially leading to incorrect or unintended outputs.

In formal verification, the objective is to ensure that the neural network satisfies certain desired properties under all possible input scenarios within a specified input set. The input set \mathcal{I}caligraphic_I is often defined by a set of constraints 𝒞insubscript𝒞𝑖𝑛\mathcal{C}_{in}caligraphic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, which bound the possible values the input can take. For instance, given an input 0subscript0\mathcal{I}_{0}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the input set could be defined as all points \mathcal{I}caligraphic_I such that the difference between \mathcal{I}caligraphic_I and 0subscript0\mathcal{I}_{0}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is within a specified range: |0|δsubscript0𝛿|\mathcal{I}-\mathcal{I}_{0}|\leq\delta| caligraphic_I - caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ≤ italic_δ, where δ𝛿\deltaitalic_δ represents the allowable perturbation.

The model’s output, denoted as 𝒪=f()𝒪𝑓\mathcal{O}=f(\mathcal{I})caligraphic_O = italic_f ( caligraphic_I ), is then verified against a set of output constraints 𝒞outsubscript𝒞𝑜𝑢𝑡\mathcal{C}_{out}caligraphic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT, which define the expected or acceptable behavior of the model. The task is to prove that for all inputs \mathcal{I}caligraphic_I satisfying 𝒞insubscript𝒞𝑖𝑛\mathcal{C}_{in}caligraphic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, the outputs 𝒪𝒪\mathcal{O}caligraphic_O will satisfy 𝒞outsubscript𝒞𝑜𝑢𝑡\mathcal{C}_{out}caligraphic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT. Formally, the verification problem can be stated as:

𝒞in,f()𝒞outformulae-sequencefor-allsubscript𝒞𝑖𝑛𝑓subscript𝒞𝑜𝑢𝑡\forall\mathcal{I}\in\mathcal{C}_{in},\quad f(\mathcal{I})\in\mathcal{C}_{out}∀ caligraphic_I ∈ caligraphic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_f ( caligraphic_I ) ∈ caligraphic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT

This ensures that the neural network behaves correctly for all possible inputs within the defined constraints, providing a strong guarantee of the model’s reliability in practical scenarios.

For image classification verification, the goal is to ensure that a model reliably predicts the correct class label under input variations. The input constraints 𝒞insubscript𝒞𝑖𝑛\mathcal{C}_{in}caligraphic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT define permissible modifications to an original image 0subscript0\mathcal{I}_{0}caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, forming a set of inputs \mathcal{I}caligraphic_I where |0|δsubscript0𝛿|\mathcal{I}-\mathcal{I}_{0}|\leq\delta| caligraphic_I - caligraphic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | ≤ italic_δ. This ensures that all modified inputs remain sufficiently similar to the original.

The output constraints 𝒞outsubscript𝒞𝑜𝑢𝑡\mathcal{C}_{out}caligraphic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT require that the predicted class label remains consistent for any input \mathcal{I}caligraphic_I within 𝒞insubscript𝒞𝑖𝑛\mathcal{C}_{in}caligraphic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. This consistency is crucial for maintaining classification accuracy despite input perturbations within the specified range. So far, such verification techniques in computer vision have been limited to image classification tasks.

Unlike the classification domain, which typically revolves around a singular option – misclassification – OD presents a broader range of potential attack vectors. In OD, adversaries have multiple avenues to explore, including:

  1. 1.

    Misclassification: This attack vector mirrors classification-based adversarial attacks and involves deliberately causing OD models to incorrectly label detected objects. Formally, when the output of the OD model is a tuple 𝒟=(x,y,h,w,c)𝒟𝑥𝑦𝑤𝑐\mathcal{D}=(x,y,h,w,c)caligraphic_D = ( italic_x , italic_y , italic_h , italic_w , italic_c ), misclassification occurs when cground truth𝑐ground truthc\neq\text{ground truth}italic_c ≠ ground truth.

  2. 2.

    Mislocalization: This attack vector involves perturbing the localization aspect of OD models, thereby causing them to erroneously estimate the position and extent of objects within the image. Formally, mislocalization occurs when any of the localization parameters (x𝑥xitalic_x, y𝑦yitalic_y, hhitalic_h, or w𝑤witalic_w) deviates significantly from their ground truth values, leading to inaccurate object localization. Given that perfect matching is rarely attainable, the success of the mislocalization attack is assessed by calculating the Intersection over Union (IoU) between the original detection and the perturbed one. If the IoU is below a predefined threshold, it signifies the effectiveness of the attack.

    IoU is the shared area between the actual object and the detected one, formally defined using bounding box (bb) notations as:

    IoU=Area of Overlap between the detected bb and the ground truth bbArea of Union between the detected bb and the ground truth bbIoUArea of Overlap between the detected bb and the ground truth bbArea of Union between the detected bb and the ground truth bb\text{IoU}=\frac{\text{Area of Overlap between the detected bb and the ground % truth bb}}{\text{Area of Union between the detected bb and the ground truth bb}}IoU = divide start_ARG Area of Overlap between the detected bb and the ground truth bb end_ARG start_ARG Area of Union between the detected bb and the ground truth bb end_ARG

    Using mathematical symbols, this can be represented as:

    IoU=|AB||AB|IoU𝐴𝐵𝐴𝐵\text{IoU}=\frac{|A\cap B|}{|A\cup B|}IoU = divide start_ARG | italic_A ∩ italic_B | end_ARG start_ARG | italic_A ∪ italic_B | end_ARG

    where A𝐴Aitalic_A represents the area of the detected bounding box, and B𝐵Bitalic_B represents the area of the ground truth bounding box.

  3. 3.

    Misdetection/False Positives: In this scenario, the aim is to manipulate the model into either missing the detection of genuine objects or generating false positives—detecting objects that are not present in the image. Assume there are d𝑑ditalic_d objects in a given image, each represented as a ground truth tuple 𝒢i=(xi,yi,hi,wi,ci)subscript𝒢𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑖subscript𝑤𝑖subscript𝑐𝑖\mathcal{G}_{i}=(x_{i},y_{i},h_{i},w_{i},c_{i})caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i=1,2,,d𝑖12𝑑i=1,2,\ldots,ditalic_i = 1 , 2 , … , italic_d. The output of the OD model is represented as a set of tuples 𝒟={𝒟1,𝒟2,,𝒟n}𝒟subscript𝒟1subscript𝒟2subscript𝒟𝑛\mathcal{D}=\{\mathcal{D}_{1},\mathcal{D}_{2},\ldots,\mathcal{D}_{n}\}caligraphic_D = { caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where each 𝒟j=(xj,yj,hj,wj,cj)subscript𝒟𝑗subscript𝑥𝑗subscript𝑦𝑗subscript𝑗subscript𝑤𝑗subscript𝑐𝑗\mathcal{D}_{j}=(x_{j},y_{j},h_{j},w_{j},c_{j})caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) corresponds to a detected object by the model. An attack of this type could manifest in one of two ways:

    Missing Genuine Objects: For any 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, if there exists no 𝒟jsubscript𝒟𝑗\mathcal{D}_{j}caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT such that cj=cisubscript𝑐𝑗subscript𝑐𝑖c_{j}=c_{i}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the IoU between 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒟jsubscript𝒟𝑗\mathcal{D}_{j}caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT exceeds a predefined threshold, it indicates that a genuine object has been missed by the OD model.

    False Positives: If there exists 𝒟jsubscript𝒟𝑗\mathcal{D}_{j}caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for which cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT does not match any ground truth class label and the IoU between 𝒟jsubscript𝒟𝑗\mathcal{D}_{j}caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and all 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is below a predefined threshold, it suggests the generation of false positives, where non-existent objects are erroneously detected. The effectiveness of the attack is determined by assessing the occurrences of these two scenarios.

3.2 Method

After establishing the preliminaries and defining formal verification for image classification, we expand the scope to include other key computer vision tasks. We begin by explaining the different types of attacks in detail, then integrate them into a comprehensive, unified formulation.

Object detection involves both locating and classifying one or more objects in an image. The most straightforward attack targets the classification aspect, similar to traditional neural network verification methods. However, the more intriguing challenge lies in extending beyond classification to handle the unique characteristics of OD. Unlike classification, OD deals with multiple objects in an image, introducing the possibility of attacks that manipulate the number of detected objects. Furthermore, the natural extension of these attacks focuses on localization, aiming to deceive the detector into identifying objects but placing them in incorrect locations.

Up to this point, the unique attacks specific to object detection, which is not found in classification, have been outlined: targeting multiple objects and localization errors. While the input constraints remain the same, the output constraints differ. However, these constraints share a common foundation, allowing for a generic formulation. As previously stated, the most general form of the output constraint is 𝒞in,f()𝒞outformulae-sequencefor-allsubscript𝒞𝑖𝑛𝑓subscript𝒞𝑜𝑢𝑡\forall\mathcal{I}\in\mathcal{C}_{in},\quad f(\mathcal{I})\in\mathcal{C}_{out}∀ caligraphic_I ∈ caligraphic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_f ( caligraphic_I ) ∈ caligraphic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT. First, let’s demonstrate how each type of attack aligns with this form:

For the attack on the number of detected objects, the verification process aims to align the detected objects with the ground truth. Naively, the number of detected object (##\#\mathcal{B}# caligraphic_B) should be aligned with the number of ground truth #GT#𝐺𝑇\#GT# italic_G italic_T. Since false-positives and false-negatives cancel each other in ##\#\mathcal{B}# caligraphic_B, the desired property is changed: #positive=#GTpositive#subscriptpositive#𝐺subscript𝑇positive\#\mathcal{B}_{\text{positive}}=\#GT_{\text{positive}}# caligraphic_B start_POSTSUBSCRIPT positive end_POSTSUBSCRIPT = # italic_G italic_T start_POSTSUBSCRIPT positive end_POSTSUBSCRIPT. This constraint ensures that the true positive detections align with the ground truth, providing a more reliable assessment of object detection performance and effectively capturing both false positives and false negatives.

For localization, the focus is on the positions of the detected objects. The Intersection over Union (IoU) measure quantifies the overlap between detections and ground truth objects. If the IoU of each detected object exceeds a predefined threshold, τ𝜏\tauitalic_τ, and the detected class labels are accurate, then the objects are considered to be correctly located. This measure indicates the degree of agreement between the detected bounding box and the ground truth. The motivation here is to verify that the IoU of each detected object is indeed higher than τ𝜏\tauitalic_τ, ensuring accurate localization and reliable assessment of detection performance.

3.3 Encoding

To implement these operations and constraints, we explored two directions, each with its pros and cons, which we will discuss. The first approach, the more direct and naïve one, aims to be as implicit as possible. This means running a given network, taking its native output, and querying it using a verifier with minimal required modifications. The second approach is to be explicit: encode each unique operation as a neural network layer, which can be verified natively. While the first method is general and offloads most of the process to behind the scenes, it forces the verifier to abstract all network operations. In contrast, the second approach, being more explicit, shifts the responsibility to us and leaves the verifier to encode known operations. This method is, of course, more complex but offers greater control and enables more feasible scenarios.

Since the first method requires no additional formal definitions we focus here on the formulation of the second approach. A verifier originally designed for classification can be adapted to other tasks by encoding unique operators or stages as DNN layers, which can then be integrated into the verifier. The network itself is standard, consisting of linear layers and non-linear activations. Those layers and activations are commonly encoded into the verifier, and we add encoding for specialized tasks, such as matching detection and ground truth for OD. This approach ensures that these specialized operators are rigorously formulated and effectively encoded into the standard verifier.

Modeling IoU with Piecewise Linear Constraints

Given two bounding boxes Bgt=(xg,yg,wg,hg)subscript𝐵𝑔𝑡subscript𝑥𝑔subscript𝑦𝑔subscript𝑤𝑔subscript𝑔B_{gt}=(x_{g},y_{g},w_{g},h_{g})italic_B start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) and Bdt=(xd,yd,wd,hd)subscript𝐵𝑑𝑡subscript𝑥𝑑subscript𝑦𝑑subscript𝑤𝑑subscript𝑑B_{dt}=(x_{d},y_{d},w_{d},h_{d})italic_B start_POSTSUBSCRIPT italic_d italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), the IoU can be calculated using the following definitions and constraints:

Intersection Calculation

The coordinates of the intersection rectangle between Bgtsubscript𝐵𝑔𝑡B_{gt}italic_B start_POSTSUBSCRIPT italic_g italic_t end_POSTSUBSCRIPT and Bdtsubscript𝐵𝑑𝑡B_{dt}italic_B start_POSTSUBSCRIPT italic_d italic_t end_POSTSUBSCRIPT are:

xi1subscript𝑥𝑖1\displaystyle x_{i1}italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT =max(xg,xd),absentsubscript𝑥𝑔subscript𝑥𝑑\displaystyle=\max(x_{g},x_{d}),= roman_max ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ,
yi1subscript𝑦𝑖1\displaystyle y_{i1}italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT =max(yg,yd),absentsubscript𝑦𝑔subscript𝑦𝑑\displaystyle=\max(y_{g},y_{d}),= roman_max ( italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ,
xi2subscript𝑥𝑖2\displaystyle x_{i2}italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT =min(xg+wg,xd+wd),absentsubscript𝑥𝑔subscript𝑤𝑔subscript𝑥𝑑subscript𝑤𝑑\displaystyle=\min(x_{g}+w_{g},x_{d}+w_{d}),= roman_min ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ,
yi2subscript𝑦𝑖2\displaystyle y_{i2}italic_y start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT =min(yg+hg,yd+hd).absentsubscript𝑦𝑔subscript𝑔subscript𝑦𝑑subscript𝑑\displaystyle=\min(y_{g}+h_{g},y_{d}+h_{d}).= roman_min ( italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) .

The width (wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and height (hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) of the intersection are given by:

wisubscript𝑤𝑖\displaystyle w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =max(0,xi2xi1),absent0subscript𝑥𝑖2subscript𝑥𝑖1\displaystyle=\max(0,x_{i2}-x_{i1}),= roman_max ( 0 , italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ) ,
hisubscript𝑖\displaystyle h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =max(0,yi2yi1).absent0subscript𝑦𝑖2subscript𝑦𝑖1\displaystyle=\max(0,y_{i2}-y_{i1}).= roman_max ( 0 , italic_y start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ) .

The area of the intersection (AIsubscript𝐴𝐼A_{I}italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT) is wihisubscript𝑤𝑖subscript𝑖w_{i}\cdot h_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Union Calculation

The areas of each bounding box are:

Agsubscript𝐴𝑔\displaystyle A_{g}italic_A start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT =wghg,absentsubscript𝑤𝑔subscript𝑔\displaystyle=w_{g}\cdot h_{g},= italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ,
Adsubscript𝐴𝑑\displaystyle A_{d}italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT =wdhd.absentsubscript𝑤𝑑subscript𝑑\displaystyle=w_{d}\cdot h_{d}.= italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT .

The area of the union is:

AU=Ag+AdAI.subscript𝐴𝑈subscript𝐴𝑔subscript𝐴𝑑subscript𝐴𝐼A_{U}=A_{g}+A_{d}-A_{I}.italic_A start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT .

IoU Calculation and Constraint

The IoU is defined as:

IoU=AIAU.𝐼𝑜𝑈subscript𝐴𝐼subscript𝐴𝑈IoU=\frac{A_{I}}{A_{U}}.italic_I italic_o italic_U = divide start_ARG italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT end_ARG .

To enforce the constraint IoU>τ𝐼𝑜𝑈𝜏IoU>\tauitalic_I italic_o italic_U > italic_τ, we express this as a linear constraint,which can be encoded using ReLU activation:

AI>τAUAIτAU>0.subscript𝐴𝐼𝜏subscript𝐴𝑈subscript𝐴𝐼𝜏subscript𝐴𝑈0A_{I}>\tau\cdot A_{U}\Leftrightarrow A_{I}-\tau\cdot A_{U}>0.italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT > italic_τ ⋅ italic_A start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ⇔ italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT - italic_τ ⋅ italic_A start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT > 0 .

Fig. 2 illustrates how the IoU between a single ground truth and a single detection can be encoded as a piecewise layer.

Refer to caption
Figure 2: Encoding IoU operation.

Verify the Number of Objects and their Localization

The matching between all ground truth and detection objects requires handling multiple bounding boxes, which necessitates using this encoding and adding another layer. When working with lists of detection boxes dts={Bdt1,Bdt2,,Bdtn}𝑑𝑡𝑠subscript𝐵𝑑𝑡1subscript𝐵𝑑𝑡2subscript𝐵𝑑𝑡𝑛dts=\{B_{dt1},B_{dt2},\ldots,B_{dtn}\}italic_d italic_t italic_s = { italic_B start_POSTSUBSCRIPT italic_d italic_t 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_d italic_t 2 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_d italic_t italic_n end_POSTSUBSCRIPT } and ground truth boxes gts={Bgt1,Bgt2,,Bgtm}𝑔𝑡𝑠subscript𝐵𝑔𝑡1subscript𝐵𝑔𝑡2subscript𝐵𝑔𝑡𝑚gts=\{B_{gt1},B_{gt2},\ldots,B_{gtm}\}italic_g italic_t italic_s = { italic_B start_POSTSUBSCRIPT italic_g italic_t 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_g italic_t 2 end_POSTSUBSCRIPT , … , italic_B start_POSTSUBSCRIPT italic_g italic_t italic_m end_POSTSUBSCRIPT }, we introduce binary variables zijsubscript𝑧𝑖𝑗z_{ij}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to indicate whether the IoU condition is met for each pair (dti,gtj)𝑑subscript𝑡𝑖𝑔subscript𝑡𝑗(dt_{i},gt_{j})( italic_d italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ):

zij=ReLU(AIτAU)subscript𝑧𝑖𝑗𝑅𝑒𝐿𝑈subscript𝐴𝐼𝜏subscript𝐴𝑈z_{ij}=ReLU(A_{I}-\tau\cdot A_{U})\ italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_R italic_e italic_L italic_U ( italic_A start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT - italic_τ ⋅ italic_A start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT )

In order to verify whether at least one or all objects are accurately detected. At least one pair meets the IoU condition of:

i=1nj=1mzij0.superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑚subscript𝑧𝑖𝑗0\sum_{i=1}^{n}\sum_{j=1}^{m}z_{ij}\geq 0.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≥ 0 .

Where we aim to verify that all objects are detected correctly, with no false positives, each detection dti𝑑subscript𝑡𝑖dt_{i}italic_d italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT must match one ground truth gtj𝑔subscript𝑡𝑗gt_{j}italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and vice versa. Here, we define zijsubscript𝑧𝑖𝑗z_{ij}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as a binary indicator where zij=1subscript𝑧𝑖𝑗1z_{ij}=1italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if detection dti𝑑subscript𝑡𝑖dt_{i}italic_d italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT matches ground truth gtj𝑔subscript𝑡𝑗gt_{j}italic_g italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and 0 otherwise:

zij={1if AIij<τAUij,0otherwise.subscript𝑧𝑖𝑗cases1if subscript𝐴subscript𝐼𝑖𝑗𝜏subscript𝐴subscript𝑈𝑖𝑗0otherwisez_{ij}=\begin{cases}1&\text{if }A_{I_{ij}}<\tau\cdot A_{U_{ij}},\\ 0&\text{otherwise}.\end{cases}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_A start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT < italic_τ ⋅ italic_A start_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW

This binarization is approximate, as explained at the end of section 3. It is important to note that this approximation makes this particular verification incomplete. Using the binarized values of zijsubscript𝑧𝑖𝑗z_{ij}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the following constraints ensure accurate matching:

  1. 1.

    Each detection matches exactly one ground truth:

    j=1mzij=1,i{1,,n}.formulae-sequencesuperscriptsubscript𝑗1𝑚subscript𝑧𝑖𝑗1for-all𝑖1𝑛\sum_{j=1}^{m}z_{ij}=1,\quad\forall i\in\{1,\ldots,n\}.∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 , ∀ italic_i ∈ { 1 , … , italic_n } .
  2. 2.

    Each ground truth is matched by exactly one detection:

    i=1nzij=1,j{1,,m}.formulae-sequencesuperscriptsubscript𝑖1𝑛subscript𝑧𝑖𝑗1for-all𝑗1𝑚\sum_{i=1}^{n}z_{ij}=1,\quad\forall j\in\{1,\ldots,m\}.∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 , ∀ italic_j ∈ { 1 , … , italic_m } .

These constraints ensure that detections accurately match the ground truths without false positives or false negatives.

Fig. 3 illustrates how the final constraints Coutsubscript𝐶𝑜𝑢𝑡C_{out}italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT are encoded to verify both cases: missing all objects and not detecting all of them.

Refer to caption
Figure 3: Encoding Localization attack.

Approximation of the Heaviside Function Using the Sigmoid Function

The Heaviside step function, traditionally denoted as H(x)𝐻𝑥H(x)italic_H ( italic_x ), is defined as zero for x<0𝑥0x<0italic_x < 0 and one for x0𝑥0x\geq 0italic_x ≥ 0. In computational settings, where a smooth and differentiable approximation is required, the sigmoid function can be utilized as an effective surrogate. The sigmoid function S(x)𝑆𝑥S(x)italic_S ( italic_x ), defined by the formula:

S(x)=11+eslopex𝑆𝑥11superscript𝑒slope𝑥S(x)=\frac{1}{1+e^{-\text{slope}\cdot x}}italic_S ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - slope ⋅ italic_x end_POSTSUPERSCRIPT end_ARG

provides a smooth transition between 0 and 1. The parameter ‘slope’ adjusts the steepness of the sigmoid function’s curve, making it versatile for different levels of approximation granularity. This property makes the sigmoid function particularly useful in scenarios where a gradual change is more physically realistic or numerically stable than an abrupt switch.

Figure 4 illustrates the approximation of the Heaviside function by the sigmoid function over a range of slope values, slope{1,10,100,1000}slope1101001000\text{slope}\in\{1,10,100,1000\}slope ∈ { 1 , 10 , 100 , 1000 }. As depicted, for high slope values, S(x)𝑆𝑥S(x)italic_S ( italic_x ) serves as a very good approximation of H(x)𝐻𝑥H(x)italic_H ( italic_x ), closely mirroring its behavior due to the sharper transition at x=0𝑥0x=0italic_x = 0.

Refer to caption
Figure 4: Comparison of the Heaviside function and its Sigmoid approximation.

In our context, we used this approximation for binarizing variables, allowing us to encode them as a single layer. However, verification properties that rely on this approximation are inherently incomplete.

4 Results

We demonstrate the initial steps in implementing object detection verification framework. As mentioned earlier, we have established two methods for this purpose and we will present the results of each method respectively. The technical details are as follows:

In the first method, experimental input images were sourced from the COCO dataset[65], reduced to a low resolution of 64×48644864\times 4864 × 48 pixels due to memory limitations. We used the neural network-based detector YOLO[22], specifically YOLOv5, the fifth generation of this widely popular and efficient one-stage detector. The medium model, containing 46 million parameters, was chosen and trained on 80 classes. To demonstrate the second method, we used the LARD[66] and MNIST-OD[67] datasets, which contain images of 256×256256256256\times 256256 × 256 and 90×90909090\times 9090 × 90 pixels, respectively. For these datasets, we migrated a designed and trained network from [48]. Experiments were conducted on a Linux-based system with an RTX 5000 GPU to ensure efficient processing. For verification, the alpha-beta-CROWN tool[40] was utilized.

We start by showing the results of verification using the first method, applying the COCO dataset on the YOLO detector. As a first example, we verify an image with a single object, as illustrated in Fig. 5. With an allowed per-pixel noise level of 0.01, the verifier identifies a case where the detector correctly identifies the object, but in the wrong location, such that the IoU between the actual and predicted bounding boxes is less than τ𝜏\tauitalic_τ (0.5 in our case). For images with multiple objects, it is necessary to aggregate all IoUs into a single metric or enhance the verifier’s capabilities. The current verifier imposes limitations, particularly when trying to attack all objects simultaneously, necessitating careful design of the unified constraints. For instance, this can be accomplished by utilizing the query that the max(IoU)(IoU)\max\text{(IoU)}roman_max (IoU) might be less than the threshold.

Refer to caption
Figure 5: Mislocalization example. Mislocalization attacks on three different images resulting in shifted, larger, and/or smaller detected bounding boxes.

Figure 7 provides a visual representation of the misclassification results. Experiments show that adversarial attacks can lead to a significant increase in misclassified objects. In some cases, objects that were correctly identified before the attack are now misclassified. Figure 7 provides insights into the presence of extra or missing detections caused by adversarial attacks. In some instances, the detector produces additional bounding boxes that do not correspond to real objects, leading to cluttered results. Conversely, the attacks can also result in the omission of objects that should have been detected.

Refer to caption
Figure 6: Misclassification example. Green tag indicates the ground truth and original detection class, red tag indicates the misclassified detection in the adversarial example.
Refer to caption
Figure 7: Extra/Missing detections examples. Green tag indicates the ground truth object and class, red tag indicates the extra detections and their class in the adversarial example.

Though we found promising initial results with this method, it suffers from two main drawbacks. Firstly, since we didn’t modify the network itself but only adapted the verifier, the verification metrics were forced to operate in higher dimensions, resulting in significant memory and efficiency issues. Secondly, all our attempts at hyperparameter selection either resulted in counterexamples or timeouts, meaning we only succeeded in identifying vulnerabilities (attack models) but could not verify the network and obtain guarantees. These limitations were addressed in the second verification method, which proved to be much more efficient. By explicitly encoding each operation, we gained greater control over the verification process. This approach not only resolved the memory and efficiency issues, but also allowed us to easily find UNSAT cases, thereby providing formal guarantees of network robustness.

Using different ranges of noise, we obtained varying results for the MNIST-OD and LARD datasets. For MNIST-OD, the detector should find the location of digit and classify it. For the LARD dataset, the detector is supposed to identify the runway for landing. As shown in Figures 8 and 9, detectors perform poorly under high levels of noise (right), failing completely. However, with a small amount of noise (middle), the detector performs adequately. The most important result is that the verifier returned UNSAT for the lower noise level, which guarantees that the network is robust for this level of noise and the specific input image. It’s important to note that while we only show the first images from the datasets, the results were consistent for all images in the datasets: SAT examples were found for high noise levels, and they were verified for lower noise levels.

Refer to caption
Figure 8: Mislocalization example on MNIST-OD. Ground truth digit (green bounding boxes), with mislocalized detection (red bounding boxes) under adversarial attacks of 1e41𝑒41e-41 italic_e - 4 (middle) and 1e21𝑒21e-21 italic_e - 2 (right).
Refer to caption
Figure 9: Mislocalization example on LARD. Ground truth runway (green bounding boxes) is pointed out by a yellow arrow (left), with mislocalized runways (red bounding boxes) under adversarial attacks of 1e41𝑒41e-41 italic_e - 4 (middle) and 1e21𝑒21e-21 italic_e - 2 (right).

5 Conclusion

This work tackles the challenge of utilizing formal verification to ensure the safety of computer vision models, specifically focusing on object detection as an extension of image classification. We have first proposed a comprehensive approach to treat these tasks as classic formal verification problems and outlined methods to implement them using state-of-the-art verification tools. Our developed methods for refuting or proving object detection robustness have successfully proven the robustness of the YOLOv5 model, or refuted it by finding counterexamples, across three datasets, including one used for autonomous landing.

Additionally, our contribution extends the formulation to new areas within the formal verification of DNNs, demonstrated through the implementation and examples presented. While further research is necessary, this work lays a strong foundation for integrating formal verification into computer vision, promising a safer future for these technologies.

References

  • [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  • [2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.
  • [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • [4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.
  • [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  • [6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • [7] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer for speech recognition, 2020.
  • [8] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020.
  • [9] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer, 2021.
  • [10] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer, 2021.
  • [11] Guy Katz, Clark Barrett, David L. Dill, and Kyle Julian. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pages 97–117, 2017.
  • [12] Changliu Liu, Tomer Arnon, Christopher Lazarus, Christopher Strong, Clark Barrett, and Mykel J. Kochenderfer. Algorithms for verifying deep neural networks, 2020.
  • [13] Marco Casadio, Tanvi Dinkar, Ekaterina Komendantskaya, Luca Arnaboldi, Omri Isac, Matthew L. Daggitt, Guy Katz, Verena Rieser, and Oliver Lemon. Nlp verification: Towards a general methodology for certifying robustness, 2024.
  • [14] Guy Amir, Michael Schapira, and Guy Katz. Towards scalable verification of deep reinforcement learning, 2021.
  • [15] Shahaf Bassan, Guy Amir, Davide Corsi, Idan Refaeli, and Guy Katz. Formally explaining neural networks within reactive systems, 2023.
  • [16] Guy Amir, Davide Corsi, Raz Yerushalmi, Luca Marzari, David Harel, Alessandro Farinelli, and Guy Katz. Verifying learning-based robotic navigation systems, 2023.
  • [17] Davide Corsi, Guy Amir, Andoni Rodriguez, Cesar Sanchez, Guy Katz, and Roy Fox. Verification-guided shielding for deep reinforcement learning, 2024.
  • [18] Ross B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
  • [19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask r-cnn. In ICCV, pages 2980–2988. IEEE Computer Society, 2017.
  • [20] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, NIPS, pages 91–99, 2015.
  • [21] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
  • [22] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  • [23] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger, 2016.
  • [24] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement, 2018.
  • [25] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection, 2020.
  • [26] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021, 2021.
  • [27] Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei Geng, Liang Li, Zaidan Ke, Qingyuan Li, Meng Cheng, Weiqiang Nie, Yiduo Li, Bo Zhang, Yufei Liang, Linyuan Zhou, Xiaoming Xu, Xiangxiang Chu, Xiaoming Wei, and Xiaolin Wei. Yolov6: A single-stage object detection framework for industrial applications, 2022.
  • [28] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, 2022.
  • [29] Xianzhe Xu, Yiqi Jiang, Weihua Chen, Yilun Huang, Yuan Zhang, and Xiuyu Sun. Damo-yolo : A report on real-time object detection design, 2023.
  • [30] Xiang Long, Kaipeng Deng, Guanzhong Wang, Yang Zhang, Qingqing Dang, Yuan Gao, Hui Shen, Jianguo Ren, Shumin Han, Errui Ding, and Shilei Wen. Pp-yolo: An effective and efficient implementation of object detector, 2020.
  • [31] Xin Huang, Xinxin Wang, Wenyu Lv, Xiaying Bai, Xiang Long, Kaipeng Deng, Qingqing Dang, Shumin Han, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, Yanjun Ma, and Osamu Yoshie. Pp-yolov2: A practical object detector, 2021.
  • [32] Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing Dang, Shengyu Wei, Yuning Du, and Baohua Lai. Pp-yoloe: An evolved version of yolo, 2022.
  • [33] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. You only learn one representation: Unified network for multiple tasks, 2021.
  • [34] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Scaled-yolov4: Scaling cross stage partial network, 2021.
  • [35] Guy Katz, Derek A. Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljić, David L. Dill, Mykel J. Kochenderfer, and Clark Barrett. The marabou framework for verification and analysis of deep neural networks. In Isil Dillig and Serdar Tasiran, editors, Computer Aided Verification, pages 443–452, Cham, 2019. Springer International Publishing.
  • [36] Haoze Wu, Omri Isac, Aleksandar Zeljić, Teruhiro Tagomori, Matthew Daggitt, Wen Kokke, Idan Refaeli, Guy Amir, Kyle Julian, Shahaf Bassan, Pei Huang, Ori Lahav, Min Wu, Min Zhang, Ekaterina Komendantskaya, Guy Katz, and Clark Barrett. Marabou 2.0: A versatile formal analyzer of neural networks, 2024.
  • [37] P. Henriksen and Alessio Lomuscio. Efficient neural network verification via adaptive refinement and adversarial search. In European Conference on Artificial Intelligence, 2020.
  • [38] Patrick Henriksen and Alessio Lomuscio. Deepsplit: An efficient splitting method for neural network verification via indirect effect analysis. In Zhi-Hua Zhou, editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 2549–2555. International Joint Conferences on Artificial Intelligence Organization, 8 2021. Main Track.
  • [39] Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efficient neural network robustness certification with general activation functions, 2018.
  • [40] Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J Zico Kolter. Beta-CROWN: Efficient bound propagation with per-neuron split constraints for complete and incomplete neural network verification. Advances in Neural Information Processing Systems, 34, 2021.
  • [41] Claudio Ferrari, Mark Niklas Muller, Nikola Jovanovic, and Martin Vechev. Complete verification via multi-neuron relaxation guided branch-and-bound, 2022.
  • [42] Mark Niklas Müller, Gleb Makarchuk, Gagandeep Singh, Markus Püschel, and Martin Vechev. Prima: general and precise neural network certification via scalable convex hull approximations. Proceedings of the ACM on Programming Languages, 6(POPL):1–33, January 2022.
  • [43] A. H. Land and A. G. Doig. An automatic method of solving discrete programming problems. Econometrica, 28(3):497–520, 1960.
  • [44] David R. Morrison, Sheldon H. Jacobson, Jason J. Sauppe, and Edward C. Sewell. Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning. Discrete Optimization, 19:79–102, 2016.
  • [45] Rudy Bunel, **gyue Lu, Ilker Turkaslan, Philip H. S. Torr, Pushmeet Kohli, and M. Pawan Kumar. Branch and bound for piecewise linear neural network verification, 2020.
  • [46] Mark Niklas Müller, Christopher Brix, Stanley Bak, Changliu Liu, and Taylor T. Johnson. The third international verification of neural networks competition (vnn-comp 2022): Summary and results, 2023.
  • [47] Christopher Brix, Stanley Bak, Changliu Liu, and Taylor T. Johnson. The fourth international verification of neural networks competition (vnn-comp 2023): Summary and results, 2023.
  • [48] Noémie Cohen, Mélanie Ducoffe, Ryma Boumazouza, Christophe Gabreau, Claire Pagetti, Xavier Pucel, and Audrey Galametz. Verification for object detection – ibp iou, 2024.
  • [49] Souradeep Dutta, Susmit Jha, Sriram Sanakaranarayanan, and Ashish Tiwari. Output range analysis for deep neural networks, 2017.
  • [50] Guy Amir, Davide Corsi, Raz Yerushalmi, Luca Marzari, David Harel, Alessandro Farinelli, and Guy Katz. Verifying learning-based robotic navigation systems, 2023.
  • [51] Raz Yerushalmi, Guy Amir, Achiya Elyasaf, David Harel, Guy Katz, and Assaf Marron. Scenario-assisted deep reinforcement learning, 2022.
  • [52] Avraham Raviv, Eliya Bronshtein, Or Reginiano, Michelle Aluf-Medina, and Hillel Kugler. Learning through imitation by using formal verification. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 342–355. Springer, 2023.
  • [53] Avraham Raviv, Yuval Gerber, Liri Benzinou, Michelle Aluf-Medina, and Hillel Kugler. Prediction and control of stochastic agents using formal methods. In Nina Narodytska, Guy Amir, Guy Katz, and Omri Isac, editors, Proceedings of the 6th Workshop on Formal Methods for ML-Enabled Autonomous Systems, volume 16 of Kalpa Publications in Computing, pages 29–34. EasyChair, 2023.
  • [54] Shahaf Bassan and Guy Katz. Towards formal xai: Formally approximate minimal explanations of neural networks, 2023.
  • [55] Yizhak Elboher, Raya Elsaleh, Omri Isac, Mélanie Ducoffe, Audrey Galametz, Guillaume Povéda, Ryma Boumazouza, Noémie Cohen, and Guy Katz. Robustness assessment of a runway object classifier for safe aircraft taxiing, 2024.
  • [56] Guy Amir, Haoze Wu, Clark Barrett, and Guy Katz. An smt-based approach for verifying binarized neural networks, 2021.
  • [57] Guy Amir, Osher Maayan, Tom Zelazny, Guy Katz, and Michael Schapira. Verifying generalization in deep learning, 2023.
  • [58] Guy Amir, Osher Maayan, Tom Zelazny, Guy Katz, and Michael Schapira. Verifying the generalization of deep learning to out-of-distribution domains, 2024.
  • [59] Omri Isac, Clark Barrett, Min Zhang, and Guy Katz. Neural network verification with proof production, 2022.
  • [60] Raya Elsaleh and Guy Katz. Delbugv: Delta-debugging neural network verifiers, 2023.
  • [61] Yerushalmi Raz, Amir Guy, Elyasaf Achiya, Harel David, Katz Guy, and Marron Assaf. Enhancing deep reinforcement learning with scenario-based modeling, 2023.
  • [62] Yizhak Yisrael Elboher, Justin Gottschlich, and Guy Katz. An Abstraction-Based Framework for Neural Network Verification, page 43–65. Springer International Publishing, 2020.
  • [63] Elazar Cohen, Yizhak Yisrael Elboher, Clark Barrett, and Guy Katz. Tighter abstract queries in neural network verification, 2023.
  • [64] Yizhak Yisrael Elboher, Elazar Cohen, and Guy Katz. Neural network verification using residual reasoning, 2022.
  • [65] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
  • [66] Mélanie Ducoffe, Maxime Carrere, Léo Féliers, Adrien Gauffriau, Vincent Mussot, Claire Pagetti, and Thierry Sammour. Lard–landing approach runway detection–dataset for vision based landing. arXiv preprint arXiv:2304.09938, 2023.
  • [67] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.