(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: National Tsing Hua University, Taiwan 22institutetext: IBM Research, U.S.A.

Steal Now and Attack Later: Evaluating Robustness of Object Detection against Black-box Adversarial Attacks

Erh-Chung Chen 11    Pin-Yu Chen 22    I-Hsin Chung 22    Che-Rung Lee 11
Abstract

Latency attacks against object detection represent a variant of adversarial attacks that aim to inflate the inference time by generating additional ghost objects in a target image. However, generating ghost objects in the black-box scenario remains a challenge since information about these unqualified objects remains opaque. In this study, we demonstrate the feasibility of generating ghost objects in adversarial examples by extending the concept of "steal now, decrypt later" attacks. These adversarial examples, once produced, can be employed to exploit potential vulnerabilities in the AI service, giving rise to significant security concerns. The experimental results demonstrate that the proposed attack achieves successful attacks across various commonly used models and Google Vision API without any prior knowledge about the target model. Additionally, the average cost of each attack is less than $ 1 dollars, posing a significant threat to AI security.

Keywords:
AI safety Adversarial attack Object detection

1 Introduction

Cybersecurity aims to protect computer systems and sensitive data from unexpected threats and attacks, such as CPU failures [8] or backdoor attacks [22]. With artificial intelligence (AI) rapidly revolutionizing various fields [5, 10], cybersecurity confronts new challenges, underscoring the importance of robust AI model design [20]. Adversarial attacks highlight AI model vulnerabilities, where subtle perturbations in inputs can alter predictions. Furthermore, attacks are achievable solely through predicted labels [14], which poses a significant threat to real-world applications.

The task we focus on is evaluating the robustness of object detection, a technology that has found widespread use in various everyday applications. Object detection aims to label the positions and classes of all objects appearing in images. Commonly used deep learning-based models include faster R-CNN [37] and YOLO series [19]. While remarkable performance has been made in those models, they are not immune to adversarial attacks that can produce ghost objects [11] or hide specific objects [43]. Furthermore, a prior study showed that the target model can be deceived by a physical patch [45].

In order to make the attack more practical, black-box attacks have been proposed. where the attackers possess no prior knowledge about the victim model [13, 1]. Notably, traditional black-box attacks generate adversarial examples by leveraging differences in output information. During the inference phase, lower probability objects are removed in an internal step, and the information about these objects remains opaque to end-users, which prevents attackers from exploiting the target model with consistent outputs. This property makes it challenging for object detection, particularly for crafting ghost objects.

In this paper, we present a novel attack strategy, called the "steal now, attack later" approach. The core objective of this distinctive attack is to deceive the victim model into generating ghost objects under the black-box scenario such that the system cannot respond to the requests immediately or computing resources are exhausted. To accomplish the attack, we argue that the initial image should contain external objects from pre-collected data and project the perturbations onto the given constraints. This innovation opens up a new avenue for potential attacks, as adversarial examples crafted in this approach can be used to exploit vulnerabilities present in AI services.

The follows are our main contributions in this paper:

  • The first paper studies the feasibility of latency attacks against object detection under the black-box scenario.

  • This paper demonstrates that adversarial examples can be crafted by information retrieved from public datasets.

  • Experimental results show that the proposed attack achieves successful attacks across various models, including Faster-RCNN [37], Retinanet [29], FCOS [42], DERT [9] and YOLO [19], as well as public vision APIs provided by Google Cloud Platform (GCP).

  • From the comprehensive analysis, deploying a private model locally as the most economical solution, supported by affordable costs associated with the proposed attack. This encourages attackers to invest in improving attack algorithms to exploit vulnerabilities in AI systems.

The rest of this paper is organized as follows. The background on object detection and adversarial attack is introduced in Section 2. Section 3 describes the details of the proposed algorithm. The experimental results, ablation studies, and discussion are shown in Section 4. Our conclusion is in the last section.

2 Background

2.1 Object Detection

Refer to caption
Figure 1: The execution flow of object detection.
Refer to caption
Figure 2: Attack Flow Overview.

Object detection aims to label the positions and classes of all objects within images. Despite the various models proposed in recent years, they generally follow a similar execution flow, as illustrated in Figure 2. Most object detection models utilize a convolutional neural network (CNN) as the backbone, renowned for its high performance in classification tasks [40, 44]. The inputs are fed into the backbone model to extract features. With these features, the head model outputs locations, class probabilities, and a confidence score for all candidate objects. Based on the design of the head component, object detection models can be categorized into three types: one-stage detectors [19], two-stage detectors [37], and transformer-based models [9, 4]. Subsequently, the candidate objects are sorted, and duplicated or low-confidence objects are filtered out.

2.2 Adversarial Attack

The primary objective of adversarial attacks is to deceive the target models by adding malicious perturbations to the input [34, 12]. These perturbations, however, have no impact on human judgment [21, 32, 17]. Adversarial attacks for object detection involve different objectives. Foe example, traffic sign attacks aim to mislead automated driving systems into incorrectly recognizing traffic signs [36]. Hidden attacks, on the other hand, produce physical patches that make the specific object invisible, causing automated surveillance systems to struggle with reliable detection [41].

Adversarial attacks under the black-box scenario assume that any knowledge about the target model remains unknown except the output predictions [35, 27, 28]. Common attacks involve estimating gradients through differences in output information [13]. Score-based attacks outperform when the gradient is unreliable [1]. Additionally, the existence of universal adversarial perturbations among multiple models has been demonstrated [25].

Similar to hard-label attacks [14], object detectors internally filter out unqualified objects, rendering information about these objects opaque to end-users. This property prevents attackers from exploiting the target model with steady outputs and makes it challenging for adversarial attacks against object detection, particularly for appearing attacks [6, 7] aimed at generating as many objects as possible in the target images.

Exploring latency attacks on AI models is an evolving field. Numerous studies have shown that numerous objects appearing simultaneously in an image might inflate execution time, posing a security concern, especially for real-time object detectors [38, 11]. A recent study [33] delved into how latency attacks can distort the camera-based autonomous driving perception pipeline. Simulation results from this study reveal alarmingly high crash rates under such attacks, underscoring the practical threats of deploying AI models in self-driving cars. However, producing latency attacks against object detection under the black-box scenario is a relatively understudied area.

3 Methods and Algorithms

3.1 Threat Model and Motivations

In this paper, our focus is on evaluating the feasibility of the latency attack against object detection in a real-world scenario, including the total cost and time consumption of each attack as well as the success rate. The primary objective of latency attacks against object detection is to increase the processing time by introducing extra objects. We assume that the execution time for the inference system is directly proportional to the number of objects. To simplify the discussion in this paper, we focus on increasing the total number of objects.

We assume that the latency attack is under black-box configurations since the victim models are generally deployed on the cloud, and end-users do not have any prior knowledge about the target models and the training data. Another constraint is that modification should be within a given epsilon ball for the target images to avoid the awareness of attacks. However, attackers can deploy private models locally and conduct operations as they see fit. Moreover, the utilization of publicly available datasets is deemed permissible in the pursuit of this objective.

It is important to highlight that this work primarily contributes to empirically assessing the cost of crafting adversarial examples and analyzing methods for generating ghost examples when information about the adversarial examples remains opaque. The elapsed time and potential damage caused by adversarial examples can vary depending on downstream tasks and target devices, but these aspects are beyond the scope of our study.

3.2 Attack Flow Overview

Our proposed method adopts the spirit of the "steal now, decrypt later" attack strategy in crafting the adversarial examples. The attack flow is outlined in Figure 2. Attackers gather relevant information through legitimate approaches and then store candidate objects in the database in advance. With the collected data, the attacker attempted to attach objects to the target image, thereby increasing the total number of objects predicted by the victim model. Subsequently, perturbations are projected onto the given epsilon ball within the color space and the best result is selected as the output adversarial image containing numerous ghost objects. The output images at each stage are visualized in Figure 3.

There are three major challenges for this approach: (1) The cost of data collection from the open world and retrieval of potential objects that may exploit the victim model. (2) The determination of which objects are attached to the target image will serve as a better initial starting point. This process involves considerations of the proper sizes and suitable positions for the attached objects. (3) The formulation of an algorithm that projects the perturbations onto the given epsilon ball.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 3: Images output at each stage of the attack process. (3(a)) is the original image; (3(b)) shows the image with external objects; (3(c)) displays the modified image projected onto the epsilon ball; and (3(d)) showcases the final adversarial image. The predictions for both the original and the adversarial images are presented in (3(e)) and (3(f)), respectively.

3.3 Data Collection from Open World

The primary objective of data collection is to gather a diverse set of objects with minimal editorial distance. We are not concerned with the accuracy of the predicted results; even false positives are deemed acceptable. Instead, our focus is on the associated costs and time investment. Therefore, any publicly available datasets, such as MS COCO [30], PASCAL VOC [16], or Open Images [23], which offer abundant, well-labeled, and diverse data, are eligible for the proposed attack.

Data collection involves repeatedly feeding multiple images from the collected data into object detectors to retrieve candidate objects and storing the associated information in a database. One direct approach involves sending requests to the service provider. However, there are limitations on the query rate, and each query may come with a cost, making this operation both expensive and time-consuming. On the other hand, attackers can gather information by deploying private models. By comparing the outputs of different models, attackers can effectively filter out divergent results and select consistent ones. Therefore, we believe that while collecting data by directly accessing the victim model can provide high-quality information, it is not strictly necessary.

Furthermore, we advocate the inclusion of augmented data in repositories due to the many extra benefits, despite the corresponding increased costs. Spatial crops can capture more objects at different scales or key components of objects, minimizing the influence of unimportant pixels when crafting adversarial examples. Color transformations, such as color jitter, random equalization, or random posterization, should also be considered. These operations help assess the impact of color distortion on each object. An object that becomes unrecognizable to the victim model after applying color transformations is not deemed a qualified candidate and can be filtered out in advance. These measures serve to decrease the failure rate and total number of queries when generating adversarial examples.

We argue that the collected information is not specific to just one image; it can be reused for all images as long as the victim model remains unchanged. We view data collection as a one-time cost. Additionally, we can determine if the model has been updated by comparing the output of queries for identical data at different times. Moreover, the superior performance of victim models may be leveraged to economize the cost of data collection. Specifically, the predictions from those models should have high similarity to ground-truth labels in well-known datasets. This suggests that attackers can gauge whether these public datasets furnish a comprehensive representation of the victim model’s outputs through a relatively limited number of trials.

3.4 Position-centric Candidate Selection

Algorithm 1 Position-centric Object Selection
1:  require: Target image x𝑥xitalic_x
2:  require: Grid dimension sgsubscript𝑠𝑔s_{g}italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT
3:  require: Threshold tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
4:  require: Trials k𝑘kitalic_k
5:  nw,nh=x.width/sg,x.height/sgformulae-sequencesubscript𝑛𝑤subscript𝑛𝑥widthsubscript𝑠𝑔𝑥heightsubscript𝑠𝑔n_{w},n_{h}=x.\textnormal{width}/s_{g},x.\textnormal{height}/s_{g}italic_n start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_x . width / italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_x . height / italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT
6:  for iter1𝑖𝑡𝑒𝑟1iter\leftarrow 1italic_i italic_t italic_e italic_r ← 1 to k𝑘kitalic_k do
7:     𝐨𝐛𝐣=model.forward(x)𝐨𝐛𝐣model.forward𝑥\mathbf{obj}=\textnormal{model.forward}(x)bold_obj = model.forward ( italic_x )
8:     for (i,j)(1,1)𝑖𝑗11(i,j)\leftarrow(1,1)( italic_i , italic_j ) ← ( 1 , 1 ) to (nw,nh)subscript𝑛𝑤subscript𝑛(n_{w},n_{h})( italic_n start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) do
9:        bBCount(𝐨𝐛𝐣,i,j)𝑏BCount𝐨𝐛𝐣𝑖𝑗b\leftarrow\textnormal{BCount}(\mathbf{obj},i,j)italic_b ← BCount ( bold_obj , italic_i , italic_j )
10:        if btp𝑏subscript𝑡𝑝b\leq t_{p}italic_b ≤ italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT then
11:           xi,jPatchGen(sg)subscript𝑥𝑖𝑗PatchGensubscript𝑠𝑔x_{i,j}\leftarrow\textnormal{PatchGen}(s_{g})italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ← PatchGen ( italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT )
12:        end if
13:     end for
14:  end for
15:  return x𝑥xitalic_x

After gathering the necessary information, the next step involves carefully determining which object, placed in which location, would serve as a suitable candidate for creating an adversarial image. An object placed in an inappropriate position may either attract undue attention from humans or evade detection by the target model. Estimating precise locations for each candidate object seems a potential solution. However, the object-centric approach evaluates selected objects sequentially, leaving the priority for attachment ambiguous. Consequently, different permutations of candidate objects yield varied results, with the costs of each attack directly proportional to the number of objects involved. Moreover, the inherent uncertainty necessitates the repetition of the procedure multiple times, rendering the implementation of a black-box attack impractical.

To address this, we propose an alternative position-centric algorithm inspired by the overload attack [11]. The target image is divided into a two-dimensional grid, and each grid is considered an independent component when generating adversarial examples. By evaluating the status of each grid, we determine which grid is suitable for perturbation. While some candidate objects in specific grids may not be recognized in every query, this method maximizes the total number of output objects, rendering it more cost-effective. Compared to the object-centric approach, the position-centric algorithm provides output information for all grids in a single query.

Algorithm 1 outlines the procedure of the position-centric algorithm, where x𝑥xitalic_x is the target image; sgsubscript𝑠𝑔s_{g}italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is grid dimension; k𝑘kitalic_k is the total trials; and tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is a threshold. During each trial, the first step involves obtaining the output result predicted by the victim model for a given image. Then we traverse each grid and determine the total number of objects present in each specific grid using the function BCount. If the number of objects is fewer than the threshold tpsubscript𝑡𝑝t_{p}italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, we assign it a new patch using PatchGen. This function randomly selects multiple objects from the database attached to the specific grid, possibly incorporating color transformations or other operations to enrich the diversity of patches. However, if all grids undergo PatchGen, the information about the target image is entirely eliminated. We maintain a certain probability of reverting to the original image.

Additionally, we keep track of the status of the patch attacked on each grid, making it easier to determine which patch is optimal in subsequent steps. The information about each patch consists of the following items: the perturbed patch, the number of objects, the distance in color space, a map storing the active pixels, and a flag marking whether it is eligible. An eligible patch implies that the perturbations are within the epsilon ball and that it contains more than one object. Otherwise, a patch without any objects can be considered invalid immediately.

It is crucial to emphasize that the primary purpose of this step is not to craft an eligible adversarial image directly but rather to produce candidate patches in each grid. That is similar to the role of the initial attack of hard-label attacks [14], where no constraints are set on the color space. A patch with a significant distance from the original image cannot be a suitable candidate because projecting the perturbations in the patch onto the epsilon ball severely distorts the underlying texture or shape of the external objects. However, a patch with perturbations outside the given epsilon ball can still serve as an initial guess.

The mechanism that decides which objects should be attached to specific grids depends on the objective. If one prefers the result of objects being uniformly distributed over the entire image, the attacker not only needs to place objects in empty regions but also reduce the confidence of objects already recognized to increase the probability that other objects will appear. For the simplicity of this work, we select the patch with the smallest editorial distance in each grid among objects whose confidence is higher than a threshold.

3.5 Color Manipulation

Algorithm 2 Color Manipulation Algorithm
1:  require: Target image x𝑥xitalic_x
2:  require: Perturbed image xadvsuperscript𝑥𝑎𝑑𝑣x^{adv}italic_x start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT
3:  require: Total iterations kasubscript𝑘𝑎k_{a}italic_k start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
4:  require: Tolerance d𝑑ditalic_d
5:  require: Radius of epsilon ball ϵitalic-ϵ\epsilonitalic_ϵ
6:  for it1𝑖𝑡1it\leftarrow 1italic_i italic_t ← 1 to kasubscript𝑘𝑎k_{a}italic_k start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT do
7:     𝐨𝐛𝐣=model.forward(x)𝐨𝐛𝐣model.forward𝑥\mathbf{obj}=\textnormal{model.forward}(x)bold_obj = model.forward ( italic_x )
8:     for (i,j)(1,1)𝑖𝑗11(i,j)\leftarrow(1,1)( italic_i , italic_j ) ← ( 1 , 1 ) to (nw,nh)subscript𝑛𝑤subscript𝑛(n_{w},n_{h})( italic_n start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) do
9:        bBCount(𝐛𝐛𝐨𝐱,i,j)𝑏BCount𝐛𝐛𝐨𝐱𝑖𝑗b\leftarrow\textnormal{BCount}(\mathbf{bbox},i,j)italic_b ← BCount ( bold_bbox , italic_i , italic_j )
10:        if b<1𝑏1b<1italic_b < 1 then
11:           xi,jadvPatchGen(sg)subscriptsuperscript𝑥𝑎𝑑𝑣𝑖𝑗PatchGensubscript𝑠𝑔x^{adv}_{i,j}\leftarrow\textnormal{PatchGen}(s_{g})italic_x start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ← PatchGen ( italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT )
12:        else
13:           xi,jadvxi,j+Projection(xi,jadvxi,j)subscriptsuperscript𝑥𝑎𝑑𝑣𝑖𝑗subscript𝑥𝑖𝑗Projectionsubscriptsuperscript𝑥𝑎𝑑𝑣𝑖𝑗subscript𝑥𝑖𝑗x^{adv}_{i,j}\leftarrow x_{i,j}+\textnormal{Projection}(x^{adv}_{i,j}-x_{i,j})italic_x start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ← italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + Projection ( italic_x start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT )
14:        end if
15:     end for
16:     xi,jadvClamp(xi,jadv,xi,jϵd,xi,j+ϵd)subscriptsuperscript𝑥𝑎𝑑𝑣𝑖𝑗Clampsubscriptsuperscript𝑥𝑎𝑑𝑣𝑖𝑗subscript𝑥𝑖𝑗italic-ϵ𝑑subscript𝑥𝑖𝑗italic-ϵ𝑑x^{adv}_{i,j}\leftarrow\textnormal{Clamp}(x^{adv}_{i,j},x_{i,j}-\epsilon d,x_{% i,j}+\epsilon d)italic_x start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ← Clamp ( italic_x start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_ϵ italic_d , italic_x start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT + italic_ϵ italic_d )
17:  end for
18:  return xadvsuperscript𝑥𝑎𝑑𝑣x^{adv}italic_x start_POSTSUPERSCRIPT italic_a italic_d italic_v end_POSTSUPERSCRIPT

The perturbed images generated through the candidate selection process described above generally fall outside the given epsilon ball. The primary objective of this step is to project these perturbed images back onto the epsilon ball while maintaining the strength of the attack. Drawing on findings from prior studies [39, 24], we believe that there are many perturbed pixels that can be dropped without influencing the final predictions. Hence, this step centers on assessing the significance of perturbations at the pixel level. Some perturbed pixels, while discernible to humans, may not give rise to ghost objects, rendering them safe to eliminate to minimize the distance. Conversely, it is imperative to identify and retain the critical perturbed pixels.

The spirit of the color manipulation is to refine the perturbations by shrinking the amplitudes of the perturbations and adjusting the average over the specific regions as well while maintaining the same predictions. The mathematical definition of Projection is listed as follows:

𝐗𝐩=𝐌𝐞Fe(𝐗𝐩)+(𝟏𝐌𝐞)Fi(𝐗𝐩),subscript𝐗𝐩tensor-productsubscript𝐌𝐞subscript𝐹𝑒subscript𝐗𝐩tensor-product1subscript𝐌𝐞subscript𝐹𝑖subscript𝐗𝐩\mathbf{X_{p}}=\mathbf{M_{e}}\otimes F_{e}(\mathbf{X_{p}})+(\mathbf{1}-\mathbf% {M_{e}})\otimes F_{i}(\mathbf{X_{p}}),bold_X start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT = bold_M start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT ⊗ italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ) + ( bold_1 - bold_M start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT ) ⊗ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ) , (1)

where 𝐗𝐩subscript𝐗𝐩\mathbf{X_{p}}bold_X start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT refers to the perturbations; 𝐌𝐞subscript𝐌𝐞\mathbf{M_{e}}bold_M start_POSTSUBSCRIPT bold_e end_POSTSUBSCRIPT is a matrix marking eligible pixels as 1111; Fe()subscript𝐹𝑒F_{e}(\cdot)italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ) and Fi()subscript𝐹𝑖F_{i}(\cdot)italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) are two functions that operate on eligible pixels and ineligible pixels, respectively, where the ineligible pixels mean the corresponding amplitudes of the perturbations are out of the the given epsilon ball. To efficiently project the ineligible pixels onto the epsilon ball, the function Fisubscript𝐹𝑖F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as follows:

Fi()=si𝐌𝐫𝐗𝐩,subscript𝐹𝑖tensor-productsubscript𝑠𝑖subscript𝐌𝐫subscript𝐗𝐩F_{i}(\cdot)=s_{i}\mathbf{M_{r}}\otimes\mathbf{X_{p}},italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_M start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT ⊗ bold_X start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT , (2)

where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a scaling term and 𝐌𝐫subscript𝐌𝐫\mathbf{M_{r}}bold_M start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT is an random indicator matrix. When sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is set to 00, it implies a simple dropout. On the other hand, we perform a linear transformation for eligible pixels. Specifically, Fe()subscript𝐹𝑒F_{e}(\cdot)italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ) can be formulated as

Fe()=se𝐗𝐩+be,subscript𝐹𝑒subscript𝑠𝑒subscript𝐗𝐩subscript𝑏𝑒F_{e}(\cdot)=s_{e}\mathbf{X_{p}}+b_{e},italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( ⋅ ) = italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , (3)

where besubscript𝑏𝑒b_{e}italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT re-centers the mean of the perturbations and sesubscript𝑠𝑒s_{e}italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT scales the strength of the perturbations such that the distribution in the color space of the perturbed images and that of the original images are consistent. These two terms besubscript𝑏𝑒b_{e}italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and sesubscript𝑠𝑒s_{e}italic_s start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT depend on the background in the target image. For example, a camouflage perturbation should have a relatively small amplitude in low-frequency regions.

Algorithm 2 sketches the procedure of the color manipulation, where d𝑑ditalic_d refers to a tolerance that allows the perturbations to exist within a larger epsilon ball during the internal steps of crafting adversarial examples. Similar to the position-centric object selection algorithm shown in Algorithm 1, we traverse each grid in each iteration. If no objects are detected in a particular grid, we generate a new patch. Otherwise, we apply the function Projection to perturbed patches. Lastly, the perturbed images are projected onto the epsilon ball with a radius of ϵditalic-ϵ𝑑\epsilon ditalic_ϵ italic_d. This algorithm is reiterated with a gradual reduction in the magnitude of d𝑑ditalic_d.

While the perturbed example can cause numerous ghost objects to be predicted by the victim model, attaining the desired objective in each iteration proves elusive. The major reason is that our objective, the number of objects, is indifferentiable, rendering gradient-based black-box algorithms impractical. Therefore, we have put forth an alternative approach where the outputs for a specific d𝑑ditalic_d are internally recorded. When obtaining worse results for smaller radii, we can revert to the previous status and repeat the procedure again to improve success rates.

4 Experiments

4.1 Setup

To demonstrate the performance of the proposed attack on various models, we conducted experiments on four one-stage, one two-stage, and one transformer-based model: YOLOv8 [19], Retinanet [29], FCOS [42], SSD300 [31], Faster-RCNN [37], and DERT [9]. We also performed the proposed algorithm on Azure vision and GCP vision API as well. We randomly selected 100 images from the validation set of MS COCO 2017 dataset [30] and Open Images [23] as the target images. The input dimensions are fixed to (640,640)640640(640,640)( 640 , 640 ) in both the data collection and inference phases. Additionally, we include ablation studies which cover various aspects including, data collection from different datasets or models and cost analysis. The predictions from different models are visualized in Appendix A.

To the best of our knowledge, this is the first paper that focuses on appearing attacks under the black-box scenario. There is no general metric to evaluate performance. A direct comparison of total objects caused by adversarial images among models should be avoided since the maximum number of objects output by different models is not a fixed number. In this paper, we project the perturbation onto L-inf norm and define an attack as successful if the object increment caused by the adversarial example is greater than 20.

4.2 Performance Evaluation

Table 1: Attack success rates (ASR) across various models under different radii of ϵitalic-ϵ\epsilonitalic_ϵ with patches are collected from different datasets.
Dataset MS COCO Open Images
Model ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8 ϵ=16italic-ϵ16\epsilon=16italic_ϵ = 16 ϵ=24italic-ϵ24\epsilon=24italic_ϵ = 24 ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32 ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8 ϵ=16italic-ϵ16\epsilon=16italic_ϵ = 16 ϵ=24italic-ϵ24\epsilon=24italic_ϵ = 24 ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32
DERT 1% 7% 13% 29% 1% 6% 15% 28%
Faster-RCNN 2% 10% 21% 35% 1% 8% 24% 36%
Retinanet 3% 9% 22% 37% 5% 8% 28% 39%
FCOS 5% 12% 29% 49% 4% 9% 31% 52%
YOLOv8 8% 15% 42% 81% 9% 13% 45% 83%
SSD300 0% 0% 0% 0% 0% 0% 0% 0%
Azure 0% 0% 0% 0% 0% 0% 0% 0%
GCP 0% 0% 11% 15% 0% 0% 12% 18%

This experiment evaluates the attack success rates (ASR) across various models under different epsilon radii. ASR is the ratio of the number of successful attacks over the total number of attacks. In this experiment, the maximum of query times during the attacking iteration is set to 4,000 and we collected from MS COCO and Open Images datasets. The experimental results are shown in Table 1.

As can be seen, the ASR sharply declines as the radius of the epsilon ball is reduced. This phenomenon occurs because the tiny perturbations fail to generate significant output signals, and the proposed attack is a stochastic process, making some false signals during the internal steps. Nevertheless, these results demonstrate that the attack achieves its highest ASR on the YOLOv8 model, and it is also effective on other models, with the exception of SSD300. The findings suggest the feasibility of attacking a public model, but the ASR is not notably high due to the lack of model-specific information.

ASR performance on DERT is the second worst. A possible reason is that we assume no inter-grid influence in this paper. This assumption does not align with the attention mechanism found in transformer-based models, which can dynamically adjust the importance of global input tokens, making objects have different ranges of influence in space.

Remarkably, the proposed attack encounters a definitive failure when applied to the SSD3000 model. The main issue stems from our objective design preference, which favors patches with multiple objects, indicating that each object is relatively small in size. This approach tends to divide the large object into multiple small objects during the iteration process of color manipulation. However, a previous study has demonstrated that the SSD model struggles with accurately identifying small objects [18]. Consequently, all objects vanish during these iterations. Similarly, Azure has a limitation: objects smaller than 5 % of the image size are eliminated internally. This implies that, by definition, the attack is destined to fail.

The experimental results shown in Table 1 suggests that the data collected from different datasets have no significant impact on ASR. However, we would like to emphasize that the strength of adversarial attacks highly depends on the color distribution between selected target images and collected data. As shown in Figure 4(a), the upper region of the left image is almost entirely black. It becomes challenging to match an object from the database that aligns with this color distribution. Conversely, Figure 4(b) exhibits wide color fluctuations across the entire spatial domain, providing attackers with the ability to embed ghost objects at any location.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: The strength of adversarial examples highly depends on color fluctuations.
Refer to caption
(c)
Refer to caption
(d)
Figure 5: Objects tend to be clustered in specific regions.

4.3 Ablation study

4.3.1 Analysis of spatial distribution

Initially, we expected the position-centric algorithm to generate numerous objects uniformly distributed across the entire image. However, as depicted in Figure 5, the generated objects tend to cluster in specific regions, a pattern consistent with findings reported in [11]. When objects crowd into a small region, there is a high probability that these objects reference the same item. Consequently, numerous candidates might be eliminated during an internal stage. Conversely, empty regions may contain objects with relatively low confidence, leading the model to not report these objects to users. This clustering phenomenon presents a challenge in generating ghost objects. To tackle this issue, it is advisable to implement a spatial attention algorithm similar to the Overload attack or to identify specific areas that should be preserved. Besides, it is imperative to design an appropriate approach to hinder high-confidence objects.

4.3.2 Minimum Number of Required Images

This experiment aims to evaluate the influence of the number of collected images on ASR. We followed the same implementation details outlined in Section 4.1, the epsilon radius was set to 32, and the data were collected from MS COCO dataset. Table 3 presents the experimental results.

As can be seen, data diversity grows as the number of collected images increases, leading to an improvement in ASR. However, these improvements are not notably significant, primarily due to each dataset having its own distinct data distribution. For instance, there is an abundance of objects with green textures such as grasslands, baseball fields, and trees. Nevertheless, as illustrated in Figure 4(a), these objects are incompatible with black backgrounds. Consequently, despite the increased number of collected images, the enhancements remain marginal. These experimental findings underscore the importance of data diversity. To enhance ASR, it’s crucial to consider gathering data from diverse datasets or implementing color transformations.

4.3.3 Influence on Using Different Datasets or Models

Table 2: Ablation study for the number of images collected on ASR.
Model Images [#]
100 200 500
DERT 29% 31% 32%
Faster-RCNN 35% 37% 37%
Retinanet 37% 39% 42%
FCOS 49% 52% 55%
YOLOv8 81% 83% 84%
Table 3: Ablation study for configurations of data collection on ASR.
Model config 1 config 2 config 3 config 4
DERT 29% 28% 29% 28%
Faster-RCNN 35% 36% 31% 36%
Retinanet 37% 35% 32% 39%
FCOS 49% 47% 43% 52%
YOLOv8 81% 81% 81% 83%

This experiment aims to evaluate the configurations of data collection on ASR and the epsilon radius was set to 32. In the first configuration, patches were collected in advance from the MS COCO dataset using the target model. The second configuration followed the same approach but substituted the MS COCO dataset with the ImageNet dataset [15]. The third configuration involved collecting candidate objects from the MS COCO dataset using the YOLOv8 model. The last configuration leveraged the GCP vision API to collect patches. The experimental results are summarized in Table 3.

As can be seen, no significant performance gap among different configurations although the locations or sizes of objects predicted by different models are not precise. These experimental results show data collection is a crucial step, but it is not currently the performance bottleneck of this work. Besides, the cost of the proposed attacks is affordable, which motivates attackers to exploit vulnerabilities in AI systems and invest in refining attack algorithms.

4.4 Time Consumption and Costs

Based on the above analysis, we can conclude that approximately 500 images are required to collect the patches and only one model needs to be deployed locally. We estimated the time consumption and total costs for data collection under the above scenario.

The inference time for processing 500 images with Nvidia V100 is less than one hour. Considering the cost of renting the same GPU on GCP for one hour ($2.48 dollars per GPU), the total expense, including the time for initialization, would not exceed $3 dollars. In contrast, the cost of using the GCP vision API is $1.5 dollars per 1,000 requests. However, employing the GCP Vision API involves formatting data packets and transferring data between the server and the client. As a result, a single request takes no more than 10 seconds, leading to a total data collection time of approximately 2 hours.

From the comprehensive analysis, it is unequivocally evident that deploying a private model locally as the most economically viable solution for the proposed algorithm.

4.5 Discussion and Limitations

Parallel works [6, 7] have suggested that a co-occurrence graph, which captures relationships between pairs of objects, can provide suggested placements for attaching these objects. However, data collection remains an essential step. The main focus of these studies lies in minimizing the overall number of queries needed to create an adversarial example capable of deceiving the model. Besides, our approach distinguishes itself by offering a practical advantage in terms of incurring manageable costs associated with data collection.

Nonetheless, we observed that specific classes tend to appear frequently in the perturbed images. Moreover, with the GCP vision API, certain specific classes, like ’2D barcode,’ can suppress the appearance of other objects, even when the pairwise distances are much larger. This could lead to a failure in our attacking algorithm. These observations seem to suggest that the GCP Vision API employs multiple processing flows for certain conditions. Real-world AI applications are more complex than what we considered in this work, and it is absolutely necessary to collect more data to analyze victim model preferences in order to improve ASR. The time investment and total costs must exceed the results of this work.

Data collection with public AI systems is not recommended. GCP officially states that the size limits for input and output data are 20MB and 10MB, respectively. If the limit is exceeded, the system will reject the response. However, we encountered some exceptions during our experiments despite not violating these limits. For instance, frequent accesses may trigger bandwidth limitations, slowing down the speed of data transfer. Additionally, certain images may cause glitches, leading the system to directly reject the requests. This suggests that GCP has an internal system for managing user requests and available resources.

The proposed attack strategy has a significant limitation: it is inherently size-variant and cannot generate adversarial examples within a narrow epsilon ball. Specifically, the input images are resized to a pre-defined size before being fed into the target model. However, the exact dimensions remain undisclosed to end-users. In an extreme scenario, a system with limited computational resources may compress inputs to small dimensions, like (64,64)6464(64,64)( 64 , 64 ), to reduce the computing costs. The compression invariably eliminates the fine structure of perturbations. Consequently, multiple adversarial examples produced during internal steps are mapped to the same resized image, rendering numerous queries invalid. Nevertheless, we believe that integrating Expectation over Transformation (EoT) [3, 2] into the proposed attack could enhance its effectiveness.

Besides, the strength of black-box attacks is much weaker than that of white-box attacks. According to the criteria set in this paper, the ASR of white-box attacks stands at 100% but with single image, fewer iterations, and a smaller epsilon ball. However, we emphasis that the primary contributions of the study is exploiting vulnerabilities in AI systems and analysing the feasibility of latency attacks on the real-world AI applications.

Those limitations and our observations implicitly propose three probable defenses. The first one performs multiple inferences with various input dimensions. If results are divergent significantly, there is a high probability that the input image is malicious. The second approach involves integrating a mechanism that can measure context inconsistency [26]. Requests are rebuffed if distinct and unrelated objects coexist within the same image. The third strategy revolves around image quality assessment. Since adversarial perturbations at the current stage live in relatively large epsilon ball, they might be detectable by some existing image quality measurements.

5 Conclusion

This paper has presented a novel attack strategy, the "steal now, attack later" approach, which aims to deceive the victim model into generating ghost objects under the black-box scenario such that the system cannot respond to the requests immediately or computing resources are exhausted. This is the first paper studies the feasibility of latency attacks against object detection under the black-box scenario. This innovative method resolves the issue of the lack of information about unqualified objects. The experimental results have demonstrated the effectiveness of this approach across various models. Furthermore, the proposed attack is shown to be applicable to public vision APIs provided by Google Cloud Platform (GCP). Through a comprehensive analysis, it has been determined that deploying a private model locally as the most economical solution, supported by affordable costs associated with the proposed attack. This encourages attackers to invest in improving attack algorithms to exploit vulnerabilities in AI systems.

References

  • [1] Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query-efficient black-box adversarial attack via random search. arXiv preprint arXiv:1912.00049 (2019)
  • [2] Athalye, A., Carlini, N., Wagner, D.: Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In: International conference on machine learning. pp. 274–283. PMLR (2018)
  • [3] Athalye, A., Engstrom, L., Ilyas, A., Kwok, K.: Synthesizing robust adversarial examples. In: International conference on machine learning. pp. 284–293. PMLR (2018)
  • [4] Beal, J., Kim, E., Tzeng, E., Park, D.H., Zhai, A., Kislyuk, D.: Toward transformer-based object detection. arXiv preprint arXiv:2012.09958 (2020)
  • [5] Biswas, S.S.: Role of chat gpt in public health. Annals of biomedical engineering 51(5), 868–869 (2023)
  • [6] Cai, Z., Rane, S., Brito, A.E., Song, C., Krishnamurthy, S.V., Roy-Chowdhury, A.K., Asif, M.S.: Zero-query transfer attacks on context-aware object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15024–15034 (2022)
  • [7] Cai, Z., Xie, X., Li, S., Yin, M., Song, C., Krishnamurthy, S.V., Roy-Chowdhury, A.K., Asif, M.S.: Context-aware transfer attacks for object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 149–157 (2022)
  • [8] Canella, C., Genkin, D., Giner, L., Gruss, D., Lipp, M., Minkin, M., Moghimi, D., Piessens, F., Schwarz, M., Sunar, B., et al.: Fallout: Leaking data on meltdown-resistant cpus. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. pp. 769–784 (2019)
  • [9] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
  • [10] Chai, W., Guo, X., Wang, G., Lu, Y.: Stablevideo: Text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23040–23050 (2023)
  • [11] Chen, E.C., Chen, P.Y., Chung, I.H., Lee, C.R.: Overload: Latency attacks on object detection for edge devices. arXiv preprint arXiv:2304.05370 (2023)
  • [12] Chen, E.C., Lee, C.R.: Towards fast and robust adversarial training for image classification. In: Proceedings of the Asian Conference on Computer Vision (ACCV) (November 2020)
  • [13] Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.J.: Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. pp. 15–26 (2017)
  • [14] Cheng, M., Le, T., Chen, P.Y., Yi, J., Zhang, H., Hsieh, C.J.: Query-efficient hard-label black-box attack: An optimization-based approach. arXiv preprint arXiv:1807.04457 (2018)
  • [15] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [16] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–338 (2010)
  • [17] Hu, Z., Huang, S., Zhu, X., Sun, F., Zhang, B., Hu, X.: Adversarial texture for fooling person detectors in the physical world. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13307–13316 (2022)
  • [18] Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7310–7311 (2017)
  • [19] Jocher, G., Chaurasia, A., Qiu, J.: YOLO by Ultralytics (Jan 2023), https://github.com/ultralytics/ultralytics
  • [20] Kaur, D., Uslu, S., Rittichier, K.J., Durresi, A.: Trustworthy artificial intelligence: a review. ACM Computing Surveys (CSUR) 55(2), 1–38 (2022)
  • [21] Kaviani, S., Han, K.J., Sohn, I.: Adversarial attacks and defenses on ai in medical imaging informatics: A survey. Expert Systems with Applications 198, 116815 (2022)
  • [22] King, S.T., Tucek, J., Cozzie, A., Grier, C., Jiang, W., Zhou, Y.: Designing and implementing malicious hardware. Leet 8,  1–8 (2008)
  • [23] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128(7), 1956–1981 (2020)
  • [24] Li, C., Yao, W., Wang, H., Jiang, T.: Adaptive momentum variance for attention-guided sparse adversarial attacks. Pattern Recognition 133, 108979 (2023)
  • [25] Li, D., Zhang, J., Huang, K.: Universal adversarial perturbations against object detection. Pattern Recognition 110, 107584 (2021)
  • [26] Li, S., Zhu, S., Paul, S., Roy-Chowdhury, A., Song, C., Krishnamurthy, S., Swami, A., Chan, K.S.: Connecting the dots: Detecting adversarial perturbations using context inconsistency. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. pp. 396–413. Springer (2020)
  • [27] Liang, S., Li, L., Fan, Y., Jia, X., Li, J., Wu, B., Cao, X.: A large-scale multiple-objective method for black-box attack against object detection. In: European Conference on Computer Vision. pp. 619–636. Springer (2022)
  • [28] Liang, S., Wu, B., Fan, Y., Wei, X., Cao, X.: Parallel rectangle flip attack: A query-based black-box attack against object detection. arXiv preprint arXiv:2201.08970 (2022)
  • [29] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
  • [30] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
  • [31] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. pp. 21–37. Springer (2016)
  • [32] Long, T., Gao, Q., Xu, L., Zhou, Z.: A survey on adversarial attacks in computer vision: Taxonomy, visualization and future directions. Computers & Security p. 102847 (2022)
  • [33] Ma, C., Wang, N., Chen, Q.A., Shen, C.: Slowtrack: Increasing the latency of camera-based perception in autonomous driving using adversarial examples. arXiv preprint arXiv:2312.09520 (2023)
  • [34] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks (2019)
  • [35] Mahmood, K., Mahmood, R., Rathbun, E., van Dijk, M.: Back in black: A comparative evaluation of recent state-of-the-art black-box attacks. IEEE Access 10, 998–1019 (2021)
  • [36] Morgulis, N., Kreines, A., Mendelowitz, S., Weisglass, Y.: Fooling a real car with adversarial traffic signs. arXiv preprint arXiv:1907.00374 (2019)
  • [37] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
  • [38] Shapira, A., Zolfi, A., Demetrio, L., Biggio, B., Shabtai, A.: Phantom sponges: Exploiting non-maximum suppression to attack deep object detectors. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 4571–4580 (2023)
  • [39] Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation 23(5), 828–841 (2019)
  • [40] Sultana, F., Sufian, A., Dutta, P.: A review of object detection models based on convolutional neural network. Intelligent computing: image processing based applications pp. 1–16 (2020)
  • [41] Thys, S., Van Ranst, W., Goedemé, T.: Fooling automated surveillance cameras: adversarial patches to attack person detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 0–0 (2019)
  • [42] Tian, Z., Shen, C., Chen, H., He, T.: Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9627–9636 (2019)
  • [43] Wu, Z., Lim, S.N., Davis, L.S., Goldstein, T.: Making an invisibility cloak: Real world adversarial attacks on object detectors. In: European Conference on Computer Vision. pp. 1–17. Springer (2020)
  • [44] Zaidi, S.S.A., Ansari, M.S., Aslam, A., Kanwal, N., Asghar, M., Lee, B.: A survey of modern deep learning based object detection models. Digital Signal Processing 126, 103514 (2022)
  • [45] Zolfi, A., Kravchik, M., Elovici, Y., Shabtai, A.: The translucent patch: A physical and universal attack on object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15232–15241 (2021)

Appendix 0.A Data Visualization

In this section, we present visualizations of the predictions made by various models for the adversarial examples, as shown in Figure 6 to Figure 10. It is evident that all models have failed when subjected to ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8. Moreover, the objects generated by adversarial examples tend to be relatively small in size. According those observations, we believe that this is the main reason why our attack cannot be applied to SSD currently.

Refer to caption
(a) ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8
Refer to caption
(b) ϵ=16italic-ϵ16\epsilon=16italic_ϵ = 16
Refer to caption
(c) ϵ=24italic-ϵ24\epsilon=24italic_ϵ = 24
Refer to caption
(d) ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32
Figure 6: The predictions of adversarial examples by DERT under differ epsilon radii.
Refer to caption
(a) ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8
Refer to caption
(b) ϵ=16italic-ϵ16\epsilon=16italic_ϵ = 16
Refer to caption
(c) ϵ=24italic-ϵ24\epsilon=24italic_ϵ = 24
Refer to caption
(d) ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32
Figure 7: The predictions of adversarial examples by faster R-CNN under differ epsilon radii.
Refer to caption
(a) ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8
Refer to caption
(b) ϵ=16italic-ϵ16\epsilon=16italic_ϵ = 16
Refer to caption
(c) ϵ=24italic-ϵ24\epsilon=24italic_ϵ = 24
Refer to caption
(d) ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32
Figure 8: The predictions of adversarial examples by Retinanet under differ epsilon radii.
Refer to caption
(a) ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8
Refer to caption
(b) ϵ=16italic-ϵ16\epsilon=16italic_ϵ = 16
Refer to caption
(c) ϵ=24italic-ϵ24\epsilon=24italic_ϵ = 24
Refer to caption
(d) ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32
Figure 9: The predictions of adversarial examples by FCOS under differ epsilon radii.
Refer to caption
(a) ϵ=8italic-ϵ8\epsilon=8italic_ϵ = 8
Refer to caption
(b) ϵ=16italic-ϵ16\epsilon=16italic_ϵ = 16
Refer to caption
(c) ϵ=24italic-ϵ24\epsilon=24italic_ϵ = 24
Refer to caption
(d) ϵ=32italic-ϵ32\epsilon=32italic_ϵ = 32
Figure 10: The predictions of adversarial examples by YOLOv8 under differ epsilon radii.