HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: mdwmath
  • failed: mdwtab
  • failed: eqparbox
  • failed: filecontents
  • failed: hyphenat

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.11515v1 [cs.CV] 18 Mar 2024

SSAP: A Shape-Sensitive Adversarial Patch for Comprehensive Disruption of Monocular Depth Estimation in Autonomous Navigation Applications

Amira Guesmi11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Muhammad Abdullah Hanif11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Ihsen Alouani22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Bassem Ouni33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT, Muhammad Shafique11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT eBrain Lab, New York University (NYU) Abu Dhabi, UAE
22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT CSIT, Queen’s University Belfast, UK
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT AI and Digital Science Research Center, Technology Innovation Institute (TII)
Abstract

Monocular depth estimation (MDE) has advanced significantly, primarily through the integration of convolutional neural networks (CNNs) and more recently, Transformers. However, concerns about their susceptibility to adversarial attacks have emerged, especially in safety-critical domains like autonomous driving and robotic navigation. Existing approaches for assessing CNN-based depth prediction methods have fallen short in inducing comprehensive disruptions to the vision system, often limited to specific local areas. In this paper, we introduce SSAP (Shape-Sensitive Adversarial Patch), a novel approach designed to comprehensively disrupt monocular depth estimation (MDE) in autonomous navigation applications. Our patch is crafted to selectively undermine MDE in two distinct ways: by distorting estimated distances or by creating the illusion of an object disappearing from the system’s perspective. Notably, our patch is shape-sensitive, meaning it considers the specific shape and scale of the target object, thereby extending its influence beyond immediate proximity. Furthermore, our patch is trained to effectively address different scales and distances from the camera. Experimental results demonstrate that our approach induces a mean depth estimation error surpassing 0.5, impacting up to 99% of the targeted region for CNN-based MDE models. Additionally, we investigate the vulnerability of Transformer-based MDE models to patch-based attacks, revealing that SSAP yields a significant error of 0.59 and exerts substantial influence over 99% of the target region on these models.

I Introduction

Refer to caption
Figure 1: Our SSAP makes the object fully disappear, in contrast, adversarial patches proposed by Yamanaka et al. [1] and Cheng et al. [2] are weak adversarial patches that only impact the depth of a small region of the target object which is restricted to the overlap** region between the patch and the input image.

MDE has become increasingly valuable in practical applications such as robotics and autonomous driving (AD). MDE enables the extraction of depth information from a single image, improving understanding of the scene. Its importance extends to critical robotic functions, including obstacle avoidance [3], object detection [4], visual SLAM [5, 6], and visual relocalization [7].

Various methods for depth estimation rely on technologies such as RGB-D cameras, Radar, LiDAR, or ultrasound devices to directly capture depth information within a scene. However, these alternatives have significant limitations. RGB-D cameras have a limited measurement range, while LiDAR and Radar provide sparse data and are costly sensing solutions. These factors may not be suitable for compact autonomous systems, such as low-cost, lightweight, and small-sized mobile robots. Additionally, ultrasound devices suffer from inherent measurement inaccuracies. Furthermore, these technologies consume substantial energy and have large form factors, making them unsuitable for resource-restricted, small-scale systems that must adhere to stringent real-world design constraints. In contrast, RGB cameras offer lightweight and cost-effective options. They have the capability to provide more comprehensive environmental data.

Leading players in the autonomous vehicle sector are driving advancements in self-driving technology by utilizing cost-effective camera solutions. Notably, Monocular Depth Estimation (MDE) has been seamlessly integrated into Tesla’s production-grade Autopilot system [8, 9]. Other major autonomous driving (AD) enterprises, such as Toyota [10] and Huawei [11], are also adopting this approach to accelerate self-driving advancements, following Tesla’s lead.

In recent years, the advancement of deep learning has significantly improved the performance of monocular depth estimation (MDE), primarily through the utilization of CNN-based models [12, 13, 14] and Transformer-based models [15, 16, 17]. However, CNNs have shown vulnerabilities to adversarial attacks, and the security properties of Transformers are yet to be thoroughly studied.

Previous efforts in patch-based adversarial attacks, which focused solely on CNN-based MDE [1, 2], produced relatively weak adversarial patches. These patches had a limited impact on the depth estimation of specific objects such as vehicles and pedestrians, with their influence confined to the overlap** area between the patch and the input image. There is significant potential for improvement in expanding the sphere of influence of these patches.

Our approach introduces a novel technique for crafting adversarial patches tailored for both CNN-based and Transformer-based monocular depth estimation. Specifically, we generate shape-sensitive adversarial patches aimed at deceiving the target methods. These patches prompt the methods to erroneously estimate the depth of designated objects (such as vehicles or pedestrians), or even to completely conceal the presence of those objects.

In summary, the novel contributions of this work are:

  • We introduce a shape-sensitive adversarial patch (SSAP) designed to disrupt the output of the MDE model.

  • We leverage information from a pre-trained detector during the patch generation process. This enables us to adaptively craft patches that are robust to varying scales and distances from the camera, mimicking real-world scenarios. Additionally, this approach helps prevent the patch from being trained for irrelevant objects, ensuring that it exclusively targets specific objects and thereby enhances its effectiveness.

  • We introduce a novel penalized loss function aimed at enhancing the efficiency of our adversarial patch and expanding its impact region (refer to Figure 1).

  • We conduct an ablation study to demonstrate the effectiveness of our modified loss function in extending the influence of our proposed patch.

  • To the best of our knowledge, we are the first to investigate the robustness of transformer-based MDE models. We demonstrate their vulnerability to our patch-based adversarial attacks, despite claims of robustness to natural noise and adversarial attacks 111We provide a demo video showcasing the effectiveness of our patch in concealing a target object for the transformer-based MDE model MIMDepth [17]. The video demonstrates various patch sizes and distances from the camera..

  • Our proposed patch achieves a high mean depth estimation error exceeding 0.5, significantly impacting nearly 99% of the target region for CNN-based MDE. Additionally, it results in a mean depth estimation error of 0.59 with a substantial influence on 99% of the region for transformer-based MDE.

  • Our devised attack methodology is generic, making it applicable to various object categories present on public roads. However, for proof-of-concept, we focus on two representative object types —cars and pedestrians— for targeting purposes.

An overview of our framework is depicted in Figure 2, while a comprehensive description can be found in Section II-B.

II Proposed Approach

II-A Problem formulation

In monocular depth estimation, when presented with a benign image denoted as I𝐼Iitalic_I, the objective of the adversarial attack is to cause the depth estimation method to inaccurately predict the depth of the intended object by employing a strategically designed image represented as I*superscript𝐼I^{*}italic_I start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Technically, the adversarial example incorporating the generated patch can be mathematically formulated as follows:

I*=(1MP)I+MPPsuperscript𝐼direct-product1subscript𝑀𝑃𝐼direct-productsubscript𝑀𝑃𝑃I^{*}=(1-M_{P})\odot I+M_{P}\odot Pitalic_I start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = ( 1 - italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) ⊙ italic_I + italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⊙ italic_P (1)

direct-product\odot is the component-wise multiplication, denoted as P𝑃Pitalic_P, with the specific property, and MPsubscript𝑀𝑃M_{P}italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the patch mask, used to constrict the size, shape, and location of the adversarial patch. The adversarial depth, i.e., the output of the victim model F𝐹Fitalic_F when taking as input the adversarial example is:

dadv=F((1MP)I+MPP)subscript𝑑𝑎𝑑𝑣𝐹direct-product1subscript𝑀𝑃𝐼direct-productsubscript𝑀𝑃𝑃d_{adv}=F((1-M_{P})\odot I+M_{P}\odot P)italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT = italic_F ( ( 1 - italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) ⊙ italic_I + italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⊙ italic_P ) (2)

The problem of generating an adversarial example can be formulated as a constrained optimization 3, given an original input image I𝐼Iitalic_I and a MDE model F(.)F(.)italic_F ( . ),:

argminPPps.t.F((1MP)I+MPP)F(I)dadvdformulae-sequence𝑎𝑟𝑔subscript𝑃subscriptnorm𝑃𝑝𝑠𝑡𝐹direct-product1subscript𝑀𝑃𝐼direct-productsubscript𝑀𝑃𝑃𝐹𝐼subscript𝑑𝑎𝑑𝑣𝑑arg\min_{P}\left\|P\right\|_{p}\\ s.t.F((1-M_{P})\odot I+M_{P}\odot P)\neq F(I)\\ d_{adv}\neq ditalic_a italic_r italic_g roman_min start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ italic_P ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_s . italic_t . italic_F ( ( 1 - italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) ⊙ italic_I + italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⊙ italic_P ) ≠ italic_F ( italic_I ) italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT ≠ italic_d (3)

The goal is to identify a minimal adversarial noise, P𝑃Pitalic_P, such that when applied on any object within a designated input domain U𝑈Uitalic_U, it strategically compromises the underlying DNN-based MDE model F(.)F(.)italic_F ( . ). This compromise can take the form of either distorting the estimated distance or causing the object to vanish from the prediction.

It’s important to note that an analytical solution isn’t feasible for this optimization task due to the non-convex nature of the DNN-based model F(.)F(.)italic_F ( . ) involved. As a result, the formulation of Equation 3 can be expressed as follows, allowing for the utilization of empirical approximation techniques to numerically address the problem:

argmaxPI𝒰l(F((1MP)I+MPP),F(I))subscriptargmax𝑃subscript𝐼𝒰𝑙𝐹direct-product1subscript𝑀𝑃𝐼direct-productsubscript𝑀𝑃𝑃𝐹𝐼\operatorname*{arg\,max}_{P}\sum_{I\in\mathcal{U}}l(F((1-M_{P})\odot I+M_{P}% \odot P),F(I))start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_I ∈ caligraphic_U end_POSTSUBSCRIPT italic_l ( italic_F ( ( 1 - italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) ⊙ italic_I + italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ⊙ italic_P ) , italic_F ( italic_I ) ) (4)

Here, l𝑙litalic_l represents a predefined loss function, and 𝒰U𝒰𝑈\mathcal{U}\subset Ucaligraphic_U ⊂ italic_U denotes the attacker’s training dataset. To tackle this challenge, optimization methods such as Adam [18] can be employed to address the problem. During each iteration of the training process, the optimizer iteratively updates the adversarial patch P𝑃Pitalic_P.

II-B Overview

While our approach does not directly incorporate considerations of stealthiness, we have focused on ensuring a meaningful and fair comparison with existing methods. To achieve this, we opted to replicate the methodology outlined by Cheng et al. [2], While not specifically addressing stealthiness. Furthermore, unlike Cheng et al.’s method, which involves training a patch for consistent placement on the same object at a fixed distance from the camera while only changing the background, a scenario that may not accurately reflect real-world conditions, our method ensures that the generated patch remains effective across various scales and distances from the camera. This adaptability to diverse conditions enhances its practical applicability and effectiveness in real-world settings. Additionally, we replicate the methodology presented in [1], wherein a patch was trained for arbitrary placement within the scene.

As illustrated in Figure 1, our SSAP achieves the complete disappearance of the targeted object, while the other two patches exhibit a less significant impact. To ensure consistency in our comparison, we maintained uniformity by utilizing identical optimization parameters, patch dimensions, and shapes.

Our study aims to develop adversarial patches with a broad impact, covering the entire object regardless of its size, shape, or position, while maintaining their effectiveness as attack tools. To achieve this goal, we utilize a pre-trained object detector to accurately pinpoint the location of the targeted object—the object we aim to hide or manipulate—along with its predicted depth. This precise identification enables us to create a patch that remains effective even with varying distances between the camera and the object.

Furthermore, we introduce a novel loss function designed to amplify the patch’s effect and expand the area it influences. As shown in Figure 2, we begin with a pre-trained object detector and initiate the process by creating two distinct masks. The first mask, denoted as Mpsubscript𝑀𝑝M_{p}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, represents the precise placement of the patch at the center of the target object. The second mask, Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, corresponds to the specific region occupied by the target object—essentially, the area influenced by the target object. Subsequently, the patch is fed to the patch transformation block, also known as the ’patch transformer’, where we implement the geometric alterations outlined in Section II-D.

After applying these transformations, we utilize ’the patch applier’ to superimpose the generated patch onto the input image. This process involves incorporating information obtained from the object detector, as explained in Section II-C. Following this step, we perform a forward pass. In the subsequent stage, we use the generated masks to compute the required loss functions. Then, we calculate the gradient of the patch. This gradient information is used to update the actual patch, denoted as P𝑃Pitalic_P.

Refer to caption
Figure 2: Overview of the proposed approach.

II-C Patch Applier

In a physical setting, the attacker’s control over the perspective, scale, and positioning of the patch in relation to the camera is limited. Therefore, we aim to enhance the robustness of our patch to accommodate a wide range of potential scenarios. During the patch generation phase, we overlay the patch onto the surface of the target object, such as the rear of a vehicle or human attire. This methodology enables us to simulate diverse scenes with different settings.

The training process involves a variety of transformations, such as rotations and occlusions, carefully incorporated to simulate the plausible appearance of our adversarial patch P𝑃Pitalic_P in a realistic context. Subsequently, leveraging information from the object detector, we obtain precise object locations (such as vehicles or individuals) within a given image I𝐼Iitalic_I. At this stage, it becomes feasible to place our adversarial patch P𝑃Pitalic_P onto the identified object. Two distinct masks arise from this process: Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, encircling the object to demarcate its influence, and Mpsubscript𝑀𝑝M_{p}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, designed to restrict the patch’s characteristics, including its location, dimensions, and shape. It is crucial to emphasize that the object detector’s role is limited to the optimization procedure of the patch and is not extended to the actual attack phase. To eliminate the need for manual patch placement onto objects as done in [2], we utilize the YOLOv4-tiny detector pretrained on the MSCOCO dataset [19].

The detector’s capabilities streamline the process by automatically identifying object placements, contributing to a more efficient and effective workflow. Let U={Ii}i=1M𝑈superscriptsubscriptsubscript𝐼𝑖𝑖1𝑀U=\{I_{i}\}_{i=1}^{M}italic_U = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and V={Jj}j=1N𝑉superscriptsubscriptsubscript𝐽𝑗𝑗1𝑁V=\{J_{j}\}_{j=1}^{N}italic_V = { italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT respectively be the M𝑀Mitalic_M training and N𝑁Nitalic_N testing images for a particular scene of attack. We run the YOLOv4-tiny object detector on U𝑈Uitalic_U and V𝑉Vitalic_V with an objectness threshold of 0.5 and non-max suppression IoU threshold of 0.40.40.40.4. This yields the tuples:

TiU={(Bi,kU)}k=1Di,TjV={(Bj,lV)}l=1Ejformulae-sequencesuperscriptsubscript𝑇𝑖𝑈superscriptsubscriptsuperscriptsubscript𝐵𝑖𝑘𝑈𝑘1subscript𝐷𝑖superscriptsubscript𝑇𝑗𝑉superscriptsubscriptsuperscriptsubscript𝐵𝑗𝑙𝑉𝑙1subscript𝐸𝑗T_{i}^{U}=\{(B_{i,k}^{U})\}_{k=1}^{D_{i}},T_{j}^{V}=\{(B_{j,l}^{V})\}_{l=1}^{E% _{j}}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT = { ( italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = { ( italic_B start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (5)

for each Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Jjsubscript𝐽𝑗J_{j}italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , where Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of detections in Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (fixed to 14 in our experiment same as in [20]). and Bi,kUsuperscriptsubscript𝐵𝑖𝑘𝑈B_{i,k}^{U}italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT is the bounding box of the kth𝑘𝑡k-thitalic_k - italic_t italic_h detection in Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. (Same for Ejsubscript𝐸𝑗E_{j}italic_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and Bj,lVsuperscriptsubscript𝐵𝑗𝑙𝑉B_{j,l}^{V}italic_B start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT for Jjsubscript𝐽𝑗J_{j}italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). The sets of all detected objects are:

TU={TiU}i=1M,TV={TjV}j=1Nformulae-sequencesuperscript𝑇𝑈superscriptsubscriptsuperscriptsubscript𝑇𝑖𝑈𝑖1𝑀superscript𝑇𝑉superscriptsubscriptsuperscriptsubscript𝑇𝑗𝑉𝑗1𝑁T^{U}=\{T_{i}^{U}\}_{i=1}^{M},T^{V}=\{T_{j}^{V}\}_{j=1}^{N}italic_T start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT = { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = { italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (6)

The information used to optimize P𝑃Pitalic_P for a scene of attack are the training images U𝑈Uitalic_U and annotations TUsuperscript𝑇𝑈T^{U}italic_T start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT. The patch P𝑃Pitalic_P is randomly initialized. Given the current P, the patch is rendered on top of each detected object of the chosen class for each training image Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using the mask Mpi,kUsuperscriptsubscript𝑀subscript𝑝𝑖𝑘𝑈M_{p_{i,k}}^{U}italic_M start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT. Mpi,kUsuperscriptsubscript𝑀subscript𝑝𝑖𝑘𝑈M_{p_{i,k}}^{U}italic_M start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT is a matrix of zeros except for the patch location (The center of the patch is the center of the bounding boxes).

The focus mask Mfi,kUsuperscriptsubscript𝑀subscript𝑓𝑖𝑘𝑈M_{f_{i,k}}^{U}italic_M start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT is defined as the space limited by the generated bounding boxes Bi,kUsuperscriptsubscript𝐵𝑖𝑘𝑈B_{i,k}^{U}italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT in a way that it covers the whole detected object. Mfi,kUsuperscriptsubscript𝑀subscript𝑓𝑖𝑘𝑈M_{f_{i,k}}^{U}italic_M start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT is a matrix of zeros except for the targeted regions (objects covered area) where the pixel values are ones. We multiply the generated masks with the adversarial depth map dadvsubscript𝑑𝑎𝑑𝑣d_{adv}italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT and use the resulting maps to compute the two losses.

II-D Patch Transformation Block

The positioning and perspective of a camera on an autonomous vehicle in relation to another vehicle or target object undergo continuous variation. The images captured and provided to the victim model are taken from various distances, angles, and lighting conditions. Therefore, any modification introduced by an attacker, such as an adversarial patch, must be resilient to these evolving circumstances. To simulate this variability, a range of physical transformations is applied, each representing different conditions that may occur. These transformations include introducing noise, applying random rotations, altering scales, and simulating variations in lighting. The patch transformer is utilized to implement these transformations effectively.

The transformations executed include:

Random Scaling: The patch’s dimensions are randomly adjusted to approximately match its real-world proportions within the scene. Random Rotations: The patch P𝑃Pitalic_P is subjected to random rotations (up to ±20plus-or-minussuperscript20\pm 20^{\circ}± 20 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT) centered around the bounding boxes Bi,kUsuperscriptsubscript𝐵𝑖𝑘𝑈B_{i,k}^{U}italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT. This emulates uncertainties related to patch placement and sizing during printing. Color Space Transformations: Pixel intensity values are manipulated through various color space transformations. These include introducing random noise (within ±0.1plus-or-minus0.1\pm 0.1± 0.1 range), applying random contrast adjustments (within the range [0.8,1.2]0.81.2[0.8,1.2][ 0.8 , 1.2 ]), and introducing random brightness adjustments (within ±0.1plus-or-minus0.1\pm 0.1± 0.1 range).

II-E Penalized Depth Loss

The bounding boxes Bi,kUsuperscriptsubscript𝐵𝑖𝑘𝑈B_{i,k}^{U}italic_B start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_U end_POSTSUPERSCRIPT generated serve as the foundation for creating a focus mask Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, which encompasses the specific region where we intend to modify the predicted depth. Our objective is to extend the region influenced by the patch, going beyond mere pixel overlap. To achieve this, we decompose the depth loss Ldepthsubscript𝐿𝑑𝑒𝑝𝑡L_{depth}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT into two distinct terms: Ld1subscript𝐿subscript𝑑1L_{d_{1}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Ld2subscript𝐿subscript𝑑2L_{d_{2}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Ld1subscript𝐿subscript𝑑1L_{d_{1}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the loss incurred by pixels that are overlapped by the patch, while Ld2subscript𝐿subscript𝑑2L_{d_{2}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT pertains to the loss stemming from pixels that don’t overlap.

To direct the optimization process towards prioritizing the reduction of the non-overlapped pixel loss Ld2subscript𝐿subscript𝑑2L_{d_{2}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we employ a squaring operation on the term denoting the disparity between the output depth and the target depth, denoted as |dtdadv|MPdirect-productsubscript𝑑𝑡subscript𝑑𝑎𝑑𝑣subscript𝑀𝑃\left|d_{t}-d_{adv}\right|\odot M_{P}| italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT | ⊙ italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT. This utilization of quadratic functions is strategic, as these functions exhibit a slower rate of increase (slope or rate of change), consequently delaying the convergence of overlapped pixel loss in comparison to non-overlapped pixels. The losses are defined as the distance between the predicted depth and the target depth and calculated as follows:

Ld1=(|dtdadv|MP)subscript𝐿subscript𝑑1direct-productsubscript𝑑𝑡subscript𝑑𝑎𝑑𝑣subscript𝑀𝑃L_{d_{1}}=(\left|d_{t}-d_{adv}\right|\odot M_{P})italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( | italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT | ⊙ italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) (7)
Ld2=|dtdadv|(MfMP)subscript𝐿subscript𝑑2direct-productsubscript𝑑𝑡subscript𝑑𝑎𝑑𝑣subscript𝑀𝑓subscript𝑀𝑃L_{d_{2}}=\left|d_{t}-d_{adv}\right|\odot(M_{f}-M_{P})italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = | italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT | ⊙ ( italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) (8)
Ldepth=Ld12+Ld2subscript𝐿𝑑𝑒𝑝𝑡superscriptsubscript𝐿subscript𝑑12subscript𝐿subscript𝑑2L_{depth}=L_{d_{1}}^{2}+L_{d_{2}}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (9)

II-F Adversarial Patch Generation

We iteratively perform gradient updates on the adversarial patch (P)𝑃(P)( italic_P ) in the pixel space in a way that optimizes our objective function defined as follows:

Ltotal=αLdepth+γLtvsubscript𝐿𝑡𝑜𝑡𝑎𝑙𝛼subscript𝐿𝑑𝑒𝑝𝑡𝛾subscript𝐿𝑡𝑣L_{total}=\alpha L_{depth}+\gamma L_{tv}italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_α italic_L start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT (10)

Ldepthsubscript𝐿𝑑𝑒𝑝𝑡L_{depth}italic_L start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT is the adversarial depth loss. Ltvsubscript𝐿𝑡𝑣L_{tv}italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT is the total variation loss on the generated image to encourage smoothness [21]. It is defined as:

Ltv=i,j(Pi+1,jPi,j)2+(Pi,j+1Pi,j)2subscript𝐿𝑡𝑣subscript𝑖𝑗superscriptsubscript𝑃𝑖1𝑗subscript𝑃𝑖𝑗2superscriptsubscript𝑃𝑖𝑗1subscript𝑃𝑖𝑗2L_{tv}=\sum_{i,j}\sqrt{(P_{i+1,j}-P_{i,j})^{2}+(P_{i,j+1}-P_{i,j})^{2}}italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT square-root start_ARG ( italic_P start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_P start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (11)

where the sub-indices i𝑖iitalic_i and j𝑗jitalic_j refer to the pixel coordinate of the patch P𝑃Pitalic_P. α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyper-parameters used to scale the three losses. For our experiments, we set α=1𝛼1\alpha=1italic_α = 1 and β=2𝛽2\beta=2italic_β = 2. We optimize the total loss using Adam [18] optimizer. We try to minimize the object function Ltotalsubscript𝐿𝑡𝑜𝑡𝑎𝑙L_{total}italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT and optimize the adversarial patch. We freeze all weights and biases in the depth estimator and only update the pixel values of the adversarial patch. The patch is randomly initialized.

III Experimental Results

In our experimental setup, we employ four MDE models: Three CNN-based models, namely, monodepth2 [22], Depthhints [23], and Manydepth [24], along with the Transformer-based MIMdepth [17]. These models were chosen based on their practicality and the availability of open-source code. It’s worth noting that the first three models are the same ones featured in the work presented in [2].

For our evaluations, we use real-world driving scenes extracted from the KITTI 2015 dataset [25]. This dataset comprises synchronized stereo video recordings alongside LiDAR measurements, all captured from a moving vehicle navigating urban surroundings. The dataset encapsulates an extensive array of road types, including local and rural roads as well as highways. Within these scenes, a diverse range of objects is present, such as vehicles and pedestrians, allowing for a comprehensive evaluation of the attack performance. Additionally, we employ the CASIA datasets [26] to test person objects.

We train our patch on a TeslaV100 GPU. Our patch optimization process is executed for 500500500500 epochs, using the Adam optimizer with a learning rate set at 0.010.010.010.01. The patch is scaled to a factor of 0.20.20.20.2 during this process.

III-A Evaluation Metrics

To assess the efficacy of our proposed attack, we utilize same metrics as employed in [2]: the mean depth estimation error (Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) attributed to the target object, and the ratio of the affected region (Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT). To compute these metrics, we define the depth prediction of the clean target object as the ground truth.

The mean depth estimation error quantifies the degree to which our proposed patch impacts the accuracy of depth estimation. A higher value in this metric indicates a more effective attack. Similarly, in the ratio of the affected region, a higher value indicates improved attack performance. In contrast to the approach presented in [2], where a patch is trained on a single object situated at a fixed distance from the camera while varying only the background, our approach involves training the patch for diverse distances. This includes objects positioned at various locations relative to the camera, such as close, far, left, right, or center. Initially, the output of the disparity map is normalized within the range of 00 to 1111, where 00 signifies the farthest point in the image and 1111 represents the closest point.

The mean depth estimation error was measured using the following metrics:

Ed=i,j(|ddadv|Mf)i,jMfsubscript𝐸𝑑subscript𝑖𝑗direct-product𝑑subscript𝑑𝑎𝑑𝑣subscript𝑀𝑓subscript𝑖𝑗subscript𝑀𝑓E_{d}=\frac{\sum_{i,j}(\arrowvert d-d_{adv}\arrowvert\odot M_{f})}{\sum_{i,j}{% M_{f}}}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( | italic_d - italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT | ⊙ italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG (12)

The ratio of the affected region is determined by evaluating the percentage of pixels whose depth values have been modified beyond a certain threshold. This evaluation is performed concerning the total number of pixels covered by the focus mask (Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT). For objects with any alteration in a pixel’s depth value surpassing 0.10.10.10.1 results in that pixel being considered as affected.

The ratio of the affected region was measured using the following metrics:

Ra=i,j𝐈((|ddadv|Mf)>0.1)i,jMfsubscript𝑅𝑎subscript𝑖𝑗𝐈direct-product𝑑subscript𝑑𝑎𝑑𝑣subscript𝑀𝑓0.1subscript𝑖𝑗subscript𝑀𝑓R_{a}=\frac{\sum_{i,j}\textbf{I}((\arrowvert d-d_{adv}\arrowvert\odot M_{f})>0% .1)}{\sum_{i,j}{M_{f}}}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT I ( ( | italic_d - italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT | ⊙ italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) > 0.1 ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG (13)

Additionally, we employ the Mean Square Error (MSE) to assess the performance of the model in relation to the predicted output depth map derived from an unperturbed input. The formulation of this metric is provided below.

MSE=1Ni,j(dadvi,jdi,j)2𝑀𝑆𝐸1𝑁subscript𝑖𝑗superscriptsubscript𝑑𝑎𝑑subscript𝑣𝑖𝑗subscript𝑑𝑖𝑗2MSE=\frac{1}{N}\sum_{i,j}(d_{adv_{i,j}}-d_{i,j})^{2}italic_M italic_S italic_E = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_a italic_d italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (14)

where N𝑁Nitalic_N is the total number of pixels.

III-B Evaluation Results

We extend our attack evaluation to include the four MDE models, targeting both classes of objects for each model. Initially, we generate adversarial patches specifically tailored for pedestrians and cyclists. In this phase, the patch scale is maintained at 0.20.20.20.2 for consistent testing across the different models.

Refer to caption
Figure 3: Impact of SSAP on person/pedestrian class for Transformer-based MDE. The person seamlessly blends with the background.

As depicted in Figure 3, our patch exhibits remarkable effectiveness by completely concealing the person. Intuitively, smaller objects are relatively easier to hide or manipulate in terms of their depth. However, achieving this for smaller objects requires smaller patches, which consequently result in a relatively reduced impact.

In the following experiment, we proceed to create an adversarial patch targeting the ”car” class. Utilizing a patch scale of 0.20.20.20.2, we observe that nearly all objects integrated with the SSAP achieve complete concealment. This outcome remains consistent regardless of the object’s specific characteristics, including factors such as shape, size, and proximity to the camera. The success in concealing various objects corroborates the robustness of our proposed patch.

Refer to caption
Figure 4: Impact of SSAP on car class for Transformer-based MDE.

We quantitatively assess the performance of our patch using the mean square error MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E, the mean depth estimation errors, denoted as Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, alongside the ratios of the affected regions, designated as Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, for the target object across 100 scenes extracted from the KITTI dataset. The presented results are determined by calculating the average values of these metrics, offering a representative result.

TABLE I: Attack performance in terms of mean depth estimation error (Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) and ratio of affected region (Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT).
Models MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
Monodepth2 0.49 0.55 0.99
Depthhints 0.47 0.53 0.98
Manydepth 0.46 0.53 0.98
MIMdepth 0.53 0.59 0.99

As reported in Table II, when our patch targets a Transformer-based model (MIMdepth), it achieves an average alteration of approximately 59%. Considering a scenario set on a highway, where the speed limit stands at 160 km/h, the recommended safe distance extends to 96 meters. When we incorporate our patch, a car positioned 90 meters away is projected to be situated approximately 143.1 meters away. This starkly demonstrates the profound consequences of our devised attack. Additionally, it’s noteworthy that our patch yields an impact on over 99% of the target region. This data underscores the extensive impact our patch has on altering depth perception.

Refer to caption
Figure 5: Impact of SSAP on car class for transformer-based MDE for different distances from the camera.

Moreover, even when targeting MIMdepth, a method touted for its enhanced robustness through the incorporation of transformer architecture and masked image modeling (MIM), our adversarial patch continues to exhibit effectiveness, causing a notable depth estimation error of 0.590.590.590.59. Figure 5 visually demonstrates the influence of our adversarial patch on the depth prediction of the MIMdepth model. For more qualitative results related to this model, please refer to the supplementary material.

III-C SSAP vs. Existing Attacks

We conducted a series of experiments to provide a quantitative comparison between our attack strategy and prior approaches [2] and [1]. To carry out this comparison, we employed the monodepth2 model to evaluate both the depth estimation error and the ratio of the affected region. The results presented in Table II demonstrate that our proposed attack consistently achieves the most substantial alteration in depth estimation and encompasses the largest affected region when compared to the referenced prior works.

TABLE II: SSAP performance vs. existing attacks.
Attack Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT MSE𝑀𝑆𝐸MSEitalic_M italic_S italic_E
SSAP 0.55 0.99 0.49
Cheng et al. [2] 0.21 0.47 0.12
Yamanaka et al [1] 0.13 0.26 0.05

IV Discussion

IV-A Ablation Study

To evaluate the effectiveness of the proposed penalized depth loss, we conducted an ablation study utilizing the monodepth2 model as our target monocular depth estimation (MDE) model. We tested various combinations of loss terms and present the results in Table III. Our proposed loss yielded the highest depth estimation error, with a value of 0.550.550.550.55, compared to 0.240.240.240.24 for the conventional loss and 0.180.180.180.18 when utilizing Ld2subscript𝐿subscript𝑑2L_{d_{2}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT for the depth loss. Furthermore, our approach resulted in the highest ratio of affected regions, with a value of 0.990.990.990.99, compared to 0.530.530.530.53 and 0.360.360.360.36 for the other combinations.

As illustrated in Figure 6, the area affected by the generated patch when using only the Ld1subscript𝐿subscript𝑑1L_{d_{1}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT depth is limited to the immediate vicinity of the patch itself. This outcome demonstrates a slightly improved performance compared to the patches featured in [1, 2]. Subsequently, we proceed to evaluate the effects of our proposed depth loss outlined in Section II-E. Upon applying this penalized depth loss, the resulting patch effectively conceals the entire object, as confirmed by our experimentation.

TABLE III: Attack performance in terms of mean depth estimation error (Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) and ratio of affected region (Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) for different losses combinations.
Ld1subscript𝐿subscript𝑑1L_{d_{1}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT Ld2subscript𝐿subscript𝑑2L_{d_{2}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT Ltvsubscript𝐿𝑡𝑣L_{tv}italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT
0.18 0.36
0.24 0.53
0.55 0.99
Refer to caption
Figure 6: Depth prediction w/o the penalized loss function: (Top) the input images, (Middle) results without the penalized depth loss (i.e., Ld1subscript𝐿subscript𝑑1L_{d_{1}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Ltvsubscript𝐿𝑡𝑣L_{tv}italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT), (Bottom) results with our proposed depth loss (i.e., Ld1subscript𝐿subscript𝑑1L_{d_{1}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Ld2subscript𝐿subscript𝑑2L_{d_{2}}italic_L start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Ltvsubscript𝐿𝑡𝑣L_{tv}italic_L start_POSTSUBSCRIPT italic_t italic_v end_POSTSUBSCRIPT).

IV-B The Influence of Patch Scale

We evaluate our attack by targeting the ”car” object class using three distinct patch sizes: 0.10.10.10.1, 0.20.20.20.2, and 0.30.30.30.3. We assess the mean depth error and the ratio of the affected region using the three depth estimation models. Table IV presents the results for the mean depth estimation error. It is noteworthy that there is a noticeable trend where Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT consistently increases with the increment in patch size across all target models. This outcome aligns with expectations, as larger patches exert a more pronounced influence on the resulting depth error. Similarly, this trend is echoed in the ratio of affected regions, as depicted in Table V, where larger patches correspond to larger affected regions.

TABLE IV: Edsubscript𝐸𝑑E_{d}italic_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT for different patch scales.
Scale Monodepth2 Depthhints Manydepth
0.1 0.46 0.37 0.3
0.2 0.55 0.53 0.53
0.3 0.66 0.64 0.63
TABLE V: Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT for different patch scales.
Scale Monodepth2 Depthhints Manydepth
0.1 0.97 0.95 0.97
0.2 0.99 0.98 0.98
0.3 0.99 0.99 0.99

V Related Work

Unlike existing physical attacks that have targeted tasks like object detection [27, 28], image classification [29, 30], and face recognition [31, 32], the area of attacks on depth estimation has received relatively limited attention. Zhang [33] proposes an attack designed to enhance performance in a universal attack scenario. Wong [34] introduces a strategy for generating targeted adversarial perturbations, which then randomly alter the associated depth map. It’s worth noting that these two attacks center around digital perturbations, rendering them less suitable for real-world applications.

Yamanaka [1] devises a method for generating printable adversarial patches, but the generated patch is trained to be applicable to random locations within the scene. On the other hand, Cheng [2] concentrates on the inconspicuousness of the generated patch, ensuring that the patch remains unobtrusive and avoids drawing attention. However, the challenge with this patch lies in its object-specific nature, necessitating separate retraining for each target object. Furthermore, the patch’s effectiveness is constrained by its limited affected region and its training for a specific context—namely, a fixed distance between the object and the camera—rendering it ineffective for varying distances.

Different from prior efforts, our emphasis lies in evaluating the comprehensive influence of the generated patch. We prioritize ensuring that the patch affects the entirety of the target object, thus guaranteeing a thorough deception of the DNN-based vision system. Our framework introduces a methodology that ensures the efficacy of the patch across various objects within the same class, accommodating distinct shapes and sizes. Importantly, our patch is designed to function across varying distances between the object’s placement and the camera of the vision system.

VI Conclusion

In this paper, we introduce a novel physical adversarial patch named SSAP, crafted with the explicit purpose of undermining MDE-based vision systems. SSAP distinguishes itself as an adaptive adversarial patch, demonstrating the capacity to fully hide objects or manipulate their perceived depth within a given scene, irrespective of their inherent size, shape, or placement. Our empirical investigations proves the effectiveness and resilience of our patch across diverse target objects. The achieved mean depth estimation error exceeding 50%percent5050\%50 %, with over 99%percent9999\%99 % of the target region undergoing alteration. Furthermore, our patch exhibits durability against defense techniques grounded in input transformations. The consequences of the proposed attack could result in significant harm, including loss, destruction, and endangerment to both life and property. Our findings serve as a clarion call to the research community, prompting the exploration of more robust and adaptive defense mechanisms.

References

  • [1] K. Yamanaka, R. Matsumoto, K. Takahashi, and T. Fujii, “Adversarial patch attacks on monocular depth estimation networks,” IEEE Access, vol. 8, pp. 179 094–179 104, 2020.
  • [2] Z. Cheng, J. Liang, H. Choi, G. Tao, Z. Cao, D. Liu, and X. Zhang, “Physical attack on monocular depth estimation with optimal adversarial patches,” 2022. [Online]. Available: https://arxiv.longhoe.net/abs/2207.04718
  • [3] X. Yang, J. Chen, Y. Dang, H. Luo, Y. Tang, C. Liao, P. Chen, and K.-T. Cheng, “Fast depth prediction and obstacle avoidance on a monocular drone using probabilistic convolutional neural network,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 1, pp. 156–167, 2021.
  • [4] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8437–8445.
  • [5] K. Tateno, F. Tombari, I. Laina, and N. Navab, “Cnn-slam: Real-time dense monocular slam with learned depth prediction,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574, 2017.
  • [6] F. Wimbauer, N. Yang, L. von Stumberg, N. Zeller, and D. Cremers, “Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6108–6118.
  • [7] L. von Stumberg, P. Wenzel, N. Yang, and D. Cremers, “Lm-reloc: Levenberg-marquardt based direct visual relocalization,” CoRR, vol. abs/2010.06323, 2020. [Online]. Available: https://arxiv.longhoe.net/abs/2010.06323
  • [8] “Andrej karpathy - ai for full-self driving at tesla,” https://youtu.be/hx7BXih7zx8, accessed: March 1, 2023.
  • [9] “Tesla ai day 2021,” https://www.youtube.com/live/j0z4FweCy4M?feature=share, accessed: March 1, 2023.
  • [10] V. Guizilini, R. Ambrus, S. Pillai, and A. Gaidon, “Packnet-sfm: 3d packing for self-supervised monocular depth estimation,” CoRR, vol. abs/1905.02693, 2019. [Online]. Available: http://arxiv.longhoe.net/abs/1905.02693
  • [11] S. Aich, J. M. U. Vianney, M. A. Islam, M. Kaur, and B. Liu, “Bidirectional attention network for monocular depth estimation,” CoRR, vol. abs/2009.00743, 2020. [Online]. Available: https://arxiv.longhoe.net/abs/2009.00743
  • [12] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011.
  • [13] M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramid-based depth residuals,” IEEE transactions on circuits and systems for video technology, vol. 31, no. 11, pp. 4381–4393, 2021.
  • [14] L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, and J. Heikkilä, “Guiding monocular depth estimation using depth-attention volume,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16.  Springer, 2020, pp. 581–597.
  • [15] W. Chang, Y. Zhang, and Z. Xiong, “Transformer-based monocular depth estimation with attention supervision,” in 32nd British Machine Vision Conference (BMVC 2021), 2021.
  • [16] A. Varma, H. Chawla, B. Zonooz, and E. Arani, “Transformers in self-supervised monocular depth estimation with unknown camera intrinsics,” arXiv preprint arXiv:2202.03131, 2022.
  • [17] H. Chawla, K. Jeeveswaran, E. Arani, and B. Zonooz, “Image masking for robust self-supervised monocular depth estimation,” 2023.
  • [18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014. [Online]. Available: https://arxiv.longhoe.net/abs/1412.6980
  • [19] T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
  • [20] Y.-C.-T. Hu, J.-C. Chen, B.-H. Kung, K.-L. Hua, and D. S. Tan, “Naturalistic physical adversarial patch for object detectors,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7828–7837.
  • [21] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5188–5196.
  • [22] C. Godard, O. M. Aodha, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” CoRR, vol. abs/1806.01260, 2018. [Online]. Available: http://arxiv.longhoe.net/abs/1806.01260
  • [23] J. Watson, M. Firman, G. Brostow, and D. Turmukhambetov, “Self-supervised monocular depth hints,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2162–2171.
  • [24] J. Watson, O. M. Aodha, V. Prisacariu, G. Brostow, and M. Firman, “The temporal opportunist: Self-supervised multi-frame monocular depth,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).  Los Alamitos, CA, USA: IEEE Computer Society, jun 2021, pp. 1164–1174. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR46437.2021.00122
  • [25] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  • [26] S. Yu, D. Tan, and T. Tan, “A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition,” in 18th international conference on pattern recognition (ICPR’06), vol. 4.  IEEE, 2006, pp. 441–444.
  • [27] Y.-C.-T. Hu, J.-C. Chen, B.-H. Kung, K.-L. Hua, and D. S. Tan, “Naturalistic physical adversarial patch for object detectors,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7828–7837.
  • [28] A. Guesmi, R. Ding, M. A. Hanif, I. Alouani, and M. Shafique, “Dap: A dynamic adversarial patch for evading person detectors,” 2023.
  • [29] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing robust adversarial examples,” in ICML, 2018.
  • [30] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, F. Tramèr, A. Prakash, T. Kohno, and D. Song, “Physical adversarial examples for object detectors,” in Proceedings of the 12th USENIX Conference on Offensive Technologies, ser. WOOT’18.  USA: USENIX Association, 2018, p. 1.
  • [31] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’16.  New York, NY, USA: Association for Computing Machinery, 2016, p. 1528–1540. [Online]. Available: https://doi.org/10.1145/2976749.2978392
  • [32] S. Komkov and A. Petiushko, “Advhat: Real-world adversarial attack on arcface face id system,” in 2020 25th International Conference on Pattern Recognition (ICPR).  Los Alamitos, CA, USA: IEEE Computer Society, jan 2021, pp. 819–826. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICPR48806.2021.9412236
  • [33] Z. Zhang, X. Zhu, Y. Li, X. Chen, and Y. Guo, “Adversarial attacks on monocular depth estimation,” CoRR, vol. abs/2003.10315, 2020. [Online]. Available: https://arxiv.longhoe.net/abs/2003.10315
  • [34] A. Wong, S. Cicek, and S. Soatto, “Targeted adversarial perturbations for monocular depth prediction,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20.  Red Hook, NY, USA: Curran Associates Inc., 2020.