License: arXiv.org perpetual non-exclusive license
arXiv:2402.15272v1 [cs.CV] 23 Feb 2024

EMIFF: Enhanced Multi-scale Image Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Zhe Wang1, Siqi Fan1, Xiaoliang Huo1,2, Tongda Xu1, Yan Wang1*, **g**g Liu1, Yilun Chen1, Ya-Qin Zhang1*
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTZhe Wang, Siqi Fan, Tongda Xu, Yan Wang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, **g**g Liu, Yilun Chen, and Ya-Qin Zhang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT are with the Institute for AI Industry Research (AIR), Tsinghua University, Bei**g, China. {wangzhe, fansiqi, xutongda, wangyan }@air.tsinghua.edu.cn22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTXiaoliang Huo is with the School of Software, Beihang University, Bei**g, China. [email protected]
Abstract

In autonomous driving, cooperative perception makes use of multi-view cameras from both vehicles and infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Currently, two major challenges persist in vehicle-infrastructure cooperative 3D (VIC3D) object detection: 1)1)1 ) inherent pose errors when fusing multi-view images, caused by time asynchrony across cameras; 2)2)2 ) information loss in transmission process resulted from limited communication bandwidth. To address these issues, we propose a novel camera-based 3D detection framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF). To fully exploit holistic perspectives from both vehicles and infrastructure, we propose Multi-scale Cross Attention (MCA) and Camera-aware Channel Masking (CCM) modules to enhance infrastructure and vehicle features at scale, spatial, and channel levels to correct the pose error introduced by camera asynchrony. We also introduce a Feature Compression (FC) module with channel and spatial compression blocks for transmission efficiency. Experiments show that EMIFF achieves SOTA on DAIR-V2X-C datasets, significantly outperforming previous early-fusion and late-fusion methods with comparable transmission costs.

11footnotetext: Correspondence to Yan Wang, Ya-Qin Zhang.22footnotetext: Code will be released at https://github.com/Bosszhe/EMIFF

I INTRODUCTION

Subject to sensor limitations, autonomous vehicles lack a global perception capability for monitoring holistic road conditions and accurately detecting surrounding objects, which bears great safety risks [1, 2]. Vehicle-to-everything (V2X) [3, 4] aims to build a communication system between vehicles and other devices in a complex traffic environment. Vehicle and infrastructure cooperation can significantly expand the perception range and improve perception capability [5]. Cameras from both two sides provide a global vantage point with a rich semantic context of road conditions beyond a single-vehicle viewpoint [6]. Vehicle-infrastructure cooperative 3D object detection (VIC3D) from cameras is a significant task for autonomous driving.

Compared with vanilla single-vehicle 3D object detection, VIC3D tasks face more unique challenges. One challenge is inherent pose error when fusing multi-view images from vehicles and those from infrastructure [7], caused by time asynchrony across agents [8]. As shown in Figure 1, this pose errors can result in inaccurate relative positions between objects and annotations. Another challenge is limited communication bandwidth between agents resulting in information loss between transmissions [3]. The raw sensor data possesses ample information required for fusion; however, it necessitates greater bandwidth, thus necessitating fusion methods to prioritize the balance between performance and transmission cost. Therefore, fusion methods to tackle such cross-agent perception challenges are the key to VIC3D.

Many fusion works on V2X are proposed based on simulated datasets, such as OPV2V [3], V2X-Sim [9] and V2XSet [10], which neglect above challenges and have a sim-to-real gap. Most existing research only focused on LiDAR-based methods due to the fusion convenience and the performance advantage, such as early fusion (EF) of raw signals [5, 11, 12], intermediate fusion (IF) of features [4, 3, 13, 14], and late fusion (LF) of prediction outputs [5, 15]. But due to the projection gap between 2D image plane and 3D space, image fusion can not be as direct as point clouds. In real scenarios, DAIR-V2X [5] adopts an LF method by combining prediction outputs from each camera, which is sensitive to calibration so that even when prediction from the infrastructure side is perfect, the vehicle will receive biased 3D detection.

Refer to caption
Figure 1: Labels (3D bounding boxes) projected from 3D space to vehicle (a) and infrastructure (b) image planes using calibration parameters Pinf/vehsubscript𝑃𝑖𝑛𝑓𝑣𝑒P_{inf/veh}italic_P start_POSTSUBSCRIPT italic_i italic_n italic_f / italic_v italic_e italic_h end_POSTSUBSCRIPT often suffer from misalignment between the ground truth and the projection position in 2D images (as illustrated by the misaligned green bounding boxes). The reason for this misalignment is that the camera’s capture time Tinf/vehsubscript𝑇𝑖𝑛𝑓𝑣𝑒T_{inf/veh}italic_T start_POSTSUBSCRIPT italic_i italic_n italic_f / italic_v italic_e italic_h end_POSTSUBSCRIPT are different and the moving object captured from the vehicle camera (in green) and infrastructure camera (in red) will appear at different locations.

In this paper, we propose a novel framework for VIC3D task, Enhanced Multi-scale Image Feature Fusion (EMIFF). We choose intermediate fusion since it doesn’t highly rely on accurate calibration parameters. For feature-level fusion, high-dimensional features extracted from raw data can be compressed, transmitted, and dynamically enhanced [8], which can be used to alleviate the negative effect of pose errors. We design modules to compress transmitted features to reduce transmission cost and enable feature enhancement in scale level, spatial level, and channel level.

Specifically, Feature Compression (FC) module compresses 2D features transmitted from the infrastructure to vehicle. Since the receptive field is larger in smaller-scale features, which theoretically has higher tolerance to slight location errors, Multi-scale Cross Attention (MCA) module aims to achieve attentive scale-wise feature selection between featurese. MCA also corrects features at the spatial level with attentive offset to overcome pixel-wise shift caused by pose errors. To correct location errors born from multiple cameras, features are further enhanced by a Camera-aware Channel Masking (CCM) module via a learned channel-wise mask following the guidance of camera parameters. Then, the enhanced features are transformed into voxel features leveraging calibration parameters. Finally splatted into BEV space, the fused feature is fed into detection heads for object detection. Experiments demonstrate the effectiveness of each EMIFF module in reducing pose errors and achieving better prediction accuracy than existing EF and LF methods. Our contributions can be summarized as follows:

  • We propose EMIFF, a novel framework for camera-based VIC3D object detection, using an intermediate fusion method to tackle cross-agent perception challenges.

  • We design MCA and CCM modules to dynamically enhance image features for better detection performance, with an additional FC module to reduce transmission costs in VIC3D system.

  • We achieve state-of-the-art results on DAIR-V2X-C dataset, the latest VIC3D benchmark with real data, where EMIFF outperforms existing LF and EF methods with comparable transmission costs.

II Related Work

II-A V2X Cooperative Perception

Current research on V2X cooperative perception mainly focuses on simulated datasets, such as OPV2V [3], V2X-Sim [9] and V2XSet [10]. Existing intermediate-fusion methods focused on simulated point clouds, such as V2VNet [13] which transmitted compressed features to nearby vehicles and generated joint perception/prediction. DiscoNet [4] introduced graphs into feature fusion and proposed edge weights to highlight different informative regions during feature propagation. Recent Where2comm [11] considered the spatial confidence of features and selected features with high confidence and complementary to others, which effectively saves transmission costs. FFNET [8] introduces the concept of feature flow to tackle the issue of temporal asynchrony in real-world scenarios, albeit its applicability is limited to point cloud data. Different from point clouds, images from vehicle and infrastructure have a huge view gap, thus features need to be transformed into unified space for fusion. One direct way for fusing multi-view images is late fusion, such as DAIR-V2X [5], which proposed a result-level fusion model for cameras with separate detectors [16]. Few approaches have focused on IF methods for cameras, especially in real scenarios.

Refer to caption
Figure 2: The general framework of EMIFF. Separate image backbone and neck extract multi-scale image feature from vehicle and infrastructure images. FC module compresses source infrastructure feature finfSsubscriptsuperscript𝑓𝑆𝑖𝑛𝑓f^{S}_{inf}italic_f start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT and decompresses it to multi-scale ones finfMsubscriptsuperscript𝑓𝑀𝑖𝑛𝑓f^{M}_{inf}italic_f start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT. MCA module consisting of MFC and MFS blocks enhances multi-scale features fveh/infMsubscriptsuperscript𝑓𝑀𝑣𝑒𝑖𝑛𝑓f^{M}_{veh/inf}italic_f start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_e italic_h / italic_i italic_n italic_f end_POSTSUBSCRIPT by seeking the correlation between the two sides, and CCM takes camera parameters (R,t,K)𝑅𝑡𝐾(R,t,K)( italic_R , italic_t , italic_K ) as input to reweight features fveh/infsubscript𝑓𝑣𝑒𝑖𝑛𝑓f_{veh/inf}italic_f start_POSTSUBSCRIPT italic_v italic_e italic_h / italic_i italic_n italic_f end_POSTSUBSCRIPT with channel relationship. Finally, Point-Sampling Voxel Fusion projects image features fveh/infsubscriptsuperscript𝑓𝑣𝑒𝑖𝑛𝑓f^{\prime}_{veh/inf}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_e italic_h / italic_i italic_n italic_f end_POSTSUBSCRIPT into 3D space to generate a unified voxel feature Vvicsubscript𝑉𝑣𝑖𝑐V_{vic}italic_V start_POSTSUBSCRIPT italic_v italic_i italic_c end_POSTSUBSCRIPT, which can be applied to 3D neck and head in turn for detection prediction.

II-B Camera-based Feature Fusion

Direct Prediction methods extract image features with object query [17, 18, 19, 20] or directly on front-view image [21]. DETR3D [17] used a sparse set of 3D object queries to sample 2D multi-view image features and predicted 3D bounding boxes with set-to-set loss. PETR [19, 20] transformed image features into 3D position-aware representation by encoding 3D coordinates into position embedding. FCOS3D [21] transformed 3D labels to front-view images and directly predicted 3D information by extending FCOS [22] to 3D detection.

Lift-based methods project features from image plane to BEV (bird’s eye view) plane through depth estimation. Most methods [23, 24, 25, 26, 27] applied 2D-to-3D transformation following LSS [28], which predicted a depth distribution for each pixel and lifted image features into frustum features with camera parameters, then splatted all frustums into a rasterized BEV feature. BEVDepth [29] claimed the quality of intermediate depth estimation is the key to improving multi-view 3D object detection and added explicit depth supervision with groundtruth depth generated from point clouds. PON [30] learned the transformation leveraging geometry relationship between image locations and BEV locations in the horizontal direction.

Projection-based methods generate dense voxel or BEV representation from image features through 3D-to-2D projection [1]. ImVoxelNet [16] aggregated the projected features from several images via a simple element-wise averaging, where spatial information might not be exploited sufficiently. Transformer-based methods [31, 32] mapped perspective view to BEV with designed BEV queries and leveraged cross- and self-attention to aggregate spatial and temporal information into BEV queries. Since global attention needs huge memory with high time cost, deformable attention was adopted in BEVFormer [31].

III Method

EMIFF aims to fuse vehicle and infrastructure features by utilizing V2X communication. It includes four main modules: Feature Compression (FC), Multi-scale Cross Attention (MCA), Camera-aware Channel Masking (CCM), and Point-Sampling Voxel Fusion, as illustrated in Figure 2.

Refer to caption
Figure 3: Illustration of FC module. Feature finfSsuperscriptsubscript𝑓𝑖𝑛𝑓𝑆f_{inf}^{S}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT is compressed into finfTsuperscriptsubscript𝑓𝑖𝑛𝑓𝑇f_{inf}^{T}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT through the channel and spatial compressors, which is transmitted to vehicle and is decoded into finfSsuperscriptsubscript𝑓𝑖𝑛𝑓𝑆f_{inf}^{S\prime}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S ′ end_POSTSUPERSCRIPT through the channel and spatial decompressors. Finally, multi-scale infrastructure features finfMsuperscriptsubscript𝑓𝑖𝑛𝑓𝑀f_{inf}^{M}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT can be recovered from finfSsuperscriptsubscript𝑓𝑖𝑛𝑓𝑆f_{inf}^{S\prime}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S ′ end_POSTSUPERSCRIPT with several Conv Blocks with stride 2.

III-A Feature Compression

The images from vehicle and infrastructure are denoted as Ivehsubscript𝐼𝑣𝑒I_{veh}italic_I start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT and Iinfsubscript𝐼𝑖𝑛𝑓I_{inf}italic_I start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT, respectively, and the shape of both images are [H×W×3]delimited-[]𝐻𝑊3\left[H\times W\times 3\right][ italic_H × italic_W × 3 ]. Since infrastructure cameras are typically installed at a higher elevation than vehicles, resulting in a huge view gap between images captured from vehicles and infrastructure, we use separate pre-trained backbones and necks on the vehicle and infrastructure respectively to extract multi-scale image features. The output multi-scale features can be denoted as fsM,s=veh/infsubscriptsuperscript𝑓𝑀𝑠𝑠𝑣𝑒𝑖𝑛𝑓f^{M}_{s},s=veh/infitalic_f start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s = italic_v italic_e italic_h / italic_i italic_n italic_f.

EMIFF transmits image features and camera parameters instead of voxel feature after projection because voxel feature is too large to be transmitted efficiently. The Feature Compression (FC) module (shown in Figure 3) compresses the largest infrastructure feature finf0subscriptsuperscript𝑓0𝑖𝑛𝑓f^{0}_{inf}italic_f start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT (noted as finfSsubscriptsuperscript𝑓𝑆𝑖𝑛𝑓f^{S}_{inf}italic_f start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT) to finfTsuperscriptsubscript𝑓𝑖𝑛𝑓𝑇f_{inf}^{T}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, transmits finfTsuperscriptsubscript𝑓𝑖𝑛𝑓𝑇f_{inf}^{T}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to vehicle and regenerate multi-scale featuresfinfMsubscriptsuperscript𝑓𝑀𝑖𝑛𝑓f^{M}_{inf}italic_f start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT through decompression.

III-B Multi-scale Cross Attention

MCA module contains Multi-scale Feature Correction (MFC) and Multi-scale Feature Selection (MFS) blocks, as shown in Figure 4 and Figure 5. MFC block is designed to select and integrate spatial-wise features with attentive offset, and thus pixel-wise shift caused by pose errors can be overcome to some extent. MFS block applies cross-attention between them to achieve attentive scale-wise feature selection.

The MFC module is first applied to multi-scale features. Since pose errors can cause a displacement between the projected and ground-truth positions on 2D plane, we apply DCN (deformable convolutional networks) [33] for each scale feature to allow every pixel to get spatial information surrounding it. Then, features at different scales are upsampled to the same size through UpConv blocks.

MFS applies MeanPooling operation to obtain the representation of different scales of infrastructure features, while vehicle features at different scales are first fused by mean operation and then refined by MeanPooling. To find the correlation between vehicle features and infrastructure features at different scales, cross attention is applied to infrastructure representations as Key and vehicle representation as Query, which generates attention weights ωinfmsubscriptsuperscript𝜔𝑚𝑖𝑛𝑓\omega^{m}_{inf}italic_ω start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT for each scale m𝑚mitalic_m. We calculate inter-product between features f^infMsubscriptsuperscript^𝑓𝑀𝑖𝑛𝑓\hat{f}^{M}_{inf}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT and weights ωinfmsubscriptsuperscript𝜔𝑚𝑖𝑛𝑓\omega^{m}_{inf}italic_ω start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT. The final outputs of MCA are augmented infrastructure image feature finfsubscript𝑓𝑖𝑛𝑓f_{inf}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT and vehicle image feature fvehsubscript𝑓𝑣𝑒f_{veh}italic_f start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT.

Refer to caption
Figure 4: Details of MFC. Every pixel-wise feature is integrated with the spatial information of surrounding pixels via DCN, and multi-scale features are scaled to the same size through UpConv blocks.
Refer to caption
Figure 5: Schema of MCA module. In the lower branch, vehicle feature fvehsubscript𝑓𝑣𝑒f_{veh}italic_f start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT is generated from fvehMsubscriptsuperscript𝑓𝑀𝑣𝑒f^{M}_{veh}italic_f start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT through MFC Block and Mean. In the upper branch, finfMsubscriptsuperscript𝑓𝑀𝑖𝑛𝑓f^{M}_{inf}italic_f start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT is refined into ‘key’ through MFC Block and MeanPooling, and queries are generated from fvehsubscript𝑓𝑣𝑒f_{veh}italic_f start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT through MeanPooling. The output weights ωinfmsuperscriptsubscript𝜔𝑖𝑛𝑓𝑚\omega_{inf}^{m}italic_ω start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT of cross-attention are applied to f^infMsubscriptsuperscript^𝑓𝑀𝑖𝑛𝑓\hat{f}^{M}_{inf}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT with inner product to form infrastructure featurefinfsubscript𝑓𝑖𝑛𝑓f_{inf}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT.

III-C Camera-aware Channel Masking

Considering that objects closer to the camera are easier to detect and it’s common for the same object to be closer to an infrastructure camera but far away from a vehicle. Since different channels represent object information at different distances, which is strongly correlated with camera parameters, it is intuitive to take camera parameters as priors to augment image features.

Inspired by the decoupled nature of SENet [34] and LSS [28], CCM will learn a channel-wise mask to weigh the importance between the channels. First, camera intrinsic and extrinsic are stretched into one dimension and concatenated together. Then, they are scaled up to the feature’s dimension C𝐶Citalic_C using MLP to generate a channel mask Mveh/infsubscript𝑀𝑣𝑒𝑖𝑛𝑓M_{veh/inf}italic_M start_POSTSUBSCRIPT italic_v italic_e italic_h / italic_i italic_n italic_f end_POSTSUBSCRIPT. Finally, Mveh/infsubscript𝑀𝑣𝑒𝑖𝑛𝑓M_{veh/inf}italic_M start_POSTSUBSCRIPT italic_v italic_e italic_h / italic_i italic_n italic_f end_POSTSUBSCRIPT is used to re-weight the image features fveh/infsubscript𝑓𝑣𝑒𝑖𝑛𝑓f_{veh/inf}italic_f start_POSTSUBSCRIPT italic_v italic_e italic_h / italic_i italic_n italic_f end_POSTSUBSCRIPT in channel-wise and obtain results fveh/infsubscriptsuperscript𝑓𝑣𝑒𝑖𝑛𝑓f^{\prime}_{veh/inf}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_e italic_h / italic_i italic_n italic_f end_POSTSUBSCRIPT. The overall CCM module can be written as:

fs=Msfs,s=veh,infms=MLP(ξ(Rs)ξ(ts)ξ(Ks))formulae-sequencesubscriptsuperscript𝑓𝑠direct-productsubscript𝑀𝑠subscript𝑓𝑠formulae-sequence𝑠𝑣𝑒𝑖𝑛𝑓subscript𝑚𝑠MLPdirect-sum𝜉subscript𝑅𝑠𝜉subscript𝑡𝑠𝜉subscript𝐾𝑠\begin{split}f^{\prime}_{s}&=M_{s}\odot f_{s},s=veh,inf\\ m_{s}&=\text{MLP}\left(\xi\left(R_{s}\right)\oplus\xi\left(t_{s}\right)\oplus% \xi\left(K_{s}\right)\right)\end{split}start_ROW start_CELL italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL start_CELL = italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⊙ italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s = italic_v italic_e italic_h , italic_i italic_n italic_f end_CELL end_ROW start_ROW start_CELL italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL start_CELL = MLP ( italic_ξ ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⊕ italic_ξ ( italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ⊕ italic_ξ ( italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_CELL end_ROW (1)

ξ𝜉\xiitalic_ξ denotes the flat operation and direct-sum\oplus means concatenation. The input of MLP is the combination of camera rotation matrix Rs3×3subscript𝑅𝑠superscript33R_{s}\in\mathbb{R}^{3\times 3}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT, translation tssubscript𝑡𝑠t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and camera intrinsics Kssubscript𝐾𝑠K_{s}italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can be obtained from MLP’s output mssubscript𝑚𝑠m_{s}italic_m start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through several Fully Connected (FC) layers and Activation (Relu, Sigmoid) layers.

Refer to caption
Figure 6: The schema of CCM module.

III-D Point-Sampling Voxel Fusion

The augmented vehicle feature fvehsubscriptsuperscript𝑓𝑣𝑒f^{\prime}_{veh}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT and infrastructure feature finfsubscriptsuperscript𝑓𝑖𝑛𝑓f^{\prime}_{inf}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT are projected into 3D space for fusion and generate two voxel features, denoted as Vvehsubscript𝑉𝑣𝑒V_{veh}italic_V start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT and Vinfsubscript𝑉𝑖𝑛𝑓V_{inf}italic_V start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT, respectively. The details of projection principle are following ImVoxelNet  [16].

We obtain the final voxel feature VvicNx×Ny×Nz×C1subscript𝑉𝑣𝑖𝑐subscript𝑁𝑥subscript𝑁𝑦subscript𝑁𝑧subscript𝐶1V_{vic}\in N_{x}\times N_{y}\times N_{z}\times C_{1}italic_V start_POSTSUBSCRIPT italic_v italic_i italic_c end_POSTSUBSCRIPT ∈ italic_N start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by averaging sampled voxel features Vvehsubscript𝑉𝑣𝑒V_{veh}italic_V start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT and Vinfsubscript𝑉𝑖𝑛𝑓V_{inf}italic_V start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT. Then, the same 3D neck as [16], which is composed of 3D CNN and downsampling layers, transforms voxel feature Vvicsubscript𝑉𝑣𝑖𝑐V_{vic}italic_V start_POSTSUBSCRIPT italic_v italic_i italic_c end_POSTSUBSCRIPT into BEV feature BvicNX×Ny×C2subscript𝐵𝑣𝑖𝑐subscript𝑁𝑋subscript𝑁𝑦subscript𝐶2B_{vic}\in N_{X}\times N_{y}\times C_{2}italic_B start_POSTSUBSCRIPT italic_v italic_i italic_c end_POSTSUBSCRIPT ∈ italic_N start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. BEV feature can be used as input of common 2D detection heads to predict 3D detection results. The loss of detection heads is similar to SECOND [35], which consists of smooth L1 Loss for bounding box Lbboxsubscript𝐿bboxL_{\text{bbox}}italic_L start_POSTSUBSCRIPT bbox end_POSTSUBSCRIPT, focal loss for classification Lclssubscript𝐿clsL_{\text{cls}}italic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT, and cross-entropy loss for direction Ldirsubscript𝐿dirL_{\text{dir}}italic_L start_POSTSUBSCRIPT dir end_POSTSUBSCRIPT.

IV Expermients

Refer to caption
Figure 7: Visualization results of ImVoxelNet (LF) (left column), ImVoxelNet_M (EF) (middle column), and EMIFF (IF) (right column). Bounding boxes in BEV (bottom row) are projected to vehicle and infrastructure image planes (top two rows). Groundtruth are in green and predictions in red. From BEV, it is clear that red and green bounding boxes from EMIFF are better aligned than LF and EF methods. This shows that ImVoxelNet (LF) and ImVoxelNet_M (EF) have detected more false positive objects and fewer true positive objects than EMIFF (IF).

IV-A Implementation Details

Datasets. We conduct our experiments on a vehicle-infrastructure-cooperation dataset DAIR-V2X [5], in which all frames are captured from real scenarios. We utilize the VIC-Sync portion of DAIR-V2X-C dataset for training and evaluation, which is composed of 9,311 pairs of infrastructure and vehicle frames captured at the same time. Annotations of each pair frame are in world coordinate and need to be translated into vehicle coordinate system for training and evaluation.

Evaluation Metrics. Evaluation metrics are Average Precision (AP) [36] and Average Byte (AB) to measure detection performance and transmission cost, the same as [5]. The AP metrics are based on the detection range surrounding the vehicle, including Overall (0-100m), 0-30m, 30-50m, and 50-100m. All metrics are calculated with IoU=0.5IoU0.5\text{IoU}=0.5IoU = 0.5 and can be divided into 2 parts: AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT and APBEV𝐴subscript𝑃BEVAP_{\text{BEV}}italic_A italic_P start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT. AB means the average size of transmitted data. It is the feature map finfTsuperscriptsubscript𝑓𝑖𝑛𝑓Tf_{inf}^{\text{T}}italic_f start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT T end_POSTSUPERSCRIPT in our method.

Training. We use ResNet-50 [37] as backbone and FPN [38] as 2D neck to extract image features. The channel number C𝐶Citalic_C of the neck’s output is 64. We set the channel of 3D voxel feature C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 64 and the channel of BEV feature C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to 256 following [35, 39].

IV-B Object Detection Results

We compare the performance of baseline Late Fusion (LF) methods with ImVoxelNet and our proposed single-side model EMIFF_Veh/Inf on DAIR-V2X-C dataset. We also implement several multi-view camera-based methods that have been applied to nuScenes dataset [40, 36] (eg., BEVFormer [31], BEVDepth [29]) for VIC3D task . The evaluation results on VIC-Sync portion of DAIR-V2X-C dataset are shown in Table I and Figure 7. From the table, Intermediate Fusion (IF) method EMIFF has achieved state-of-the-art performance on the multi-view camera fusion benchmark, compared with other methods of Late Fusion (LF) and Early Fusion (EF). EMIFF obtains 15.61 AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT and 21.44 APBEV𝐴subscript𝑃BEVAP_{\text{BEV}}italic_A italic_P start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT in overall setting.

EMIFF_Veh and EMIFF_Inf remove the MCA module but preserve CCM and FC modules so that models can be applied to the vehicle side and infrastructure side respectively without interaction between them, and predictions can be used for Late Fusion. EMIFF achieves higher AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT and APBEV𝐴subscript𝑃BEVAP_{\text{BEV}}italic_A italic_P start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT compared with ImVoxelNet [16] under the setting of Only-Veh, Only-Inf, and LF. This indicates that EMIFF’s single-side model has a stronger feature extraction ability.

What is interesting is that Only-Inf methods achieve the best scores in 50-100m AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT and APBEV𝐴subscript𝑃BEVAP_{\text{BEV}}italic_A italic_P start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT and this phenomenon can also be seen in [5]. As mentioned before, these metrics are related to detecting objects far from the ego vehicle. We count 16,934 objects within the distance range of 50-100m from vehicle, which are used to calculate the metric of 50-100m AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT. Among these objects, almost three-quarters (12,651) objects are closer to infrastructure camera, which are easier to be detected by Only-Inf models.

Fusion Model AP3D (IoU=0.5)𝐴subscript𝑃3D (IoU=0.5)AP_{\textbf{3D (IoU=0.5)}}italic_A italic_P start_POSTSUBSCRIPT 3D (IoU=0.5) end_POSTSUBSCRIPT APBEV (IoU=0.5)𝐴subscript𝑃BEV (IoU=0.5)AP_{\textbf{BEV (IoU=0.5)}}italic_A italic_P start_POSTSUBSCRIPT BEV (IoU=0.5) end_POSTSUBSCRIPT AB (Byte)
Overall 0-30m 30-50m 50-100m Overall 0-30m 30-50m 50-100m
Only-Veh ImVoxelNet [16] 7.29 16.98 2.35 0.13 8.85 19.89 3.44 0.28 \
EMIFF_Veh 8.65 19.11 4.33 0.20 10.46 22.42 5.57 0.42
Only-Inf ImVoxelNet [16] 8.66 13.05 5.79 5.50 14.41 17.98 10.34 11.19 \
EMIFF_Inf 9.76 13.59 6.90 6.63 14.81 18.78 11.50 11.43
LF ImVoxelNet [5] 11.08 22.27 4.40 2.33 14.76 27.02 7.13 4.73 0.28K
EMIFF_Veh/Inf 11.99 24.79 6.08 2.30 15.79 30.39 8.50 4.84 0.28K
EF BEVDepth [29] 7.36 16.23 1.79 0.18 13.17 26.42 5.00 4.82 550.84K
BEVFormer_S [31] 8.80 18.07 3.71 1.76 13.45 24.76 6.46 4.63
ImVoxelNet [16] 12.72 23.63 7.38 3.11 18.17 30.54 11.39 7.00
IF EMIFF 15.61 29.12 9.07 4.01 21.44 36.24 13.51 8.28 32.64K
TABLE I: Quantitative evaluation on DAIR-V2X-C. Best values are marked by bold, and the second best is underlined. All scores in %percent\%%.

We also compared EMIFF with some representative cooperative perception models on DAIR-V2X-C datasets and experimental results are reported in Table II. The performance advantage of EMIFF is significant and it outperforms DiscoNet [4] by 23.69%. (*** means results are from paper [41]).

Fusion Modality Model AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT
Bounding Box LateFusion [5] (CVPR’22) 11.08
BEV CoBEVT [6] (CoRL’22) 4.80
V2VNet* [13] (ECCV’20) 8.47
When2com* [42] (CVPR’20) 9.84
Where2comm* [11] (NeurIPS’22) 10.25
V2X-ViT* [10] (ECCV’22) 10.75
DiscoNet* [4] (NeurIPS’21) 12.62
Voxel EMIFF 15.61
TABLE II: Comparison results of representative cooperative methods on DAIR-V2X-C. All scores in %percent\%%.

IV-C Ablation Study

We remove MCA, CCM, and FC modules in EMIFF and regard it as baseline in the ablation study. We also conduct experiments to investigate when to fuse information from vehicle and infrastructure.

MCA CCM FC AP𝟑𝐃𝐴subscript𝑃𝟑𝐃AP_{\textbf{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT AP𝐁𝐄𝐕𝐴subscript𝑃𝐁𝐄𝐕AP_{\textbf{BEV}}italic_A italic_P start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT
13.60 20.05
13.98 20.23
14.65 20.64
15.27 21.03
15.61 21.44
TABLE III: Ablation study on EMIFF.

Effect of Each Component. The ablation results on MCA, CCM, and FC modules are summarized in Table III. The 1st row model can be denoted as EMIFF_B, which removes MCA, FC, and CCM modules and only keeps the fusion methodology at feature level. Comparing the 2nd and 3rd rows with the 1st row, both MCA and CCM can improve performance over baseline, and MCA has increased AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT and APBEV𝐴subscript𝑃BEVAP_{\text{BEV}}italic_A italic_P start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT by 1.05 and 0.59, better than 0.38 and 0.18 increase induced by CCM module. These results demonstrate the validity of MCA, which selects more useful infrastructure features at different scales based on vehicle features with a cross-attention mechanism. FC is designed to eliminate redundant information included in features, while it can also improve detection performance. This is because FC module increases the depth of the whole network and introduces extra computation, which can be regarded as feature refinement.

Refer to caption
Figure 8: AP3D (IoU=0.5)𝐴subscript𝑃3D (IoU=0.5)AP_{\text{3D (IoU=0.5)}}italic_A italic_P start_POSTSUBSCRIPT 3D (IoU=0.5) end_POSTSUBSCRIPT with respect to Compression Rate (shown as number ×\times×). CCR is changed from ×1absent1\times 1× 1 to ×64absent64\times 64× 64 and SCR is set from ×1absent1\times 1× 1 to ×256absent256\times 256× 256 with CCR set to ×64absent64\times 64× 64.

Voxel or BEV Fusion? To investigate when to fuse features in IF method (at voxel or BEV level), we compare the performance of EMIFF with EMIFF_BEV. The former belongs to the IF-Voxel pipeline while the latter belongs to the IF-BEV fusion pipeline, which condenses voxel features Vvehsubscript𝑉𝑣𝑒V_{veh}italic_V start_POSTSUBSCRIPT italic_v italic_e italic_h end_POSTSUBSCRIPT and Vinfsubscript𝑉𝑖𝑛𝑓V_{inf}italic_V start_POSTSUBSCRIPT italic_i italic_n italic_f end_POSTSUBSCRIPT into BEV feature respectively with two 3D necks, and then two BEV features are averaged for fusion. Results (Table IV) show that fusion at the voxel level has better performance, which indicates that the transformation from voxel to BEV feature can cause higher information loss.

Fusion Model AP𝟑𝐃𝐴subscript𝑃𝟑𝐃AP_{\textbf{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT AP𝐁𝐄𝐕𝐴subscript𝑃𝐁𝐄𝐕AP_{\textbf{BEV}}italic_A italic_P start_POSTSUBSCRIPT BEV end_POSTSUBSCRIPT
LF ImVoxelNet 11.08 14.76
EF ImVoxelNet_M 12.72 18.17
IF (BEV) EMIFF_BEV 11.50 16.23
IF (Voxel) EMIFF 13.37 19.66
TABLE IV: Analysis on choice of feature fusion.
Backbone Model AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT APBEV𝐴subscript𝑃𝐵𝐸𝑉AP_{BEV}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT Param FLOPs
ResNet-50 EMIFF_B 13.60 20.05 47.82M 94.01G
ResNet-50 EMIFF 15.61 21.44 49.32M 123.76G
ResNet-101 EMIFF_B 14.06 20.56 85.79M 171.10G
ResNet-101 EMIFF 16.46 22.32 87.31M 201.46G
TABLE V: Comparison results of model capacities.

IV-D Influence of Model Capacities

To further explain the effect of the proposed modules, we replace the image backbone of EMIFF and the baseline model EMIFF_B from ResNet-50 to ResNet-101 to assess the extent of performance improvement that can be obtained by increasing the parameters and capacities. Results in Table V show that a more complex network with higher capacity has better performance, while EMIFF with fewer parameters and lower FLOPs still outperforms baseline EMIFF_B with ResNet-101. This observation provides additional evidence supporting the effectiveness of the proposed modules.

IV-E Impact of Feature Compression

As seen in Figure 8, We investigate the effect of Channel Compressor and Spatial Compressor. First, we change Channel Compression Rate (CCR) from ×1absent1\times 1× 1 to ×64absent64\times 64× 64, and the model performance is almost stable at low compression rates, which indicates that channel compression can extract more useful information and remove redundancy. After CCR reaches the maximum, we continue to compress features with Spatial Compressor. The compression rate ranges from ×64absent64\times 64× 64 to ×16384absent16384\times 16384× 16384. With compressed feature shapes getting smaller, the AP3D𝐴subscript𝑃3DAP_{\text{3D}}italic_A italic_P start_POSTSUBSCRIPT 3D end_POSTSUBSCRIPT declines from 15.33 to 12.63 but is still higher than LF, and the transmission cost has fallen to 0.51KB which is comparable to LF’s cost.

V CONCLUSIONS

EMIFF is a novel multi-view intermediate-fusion framework for camera-based VIC3D task. To correct the negative effect of pose errors and time asynchrony, we design a Multi-scale Cross-Attention module and Camera-aware Channel Masking module to fuse and augment multi-view features. EMIFF also effectively reduces transmission cost via Feature Compression, and has achieved state-of-the-art results on DAIR-V2X-C benchmark, significantly outperforming previous EF and LF methods. Future study points to extension of the framework to more data modalities.

ACKNOWLEDGMENT

This work is funded by the National Key R&D Program of China (2022ZD0115502) and Lenovo Research.

References

  • [1] Y. Ma, T. Wang, X. Bai, H. Yang, Y. Hou, Y. Wang, Y. Qiao, R. Yang, D. Manocha, and X. Zhu, “Vision-centric bev perception: A survey,” arXiv preprint arXiv:2208.02797, 2022.
  • [2] Y. Han, H. Zhang, H. Li, Y. **, C. Lang, and Y. Li, “Collaborative perception in autonomous driving: Methods, datasets and challenges,” arXiv preprint arXiv:2301.06262, 2023.
  • [3] R. Xu, H. Xiang, X. Xia, X. Han, J. Li, and J. Ma, “Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 2583–2589.
  • [4] E. Mehr, A. Jourdan, N. Thome, M. Cord, and V. Guitteny, “Disconet: Shapes learning on disconnected manifolds for 3d editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3474–3483.
  • [5] H. Yu, Y. Luo, M. Shu, Y. Huo, Z. Yang, Y. Shi, Z. Guo, H. Li, X. Hu, J. Yuan, and Z. Nie, “Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2022.
  • [6] R. Xu, Z. Tu, H. Xiang, W. Shao, B. Zhou, and J. Ma, “Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers,” arXiv preprint arXiv:2207.02202, 2022.
  • [7] N. Vadivelu, M. Ren, J. Tu, J. Wang, and R. Urtasun, “Learning to communicate and correct pose errors,” in Proceedings of the 2020 Conference on Robot Learning, ser. Proceedings of Machine Learning Research, J. Kober, F. Ramos, and C. Tomlin, Eds., vol. 155.   PMLR, 16–18 Nov 2021, pp. 1195–1210.
  • [8] H. Yu, Y. Tang, E. Xie, J. Mao, J. Yuan, P. Luo, and Z. Nie, “Vehicle-infrastructure cooperative 3d object detection via feature flow prediction,” arXiv preprint arXiv:2303.10552, 2023.
  • [9] Y. Li, D. Ma, Z. An, Z. Wang, Y. Zhong, S. Chen, and C. Feng, “V2x-sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 914–10 921, 2022.
  • [10] R. Xu, H. Xiang, Z. Tu, X. Xia, M.-H. Yang, and J. Ma, “V2x-vit: Vehicle-to-everything cooperative perception with vision transformer,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX.   Springer, 2022, pp. 107–124.
  • [11] Y. Hu, S. Fang, Z. Lei, Y. Zhong, and S. Chen, “Where2comm: Communication-efficient collaborative perception via spatial confidence maps,” arXiv preprint arXiv:2209.12836, 2022.
  • [12] R. Chen, Y. Mu, R. Xu, W. Shao, C. Jiang, H. Xu, Z. Li, and P. Luo, “Co^ 3: Cooperative unsupervised 3d representation learning for autonomous driving,” arXiv preprint arXiv:2206.04028, 2022.
  • [13] T.-H. Wang, S. Manivasagam, M. Liang, B. Yang, W. Zeng, and R. Urtasun, “V2vnet: Vehicle-to-vehicle communication for joint perception and prediction,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16.   Springer, 2020, pp. 605–621.
  • [14] S. Fan, H. Yu, W. Yang, J. Yuan, and Z. Nie, “Quest: Query stream for vehicle-infrastructure cooperative perception,” arXiv preprint arXiv:2308.01804, 2023.
  • [15] W. Chen, R. Xu, H. Xiang, L. Liu, and J. Ma, “Model-agnostic multi-agent perception framework,” arXiv preprint arXiv:2203.13168, 2022.
  • [16] D. Rukhovich, A. Vorontsova, and A. Konushin, “Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2397–2406.
  • [17] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “Detr3d: 3d object detection from multi-view images via 3d-to-2d queries,” in Conference on Robot Learning.   PMLR, 2022, pp. 180–191.
  • [18] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A unified sensor fusion framework for 3d detection,” arXiv preprint arXiv:2203.10642, 2022.
  • [19] Y. Liu, T. Wang, X. Zhang, and J. Sun, “Petr: Position embedding transformation for multi-view 3d object detection,” arXiv preprint arXiv:2203.05625, 2022.
  • [20] Y. Liu, J. Yan, F. Jia, S. Li, Q. Gao, T. Wang, X. Zhang, and J. Sun, “Petrv2: A unified framework for 3d perception from multi-camera images,” arXiv preprint arXiv:2206.01256, 2022.
  • [21] T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one-stage monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.
  • [22] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
  • [23] J. Huang, G. Huang, Z. Zhu, and D. Du, “Bevdet: High-performance multi-camera 3d object detection in bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
  • [24] J. Huang and G. Huang, “Bevdet4d: Exploit temporal cues in multi-camera 3d object detection,” arXiv preprint arXiv:2203.17054, 2022.
  • [25] E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo, and J. M. Alvarez, “M^ 2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation,” arXiv preprint arXiv:2204.05088, 2022.
  • [26] Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,” arXiv preprint arXiv:2205.09743, 2022.
  • [27] C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8555–8564.
  • [28] J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in European Conference on Computer Vision.   Springer, 2020, pp. 194–210.
  • [29] Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,” arXiv preprint arXiv:2206.10092, 2022.
  • [30] T. Roddick and R. Cipolla, “Predicting semantic map representations from images using pyramid occupancy networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 138–11 147.
  • [31] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” arXiv preprint arXiv:2203.17270, 2022.
  • [32] L. Peng, Z. Chen, Z. Fu, P. Liang, and E. Cheng, “Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs,” arXiv preprint arXiv:2203.04050, 2022.
  • [33] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [34] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [35] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
  • [36] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” International Journal of Robotics Research (IJRR), 2013.
  • [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [38] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
  • [39] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 697–12 705.
  • [40] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
  • [41] Y. Hu, Y. Lu, R. Xu, W. Xie, S. Chen, and Y. Wang, “Collaboration helps camera overtake lidar in 3d detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9243–9252.
  • [42] Y.-C. Liu, J. Tian, N. Glaser, and Z. Kira, “When2com: Multi-agent perception via communication graph grou**,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.