HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: fontawesome
  • failed: arydshln
  • failed: nth
  • failed: bibentry

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: CC BY-NC-SA 4.0
arXiv:2302.02367v6 [cs.CV] 13 Dec 2023

FastPillars: A Deployment-friendly Pillar-based 3D Detector

Sifan Zhou1, Zhi Tian2, Xiangxiang Chu2, Xinyu Zhang2, Bo Zhang2, Xiaobo Lu11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT11footnotemark: 1
Chengjian Feng2, Zequn Jie2, Miao Sun3, Patrick Yin Chiang3, Lin Ma2
Abstract

The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios. Existing BEV-based (i.e., Bird Eye View) detectors favor sparse convolutions (known as SPConv) to speed up training and inference, which puts a hard barrier for deployment, especially for on-device applications. In this paper, in order to tackle the challenge of efficient 3D object detection from an industry perspective, we devise a deployment-friendly pillar-based 3D detector, termed FastPillars. First, we introduce a novel lightweight Max-and-Attention Pillar Encoding (MAPE) module specially for enhancing small 3D objects. Second, we propose a simple yet effective principle for designing backbone in pillar-based 3D detection. We construct FastPillars based on these designs, achieving high performance and low latency without SPConv. Extensive experiments on two large-scale datasets demonstrate the effectiveness and efficiency of FastPillars for on-device 3D detection regarding both performance and speed. Specifically, FastPillars delivers state-of-the-art accuracy on Waymo Open Dataset with 1.8 ×\times× speed up and 3.8 mAPH/L2 improvement over CenterPoint (SPConv-based). We will release our code.

1 Introduction

3D object detection using LiDAR point cloud has a wide range of applications and has shown remarkable progress in self-driving and robotics (Qi et al. 2018; Shi et al. 2020a, b). However, the community tends to explore high-performance detectors while overlooking the requirement for fast runtime speed, essential for onboard deployment in autonomous systems. Therefore, it is imperative to develop a real-time top-performing 3D detector from the industrial perspective.

According to the type of input data, existing real-time mainstream 3D detectors can be divided into two classes: point-based (Qi et al. 2019; Shi, Wang, and Li 2019; Shi et al. 2020b; Yang et al. 2020) and grid-based methods (Zhou and Tuzel 2018; Lang et al. 2019; Deng et al. 2021). In point-based methods, PointNet families (Qi et al. 2017a, b) are leveraged to learn discriminative representation from raw point cloud. However, they may not be friendly to effective hardware implementation as they often require point query/retrieval in 3D space (e.g., PointNet++ (Qi et al. 2017b)). Voxel-based 3D detectors convert irregular point cloud into arranged grids (i.e., voxels/pillars). However, sparse point cloud result in numerous empty grids, which will lead to significant redundant computational overheads. Some methods (Yan, Mao, and Li 2018; Yin, Zhou, and Krahenbuhl 2021a) use SPConv (Graham and van der Maaten 2017) to skip calculation on empty grids to reduce computational burden. Despite being effective, SPConv poses a challenge when converted to TRT (i.e., TensorRT) for deployment and hampers further speedup through these techniques.

Refer to caption
Figure 1: The overall comparison with other real-time one-stage 3D detection methods. FastPillars finds a better trade-off between accuracy and speed, outperforming CenterPoint by 1.8 ×\times× faster and 3.8 mAPH L2 higher on the Waymo val set. FPS is measured on an NVIDIA Tesla V100 GPU.

PointPillars (Lang et al. 2019) is proposed to utilize highly-optimized 2D convolutions alone, achieving lower latency. Despite being fast, PointPillars has unsatisfactory performance due to the lack of an efficient backbone. Recently, PillarNet (Shi, Li, and Ma 2022) achieved high-performance while kee** real-time speed based on a SPConv-based backbone. Nevertheless, the use of SPConv in PillarNet makes it hard to be quantized and deployed via TRT. Specifically, the deployment of SPConv has the following difficulties. i) SPConv is not a built-in operation in TensorRT. This makes it necessary to write a tedious custom plugin in CUDA C++ with several limitations like fixed-shape input and reduced compatibility for commonly-used TensorRT for the quantization deployment. ii) SPConv requires irregular memory access patterns, which is challenging to optimize on modern hardware. iii) SPConv takes as inputs 3D coordinates that are difficult to be quantized and requires taking fusion into consideration for fast execution, further complicating the deployment process. However, TRT offers several advantages: i) TRT can significantly accelerate model inference while maintaining accuracy through deep optimization and high parallelism techniques such as layer and tensor fusion. ii) TRT supports a wide range of embedded and automotive environments, making it extensively used in the deployment of deep learning models in various industrial products.

In this paper, we propose a real-time and high-performance 3D object detector designed with deployment in mind, termed FastPillars. FastPillars is fully based on standard convolutions, and thus it can be effortlessly deployed in onboard applications and seamlessly enjoy the speedup of TRT and network quantization. FastPillars consists of four essential blocks which are PFE (Pillar Feature Encoding), backbone, neck and head (see Fig. 2), respectively. In PFE block, we observe that previous pillar-based methods did not pay attention to local geometry patterns. To this end, we propose a simple but effective Max-and-Attention Pillar Encoding (MAPE) module, which attentively integrates significant local features and thus alleviates the information loss in the pillar generating process. In addition, the MAPE module barely increases the overall latency (+4ms). In backbone, we find that the design rules of 2D backbone are not suitable for point cloud detection task. Based on this observation, we propose a computation-efficient principle and design a lightweight backbone. Finally, we enrich the semantic features in neck block and adopt center-based detection head. Extensive experiments demonstrate that FastPillars achieves state-of-the-art performance on two large-scale datasets nuScenes and Waymo. As shown in Fig. 1, it can be seamlessly speeded up through TensorRT with 1.8 ×\times× speed up and 3.8 mAPH L2 improvement over CenterPoint (SPConv-based). As a result, the proposed method offers an even better trade-off between speed and accuracy for real-time embedded applications. We summarize our contributions as follows:

  • We introduce a novel lightweight Max-and-Attention Pillar Encoding (MAPE) module specially for enhancing small 3D objects.

  • We propose a simple yet effective principle for designing pillar-based 3D backbone with remarkable performance and real-time inference speed.

  • Based on the above designs, we construct a high-performance and low-latency 3D detector termed FastPillars. FastPillars eliminates the need for SPConv hampering the on-device deployment and provides a strong and simple alternative to SPConv-based detectors.

  • Extensive experiments on the nuScenes and Waymo dataset show that FastPillars brings a new state-of-the-art for on-board 3D detection in terms of accuracy and latency trade-off, and most importantly end-to-end deployable in TensorRT. We believe our method serve as a strong competitor to its peers and pose a significant impact for the community.

Refer to caption
Figure 2: The framework of FastPillars. Input the point cloud, FastPillars predicts 3D bounding boxes. As shown in the bottom, FastPillars consists of four parts: MAPE module, backbone, neck and center-based head. First, the point cloud is pillarized with MAPE, and then the encoded features are sent to the backbone for choreographed feature extraction. These features are fused by the neck, and 3D boxes are regressed based on the center-based head. The backbone is designed to be scalable by changing the number of blocks at the early stages according to different needs. Best viewed in color.

2 Related Work

Voxel-based 3D Detectors. Voxel-based 3D detectors (Chen et al. 2017; Kuang et al. 2020; Deng et al. 2021) generally convert the unstructured point cloud to regular pillar/voxel grids. This further allows learning point features by utilizing the mature 2D/3D CNNs. VoxelNet (Zhou and Tuzel 2018) is a pioneering work, which voxelizes the point cloud and then uses Voxel Feature Extractor (VFE) and 3D CNNs to learn the geometrical representation. Its shortcoming is the slow inference speed due to the huge computational burden of the 3D convolutions. To save the memory cost, SECOND (Yan, Mao, and Li 2018) uses 3D sparse convolutions (Graham and van der Maaten 2017) that operates on non-empty voxels to speed up the training and inference. However, the use of SPConv has a drawback: it is not deployment-friendly, makeing it tricky to apply them on embedded systems. To this end, PointPillars (Lang et al. 2019) was proposed for on-device deployment. In PointPillars, a mature 2D detector pipeline is applied to predict 3D objects, making it easy to be converted into ONNX/TensorRT for deployment. Meanwhile, PointPillars’ deployment-friendly nature has made it a popular method in practice. After that, CenterPoint (Yin, Zhou, and Krahenbuhl 2021a) was proposed, which uses a nearly real-time and anchor-free pipeline, achieving state-of-the-art performance. Recently, PillarNet (Shi, Li, and Ma 2022) uses 2D SPConv based on the “encoder-neck-head” architecture to boost accuracy with real-time speed. However, due to the use of SPConv, it inevitably faces the difficulty of deployment for industrial applications and further speedup with TRT optimization.

Industry-level Lightweight Network Structures for Object Detection. For years, the YOLO series (Bochkovskiy, Wang, and Liao 2020; Glenn 2022) has been the de facto industry standard for lightweight 2D object detection, whose backbone designs mainly inherit the ideas from RepVGG (Ding et al. 2021). RepVGG refactored the famous plain network VGG (Simonyan and Zisserman 2014) using a reparameterization-based structural design. During training, a plain Conv-BN-ReLU is replaced by its over-parameterized three-branch counterpart, i.e., Conv3×3absent3\times 3× 3-BN, Conv1×1absent1\times 1× 1-BN and Identity-BN, followed by the ReLU function after the summation of the three branches. The three-branch structure substantially helps the network optimization while the reparameterization converts three branches identically into one at inference, improving the efficiency in inference. Due to its advantage, this trend has swept 2D object detectors and shown high performance at extreme speeds, such as PPYOLO-E (Xu et al. 2022), YOLOv6 (Li et al. 2022a) and YOLOv7 (Wang, Bochkovskiy, and Liao 2022). Albeit the success, it is not yet seen, to our best knowledge, any application of these schemes in LiDAR object detection. Our successful application significantly improves computational efficiency and reduces the difficulty of deployment, especially for resource-constrained hardwares.

3 Our Approach

This section presents FastPillars, an end-to-end trainable and SPConv-free neural network for real-time high-performance 3D detection. As shown in Fig. 2, our network consists of four blocks: MAPE module, backbone, neck and center-based detection head.

Problem Setting. We present the basic task definitions of LiDAR-based 3D detection before introducing the detailed method. Given a point set with N𝑁Nitalic_N points in the 3D space, which is defined as 𝐏={𝐩i=[xi,yi,zi,ri,ti]TN×5}𝐏subscript𝐩𝑖superscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖subscript𝑟𝑖subscript𝑡𝑖𝑇superscript𝑁5\mathbf{P}=\{\mathbf{p}_{i}=[x_{i},y_{i},z_{i},r_{i},t_{i}]^{T}\in{\mathbb{R}^% {N\times 5}}\}bold_P = { bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 5 end_POSTSUPERSCRIPT }, where xi,yi,zisubscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖x_{i},y_{i},z_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the coordinate values of each point along the axes X, Y, Z, respectively, and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the laser reflection intensity. tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the relative timestamp, which is optional and depends on the specific settings of a dataset. Given a set of object in the 3D scene 𝐁={𝐛j=[xj,yj,zj,hj,wj,lj,θj,cj]TM×8}𝐁subscript𝐛𝑗superscriptsubscript𝑥𝑗subscript𝑦𝑗subscript𝑧𝑗subscript𝑗subscript𝑤𝑗subscript𝑙𝑗subscript𝜃𝑗subscript𝑐𝑗𝑇superscript𝑀8\mathbf{B}=\{\mathbf{b}_{j}=[x_{j},y_{j},z_{j},h_{j},w_{j},l_{j},{\theta}_{j},% c_{j}]^{T}\in{\mathbb{R}^{M\times 8}}\}bold_B = { bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × 8 end_POSTSUPERSCRIPT }, where M𝑀Mitalic_M is the total number of objects, bisubscript𝑏𝑖{b}_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th object in the scene, xj,yj,zjsubscript𝑥𝑗subscript𝑦𝑗subscript𝑧𝑗x_{j},y_{j},z_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the object’s center, hj,wj,ljsubscript𝑗subscript𝑤𝑗subscript𝑙𝑗h_{j},w_{j},l_{j}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the object’s size, θjsubscript𝜃𝑗{\theta}_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the object’s heading angle and cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the object’s class. The task of 3D object detection is to detect the 3D boxes 𝐁𝐁\mathbf{B}bold_B from the point cloud 𝐏𝐏\mathbf{P}bold_P accurately.

Refer to caption
Figure 3: The MAPE module comprises three units: point encoding, max-pooling encoding and attention-pooling encoding. For one pillar containing Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT points, in point encoding unit, we firstly augment the raw points with the pillar center and point cloud range, then map the augmented features to the feature space by a MLP. In max-encoding unit, we obtained pillar-wise features by max-pooling operation across point dimension. In atten-encoding unit, we obtained pillar-wise features by a weighted summation operation across point dimension. The final pillar-wise feature by averaging the max- and attention-pooling features.

3.1 Max-and-Attention Pillar Encoding

Point cloud voxel/pillar encoding is very crucial for grid-based 3D detection methods. The pioneering PointPillars aggressively utilizes max pooling to aggregate point features in each pillar. However, the max-pooling operation will result in the loss of fine-grained information, and those local geometric patterns are vital for pillar-based objects, especially for small objects. Therefore, paying attention to this information is important for accurate detection.

In this paper, we propose a simple yet efficient pillar encoding module, named Max-and-Attention Pillar Encoding (MAPE), which takes into account every pillar’s local detailed geometric information with negligible computational burden and benefits the performance of small objects (e.g., pedestrian and cyclist). Meanwhile, the lightweight MAPE module makes it highly suitable for real-time embedded applications. As shown in Fig. 3, our MAPE module consists of three units: 1) the point encoding, 2) the max-pooling encoding, 3) and the attention-pooling encoding.

We suppose that a point cloud 𝐏𝐏\mathbf{P}bold_P in the 3D space has the range of L𝐿Litalic_L, W𝑊Witalic_W, H𝐻Hitalic_H along the axes X, Y, Z. 𝐏𝐏\mathbf{P}bold_P is equally divided into a specific pillar grid with the size of l𝑙litalic_l, w𝑤witalic_w, H𝐻Hitalic_H. Here, as in PointPillar, we only voxelize point cloud in the XY plane without the height dimension. Let v={pi=[xi,yi,zi,ri,ti]Nv×5}𝑣subscript𝑝𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖subscript𝑟𝑖subscript𝑡𝑖superscriptsubscript𝑁𝑣5v=\{p_{i}=[x_{i},y_{i},z_{i},r_{i},t_{i}]\in{\mathbb{R}^{N_{v}\times 5}}\}italic_v = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 5 end_POSTSUPERSCRIPT } be a non-empty pillar contains N𝑁Nitalic_N points with the spatial shape [l,w,H]𝑙𝑤𝐻[l,w,H][ italic_l , italic_w , italic_H ], i{1,,Nv}𝑖1subscript𝑁𝑣i\in\left\{1,...,N_{v}\right\}italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT }, Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is number of points in pillar v𝑣vitalic_v.

Point Encoding. First, we augment the points in each pillar into pi^={[xi,yi,zi,ri,ti,xic,yic,zic,xir,yir,zir]Nv×11}^subscript𝑝𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖subscript𝑟𝑖subscript𝑡𝑖superscriptsubscript𝑥𝑖𝑐superscriptsubscript𝑦𝑖𝑐superscriptsubscript𝑧𝑖𝑐superscriptsubscript𝑥𝑖𝑟superscriptsubscript𝑦𝑖𝑟superscriptsubscript𝑧𝑖𝑟superscriptsubscript𝑁𝑣11\hat{p_{i}}=\{[x_{i},y_{i},z_{i},r_{i},t_{i},{x}_{i}^{c},{y}_{i}^{c},{z}_{i}^{% c},{x}_{i}^{r},{y}_{i}^{r},{z}_{i}^{r}]\in{\mathbb{R}^{N_{v}\times 11}}\}over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = { [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × 11 end_POSTSUPERSCRIPT }, where [xi,yi,zi]subscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖[x_{i},y_{i},z_{i}][ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] is the original point coordinates in the ego frame, [xic,yic,zic]superscriptsubscript𝑥𝑖𝑐superscriptsubscript𝑦𝑖𝑐superscriptsubscript𝑧𝑖𝑐[{x}_{i}^{c},{y}_{i}^{c},{z}_{i}^{c}][ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ] is the offset of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the current pillar center, and [xir,yir,zir]superscriptsubscript𝑥𝑖𝑟superscriptsubscript𝑦𝑖𝑟superscriptsubscript𝑧𝑖𝑟[{x}_{i}^{r},{y}_{i}^{r},{z}_{i}^{r}][ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] is the relative coordinates of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT obtained by subtracting the range of point cloud. Notably, in each pillar, we did not adopt any sampling strategy to keep the number of points within each the same, because this operation may drop useful points and impair the original geometric patterns. Second, the augmented point-wise features pi^^subscript𝑝𝑖\hat{p_{i}}over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG within v𝑣vitalic_v are mapped to the high-dimensional feature space through an MLP layer. This process is formulated as

pie=m(pi^;wm),superscriptsubscript𝑝𝑖𝑒𝑚^subscript𝑝𝑖subscript𝑤𝑚{p}_{i}^{e}=m(\hat{p_{i}};w_{m}),italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT = italic_m ( over^ start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ; italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , (1)

where m()𝑚m(\cdot)italic_m ( ⋅ ) denotes an MLP, wmsubscript𝑤𝑚w_{m}italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes learnable weights of function m()𝑚m(\cdot)italic_m ( ⋅ ), and pieNv×Dsuperscriptsubscript𝑝𝑖𝑒superscriptsubscript𝑁𝑣𝐷{p}_{i}^{e}\in{\mathbb{R}^{N_{v}\times D}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT is point-wise feature.

Max-pooling Encoding. This unit aggregates all point features within a pillar into a single feature vector, while remaining invariant to point permutations in each pillar, which is formulated as

fm=max(pie),superscript𝑓𝑚𝑚𝑎𝑥superscriptsubscript𝑝𝑖𝑒{f}^{m}=max({p}_{i}^{e}),italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = italic_m italic_a italic_x ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) , (2)

where max()𝑚𝑎𝑥max(\cdot)italic_m italic_a italic_x ( ⋅ ) means the max-pooling operation across these point features, and fmDsuperscript𝑓𝑚superscript𝐷{f}^{m}\in{\mathbb{R}^{D}}italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is the resulting feature vector of each pillar.

Attention-pooling Encoding. This unit is designed to maintain the local fine-grained information. Max pooling is hard to integrate point-wise features piesuperscriptsubscript𝑝𝑖𝑒{p}_{i}^{e}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT within each pillar v𝑣vitalic_v as it only takes the maximum value. However, the rich local detailed patterns are highly valuable for smaller object detection from BEV perspective. Therefore, we turn to the powerful attention mechanism to automatically learn the important local features. First, we use a function g()𝑔g(\cdot)italic_g ( ⋅ ) consisting of a shared MLP to predict attention scores for these points in a pillar, i.e., si=g(pie;wg)subscript𝑠𝑖𝑔superscriptsubscript𝑝𝑖𝑒subscript𝑤𝑔s_{i}=g({p}_{i}^{e};w_{g})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ; italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ), where wgsubscript𝑤𝑔w_{g}italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT denotes learnable weights of the MLP, siNv×Dsubscript𝑠𝑖superscriptsubscript𝑁𝑣𝐷s_{i}\in{\mathbb{R}^{N_{v}\times D}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT is the attention scores. Second, the learnt attention scores can be regarded as a soft mask which dynamically weight the point-wise feature piesuperscriptsubscript𝑝𝑖𝑒{p}_{i}^{e}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT. Finally, the weighted summed features are as follows:

fa=sijpije,i=1Nsij=1formulae-sequencesuperscript𝑓𝑎subscript𝑠𝑖𝑗superscriptsubscript𝑝𝑖𝑗𝑒superscriptsubscript𝑖1𝑁subscript𝑠𝑖𝑗1\displaystyle{f}^{a}=\sum s_{ij}\cdot p_{ij}^{e},\ \sum_{i=1}^{N}s_{ij}=1italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = ∑ italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 (3)

where faNv×Dsuperscript𝑓𝑎superscriptsubscript𝑁𝑣𝐷{f}^{a}\in{\mathbb{R}^{N_{v}\times D}}italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT is resulting pillar attention pooling features, j{1,,D}𝑗1𝐷j\in\left\{1,...,D\right\}italic_j ∈ { 1 , … , italic_D } is the feature dimension index, sijsubscript𝑠𝑖𝑗s_{ij}italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and pijesuperscriptsubscript𝑝𝑖𝑗𝑒p_{ij}^{e}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT is the attention score and feature of j𝑗jitalic_j-th dimension at i𝑖iitalic_i-th point respectively. Notably, the weighting operation across different points (Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT), this process introduces interaction between different points inside a pillar (e.g., local area). Finally, we combine the learnt pillar-wise max and attentive features by averaging them, i.e., f=fm+fa2𝑓superscript𝑓𝑚superscript𝑓𝑎2f=\frac{{f}^{m}+{f}^{a}}{2}italic_f = divide start_ARG italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT + italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG, where f1×D𝑓superscript1𝐷f\in{\mathbb{R}^{1\times D}}italic_f ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_D end_POSTSUPERSCRIPT is the final pillar-wise feature including the global-aware and local-aware information inside one pillar. The max-pooling operation preserves the maximum response feature in each pillar, while the attention pooling features maintain the local fine-grained information. By combining the two features, richer information can be effectively retained to enhance the pillar representation. Despite being simple, our MAPE module significantly improve the performance of small objects (+1.6 mAPH L2 for pedestrian) as shown in Tab. 4 and Fig. 7.

Refer to caption
Figure 4: The diagram of our computation reallocation in backbone design. This design brings obvious performance gain without extra latency
Refer to caption
Figure 5: Performance verse different blocks number on Waymo val set trained with 20% training data. Taking stage1 as an example, we arrange the number of blocks in stage1 from 0 to 8 with stride 2, while setting the number of blocks in other stages to 2.

3.2 Backbone Design

Computation Reallocation. As presented in PillarNet, a lightweight and powerful backbone is very important for effective pillar feature learning. Through adopting classical 2D CNN backbones, i.e., VGGNet  (Simonyan and Zisserman 2014), ResNet-18/34  (He et al. 2016), PillarNet achieves a better scalability and flexibility for model complexity. However, we argue that these backbones are specifically designed to extract high-level semantic features and obtain geometries of objects in RGB image, which may not be suitable for LiDAR point cloud. This is because there exists a substantial modal difference between RGB image and LiDAR point cloud. Different from RGB image, LiDAR point cloud can easily gauges spatial distances, relationships and shapes of objects through collecting laser measurement signals to represent 3d models and maps of environments. This means that rich and accurate geometric information of objects is already explicitly encoded in LiDAR point cloud, which is also discussed in FCOS-LiDAR (Tian et al. 2022). Therefore, we assume that instead of allocating too much computation resources to model the geometries of objects in the later stages like ResNet in RGB images, we should reallocate the capacity to the early stages to better incorporate the geometry information carried by the raw points.

To verify the above assumption, we systematically study computation allocation in backbone design through adjusting the stage compute ratio of ResNet. As shown in Fig. 5, we set ResNet-18 as the initial model, and arrange the number of blocks from 0 to 8 with stride 2 in each stage, respectively. From the results, we can find that the performance is quite sensitive to the capacity of stage (1, 2), rather than stage (3, 4), which confirms our hypothesis that it is more beneficial to allocate computation in the early stages in pillar-based LiDAR detection. Hence, to achieve a better trade-off between performance and latency, we set the stage compute ratio of four stages to (6, 6, 3, 1), which has a comparable computation cost to ResNet-34 with (3, 4, 6, 3). As shown in Fig. 4 and Tab. 5, our backbone after computation reallocation improves accuracy (+1.3 mAPH L2). See supplements for more details.

Refer to caption
Figure 6: The design of lightweight backbone. (a) a ResNet blocks. (b) a structural re-parameterization block, which using 3 ×\times× 3, 1 ×\times× 1 and identity branches to replace a 3 ×\times× 3 conv. (c) During inference time, a structural re-parameterization block is converted to a 3 ×\times× 3 conv.

Lightweight Backbone Design. Re-parameterized structure (Ding et al. 2021) has excellent feature representation ability in training, and can effectively reduce inference latency without performance loss through re-parameterized operation. This structure has been proven effective in 2D detection tasks (Xu et al. 2022; Li et al. 2022a; Wang, Bochkovskiy, and Liao 2022). However, they have not yet been well exploited for the 3D point cloud detection. Inspired by this, we introduce this advanced design from 2D object detection into our backbone designed for point cloud. As shown in Fig. 6, during training, a plain res block (Fig. 6 (a)) is replaced by its over-parameterized three-branch counterpart, i.e., Conv3×3absent3\times 3× 3, Conv1×1absent1\times 1× 1 and Identity, followed by the ReLU function after the summation of the three branches (Fig. 6 (b)). The three-branch structure substantially helps the network optimization while the reparameterization converts three branches identically into one at inference (from Fig. 6 (b) to Fig. 6 (c)), improving the efficiency in inference. As a result, as shown in Tab. 6, our backbone network with single-path structure reduces inference latency (14%) while improving accuracy (+0.6 mAPH/L2). During training, a plain Conv-BN-ReLU is replaced by its over-parameterized three-branch counterpart, i.e., Conv3×3absent3\times 3× 3, Conv1×1absent1\times 1× 1 and identity, followed by the ReLU function after the summation of the three branches.

The Insight of Backbone Design.

  • Our core insight is computation reallocation backbone design for point cloud based on root modality difference of point cloud and images. Specifically, we conclude that we should reallocate the capacity to the early stages (1, 2) to better integrate the geometric information carried by the raw points, instead of allocating the capacity in the later stages (3, 4) like ResNet setting. This provides a novel perspective and principle to design backbone architecture for point clouds to the community.

  • For re-parameterized structure, which demonstrated that simply adapting the advanced backbone design from 2D object detection brings non-trivial improvements to 3D object detection, which is encouraging to explore more successful practices in the image domain to upgrade the network designs for point clouds.

Method Reference #Frames Latency Speedup Mean L2 Vehicle L2 Pedestrian L2 Cyclist L2 #MACs FPS
(ms) (2021a) (mAPH) (mAP/APH) (mAP/APH) (mAP/APH) (G)
SECOND1 Sensors 2018 1 57.2 63.9 / 63.3 60.7 / 51.3 58.3 / 57.0
PointPillars1 CVPR 2019 1 57.8 63.6 / 63.1 62.8 / 50.3 61.9 / 59.9
CenterPoint CVPR 2021a 1 64.3 1.0×\times× 65.5 66.7 / 66.2 68.3 / 62.6 68.7 / 67.6 307.9 15.5
PillarNeXt CVPR 2023 1 103.2 0.6×\times× 69.1 70.3 / 69.8 74.9 / 69.8 70.6 / 69.6 281.0 9.7
FSD NeurIPS 2022b 1 74.3 0.9×\times× 69.7 68.9 / 68.5 73.2 / 68.0 73.8 / 72.5 - 13.5
VoxelNeXt CVPR 2023 1 48.9 1.3×\times× 68.2 69.7 / 69.2 72.2 / 65.9 70.7 / 69.6 38.7 20.4
PillarNet ECCV 2022 1 38.7 1.7×\times× 67.2 70.4 / 69.9 71.6 / 64.9 67.8 / 66.7 319.5 25.8
FastPillars Ours 1 36.5 1.8×\times× 69.3 71.5 / 71.1 73.2 / 67.2 70.5 / 69.5 894.7 27.4
CenterPoint CVPR 2021a 2 72.2 1.0×\times× 68.4 67.7 / 67.2 71.0 / 67.5 71.5 / 70.5 318.6 13.9
PillarNet ECCV 2022 2 45.6 1.6×\times× 70.0 71.6 / 71.1 74.5 / 71.4 68.3 / 67.5 331.0 21.9
FastPillars Ours 2 41.2 1.8×\times× 72.5 72.5 / 72.0 75.5 / 72.4 73.9 / 73.0 895.2 24.3
CenterPoint CVPR 2021a 3 80.8 1.0×\times× 325.7 12.4
CenterPoint++ CVPR 2021b 3 50.1 1.6×\times× 71.6 71.8 / 71.4 73.5 / 70.8 73.7 / 72.8 294.6 19.9
FastPillars Ours 3 46.0 1.8×\times× 73.3 73.2 / 72.8 76.3 / 73.2 74.6 / 73.8 895.6 21.7
Table 1: Results of single-stage 3D detectors on the Waymo val set. FastPillars achieves 1.8×\times× speedup over CenterPoint while being more accurate. Methods with <<<60 L2 mAPH are marked gray due to the low performance.1: from FSD paper.

3.3 Neck and Center-based Head

In the neck block, we followed PillarNet (Shi, Li, and Ma 2022) by adopting an enriched neck design to fuse features of different levels (8×\times× and 16×\times×) for effective interaction of spatial semantic features. In the head block, we directly utilize the center-based detection head (Yin, Zhou, and Krahenbuhl 2021a). Besides, as in AFDetV2 (Hu et al. 2022), we use an IoU branch to bridge the gap between the classification and regression prediction.

3.4 Loss Functions

We follow (Yin, Zhou, and Krahenbuhl 2021a) to design our loss functions.

To be specific, for the classification branch, we use the focal loss (Lin et al. 2017) as the heatmap loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT. For the 3D box regression, we make use of the L1 loss regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT to supervise their localization offsets, size and orientation. For the IoU branch, we also utilize the L1 loss iousubscript𝑖𝑜𝑢\mathcal{L}_{iou}caligraphic_L start_POSTSUBSCRIPT italic_i italic_o italic_u end_POSTSUBSCRIPT to supervise, where the target 3D IoU score I𝐼Iitalic_I is 2×(I0.5)[1,1]2𝐼0.5112\times(I-0.5)\in[-1,1]2 × ( italic_I - 0.5 ) ∈ [ - 1 , 1 ]. Besides, the DIoU loss odiousubscript𝑜𝑑𝑖𝑜𝑢\mathcal{L}_{od-iou}caligraphic_L start_POSTSUBSCRIPT italic_o italic_d - italic_i italic_o italic_u end_POSTSUBSCRIPT  (Zheng et al. 2020) is added in the regression branch. The overall loss consists of four parts as follows:

total=λ1cls+λ2iou+λ3(odiou+reg)subscript𝑡𝑜𝑡𝑎𝑙subscript𝜆1subscript𝑐𝑙𝑠subscript𝜆2subscript𝑖𝑜𝑢subscript𝜆3subscript𝑜𝑑𝑖𝑜𝑢subscript𝑟𝑒𝑔\centering\mathcal{L}_{total}=\lambda_{1}\mathcal{L}_{cls}+\lambda_{2}\mathcal% {L}_{iou}+\lambda_{3}(\mathcal{L}_{od-iou}+\mathcal{L}_{reg})\@add@centeringcaligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_o italic_u end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_o italic_d - italic_i italic_o italic_u end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ) (4)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT represent the weights of these losses.

4 Experiments

nuScenes Dataset. nuScenes (Caesar et al. 2020) dataset contains 700 training scenes, 150 val scenes and 150 test scenes. Each frame is generated approximately 30K points by a 32 channels LiDAR sampled with 20Hz. It contains 40K annotated key-frames and 10 categories in total. We report nuScenes detection score (NDS) and Mean Average Precision (mAP), where NDS is the main ranking metric.

Waymo Open Dataset. Waymo Open Dataset (Sun et al. 2020) contains 1150 sequences in total, 798 for training, 202 for validation and 150 for test. Each sequence is sampled at 10Hz with a 64 channels LiDAR containing 6.1M vehicle, 2.8M pedestrian, and 67k cyclist boxes. Each frame covers a scene with a size of 150m×150m. The official evaluation tools evaluated the methods in two difficulty levels: LEVEL1 for boxes with more than five LiDAR points, and LEVEL2 for boxes with at least one LiDAR point.

Implementation Details. FastPillars use the same training schedules as prior methods (Shi, Li, and Ma 2022) with the Adam optimizer under the Det3D (Zhu et al. 2019) framework on 8 A100 GPUs. We use one-cycle learning rate policy (Smith and Topin 2019) with an initial learning rate 10e-4, weight decay 0.01, and momentum 0.85 to 0.95. For a fair comparison, we follow (Yin, Zhou, and Krahenbuhl 2021a) to only use double-flip test-time augmentation without any model ensemble on the nuScenes Dataset. Besides, we also use the ground-truth copy-paste data augmentation from  (Yan, Mao, and Li 2018) during training and disable this data augmentation in the last 5 epochs following  (Wang et al. 2021) (e.g., fade strategy).

Methods Reference Stages Frames Vehicle (L1) Vehicle (L2) Ped. (L1) Ped. (L2) Cyc. (L1) Cyc. (L2)
mAP mAPH mAP mAPH mAP mAPH mAP mAPH mAP mAPH mAP mAPH
PointPillars{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT CVPR 2019 One 1 68.60 68.10 60.50 60.10 68.00 55.50 61.40 50.10 - - - -
RCD CoRL 2020 Two 1 71.97 71.59 65.06 64.70 - - - - - - - -
CenterPoint CVPR 2021a Two 1 80.20 79.70 72.20 71.80 78.30 72.10 72.20 66.40 - - - -
PV-RCNN CVPR 2020a Two 1 80.60 80.15 72.81 72.39 78.16 72.01 71.81 66.05 71.80 70.42 69.13 67.80
AFDetV2 AAAI 2022 One 1 80.49 80.43 72.98 72.55 79.76 74.35 73.71 68.61 72.43 71.23 69.84 68.67
PV-RCNN++ IJCV 2022 Two 1 81.62 81.20 73.86 73.47 80.41 74.99 74.12 69.00 71.93 70.76 69.28 68.15
PillarNet-34{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT ECCV 2022 One 1 81.91 81.47 74.43 74.01 80.22 73.43 74.16 67.78 69.00 67.79 66.48 65.35
FastPillars Ours One 1 82.75 82.31 75.42 75.01 81.05 74.84 75.09 69.22 72.94 71.82 70.26 69.23
CenterPoint CVPR 2021a Two 2 81.05 80.59 73.42 72.99 80.47 77.28 74.56 71.52 74.60 73.68 72.17 71.28
PV-RCNN CVPR 2020a Two 2 81.06 80.57 73.69 73.23 80.31 76.28 73.98 70.16 - - - -
VISTA CVPR 2022 One 2 81.70 81.30 74.40 74.00 81.40 78.30 75.50 72.50 74.90 73.90 72.50 71.60
Pyramid R-CNN ICCV 2021 Two 2 81.77 81.32 74.87 74.43 - - - - - - - -
AFDetV2 AAAI 2022 One 2 81.65 81.22 74.30 73.89 81.26 78.05 75.47 72.41 76.41 75.37 74.05 73.04
PillarNet-34{}^{{\ddagger}}start_FLOATSUPERSCRIPT ‡ end_FLOATSUPERSCRIPT ECCV  2022 One 2 82.80 82.37 75.65 75.25 82.17 78.73 76.49 73.21 71.25 70.27 69.05 68.10
PV-RCNN++ IJCV 2022 Two 2 83.74 83.32 76.31 75.92 82.60 79.38 76.63 73.55 74.44 73.43 72.06 71.09
FastPillars Ours One 2 83.60 83.17 76.52 76.12 82.82 79.45 77.20 73.90 76.52 75.45 74.14 73.11
3D-MAN CVPR 2021 Multi 16 78.71 78.28 70.37 69.98 69.97 65.98 63.98 60.26 - - - -
RSN CVPR 2021 Two 3 80.70 80.30 71.90 71.60 78.90 75.60 70.70 67.80 - - - -
CenterPoint++ CVPR 2021a Two 3 82.78 82.33 75.47 75.05 81.07 78.21 75.13 72.41 74.40 73.33 72.04 71.01
SST CVPR 2022a Two 3 80.99 80.62 73.08 73.74 83.05 79.38 76.65 73.14 - - - -
SWFormer_3f ECCV 2022 One 3 82.89 82.49 75.02 74.65 82.13 78.13 75.87 72.07 - - - -
PillarNeXt CVPR 2023 One 3 83.28 82.83 76.18 75.76 84.40 81.44 78.84 75.98 73.77 72.73 71.56 70.55
FastPillars Ours One 3 84.02 83.59 77.09 76.68 83.26 80.01 77.76 74.62 76.61 75.49 74.18 73.16
Table 2: Single- and multi-frame LiDAR-only non-ensemble performance comparison on the Waymo test set. \dagger denotes the reported results from RSN (Sun et al. 2021), {\ddagger}: reproduced based on official codebase (Shi, Li, and Ma 2022).

4.1 Overall Results

Firstly, we compared FastPillars with the real-time 3D detectors on the Waymo val set. Secondly, we evaluated FastPillars on the Waymo and nuScenet test set. Finally, we ablated the MAPE and backbone design.

Method
Reference Stages NDS mAP
PointPillars CVPR 2019 One 45.3 30.5
3DSSD CVPR 2020 One 56.4 42.6
CenterPoint CVPR 2021a Two 65.5 58.0
FCOS-LiDAR NeurIPS 2022 One 65.7 60.2
VMVF CVPR 2022 One 67.3 60.9
AFDetV2 AAAI 2022 One 68.5 62.4
UVTR-L CVPR 2022b One 69.7 63.9
VISTA CVPR 2022 One 69.8 63.0
Focals Conv CVPR 2022 One 70.0 63.8
PillarNet ECCV 2022 One 70.8 65.0
VoxelNeXt CVPR 2023 One 71.4 66.2
FastPillars Ours One 71.8 66.8
Table 3: State-of-the-art comparisons for 3D detection on nuScenes test𝑡𝑒𝑠𝑡testitalic_t italic_e italic_s italic_t set. The table is mainly sorted by nuScenes detection score (NDS) which is the official ranking metric.

Comparison with one-stage real-time methods. We compare our FastPillars with state-of-the-art SPConv-based 3D detectors with different input frames setting. For fair comparison, we evaluate the whole latency on NVIDIA Tesla V100 GPU using FP16 precision and report the whole running time. We adopt SpConv v2.1.23 (Contributors 2022) to execute the SPConv-based backbone. All modules after backbone are executed with TensorRT 8.6. All the methods are executed on the first 1,000 samples for 50 runs. Notably, FastPillars is 1.8×\times× faster and 3.8 mAPH L2 higher than CenterPoint in single- and two-frame setting, and is faster and 4.9 mAPH L2 higher in overall, which shows a better trade-off between accuracy and latency in different input setting. Our effective model design make it available replacement for previous real-time state-of-the-art 3D detection methods CenterPoint, PillarNet.

Evaluation on Waymo test set. For a more comprehensive comparison, we also evaluate our FastPillars with published methods on the Waymo test set. As shown in Table 2, in single-frame input, FastPillars outperforms the previous one-stage and two-stage 3D detectors for the vehicle and pedestrian detection with remarkable performance gains (+1.44 mAPH L2 for the pedestrian). In two-frame input, FastPillars consistently show the superior performance compared with single-frame counterparts. Extensive experiments show FastPillars using computation reallocation and reparameterization-based structural design achieve superior performance in large scale Waymo Open Dataset. In three-frame input, our best model outperforms the two-stage SST (Fan et al. 2022a) in the challenging pedestrian class with remarkable performance gains (+1.38 in mAPH L2 metrics). Notably, SST specially designs a single-stride transformer architecture for small object detection in 3D space. However, our method predict objects of all sizes at the same resolution feature map (8×\times× strides). We owe such leading performance to the effective design of our MAPE module. FastPillars is 2.03 and 2.55 mAPH L2 ahead of SWFormer (Pei Sun and Anguelov 2022) which adopts transformer architecture in vehicle and pedestrian class with the same temporal information (3 frames) respectively. PillarNeXt surpasses FastPillars in pedestrian class with 3-frames setting due to its smaller pillar size (0.075m). This leads to the fact that PillarNeXt has more computation cost and cannot run in real time. Morever, FastPillars perform superior performance than PillarNeXt in vehicle and cyclist class, owing to our efficient backbone design.

Evaluation on nuScenes test set. In Table 3, we evaluate our FastPillars with other LiDAR-only non-ensemble methods on nuScenes test set. Both lines of results are better than previous ones. The extensive experiments verify the excellent performance of FastPillars even without SPConv.

4.2 Ablation Experiments

We conduct ablation studies for the MAPE module and efficient backbone design to analyze their effect on the latency and performance on the Waymo val set.

Methods Vehicle (L2) Ped. (L2) Cyc. (L2)
mAP mAPH mAP mAPH mAP mAPH
Max-pool 71.2 70.7 71.9 65.4 68.8 67.8
MAPE (Ours) 71.6 71.1 72.9 67.0 69.6 68.6
improvement + 0.4 + 0.4 + 1.0 + 1.6 + 0.8 + 0.8
Table 4: MAPE ablation on Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set. MAPE shows remarkable improvements especially for small objects.

MAPE module. As shown in Tab. 4, compared with common max-pooling operations, the MAPE module improves the performance of vehicle, pedestrian and cyclist in different degrees on Waymo val set. Notably, MAPE boost pedestrian category with remarkable performance gains (+1.6 mAPH L2). We further visualize the attention scores in different categories with MAPE module. As shown in Fig. 7, MAPE pays more attention to the object semantic information (contour of car/pedestrian). Conversely, Max-pooling lose much geometries. The qualitative and quantitative results show that our MAPE module encodes the local fine-grained geometrical patterns of objects and the most prominent features (i.e., the maximum value) effectively by combining the atten- and max-pooling operation and improve the perceptual ability in the BEV perspective. More ablation studies refer to supplements.

Refer to caption
Figure 7: Visualization of the attention scores on car (a-b) and pedestrian(c-d) using Max-pooling or MAPE. The size of points means scores. The point will be paid more attention if it has a higher score. Compared with (a) and (b), (c) and (d), MAPE module pays more attention to the object semantic and local geometries. Best viewed in color.

Computation Reallocation in Backbone. We conduct experiments by arranging the number of blocks per stages from 0 to 8 with a stride of 2, while setting the number of blocks in other stages to 2, as shown in Tab. 5. We observed the different phenomena in different stages.

Stage Ratio Mean (L2) Veh. (L2) Ped. (L2) Cyc. (L2) GFLOPS
mAPH mAPH mAPH mAPH
1 0, 2, 2, 2 61.0 67.1 56.1 59.6 278.0
2, 2, 2, 2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 62.8 67.7 58.5 62.3 355.1
4, 2, 2, 2 63.4 68.2 59.5 62.6 432.2
6, 2, 2, 2 64.1(+3.1) 68.4 59.9 64.0 509.3
2 2, 0, 2, 2 62.0 67.2 57.6 61.2 278.0
2, 2, 2, 2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 62.8 67.7 58.5 62.3 355.1
2, 4, 2, 2 63.4 68.2 59.5 62.7 432.2
2, 6, 2, 2 64.0(+2.0) 68.8 59.7 63.5 509.3
3 2, 2, 0, 2 62.4 67.0 58.3 62.0 278.0
2, 2, 2, 2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 62.8 67.7 58.5 62.3 355.1
2, 2, 4, 2 63.4 68.4 58.9 62.7 432.2
2, 2, 6, 2 63.0(+0.6) 68.4 58.7 62.0 509.3
4 2, 2, 2, 0 62.9 67.9 58.9 61.8 278.0
2, 2, 2, 2{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 62.8 67.7 58.5 62.3 355.1
2, 2, 2, 4 62.9 67.8 58.7 62.2 432.2
2, 2, 2, 6 62.8(-0.1) 67.6 58.7 62.2 509.3
Res-34 3, 4, 6, 3 63.5 68.8 59.6 62.2 660.3
Ours 6, 6, 3, 1 64.8(+1.3) 69.1 60.8 64.5 660.3
Table 5: Performance of various stage compute ratio on Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set trained with 20% training data. \bf{\dagger} means the stage compute ratio of Res-18.
  • In stage 1, increasing the number of blocks remarkably improves the overall performance, from 61.0 to 64.1, with 3.1 mAPH L2 gain. Specifically, mAPH L2 of vehicles, pedestrian and cyclist can be improved 1.3, 3.8 and 4.4, respectively.

  • In stage 2, increasing the number of blocks also brings a significantly improvement in the overall performance, from 62.0 to 64.0 mAPH L2. Although the improvement is relatively small compared to stage 1 (+2.0 vs. +3.1), it still shows consistent gains for different types of objects (vehicles +1.6, pedestrian +2.1, cyclist +2.3 mAPH L2).

  • In stage 3, increasing the number of blocks brings a slight improvement in the overall performance, from 62.4 to 63.0 mAPH L2. The type of vehicle still shows a continuous improvement, from 67.0 to 68.4, while the pedestrian and cyclist have a relatively weak gain.

  • In stage 4, increasing the number of blocks yields little overall performance improvement and performance for each category is almost unchanged.

Based on the above results, we conclude that (1) Increasing the number of blocks in stage (1, 2, 3) can improve the overall detection performance, especially in stage (1, 2). (2) For different object class, the vehicle exhibits consistent improvement with the increase in the number of blocks in stage (1, 2, 3) and no improvement in stage 4. (3) The pedestrian and cyclist show consistent improvement with the increase in the number of blocks in stage (1, 2) and very little even negative gain in stage (3, 4).

The results prove our previous assumption: for pillar-based LiDAR detection tasks, we should reallocate the capacity to the early stages to better integrate the geometric information carried by the raw points, instead of allocating the capacity in the later stages like ResNet setting. In particular, allocating more computing sources in the early stages can effectively improve small objects’ accuracy. Therefore, to achieve a better performance under the constraint of low computation resource, we set the stage compute ratio of four stages to (6, 6, 3, 1), which has a comparable computation cost to ResNet-34 with (3, 4, 6, 3). Besides, we think that the proposed compute ratios of different stage is coarse. Considering factors such as resolution, channel dimensions, and stage depths of the backbone, neural architecture search (NAS) technique can be used to achieve improved capacity reallocation. This remains an open problem for future research to the community.

setting Ratio mAPH L2 \uparrow Latency(ms)\downarrow Params(M)\downarrow FLOPs(G)\downarrow
Vanish 3,4,6,3 63.5 17.7 24.2 660.3
Backbone 6,6,3,1 64.8 17.6 12.1 660.3
+Rep 6,6,3,1 65.4 16.0 12.1 660.3
improve + 1.9 -1.7 -12.1
Table 6: Ablation on the lightweight backbone design. Our backbone can effectively improve performance (+1.9 mAPH L2) while reducing latency (11%) and memory cost (50%).

Lightweight Backbone Design. Tab. 6 demonstrates the efficacy of our backbone in reducing the number of parameters (50%) and improving performance (1.3 mAPH L2) through computation reallocation. Furthermore, we introduce the re-parameterized structure, which brings 11% reduction in latency and a 0.6 mAPH L2 gain. Our backbone achieves remarkable 1.9 mAPH L2 improvement and 11% acceleration, proving its effectiveness.

5 Conclusion

In this paper, we propose FastPillars, a real-time one-stage pillar-based 3D detector, to simultaneously improve the detection accuracy and runtime efficiency while kee** the deployment in mind. In particular, we show that SPConv can be safely sidestepped with a redesigned lightweight and effective backbone. Moreover, we also propose a Max-and-Attention Pillar Encoding (MAPE) module to compensate for the information loss in the pillar encoding.

Extensive experiments show that our FastPillars achieves a better trade-off between speed and accuracy, and can be deployed through TensorRT for real-time on-device applications. Given its effectiveness and efficiency, we hope that our method can serve as a strong and simple alternative to current mainstream SPConv-based 3D detectors. Besides, we believe that improved computation reallocation, taking into account factors such as resolution, channel dimensions, and stage depths for the backbone, could be achieved using the neural architecture search (NAS) technique. This remains an open problem for future research.

References

  • Bewley et al. (2020) Bewley, A.; Sun, P.; Mensink, T.; Anguelov, D.; and Sminchisescu, C. 2020. Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection. In Conference on Robot Learning (CoRL).
  • Bochkovskiy, Wang, and Liao (2020) Bochkovskiy, A.; Wang, C.-Y.; and Liao, H.-Y. M. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
  • Caesar et al. (2020) Caesar, H.; Bankiti, V.; Lang, A.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. Nuscenes: A multimodal dataset for autonomous driving. 11621–11631.
  • Chen et al. (2017) Chen, X.; Ma, H.; Wan, J.; Li, B.; and Xia, T. 2017. Multi-view 3D object detection network for autonomous driving. In CVPR, 1907–1915.
  • Chen et al. (2022) Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; and Jia, J. 2022. Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Chen et al. (2023) Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; and Jia, J. 2023. VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Contributors (2022) Contributors, S. 2022. Spconv: Spatially Sparse Convolution Library.
  • Deng et al. (2021) Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; and Li, H. 2021. Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection. In AAAI.
  • Deng et al. (2022) Deng, S.; Liang, Z.; Sun, L.; and Jia, K. 2022. VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8448–8457.
  • Ding et al. (2021) Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; and Sun, J. 2021. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13733–13742.
  • Fan et al. (2022a) Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.-X.; Zhao, H.; Wang, F.; Wang, N.; and Zhang, Z. 2022a. Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8458–8468.
  • Fan et al. (2022b) Fan, L.; Wang, F.; Wang, N.; and ZHANG, Z.-X. 2022b. Fully sparse 3d object detection. Advances in Neural Information Processing Systems, 35: 351–363.
  • Fazlali et al. (2022) Fazlali, H.; Xu, Y.; Ren, Y.; and Liu, B. 2022. A versatile multi-view framework for lidar-based 3d object detection with guidance from panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17192–17201.
  • Glenn (2022) Glenn, J. 2022. YOLOv5 release v6.1. https://github.com/ultralytics/yolov5/releases/tag/v6.1.
  • Graham and van der Maaten (2017) Graham, B.; and van der Maaten, L. 2017. Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • Hu et al. (2022) Hu, Y.; Ding, Z.; Ge, R.; Shao, W.; Huang, L.; Li, K.; and Liu, Q. 2022. Afdetv2: Rethinking the necessity of the second stage for object detection from point clouds. 36(1): 969–979.
  • Kuang et al. (2020) Kuang, H.; Wang, B.; An, J.; Zhang, M.; and Zhang, Z. 2020. Voxel-FPN: Multi-scale voxel feature aggregation for 3D object detection from LIDAR point clouds. Sensors, 20(3): 704.
  • Lang et al. (2019) Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; and Beijbom, O. 2019. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 12697–12705.
  • Li et al. (2022a) Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. 2022a. YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976.
  • Li, Luo, and Yang (2023) Li, J.; Luo, C.; and Yang, X. 2023. PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Li et al. (2022b) Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; and Jia, J. 2022b. Unifying Voxel-based Representation with Transformer for 3D Object Detection. In Advances in Neural Information Processing Systems.
  • Lin et al. (2017) Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
  • Mao et al. (2021) Mao, J.; Niu, M.; Bai, H.; Liang, X.; Xu, H.; and Xu, C. 2021. Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection. In ICCV, 2723–2732.
  • Pei Sun and Anguelov (2022) Pei Sun, W. W. C. L. F. X. Z. L., Mingxing Tan; and Anguelov, D. 2022. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV.
  • Qi et al. (2019) Qi, C. R.; Litany, O.; He, K.; and Guibas, L. J. 2019. Deep hough voting for 3D object detection in point clouds. In ICCV, 9277–9286.
  • Qi et al. (2018) Qi, C. R.; Liu, W.; Wu, C.; Su, H.; and Guibas, L. J. 2018. Frustum pointnets for 3D object detection from rgb-d data. In CVPR, 918–927.
  • Qi et al. (2017a) Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Pointnet: Deep learning on point sets for 3D classification and segmentation. In CVPR, 652–660.
  • Qi et al. (2017b) Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 5099–5108.
  • Shi, Li, and Ma (2022) Shi, G.; Li, R.; and Ma, C. 2022. PillarNet: Real-Time and High-Performance Pillar-based 3D Object Detection.
  • Shi et al. (2020a) Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; and Li, H. 2020a. Pv-rcnn: Point-voxel feature set abstraction for 3D object detection. In CVPR, 10529–10538.
  • Shi et al. (2022) Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; and Li, H. 2022. PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. IJCV.
  • Shi, Wang, and Li (2019) Shi, S.; Wang, X.; and Li, H. 2019. Pointrcnn: 3D object proposal generation and detection from point cloud. In CVPR, 770–779.
  • Shi et al. (2020b) Shi, S.; Wang, Z.; Shi, J.; Wang, X.; and Li, H. 2020b. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE TPAMI.
  • Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • Smith and Topin (2019) Smith, L. N.; and Topin, N. 2019. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, 369–386. SPIE.
  • Sun et al. (2020) Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. 2020. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2446–2454.
  • Sun et al. (2021) Sun, P.; Wang, W.; Chai, Y.; Elsayed, G.; Bewley, A.; Zhang, X.; Sminchisescu, C.; and Anguelov, D. 2021. RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection. In CVPR, 5725–5734.
  • Tian et al. (2022) Tian, Z.; Chu, X.; Wang, X.; Wei, X.; and Shen, C. 2022. Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images. NeurIPS.
  • Wang et al. (2021) Wang, C.; Ma, C.; Zhu, M.; and Yang, X. 2021. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11794–11803.
  • Wang, Bochkovskiy, and Liao (2022) Wang, C.-Y.; Bochkovskiy, A.; and Liao, H.-Y. M. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696.
  • Xu et al. (2022) Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. 2022. PP-YOLOE: An evolved version of YOLO. arXiv preprint arXiv:2203.16250.
  • Yan, Mao, and Li (2018) Yan, Y.; Mao, Y.; and Li, B. 2018. Second: Sparsely embedded convolutional detection. Sensors, 18(10): 3337.
  • Yang et al. (2020) Yang, Z.; Sun, Y.; Liu, S.; and Jia, J. 2020. 3dssd: Point-based 3D single stage object detector. In CVPR, 11040–11048.
  • Yang et al. (2021) Yang, Z.; Zhou, Y.; Chen, Z.; and Ngiam, J. 2021. 3D-MAN: 3D Multi-frame Attention Network for Object Detection. In CVPR, 1863–1872.
  • Yin, Zhou, and Krahenbuhl (2021a) Yin, T.; Zhou, X.; and Krahenbuhl, P. 2021a. Center-based 3d object detection and tracking. 11784–11793.
  • Yin, Zhou, and Krahenbuhl (2021b) Yin, T.; Zhou, X.; and Krahenbuhl, P. 2021b. CenterPoint++ Submission to the Waymo Real-time 3D Detection Challenge. Accessed: 2021-12-05.
  • Zheng et al. (2020) Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; and Ren, D. 2020. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 12993–13000.
  • Zhou and Tuzel (2018) Zhou, Y.; and Tuzel, O. 2018. Voxelnet: End-to-end learning for point cloud based 3D object detection. In CVPR, 4490–4499.
  • Zhu et al. (2019) Zhu, B.; Jiang, Z.; Zhou, X.; Li, Z.; and Yu, G. 2019. Class-balanced Grou** and Sampling for Point Cloud 3D Object Detection. arXiv preprint arXiv:1908.09492.

Appendix A More Ablation Study

Methods Mean (L2) Vehicle (L2) Ped. (L2) Cyc. (L2)
mAPH mAP mAPH mAP mAPH mAP mAPH
CenterPoint-PP 55.9 62.5 62.0 61.8 51.5 56.1 54.3
CenterPoint-PP+ MAPE 57.0 63.0 62.5 63.5 53.5 56.7 54.9
improvement +1.1 +0.5 +0.5 +1.7 +2.0 +0.6 +0.6
Table 7: Apply MAPE module to CenterPoint-Pillar on Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set trained with 20% training data.

MAPE with PointPillars.

To evaluate the generalization of the proposed MAPE module, we further apply the proposed MAPE module to pillar encoding of CenterPoint-Pillar. As shown in Table 7, MAPE shows remarkable improvements in pedestrian class (+2.0 mAPH L2) and an overall performance gain of 1.1 mAPH L2. This result shows that by introducing attention-pooling operation in pillar encoding, the fine-grained geometrical patterns of objects can be better encoded, thus improving the performance of small objects.

Atten Avg Max mAP NDS
(a) 60.5 67.1
(b) 60.5 66.7
(c) 61.2 67.7
(d) 61.2 67.7
(e) 59.8 67.0
(f) 61.1 67.7
MAPE 61.5 68.1
Table 8: Performance comparison of different pooling operations on nuScenes val𝑣𝑎𝑙valitalic_v italic_a italic_l set.

Different Pooling Operations in MAPE.

We compare the performance of different pooling operations in Tab. 8. Only using attention-pooling operation achieves 60.5 map performance, while only using max-pooling operation has higher 61.2 map. The reason may be that the max-pooling operation retains the most representative feature, while the attention-pooling focuses on the fine-grained information of different points. However, the combination of Atten- and Max-pooling (i.e., MAPE) obtains the best performance with a mere 4ms of added latency compared with commonly used max-pooling operation in PointPillars (Lang et al. 2019). This is because that our proposed MAPE module can not only effectively learn the most representative features, but also pay attention on local geometrical patterns automatically, which is beneficial to small objects.

Appendix B Detailed Breakup Time of FastPillars

Device Method PFE/VFE Backbone Neck-Head P-P Overall FPS
V100 -32G CenterPoint-1f 20.7 29.5 7.9 6.2 64.3 15.5
FastPillars-1f 6.7 16.0 8.4 5.4 36.5 27.4
CenterPoint-2f 25.2 32.3 7.9 6.8 72.2 13.9
FastPillars-2f 10.9 16.0 8.4 5.9 41.2 24.3
A100 -80G FastPillars-1f 5.1 14.3 7.0 4.5 30.9 32.4
FastPillars-2f 9.2 14.3 7.0 5.1 35.3 28.1
Table 9: Inference time on Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set.

As shown in Tab. 9. We evaluated the average running time on Waymo val set to measure the detailed inference time of FastPillars in different devices. For example, FastPillars-1f achieved 32.4 FPS on a single NVIDIA A100 GPU, including 5.1 ms for pillar encoding, 21.3 ms for model forward propagation, and 4.5 ms for post-processing.

Appendix C FastPillars Deployment on Edge Devices

Refer to caption
Figure 8: Measured latency on NVIDIA Jetson AGX Orin. FastPillars can run in real-time on edge GPUs.

SPConv is not a built-in operation in TensorRT. This makes it necessary to write a tedious custom plugin in CUDA C++ with several limitations like fixed-shape input and reduced compatibility for commonly-used TensorRT for the quantization deployment. Therefore, the use of SPConv makes it hard to be quantized and deployed via TensorRT. In contrast, our model can be easily exported as the standard ONNX/TRT format, allowing it to run on edge devices where TensorRT is supported. Furthermore, we deploy our FastPillars on an NVIDIA Jeston AGX Orin, a resource-constrained edge GPU platform widely used in real-word autonomous driving. As show in Fig. 8, FastPillars runs at 18 FPS, is faster than CenterPoint (both report the network forward time). This shows that FastPillars can be deployed on different types of edge hardware (i.e., the deployment-friendly ability). We believe that it serves as a strong and simple alternative to current mainstream SPConv-based 3D detectors for efficient LiDAR-centric perception in real-world deployment applications.

Refer to caption
Refer to caption
Figure 9: Visualization of the learned attention scores in MAPE module on different class. The size and color of points represents their scores. The point will be paid more attention if it has a higher score. Best viewed in color.

Appendix D More Implementation details

We use one-cycle learning rate policy (Smith and Topin 2019) with an initial learning rate 10e-4 during training. The learning rate gradually increases to 0.001 in the first 40% epochs and then gradually decreases to 10e-5 in the rest of the training process. The whole point cloud is flipped randomly along the X or Y axis, randomly rotated along the Z axis in the range [π/4,π/4]𝜋4𝜋4[-\pi/4,\pi/4][ - italic_π / 4 , italic_π / 4 ] and translated by [0.5m,0.5m]0.5𝑚0.5𝑚[-0.5m,0.5m][ - 0.5 italic_m , 0.5 italic_m ], as well as globally scaled by a random factor sampled from [0.95,1.05]0.951.05[0.95,1.05][ 0.95 , 1.05 ]. In loss function, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are set to 1.0, 1.0 and 0.25, respectively. For backbone, the number of channels in the four stages is 64, 128, 256, 512.

For nuScenes dataset, we set the detection range of the point cloud to [54m,54m]54𝑚54𝑚[-54m,54m][ - 54 italic_m , 54 italic_m ] for the X and Y axis and [5m,3m]5𝑚3𝑚[-5m,3m][ - 5 italic_m , 3 italic_m ] for the Z axis and set the pillar size as 0.15m. Besides, following CenterPoint we use the class-agnostic NMS with the score threshold set to 0.2 and rectification factor α𝛼\alphaitalic_α to 0.5 for 10 classes during the post-processing in inference. FastPillars is trained by 20 epochs, which takes similar-to\sim25 hours on 8 A100 GPUs with batch size 32.

For Waymo Open Dataset, we set the detection range of the point cloud to [75.2m,75.2m]75.2𝑚75.2𝑚[-75.2m,75.2m][ - 75.2 italic_m , 75.2 italic_m ] for the X and Y axis and [2m,4m]2𝑚4𝑚[-2m,4m][ - 2 italic_m , 4 italic_m ] for the Z axis and set the pillar size as 0.2m. We train the FastPillar from scratch with batch size 32, max learning rate 3e-3 for 36 epochs. For the post-processing process during inference, following AFDetV2 (Hu et al. 2022), we use class-specific NMS with the IoU threshold set to 0.8, 0.55, 0.55 and rectification factor β𝛽\betaitalic_β to 0.68, 0.71, 0.65 for Vehicle, Pedestrian and Cyclist respectively.

Appendix E More Qualitative Analysis of MAPE

For vehicles class, we visualize two cars in different views (point-view and bird eye view) in Fig. 9. Compared with (a) (b) and (c) (d), the MAPE module obviously pays more attention to the object semantic information (car’s outline) and local geometric context information. However, the Max-pooling operation loses much useful fine-grained information. Notably, FastPillars is a BEV-based method rather than a point-based method. For pedestrian and cyclist class, due to the tiny size and non-rigid property, pedestrian and cyclist detection are more challenging than vehicle detection. Pedestrian and car detection are quite different. In Fig. 9, we visualize pedestrian and cyclist categories in different views (point-view and bird eye view). Compared with (a1) (b1) and (c1) (d1), the MAPE module focus more stable geometric feature in pedestrian (pedestrian’s body parts), and has richer representation than the Max-pooling under BEV. Similar with pedestrian, Compared with (a2) (b2) and (c2) (d2), cyclist also maintain more semantic information with the MAPE modules. Different from the Max-pooling operation, MAPE reserves some background information, which enhances the local context representation for detection. Experiments demonstrate the superiority and effectiveness of our MAPE module, which introduces low latency and is suitable for real-time deployment applications.

Refer to caption
Figure 10: Qualitative Experiments of FastPillars on the Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set with baseline (CenterPoint (Yin, Zhou, and Krahenbuhl 2021a)) . Blue boxes mean CenterPoint (Yin, Zhou, and Krahenbuhl 2021a) predictions, Green boxes and red boxes are ground-truth and FastPillars predictions, respectively. Best viewed in color.
Refer to caption
Figure 11: Qualitative Experiments of FastPillars on the Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set. Blue boxes mean Pillarnet (Shi, Li, and Ma 2022) predictions, Green boxes and red boxes are ground-truth and FastPillars predictions, respectively. Best viewed in color.
Refer to caption
Figure 12: Visualization results of FastPillars on the nuScenes val𝑣𝑎𝑙valitalic_v italic_a italic_l set.
Refer to caption
Figure 13: Visualization results of FastPillars on the Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set.
Refer to caption
Figure 14: Visualization results of FastPillars on the Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set.

Appendix F Visualization Results with Baseline

As shown in Fig. 10, we visualize results on the Waymo val set, demonstrating that our method outperform baseline (CenterPoint), especially in pedestrian classes. This shows the effectiveness of our MAPE and backbone design.

Appendix G Qualitative Experiments on Waymo Dataset

Here, we make a qualitative comparison with previous state-of-the-art SPConv-based method PillarNet on Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l set. As shown in Fig. 11, PillarNet has a lot of false detection and missed detection, but FastPillars has better detection performance.

Appendix H Visualization on nuScenes Dataset

Some visualization results are shown in Fig. 12. Here, we visualize the detection results in some challenging scenarios on the nuScenes val𝑣𝑎𝑙valitalic_v italic_a italic_l set based on our FastPillars. As we can see, FastPillars can work reliably under a wide variety of challenging circumstances. We can clearly see that the proposed FastPillars is capable of detecting small targets, such as pedestrians, barriers and bicycles.

Appendix I Visualization on Waymo Dataset

We visualize the detection results on Waymo val𝑣𝑎𝑙valitalic_v italic_a italic_l Set in Fig. 13 and Fig. 14 based on FastPillars model. Thanks to the powerful pillar feature encoding ability of CRVNet and fine-grained geometric information provided by the MAPE module, our FastPillars performs well on the large scenes and can locate 3D objects with sparse points accurately.

Appendix J Potential Negative Social Impact

In this paper, we proposed a real-time pillar-based detector capable of achieving promising low-latency objects detection in autonomous driving scenarios. Our model is trained and evaluated totally based on public datasets, and there is no known potential negative impact on society.