SGCCNet: Single-Stage 3D Object Detector With Saliency-Guided Data Augmentation and Confidence Correction Mechanism

Ao Liang, Wenyu Chen, Jian Fang, Huaici Zhao Manuscript received April 9, 2024. This work was funded in part by CAS Innovation Fund, under Award Number E01Z040101.Ao Liang, Wenyu Chen and Huaici Zhao are with the Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences, Shenyang 110016, China. Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China. University of Chinese Academy of Sciences, Bei**g 100049, China, and also with Key Laboratory of Optical Information and Simulation Technology, Liaoning Province, Shenyang 110016, China(e-mail: {liangao,chenwenyu,hczhao}@sia.cn).Jian Fang is with the Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences, Shenyang 110016, China. Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China, and also with Key Laboratory of Optical Information and Simulation Technology, Liaoning Province, Shenyang 110016, China(e-mail: [email protected]).Huaici Zhao is the corresponding author.
Abstract

The single-stage point-based 3D object detectors have attracted widespread research interest due to their advantages of lightweight and fast inference speed. However, they still face challenges such as inadequate learning of low-quality objects (ILQ) and misalignment between localization accuracy and classification confidence (MLC). In this paper, we propose SGCCNet to alleviate these two issues. For ILQ, SGCCNet adopts a Saliency-Guided Data Augmentation (SGDA) strategy to enhance the robustness of the model on low-quality objects by reducing its reliance on salient features. Specifically, We construct a classification task and then approximate the saliency scores of points by moving points towards the point cloud centroid in a differentiable process. During the training process, SGCCNet will be forced to learn from low saliency features through drop** points. Meanwhile, to avoid internal covariate shift and contextual features forgetting caused by drop** points, we add a geometric normalization module and skip connection block in each stage. For MLC, we design a Confidence Correction Mechanism (CCM) specifically for point-based multi-class detectors. This mechanism corrects the confidence of the current proposal by utilizing the predictions of other key points within the local region in the post-processing stage. Extensive experiments on the KITTI dataset demonstrate the generality and effectiveness of our SGCCNet. On the KITTI test set, SGCCNet achieves 80.82%percent80.8280.82\%80.82 % for the metric of AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT on the Moderate level, outperforming all other point-based detectors, surpassing IA-SSD and Fast Point R-CNN by 2.35%percent2.352.35\%2.35 % and 3.42%percent3.423.42\%3.42 %, respectively. Additionally, SGCCNet demonstrates excellent portability for other point-based detectors.

Index Terms:
3D object detection, saliency guided, confidence correction, point cloud processing

I Introduction

With the rapid development of intelligent transportation and robot technology, LiDAR-based 3D object detection has become a key technology for intelligent agents to acquire environmental information, attracting widespread research interest [1, 2]. Within related methods, single-stage point-based detectors can achieve a better balance between accuracy and inference efficiency, gaining increasing attention [3, 4, 5, 6].

Refer to caption
Figure 1: Visualize the saliency of three classes of objects in KITTI. The model’s reliance on highly salient features is detrimental to the detection of low-quality objects.

For example, IA-SSD [7] and DBQ-SSD [7] can respectively achieve impressive 82FPS and 162FPS inference efficiency on the KITTI dataset [8] while maintaining ideal detection performance, surpassing other structured point cloud detectors comprehensively, greatly enhancing the application prospects of point-based detectors on devices with high real-time requirements. However, these type detectors still suffer from two prominent issues, namely Insufficient Learning of Low-Quality Objects (ILQ) and Misalignment between Localization Accuracy and Classification Confidence (MLC).

For ILQ, as limited training data cannot cover all possible feature distributions, especially for low-quality targets that appear less frequently, detectors will lack learning about them. And under the paradigm of repeated sampling and limited augmentation [9], the models will develop feature dependencies, giving better decision-making capabilities to prominent features while neglecting the learning of other features. In SPSNet, Liang et al. [10] obtained a way to represent point saliency through loss competition. When only discarding the top 10 most salient points, the model’s average AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT decreased by 8.61%, providing an intuitive explanation for this scenario. As shown in Fig. 1, we visualize the saliency scores of several objects in KITTI dataset. It can be seen that the distribution of saliency features is relatively fixed, with Car objects concentrated near the roof, Cyclist objects near the head and wheels, and Pedestrian objects near the head and feet. For low-quality objects, these strong saliency distribution areas are likely to be missing and sparse. If the model’s feature learning capability is limited to saliency regions, it will greatly reduce its robustness.

Refer to caption
Figure 2: Three typical scenarios of MLC in point-based single-stage 3D detectors. (a) False positive targets. (b) Suboptimal predicted boxes. (c) Missed accurately located targets.

The MLC problem is inherent in single-stage detectors, unlike two-stage detectors which can refine Rigions of Interesting (RoIs) for a second time. The decision on target confidence and geometric information regression is determined by two separate network branches without any connection between them. This can mislead the Non Max Suppression (NMS) operation in post-processing. We list three typical examples in Fig. 2. The first is the detection of too many false positive targets, where the detection results overly rely on the model’s semantic learning ability based solely on confidence. This is disadvantageous for detectors using point-based 3D backbones because point clouds are geometrically rich but textureless, leading to a high probability of objects with similar geometry to be detected as false positives. The second scenario is when the NMS selects target box positions and sizes that are not optimal. Although the selected points have the highest confidence, their IoU with GT is lower than other key points. In the third scenario, the accurate target position can be accurately predicted by the model, but it is not classified as a foreground category. All three scenarios directly reduce the model’s performance metrics.

In this paper, we propose SGCCNet to alleviate the impact of the above two issues. Firstly, SGCCNet employs a saliency-guide data augmentation (SGDA) method to enrich the feature distribution of foreground objects in the training data. Specifically, We use a differentiable process of moving points towards the point cloud centroid to act as a point dropout process to approximate the saliency scores of points. During the learning process, we randomly drop a certain number of salient points of each object to create a new ground truth for GT sampling. For the MLC problem, we have designed a Confidence Correction Mechanism (CCM) in the post-processing stage. For proposals with low IoU, CCM reduces their confidence. For proposals that are considered background by the model but have a significant overlap with bounding boxes in the neighborhood, CCM increases their probability of being true positive ones.

We validated the superiority of SGCCNet on the KITTI dataset. Specifically, the proposed SGCCNet outperforms all previously published point-based approaches. Under the multi-class training scheme, SGCCNet achieved 80.82% AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT on the Car target in the KITTI test set, surpassing prior methods that utilize structure-based backbones while also demonstrating higher efficiency. For the Pedestrian and Cyclist targets, compared to the latest point-based models, SGCCNet also showed improvements of 1% and 3.1% AP3D𝐴subscript𝑃3𝐷AP_{3D}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT respectively. In summary, the key contributions of our work are as follows:

  • A new saliency-guided data augmentation method is proposed, which enhances the diversity of the training data distribution at the feature level, thus improving the robustness of the model to low-quality object detection.

  • A new point-based backbone that combines geometric normalization modules and skip connection blocks is proposed to alleviate the internal covariate shift problem and feature forgetting problem.

  • A confidence correction mechanism for post-processing is proposed to effectively address the misalignment between localization accuracy and classification confidence (MLC) issue commonly seen in single-stage detectors.

  • A high-efficiency and high-precision single-stage point-based 3D detector SGCCNet is proposed, and experimental results show that our method outperforms other similar methods significantly on the KITTI dataset.

The structure following this text is as follows: In Section II, we review the design principles of classic and state-of-the-art 3D object detectors, and list some methods to overcome ILQ and MLC problems. In Section III, we introduce the key components of SGCCNet, detail the implementation process of saliency-guide data augmentation, and finally introduce its end-to-end training loss. In Section IV, we demonstrate the superiority of SGCCNet through comprehensive comparative experiments and ablation studies. Finally, in Section V, we summarize this paper and provide prospects for future work.

II Related Work

As mentioned above, LiDAR-based 3D detectors can be classified into grid-based, range-based, and point-based according to the data form before inputting to models. In this section, we will review the key ideas of classic and state-of-the-art models within each domain, and analyze in detail the measures currently taken by models to alleviate the issues of ILQ and MLC.

II-A 3D Detectors

Grid-based. There are three major types of grid representations: voxels, pillars, and BEV feature maps. Voxel-based models [11, 12] can preserve the spatial information of the original point cloud scene to the greatest extent based on the grid size. Pillar-based methods [13, 14] normalize the voxel space along the z-axis and significantly reducing the number of cells. The Bird’s-eye view (BEV) feature map is a dense 2D representation, where each pixel corresponds to a specific region and encodes the points information in this region [15, 16, 17, 18, 19]. In recent years, researchers have found that BEV representation can naturally integrate with downstream tasks such as behavior decision-making and trajectory planning to form end-to-end intelligent systems, leading to a resurgence in research on perception models based on BEV [20, 21, 22, 23].

Range-based (RV). Since range images are 2D representations like RGB images, range-based 3D object detectors can naturally borrow the models in 2D object detection to handle range images [24, 25].

Both of the above methods convert point clouds into a structured representation, with the main advantage being the ability to generate more dense feature maps. Even the space not occupied by the original point cloud learns features, which is crucial for sparse and incomplete objects. However, these methods inevitably result in information loss during the process of structuring the point cloud, and their reliance on large parameter 3D backbones makes it challenging to strike a balance between inference efficiency and detection accuracy. In contrast, in LiDAR scenes with multi-beams and fewer points, the advantages of structured point cloud detectors become less pronounced, and point-based detectors begin to stand out.

Refer to caption
Figure 3: Overview of proposed SGCCNet. SGCCNet adopts a PointNet++-style 3D backbone to learn point cloud features. In addition, SGCCNet consists of three core components, namely a saliency-guided data augmentation strategy, SA layer with geometric normalization modules and skip connection blocks, and a confidence correction mechanism during post-processing.

Point-based. The current state-of-the-art point-based detectors [3, 4, 5, 6, 5, 10] follow the design paradigm of PointNet [26], PointNet++ [27], and PointMLP [28]. The point cloud is processed through a set abstraction (SA) layer consisting of multiple stages of downsampling modules, local feature learning modules, and feature aggregation modules to learn rich spatial and semantic information.

In terms of results, single-stage point-based detectors have achieved performance comparable to structure-based models in multi-beam LiDAR scenes, and their real-time inference capability and lightweight model architecture are desirable for many mobile embedded devices. The main performance bottlenecks are two factors: inadequate learning of low-quality targets (ILQ) and misalignment between localization accuracy and classification confidence (MLC).

II-B Dealing with ILQ

As mentioned above, limited training data cannot cover all possible feature distributions. For models that rely on salient features, they cannot actively explore information in non-significant parts of the target. Inspired by AlexNet [29], all current 3D detection models adopt some common data augmentation techniques to improve the feature distribution of training data, such as using geometric transformations to perturb both the overall scene and local features of the target. Choi et al. [30] randomly occluded parts of the foreground target to expand the training data by 2.5 times. Reuse et al. [31] conducted detailed controlled experiments to demonstrate the effectiveness of local target transformations in improving detector performance. Hu et al. [32] proposed pattern-aware GT sampling, enhancing data augmentation by subsampling objects based on LiDAR characteristics. Wang et al. [33] consistently pasted virtual high-quality targets into point clouds to enhance the model’s perception of low-quality targets.

These methods aim to improve model robustness by enhancing the diversity of the training data distribution. However, these augmentation processes are random, and data diversity does not necessarily imply feature diversity, leading to significant bottlenecks in benefits. This paper proposes a saliency-guided data augmentation method that, from the perspective of feature learning, removes salient features on which the model relies, forcing the model to actively explore information in low saliency regions and enhance the diversity of feature distribution in training samples from the source.

II-C Dealing with MLC

He et al. [34] proposed an auxiliary network to convert the features of key points belonging to the grid in the 3D backbone into point-wise representation to improve localization accuracy. They also introduced a part-sensitive war** operation to align the confidences to the predicted bounding boxes. CIA-SSD [35] combined the predicted IoU with the key point classification probability as the final confidence. Wang et al. [36] proposed a confidence-based filtering mechanism to filter out poorly localized proposals. Sheng et al. [37] proposed a new Rotation-Decoupled IoU (RDIoU) method to generate more effective optimization targets. In addition, models such as FVTP [38], DFDNet [39], and DDIGNet [40] have also adopted IoU prediction branches to improve the MLC problem.

Although the above methods have achieved certain effects, they are mostly designed for anchor-based detection heads, and point-based detection heads have not been given much attention. Specifically, compared to anchor-based detection heads, the sparsity of keypoints obtained by point-based detection heads varies, leading to different numbers of predicted bounding boxes for each target, making these methods unsuitable for direct use in point-based detection heads. In this paper, a new confidence correction mechanism is proposed for point-based detection heads, taking into account the prediction situations of other proposals within the keypoint domain and the density of key points in the domain. On the one hand, it can filter out low-quality proposals, and on the other hand, it can explore potential high-quality proposals.

III Method

III-A Overview

As mentioned above, our goal is to improve the model’s ability to learn from low-quality targets and enhance target localization accuracy by correcting confidence. To achieve this, we propose SGCCNet, a generic and unified single-stage point-based 3D object detector as illustrated in Fig. 3, which includes three core components, namely a saliency-guided data augmentation method, SA layers with geometric normalization modules and skip connection blocks, and a confidence correction mechanism.

In this section, we will detail the design principles and implementation process of SGCCNet. Section III-B introduces the mathematical symbols and their meanings to be used. In Section III-C, we present the method for obtaining point saliency in the saliency-guide data augmentation and the specific enhancement process. Section III-D describes the specific structure of the 3D backbone of SGCCNet, focusing on the design rationale of the geometric correction module and skip connection block. Section III-E proposes mechanisms for point-based detection heads and confidence correction. Finally, in Section III-F, we discuss the model loss in the end-to-end training process.

III-B Preliminary

Let 𝒟𝒟\mathcal{D}caligraphic_D be the training dataset used for experiments, consisting of N𝑁Nitalic_N point cloud scenes 𝒫isubscript𝒫𝑖\mathcal{P}_{i}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, {𝒫ini×(3+C)|i=1,2,,N}conditional-setsubscript𝒫𝑖superscriptsubscript𝑛𝑖3𝐶𝑖12𝑁\{\mathcal{P}_{i}\in\mathbb{R}^{n_{i}\times(3+C)}|i=1,2,...,N\}{ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × ( 3 + italic_C ) end_POSTSUPERSCRIPT | italic_i = 1 , 2 , … , italic_N }. Here, nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the number of point clouds in the current scene, and C𝐶Citalic_C represents other features besides the spatial positions of the point clouds, such as intensity. The annotated ground-truth for each scene is denoted as isubscript𝑖\mathcal{B}_{i}caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, {imi×8|i=1,2,,N}conditional-setsubscript𝑖superscriptsubscript𝑚𝑖8𝑖12𝑁\{\mathcal{B}_{i}\in\mathbb{R}^{m_{i}\times 8}|i=1,2,...,N\}{ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × 8 end_POSTSUPERSCRIPT | italic_i = 1 , 2 , … , italic_N }, where misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of ground-truth annotations in the scene. The eight-dimensional features include the center position of the bounding box {x,y,z}𝑥𝑦𝑧\{x,y,z\}{ italic_x , italic_y , italic_z }, dimensions {l,w,h}𝑙𝑤\{l,w,h\}{ italic_l , italic_w , italic_h }, and category c𝑐citalic_c. The total number of annotated ground-truth in the dataset is A=i=1Nmi𝐴superscriptsubscript𝑖1𝑁subscript𝑚𝑖A=\sum_{i=1}^{N}{m_{i}}italic_A = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

III-C Saliency-Guided Data Augmentation (SGDA)

For LiDAR-based 3D object detection, limited training samples cannot cover all possible target feature distributions. In a training mode with limited data and repeated sampling, the model gradually relies on prominent features with common distributions and uses them as decision criteria without exploring features in other less prominent regions of the target. Liang et al. [10] have demonstrated the model’s dependency on prominent features in their work SPSNet. SPSNet assigns a Gaussian soft label to foreground points and regresses geometric information belonging to Dirac bounding boxes, obtaining the saliency score of points through stability and perturbation adversities. Ultimately, with the scenario of discarding only 10 foreground points, the model’s average AP drops rapidly by 8.61%. This dependency on salient features is detrimental to the robustness of the model’s detection.

Refer to caption
Figure 4: (a) Overview of proposed SGCCNet-elite for classification task. (b) Shifting the point towards the centroid is similar to discarding the point, and the movement process is differentiable, which can be used to approximate the saliency score of the point.

We hope to develop a purposeful data augmentation approach, that is, from the perspective of features, by using saliency analysis to identify common features that the model relies on, removing them, and forcing the model to explore features in the target that were originally in low saliency regions. However, currently, there is not much attention given to point-wise saliency analysis methods for point-based 3D object detection. We have shifted our focus to develo** more in-depth saliency analysis methods in point cloud classification tasks. Although these are two different tasks, we believe that the saliency scores obtained have commonalities for two reasons: 1) Considering the model structure. Point-based detectors throw vote points into the SA layer to determine the semantic information of local regions, a process that is identical to the prediction process of point cloud classification models. 2) Considering feature learning. Saliency scores are determined by the specific features of the model, and using similar structured models, the saliency features they focus on should be similar as well. We will demonstrate our viewpoint in the experimental section, and next we will introduce the specific data augmentation process.

Dataset Preparing. We conduct our experiments on the KITTI dataset 𝒟𝒟\mathcal{D}caligraphic_D. Firstly, we extract all foreground objects using the annotated ground-truth {imi×8|i=1,2,,A}conditional-setsubscript𝑖superscriptsubscript𝑚𝑖8𝑖12𝐴\{\mathcal{B}_{i}\in\mathbb{R}^{m_{i}\times 8}|i=1,2,...,A\}{ caligraphic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × 8 end_POSTSUPERSCRIPT | italic_i = 1 , 2 , … , italic_A } in 𝒟𝒟\mathcal{D}caligraphic_D, and then construct a classification dataset similar to ModelNet40 [41] and ScanObjectNN [42]. Each point in the dataset contains 4-dimensional features, representing spatial position and intensity. KITTI categorizes foreground objects into Easy, Moderate, and Hard levels based on factors such as occlusion and the number of points within the bounding box. We discard samples with fewer points than a certain threshold, and train the classification model on all levels. It is worth noting that we only perform data augmentation on Easy and Moderate levels, while retaining all samples in the Hard level.

Classification Model Preparing. We have streamlined the structure of SGCCNet and constructed an elite model SGCCNet-elite for classification tasks as shown in Fig. 4 (a). This model consists of only one SA layer for feature learning, and the training samples do not undergo downsampling in the SA layer. Finally, the point-wise features of each point are pooled using max pooling to form the overall feature of the sample, which is then fed into a Fully Connected (FC) Layer for classification.

Saliency Analysis. Let Eθ(pi)subscript𝐸𝜃superscript𝑝𝑖E_{\mathcal{\theta}}(p^{i})italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) be a classification model, where pi:=(p1i,p2i,,pkii)assignsuperscript𝑝𝑖subscriptsuperscript𝑝𝑖1subscriptsuperscript𝑝𝑖2subscriptsuperscript𝑝𝑖subscript𝑘𝑖p^{i}:=(p^{i}_{1},p^{i}_{2},...,p^{i}_{k_{i}})italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT := ( italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is a training sample, and pki×4𝑝superscriptsubscript𝑘𝑖4p\in\mathbb{R}^{k_{i}\times 4}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × 4 end_POSTSUPERSCRIPT. c()subscript𝑐\mathcal{L}_{c}(\cdot)caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ) is the classification loss function.

c=1Aic=13yiclog(yic^)subscript𝑐1𝐴subscript𝑖superscriptsubscript𝑐13subscript𝑦𝑖𝑐^subscript𝑦𝑖𝑐\mathcal{L}_{c}=-\frac{1}{A}\sum_{i}{\sum_{c=1}^{3}{y_{ic}\log(\hat{y_{ic}})}}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_A end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT end_ARG ) (1)

Among them, yicsubscript𝑦𝑖𝑐y_{ic}italic_y start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT is the label, with a value of 1 if sample pisuperscript𝑝𝑖p^{i}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT belongs to class c𝑐citalic_c, otherwise 0, and yic^^subscript𝑦𝑖𝑐\hat{y_{ic}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i italic_c end_POSTSUBSCRIPT end_ARG is the predicted probability. The most intuitive way to determine point significance is to gradually discard points and observe the change in loss csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. However, due to the large number of points and the fact that the contribution of each point to the whole is not isolated, this method is time-consuming and inaccurate. Inspired by Zheng et al. [43], we transform the process of discarding points into a differentiable process of moving points to calculate the significance score of each point. Specifically, in the classification task, we normalize the feature information of each sample.

pi^=pi1kj=1kipjimaxj{1,2,,ki}c=14pjci2^superscript𝑝𝑖superscript𝑝𝑖1𝑘superscriptsubscript𝑗1subscript𝑘𝑖subscriptsuperscript𝑝𝑖𝑗subscript𝑗12subscript𝑘𝑖superscriptsubscript𝑐14superscriptsubscriptsuperscript𝑝𝑖𝑗𝑐2\hat{p^{i}}=\frac{p^{i}-{\frac{1}{k}\sum_{j=1}^{k_{i}}{p^{i}_{j}}}}{\max% \limits_{j\in\{1,2,...,k_{i}\}}{\sqrt{\sum_{c=1}^{4}{p^{i}_{{jc}}}^{2}}}}over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_max start_POSTSUBSCRIPT italic_j ∈ { 1 , 2 , … , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT square-root start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (2)

The center of the sample is now moved to the origin. Based on the mechanism of LiDAR scanning imaging, we have reason to believe that the points at the origin have little contribution to the classification of the sample. The impact of moving points from other positions in the sample to the origin on model decisions is almost the same as discarding points. Since points are not angle invariant, there are difficulties in measuring gradients in Euclidean space, so we consider point shifting in the Spherical Coordinate System.

In the Spherical Coordinate System, a point pjisubscriptsuperscript𝑝𝑖𝑗{p^{i}_{j}}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is represented as (rji,φji,ϕji)subscriptsuperscript𝑟𝑖𝑗subscriptsuperscript𝜑𝑖𝑗subscriptsuperscriptitalic-ϕ𝑖𝑗\left({{r}^{i}_{j}},{{\varphi}^{i}_{j}},{{\phi}^{i}_{j}}\right)( italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_φ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with a𝑎aitalic_a as the sphere core, rjisubscriptsuperscript𝑟𝑖𝑗{{r}^{i}_{j}}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is distance of pjisubscriptsuperscript𝑝𝑖𝑗{p^{i}_{j}}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to a𝑎aitalic_a, φjisubscriptsuperscript𝜑𝑖𝑗{{\varphi}^{i}_{j}}italic_φ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and ϕjisubscriptsuperscriptitalic-ϕ𝑖𝑗{{\phi}^{i}_{j}}italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the two angles of a point relative to a𝑎aitalic_a. After shifting pjisubscriptsuperscript𝑝𝑖𝑗{p^{i}_{j}}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the direction of rjisubscriptsuperscript𝑟𝑖𝑗{{r}^{i}_{j}}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT towards sphere core a𝑎aitalic_a by δ𝛿\deltaitalic_δ, the change in model loss is δcrji𝛿subscript𝑐subscriptsuperscript𝑟𝑖𝑗-\delta\frac{\partial\mathcal{L}_{c}}{\partial{{r}^{i}_{j}}}- italic_δ divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG, where rji=c=13(pjciac)2subscriptsuperscript𝑟𝑖𝑗𝑐13superscriptsubscriptsuperscript𝑝𝑖𝑗𝑐subscript𝑎𝑐2{{r}^{i}_{j}}=\sqrt{\underset{c=1}{\overset{3}{\mathop{\sum}}}\,{{\left({{p}^{% i}_{jc}}-{{a}_{c}}\right)}^{2}}}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = square-root start_ARG start_UNDERACCENT italic_c = 1 end_UNDERACCENT start_ARG over3 start_ARG ∑ end_ARG end_ARG ( italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_c end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, as:

{pj1ia1=rjicosϕjisinφjipj2ia2=rjicosφjisinϕjipj3ia3=rjisinϕji\left\{\begin{matrix}{{p}^{i}_{j1}}-{{a}_{1}}={{r}^{i}_{j}}\cos{{\phi}^{i}_{j}% }\sin{{\varphi}^{i}_{j}}\\ {{p}^{i}_{j2}}-{{a}_{2}}={{r}^{i}_{j}}\cos{{\varphi}^{i}_{j}}\sin{{\phi}^{i}_{% j}}\\ {{p}^{i}_{j3}}-{{a}_{3}}={{r}^{i}_{j}}\sin{{\phi}^{i}_{j}}\\ \end{matrix}\right.{ start_ARG start_ROW start_CELL italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_cos italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_sin italic_φ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_cos italic_φ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_sin italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j 3 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_sin italic_ϕ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW end_ARG (3)

where pji={pjci}c=1,2,3subscriptsuperscript𝑝𝑖𝑗subscriptsubscriptsuperscript𝑝𝑖𝑗𝑐𝑐123{{p}^{i}_{j}}={{\left\{{{p}^{i}_{jc}}\right\}}_{c=1,2,3}}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 , 2 , 3 end_POSTSUBSCRIPT , a={ac}c=1,2,3𝑎subscriptsubscript𝑎𝑐𝑐123a={{\left\{{{a}_{c}}\right\}}_{c=1,2,3}}italic_a = { italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 , 2 , 3 end_POSTSUBSCRIPT represent the 3D coordinates value of pjisubscriptsuperscript𝑝𝑖𝑗{p^{i}_{j}}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and a𝑎aitalic_a. So:

(pjciac)rji=pjciacrjisubscriptsuperscript𝑝𝑖𝑗𝑐subscript𝑎𝑐subscriptsuperscript𝑟𝑖𝑗subscriptsuperscript𝑝𝑖𝑗𝑐subscript𝑎𝑐subscriptsuperscript𝑟𝑖𝑗\frac{\partial\left({{p}^{i}_{jc}}-{{a}_{c}}\right)}{\partial{{r}^{i}_{j}}}=% \frac{{{p}^{i}_{jc}}-{{a}_{c}}}{{{r}^{i}_{j}}}divide start_ARG ∂ ( italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_c end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_c end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG (4)
rji=c=13cpjcipjciacrjisubscriptsuperscript𝑟𝑖𝑗𝑐13subscript𝑐subscriptsuperscript𝑝𝑖𝑗𝑐subscriptsuperscript𝑝𝑖𝑗𝑐subscript𝑎𝑐subscriptsuperscript𝑟𝑖𝑗\frac{\partial\mathcal{L}}{\partial{{r}^{i}_{j}}}=\underset{c=1}{\overset{3}{% \mathop{\sum}}}\,\frac{\partial\mathcal{L}_{c}}{\partial{{p}^{i}_{jc}}}\frac{{% {p}^{i}_{jc}}-{{a}_{c}}}{{{r}^{i}_{j}}}divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = start_UNDERACCENT italic_c = 1 end_UNDERACCENT start_ARG over3 start_ARG ∑ end_ARG end_ARG divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_c end_POSTSUBSCRIPT end_ARG divide start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_c end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG (5)

So that after shifting pjisubscriptsuperscript𝑝𝑖𝑗{p^{i}_{j}}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the change in model loss is sji=crjirjisubscriptsuperscript𝑠𝑖𝑗subscript𝑐subscriptsuperscript𝑟𝑖𝑗subscriptsuperscript𝑟𝑖𝑗{{s}^{i}_{j}}=-\frac{\partial\mathcal{L}_{c}}{\partial{{r}^{i}_{j}}}{{r}^{i}_{% j}}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = - divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In practical computation, we also set the fourth dimension intensity value of the central position point to 0, and add it as a spatial information to the gradient calculation.

Algorithm 1 Drop** Points Strategy
0:    Selected training sample pi={pji|j=1,2,,ki}superscript𝑝𝑖conditional-setsubscriptsuperscript𝑝𝑖𝑗𝑗12subscript𝑘𝑖p^{i}=\{p^{i}_{j}|j=1,2,...,k_{i}\}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j = 1 , 2 , … , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, pji=(xji,yji,zji,Iji)4subscriptsuperscript𝑝𝑖𝑗subscriptsuperscript𝑥𝑖𝑗subscriptsuperscript𝑦𝑖𝑗subscriptsuperscript𝑧𝑖𝑗subscriptsuperscript𝐼𝑖𝑗superscript4p^{i}_{j}=(x^{i}_{j},y^{i}_{j},z^{i}_{j},I^{i}_{j})\in\mathbb{R}^{4}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. Category label cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;Point cloud classification model Eθ()subscript𝐸𝜃E_{\mathcal{\theta}}(\cdot)italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) and loss function c()subscript𝑐\mathcal{L}_{c}(\cdot)caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ );Linear hyperparameter of the number of deleted points α,β𝛼𝛽\alpha,\betaitalic_α , italic_β and dropout interval drop_interval𝑑𝑟𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙drop\_intervalitalic_d italic_r italic_o italic_p _ italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l;
0:    ki>0subscript𝑘𝑖0k_{i}>0italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0;
1:   Calculate the number of points that should be discarded: di=αki+βd^{i}=\lfloor\alpha k_{i}+\beta\lflooritalic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ⌊ italic_α italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β ⌊;
2:  for l=0,1,,didrop_interval𝑙01superscript𝑑𝑖𝑑𝑟𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙l=0,1,\cdots,\lfloor\frac{d^{i}}{drop\_interval}\rflooritalic_l = 0 , 1 , ⋯ , ⌊ divide start_ARG italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_r italic_o italic_p _ italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l end_ARG ⌋ do
3:      Get normalized point cloud pi^^superscript𝑝𝑖\hat{p^{i}}over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG with Eq. 2;
4:      Calculate classification loss: =c(Eθ(pi^),ci)subscript𝑐subscript𝐸𝜃^superscript𝑝𝑖subscript𝑐𝑖\mathcal{L}=\mathcal{L}_{c}(E_{\mathcal{\theta}}(\hat{p^{i}}),c_{i})caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ) , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );
5:      Calculate the saliency score for each point: sji=rjirjisubscriptsuperscript𝑠𝑖𝑗subscriptsuperscript𝑟𝑖𝑗subscriptsuperscript𝑟𝑖𝑗{{s}^{i}_{j}}=-\frac{\partial\mathcal{L}}{\partial{{r}^{i}_{j}}}{{r}^{i}_{j}}italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = - divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT;
6:      Keep the lowest kidrop_intervalsubscript𝑘𝑖𝑑𝑟𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙k_{i}-drop\_intervalitalic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d italic_r italic_o italic_p _ italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l saliency score points pdroppedisubscriptsuperscript𝑝𝑖𝑑𝑟𝑜𝑝𝑝𝑒𝑑p^{i}_{dropped}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_r italic_o italic_p italic_p italic_e italic_d end_POSTSUBSCRIPT;
7:      kikidrop_intervalsubscript𝑘𝑖subscript𝑘𝑖𝑑𝑟𝑜𝑝_𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙k_{i}\leftarrow k_{i}-drop\_intervalitalic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_d italic_r italic_o italic_p _ italic_i italic_n italic_t italic_e italic_r italic_v italic_a italic_l;
8:      pipdroppedisuperscript𝑝𝑖subscriptsuperscript𝑝𝑖𝑑𝑟𝑜𝑝𝑝𝑒𝑑p^{i}\leftarrow p^{i}_{dropped}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ← italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_r italic_o italic_p italic_p italic_e italic_d end_POSTSUBSCRIPT;
9:  end for
10:  return  pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
Refer to caption
Figure 5: Mark the changes in the scenes before and after drop** points, with points representing the Car, Pedestrian, Cyclist, and Background classes in green, yellow, red, and blue respectively.

Droppint Points. For a training sample pisuperscript𝑝𝑖p^{i}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, after obtaining the saliency scores {sji|j=1,2,,ki}conditional-setsubscriptsuperscript𝑠𝑖𝑗𝑗12subscript𝑘𝑖\{s^{i}_{j}|j=1,2,...,k_{i}\}{ italic_s start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_j = 1 , 2 , … , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } for each point, we delete high saliency points in the sample according to a preset ratio. The number of points to be deleted, denoted as disuperscript𝑑𝑖d^{i}italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, is linearly proportional to the number of points in the sample, k𝑘kitalic_k, i.e., di=αki+βsuperscript𝑑𝑖𝛼subscript𝑘𝑖𝛽d^{i}=\lfloor\alpha k_{i}+\beta\rflooritalic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ⌊ italic_α italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β ⌋. To ensure the accuracy of saliency ranking, the disuperscript𝑑𝑖d^{i}italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT points are not deleted all at once, but are deleted at fixed intervals as described in Algorithm. 1. After discarding salient features, the sample is restored to its original space and used for training alongside the original ground truth in the GT sampling process. Fig. 5 illustrates the changes in a set of scenarios before and after point removal. The entire saliency-guided data augmentation process is shown in Fig. 6.

Refer to caption
Figure 6: Workflow of Saliency-Guided Data Augmentation (SGDA).

III-D 3D Backbone

Similar to previous state-of-the-art point-based detectors, SSGCNet also utilizes a PointNet++-style 3D backbone to extract features from point clouds. This backbone incorporates multi-stage downsampling, multi-scale feature learning, and local feature aggregation processes to extract point-wise features with rich semantic and geometric information.

Refer to caption
Figure 7: Structure of the Geometric Normalization Module (GNM).

It is worth noting that the downsampling scheme of SGCCNet is designed based on IA-SSD. Due to the limited accuracy of semantic information learning by the point-based backbone, SGCCNet uses the Farthest Point Sampling (FPS) algorithm in the first two stages, and in the last two stages, sampling is based on the predicted point-wise foreground probability. The former ensures global coverage of sampling points, while the latter ensures a high recall rate of foreground points, reducing target information loss. In addition, SGCCNet also includes a geometric normalization module and skip connection block.

Geometric normalization module (GNM). We found that the PointNet structure ignores the Internal Covariate Shift (ICS) problem caused by the irregular and sparse geometric properties of point clouds during the process of aggregating local features. Specifically, in each learning stage, the same MLP layer needs to deal with a large number of regions with different geometric information simultaneously, and it is difficult for a simple MLP to achieve stable convergence speed and generalization ability from such a diverse feature distribution. Therefore, we propose to alleviate this problem by performing geometric normalization on the locally aggregated features, treating each ball used for learning local features as a batch, and applying ’Batch Normalization (BN)’ operation before inputting into the MLP layer. The structure of GNM is shown in Fig. 7. In detail, we first move the distribution of local features within each ball to a more normalized space with the ball center as the mean, and then increase the diversity of the feature distribution through learnable weight and offset parameters of the same dimension as the features. Finally, to prevent information loss, we concat the normalized features with the original features and input them into the MLP layer for learning local features.

The specific structure of the geometric normalization module is shown in Fig. 7. Let {fi,j}j=1,2,,kk×dsubscriptsubscript𝑓𝑖𝑗𝑗12𝑘superscript𝑘𝑑\{f_{i,j}\}_{j=1,2,...,k}\in\mathbb{R}^{k\times d}{ italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 , 2 , … , italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d end_POSTSUPERSCRIPT be the local features of the sampling point pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where k𝑘kitalic_k is the number of neighbors in its local region, and d𝑑ditalic_d is the feature dimension of the sampling point and its neighbor points. We normalize the features of neighbor points in the local region through the following equation:

{fi,j}=α{fi,j}fiσ+ϵ+βsubscript𝑓𝑖𝑗direct-product𝛼subscript𝑓𝑖𝑗subscript𝑓𝑖𝜎italic-ϵ𝛽\{f_{i,j}\}=\alpha\odot\frac{\{f_{i,j}\}-f_{i}}{\sigma+\epsilon}+\beta{ italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } = italic_α ⊙ divide start_ARG { italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_σ + italic_ϵ end_ARG + italic_β (6)
σ=1k×n×di=1nj=1k(fi,jfi)2𝜎1𝑘𝑛𝑑superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑘superscriptsubscript𝑓𝑖𝑗subscript𝑓𝑖2\sigma=\sqrt{\frac{1}{k\times n\times d}\sum_{i=1}^{n}\sum_{j=1}^{k}(f_{i,j}-f% _{i})^{2}}italic_σ = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_k × italic_n × italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (7)

Similar to BN layer, where αd𝛼superscript𝑑\alpha\in\mathbb{R}^{d}italic_α ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and βd𝛽superscript𝑑\beta\in\mathbb{R}^{d}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are learnable parameters, and direct-product\odot indicates Hadamard product. ϵ=1e5italic-ϵ1superscript𝑒5\epsilon=1e^{-5}italic_ϵ = 1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT is a small number for numerical stability. Note that σ𝜎\sigmaitalic_σ is a scalar that describes the feature deviation across all local groups and channels. By doing so, we transform the points via a normalization operation while maintaining original geometric properties. The features after geometric normalization will be concatenated with the original features for subsequent learning.

Skip Connection Block (SCB). When determining the appropriate number of training epochs for SGCCNet, we found that some high-quality targets were detected with high confidence early in training, but as training progressed, the confidence decreased below the threshold, undoubtedly reducing the model’s performance. After analysis, we believe there may be three reasons for this: 1) The sparsity of the targets greatly increases after continuous downsampling, causing sparse sampling points to lose the ability to perceive the overall geometric information of the targets. 2) As the ball query radius increases, noise and features of other targets in the local area are introduced while expanding the model’s receptive field, leading to the loss of key information of the corresponding targets after pooling layers. 3) For some targets, shallow features have a better ability to represent target information. In testing classification tasks, the fact that models with extremely small parameter amounts can achieve classification levels far higher than detection models directly proves this point. We refer to this type of problem as feature forgetting and have designed a skip connection block to alleviate this issue. The structure of SCB is shown in Fig. 8.

Refer to caption
Figure 8: Structure of the Skip Connection Block (SCB).

The SCB alleviates the problem of feature forgetting by enabling interactions between key point features of adjacent SA layers in the 3D backbone, as shown in Fig. 8. Let the feature of the l𝑙litalic_l-th SA layer’s sampled points be {Fil}i=1,,kk×dlsubscriptsubscriptsuperscript𝐹𝑙𝑖𝑖1𝑘superscript𝑘subscript𝑑𝑙\{F^{l}_{i}\}_{i=1,...,k}\in\mathbb{R}^{k\times d_{l}}{ italic_F start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where k𝑘kitalic_k is the number of sampled points at the current stage, and d𝑑ditalic_d is the feature dimension. Based on the basic structure of PointNet++, it is known that the sampled points at the current stage are also sampled points from the previous stage. Therefore, let the features of the sampled points in the l𝑙litalic_l-th SA layer at the previous stage be {Fil1}i=1,,kk×dl1subscriptsubscriptsuperscript𝐹𝑙1𝑖𝑖1𝑘superscript𝑘subscript𝑑𝑙1\{F^{l-1}_{i}\}_{i=1,...,k}\in\mathbb{R}^{k\times d_{l-1}}{ italic_F start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The learning process of SCB is as follows: (insert equation here).

gi=Φpos(Φpre(Fil1)+Fil)subscript𝑔𝑖subscriptΦ𝑝𝑜𝑠subscriptΦ𝑝𝑟𝑒subscriptsuperscript𝐹𝑙1𝑖subscriptsuperscript𝐹𝑙𝑖g_{i}=\Phi_{pos}(\Phi_{pre}(F^{l-1}_{i})+F^{l}_{i})italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Φ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( roman_Φ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_F start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (8)

The ΦpresubscriptΦ𝑝𝑟𝑒\Phi_{pre}roman_Φ start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT is a module composed of residual networks, which elevates the dimension of features Fil1subscriptsuperscript𝐹𝑙1𝑖F^{l-1}_{i}italic_F start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from dl1subscript𝑑𝑙1d_{l-1}italic_d start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT to dlsubscript𝑑𝑙d_{l}italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from the previous stage, and ΦpossubscriptΦ𝑝𝑜𝑠\Phi_{pos}roman_Φ start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT further interacts with the features. The design of SCB is inspired by He et al.’s work [34], but we naturally leverage the advantages of point-based backbone hierarchical downsampling, without any feature projection or transformation processes. As shown in Fig. 8, with the addition of SCB, shallow features and deep features of the same sampling point, as well as local small region features and large region features, interact with each other. Even in the subsequent downsampling process where the sampling points become sparser or noise is introduced, valuable information from earlier stages is still retained.

Algorithm 2 Confidence Correction Mechanism
0:    Predicted bounding boxes \mathcal{B}caligraphic_B of one LiDAR frame with the size of k×7𝑘7k\times 7italic_k × 7, where k𝑘kitalic_k is the number of bounding boxes, and (x,y,z,w,l,h,r)𝑥𝑦𝑧𝑤𝑙𝑟(x,y,z,w,l,h,r)( italic_x , italic_y , italic_z , italic_w , italic_l , italic_h , italic_r ) is the parameters of a bounding box;Predicted classification confidence values 𝒞𝒞\mathcal{C}caligraphic_C and IoU values 𝒰𝒰\mathcal{U}caligraphic_U of the corresponding predicted bounding boxes with the size of N×1𝑁1N\times 1italic_N × 1, respectively;Initial confidence score threshold score_thres1𝑠𝑐𝑜𝑟𝑒_𝑡𝑟𝑒subscript𝑠1score\_thres_{1}italic_s italic_c italic_o italic_r italic_e _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Final confidence score threshold score_thres2𝑠𝑐𝑜𝑟𝑒_𝑡𝑟𝑒subscript𝑠2score\_thres_{2}italic_s italic_c italic_o italic_r italic_e _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT;IoU threshold iou_thres𝑖𝑜𝑢_𝑡𝑟𝑒𝑠iou\_thresitalic_i italic_o italic_u _ italic_t italic_h italic_r italic_e italic_s;Missed sample incremental confidence ΔcΔ𝑐\Delta croman_Δ italic_c. Missed sample IoU threshold iou_thresmissed𝑖𝑜𝑢_𝑡𝑟𝑒subscript𝑠𝑚𝑖𝑠𝑠𝑒𝑑iou\_thres_{missed}italic_i italic_o italic_u _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s italic_e italic_d end_POSTSUBSCRIPT. Threshold of the number of missed sample neighbors neighbor_thresmissed𝑛𝑒𝑖𝑔𝑏𝑜𝑟_𝑡𝑟𝑒subscript𝑠𝑚𝑖𝑠𝑠𝑒𝑑neighbor\_thres_{missed}italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s italic_e italic_d end_POSTSUBSCRIPT;={𝐛i}i=1,,ksubscriptsubscript𝐛𝑖𝑖1𝑘\mathcal{B}=\{\mathbf{b}_{i}\}_{i=1,...,k}caligraphic_B = { bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT; 𝒞={𝐜i}i=1,,k𝒞subscriptsubscript𝐜𝑖𝑖1𝑘\mathcal{C}=\{\mathbf{c}_{i}\}_{i=1,...,k}caligraphic_C = { bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT; 𝒰={𝐮i}i=1,,k;𝒰subscriptsubscript𝐮𝑖𝑖1𝑘\mathcal{U}=\{\mathbf{u}_{i}\}_{i=1,...,k};caligraphic_U = { bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT ;
0:    Initial selected bounding boxes index 𝒟=𝒟\mathcal{D}=\emptysetcaligraphic_D = ∅;Selected bounding boxes o=subscript𝑜\mathcal{B}_{o}=\emptysetcaligraphic_B start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = ∅;Rectified confidence values 𝒮o=subscript𝒮𝑜\mathcal{S}_{o}=\emptysetcaligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = ∅ of the corresponding selected bounding boxes;
1:  for i=0,1,,k𝑖01𝑘i=0,1,\cdots,kitalic_i = 0 , 1 , ⋯ , italic_k do
2:     kselect=0subscript𝑘𝑠𝑒𝑙𝑒𝑐𝑡0k_{select}=0italic_k start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT = 0;
3:     if ci>score_thres1subscriptc𝑖𝑠𝑐𝑜𝑟𝑒_𝑡𝑟𝑒subscript𝑠1\textbf{c}_{i}>score\_thres_{1}c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_s italic_c italic_o italic_r italic_e _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT then
4:        cici0.7×ui0.3subscriptc𝑖superscriptsubscriptc𝑖0.7superscriptsubscriptu𝑖0.3\textbf{c}_{i}\leftarrow\textbf{c}_{i}^{0.7}\times\textbf{u}_{i}^{0.3}c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0.7 end_POSTSUPERSCRIPT × u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0.3 end_POSTSUPERSCRIPT;
5:        kselectkselect+1subscript𝑘𝑠𝑒𝑙𝑒𝑐𝑡subscript𝑘𝑠𝑒𝑙𝑒𝑐𝑡1k_{select}\leftarrow k_{select}+1italic_k start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT ← italic_k start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT + 1;
6:        𝒟𝒟i𝒟𝒟𝑖\mathcal{D}\leftarrow\mathcal{D}\cup icaligraphic_D ← caligraphic_D ∪ italic_i;
7:     end if
8:  end for
9:  (𝒟)𝒟\mathcal{B}\leftarrow\mathcal{B}(\mathcal{D})caligraphic_B ← caligraphic_B ( caligraphic_D ); 𝒞𝒞(𝒟)𝒞𝒞𝒟\mathcal{C}\leftarrow\mathcal{C}(\mathcal{D})caligraphic_C ← caligraphic_C ( caligraphic_D ); 𝒰𝒰(𝒟)𝒰𝒰𝒟\mathcal{U}\leftarrow\mathcal{U}(\mathcal{D})caligraphic_U ← caligraphic_U ( caligraphic_D );
10:  for i=0,1,,kselect𝑖01subscript𝑘𝑠𝑒𝑙𝑒𝑐𝑡i=0,1,\cdots,k_{select}italic_i = 0 , 1 , ⋯ , italic_k start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT do
11:     iouall=0,Nneighbor=0formulae-sequence𝑖𝑜subscript𝑢𝑎𝑙𝑙0subscript𝑁𝑛𝑒𝑖𝑔𝑏𝑜𝑟0iou_{all}=0,N_{neighbor}=0italic_i italic_o italic_u start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT = 0 , italic_N start_POSTSUBSCRIPT italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r end_POSTSUBSCRIPT = 0;
12:     for j=0,1,,kselect𝑗01subscript𝑘𝑠𝑒𝑙𝑒𝑐𝑡j=0,1,\cdots,k_{select}italic_j = 0 , 1 , ⋯ , italic_k start_POSTSUBSCRIPT italic_s italic_e italic_l italic_e italic_c italic_t end_POSTSUBSCRIPT do
13:        if IoU(𝐛i,𝐛j)>iou_thresIoUsubscript𝐛𝑖subscript𝐛𝑗𝑖𝑜𝑢_𝑡𝑟𝑒𝑠\mathrm{IoU}(\mathbf{b}_{i},\mathbf{b}_{j})>iou\_thresroman_IoU ( bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) > italic_i italic_o italic_u _ italic_t italic_h italic_r italic_e italic_s then
14:            ioualliouall+IoU(𝐛i,𝐛j)𝑖𝑜subscript𝑢𝑎𝑙𝑙𝑖𝑜subscript𝑢𝑎𝑙𝑙IoUsubscript𝐛𝑖subscript𝐛𝑗iou_{all}\leftarrow iou_{all}+\mathrm{IoU}(\mathbf{b}_{i},\mathbf{b}_{j})italic_i italic_o italic_u start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT ← italic_i italic_o italic_u start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT + roman_IoU ( bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT );
15:            NneighborNneighbor+1subscript𝑁𝑛𝑒𝑖𝑔𝑏𝑜𝑟subscript𝑁𝑛𝑒𝑖𝑔𝑏𝑜𝑟1N_{neighbor}\leftarrow N_{neighbor}+1italic_N start_POSTSUBSCRIPT italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r end_POSTSUBSCRIPT + 1;
16:        end if
17:     end for
18:      ioumean=iouallNneighbor;𝑖𝑜subscript𝑢𝑚𝑒𝑎𝑛𝑖𝑜subscript𝑢𝑎𝑙𝑙subscript𝑁𝑛𝑒𝑖𝑔𝑏𝑜𝑟iou_{mean}=\frac{iou_{all}}{N_{neighbor}};italic_i italic_o italic_u start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = divide start_ARG italic_i italic_o italic_u start_POSTSUBSCRIPT italic_a italic_l italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r end_POSTSUBSCRIPT end_ARG ;
19:      s=ioumeanci;𝑠𝑖𝑜subscript𝑢𝑚𝑒𝑎𝑛subscript𝑐𝑖s=iou_{mean}\cdot c_{i};italic_s = italic_i italic_o italic_u start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ⋅ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ;
20:     if ioumean>iou_thresmissed&Nneighbor>neighbor_thresmissedformulae-sequence𝑖𝑜subscript𝑢𝑚𝑒𝑎𝑛𝑖𝑜𝑢_𝑡𝑟𝑒subscript𝑠𝑚𝑖𝑠𝑠𝑒𝑑subscript𝑁𝑛𝑒𝑖𝑔𝑏𝑜𝑟𝑛𝑒𝑖𝑔𝑏𝑜𝑟_𝑡𝑟𝑒subscript𝑠𝑚𝑖𝑠𝑠𝑒𝑑iou_{mean}>iou\_thres_{missed}\quad\&\quad N_{neighbor}>neighbor\_thres_{missed}italic_i italic_o italic_u start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT > italic_i italic_o italic_u _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s italic_e italic_d end_POSTSUBSCRIPT & italic_N start_POSTSUBSCRIPT italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r end_POSTSUBSCRIPT > italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s italic_e italic_d end_POSTSUBSCRIPT then
21:         ss+Δc;𝑠𝑠Δ𝑐s\leftarrow s+\Delta c;italic_s ← italic_s + roman_Δ italic_c ;
22:     end if
23:     if s>score_thres2𝑠𝑠𝑐𝑜𝑟𝑒_𝑡𝑟𝑒subscript𝑠2s>score\_thres_{2}italic_s > italic_s italic_c italic_o italic_r italic_e _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT then
24:         oo𝐛isubscript𝑜subscript𝑜subscript𝐛𝑖\mathcal{B}_{o}\leftarrow\mathcal{B}_{o}\cup\mathbf{b}_{i}caligraphic_B start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ← caligraphic_B start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∪ bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;𝒮o𝒮ossubscript𝒮𝑜subscript𝒮𝑜𝑠\mathcal{S}_{o}\leftarrow\mathcal{S}_{o}\cup scaligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ← caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∪ italic_s;
25:     end if
26:  end for
27:  return  o,𝒮osubscript𝑜subscript𝒮𝑜\mathcal{B}_{o},\mathcal{S}_{o}caligraphic_B start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , caligraphic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT

III-E Confidence Correction Mechanism (CCM)

The previous advanced single-stage point-based detectors use two completely independent branch networks in the detection head to respectively regress the confidence of predicted boxes and geometric information. During post-processing, the NMS operator is guided by confidence scores, retaining predicted boxes with the highest confidence scores in local regions and at any position with confidence scores higher than a threshold. This is disadvantageous for single-stage point-based detectors, as high confidence scores do not necessarily represent high-quality localization accuracy, and the rich geometric information of point clouds, coupled with the lack of semantic information, makes it difficult for the point-based backbone to learn precise confidence, leading to a large number of missed detections and false positive detections. Researchers have referred to this issue as misalignment between localization accuracy and classification confidence.

As mentioned in Section 2, the method with IoU prediction branch brings hope to alleviate the MLC problem of single-stage detectors. These methods predict the confidence of proposals while also predicting their IoU values with GT boxes. The final confidence sent to NMS is a joint indicator of both. SGCCNet also adopts this method to improve model performance. Furthermore, we not only consider the prediction of a single voting point but also incorporate the predictions of surrounding voting points into the reference range for confidence correction. The work closest to our idea is NIV-SSD [44], but it only applies to single-class detection with anchor-based detection heads, and the issue of uneven prediction of each target voting point in point-based detectors is not considered.

For a stable training SGCCNet, the key points output by the 3D backbone after voting will move to positions near the center of the target box. As shown in Fig. 9, for the same target, different voting points have a certain degree of clustering in terms of their positions and regression predicted boxes. Therefore, we use the predictions of neighboring voting points to correct the confidence of the current voting point. We believe that voting points in local regions with more neighbors having high IoU values are more accurate, while those with unreliable confidence. The specific confidence correction mechanism works as shown in Algorithm 2. Let the parameters of the predicted box regressed by the current voting point be bi=(x,y,z,w,l,h,r)subscript𝑏𝑖𝑥𝑦𝑧𝑤𝑙𝑟b_{i}=(x,y,z,w,l,h,r)italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x , italic_y , italic_z , italic_w , italic_l , italic_h , italic_r ), we calculate its IoU value with the predicted boxes of other voting points B={bj}j=1,,kk×7𝐵subscriptsubscript𝑏𝑗𝑗1𝑘superscript𝑘7B=\{b_{j}\}_{j=1,...,k}\in\mathbb{R}^{k\times 7}italic_B = { italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 , … , italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × 7 end_POSTSUPERSCRIPT, and consider the predicted boxes with IoU values greater than a certain threshold as neighbors of the current voting point. Finally, the average IoU value between the voting point and its neighboring voting points’ predicted boxes will be used as a weight to calibrate the confidence of the current voting point. In addition, CCM sets strict conditions to filter out boxes with high localization accuracy but confidence lower than a threshold. Before the NMS operation, we give them a certain increment to increase their probability of being considered as foreground. We will analyze the advantages of CCM in detail during the experimental phase.

Refer to caption
Figure 9: The working mechanism of the confidence correction mechanism. CCM corrects the prediction of the current vote point by considering the predictions of other vote points in the neighborhood. CCM also takes into account the possibility of missed detections, where η𝜂\etaitalic_η and ξ𝜉\xiitalic_ξ in the figure correspond to the IoU threshold and the number of local neighbors threshold in line 20 of Algorithm 2.

III-F End-to-End Training

SGCCNet is a one-stage detector, using an end-to-end multi-task training fashion. Firstly, there is the semantic loss ssubscript𝑠\mathcal{L}_{s}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The last two SA layers in SGCCNet use foreground sampling, so it is necessary to obtain the semantic information of the sampled points from the previous stage. Due to the highly unbalanced number of different categories of objects in the KITTI dataset, such as the number of Car objects being five times that of Pedestrian objects and ten times that of Cyclist objects, and the point number of Car objects is usually more than the other two, we adopt weighted cross entropy (WCE) loss to manually emphasize rare categories. The WCE loss can be formulated as:

αc=1Fcϵsubscript𝛼𝑐1subscript𝐹𝑐italic-ϵ\alpha_{c}=\frac{1}{F_{c}-\epsilon}italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_ϵ end_ARG (9)
s=c=1Cαcyclogyc^subscript𝑠superscriptsubscript𝑐1𝐶subscript𝛼𝑐subscript𝑦𝑐^subscript𝑦𝑐\mathcal{L}_{s}=-\sum_{c=1}^{C}\alpha_{c}y_{c}\log{\hat{y_{c}}}caligraphic_L start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_log over^ start_ARG italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG (10)

where ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the ground truth label determined by whether the point is inside the bounding box, yc^^subscript𝑦𝑐\hat{y_{c}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG is the predicted probability, Fcsubscript𝐹𝑐F_{c}italic_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the frequency, and αcsubscript𝛼𝑐\alpha_{c}italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the weight of the cthsuperscript𝑐𝑡c^{th}italic_c start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class. C𝐶Citalic_C is the category number of the dataset. Yang et al. [3] pointed out that points closer to the target center are easier to regress accurate bounding boxes, have higher confidence, and achieve higher accuracy in local semantic prediction. Following the design of Zhang et al. [6], we assign weights to each point participating in the calculation of semantic loss, with points closer to the bounding box center having higher weights. This way, during training, these points will learn higher semantic scores.

sample=isiMASKisubscript𝑠𝑎𝑚𝑝𝑙𝑒subscript𝑖subscriptsubscript𝑠𝑖𝑀𝐴𝑆subscript𝐾𝑖\mathcal{L}_{sample}=\sum_{i}\mathcal{L}_{{s}_{i}}\cdot MASK_{i}caligraphic_L start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_M italic_A italic_S italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (11)
Maski=minf,bmaxf,b×minl,rmaxl,r×minu,dmaxu,d3𝑀𝑎𝑠subscript𝑘𝑖3superscript𝑓superscript𝑏superscript𝑓superscript𝑏superscript𝑙superscript𝑟superscript𝑙superscript𝑟superscript𝑢superscript𝑑superscript𝑢superscript𝑑Mask_{i}=\sqrt[3]{\frac{\min{f^{*},b^{*}}}{\max{f^{*},b^{*}}}\times\frac{\min{% l^{*},r^{*}}}{\max{l^{*},r^{*}}}\times\frac{\min{u^{*},d^{*}}}{\max{u^{*},d^{*% }}}}italic_M italic_a italic_s italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = nth-root start_ARG 3 end_ARG start_ARG divide start_ARG roman_min italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG roman_max italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG × divide start_ARG roman_min italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG roman_max italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG × divide start_ARG roman_min italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG roman_max italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG end_ARG (12)

Where Mask𝑀𝑎𝑠𝑘Maskitalic_M italic_a italic_s italic_k is the weight for the centrality of sampled points. f,b,l,r,u,dsuperscript𝑓superscript𝑏superscript𝑙superscript𝑟superscript𝑢superscript𝑑f^{*},b^{*},l^{*},r^{*},u^{*},d^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represent the distances between the sampled points and the six faces of the target box. After training, points closer to the center of the target yc^^subscript𝑦𝑐\hat{y_{c}}over^ start_ARG italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG will have higher values, and these points will be prioritized during inference.

The design of the remaining losses is similar to most point-based detectors. First is the regression loss of the vote points votesubscript𝑣𝑜𝑡𝑒\mathcal{L}_{vote}caligraphic_L start_POSTSUBSCRIPT italic_v italic_o italic_t italic_e end_POSTSUBSCRIPT, which determines the localization accuracy of the predicted boxes, we use L1𝐿1L1italic_L 1 loss. Then is the bounding box classification prediction loss clssubscript𝑐𝑙𝑠\mathcal{L}_{cls}caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, also calculated using WCE loss. Next is the regression loss of the vote point features for the bounding box sizes regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT, which is further decomposed into location, size, angle-bin, angle-res, and corner parts. To alleviate the MLC problem, we also add an IoU branch IoUsubscript𝐼𝑜𝑈\mathcal{L}_{IoU}caligraphic_L start_POSTSUBSCRIPT italic_I italic_o italic_U end_POSTSUBSCRIPT to predict the IoU value between the predicted box regression of the vote points and the GT, also using L1𝐿1L1italic_L 1 loss, the result of which is used to preliminarily correct the confidence (see Algorithm. 2). All the losses involved in the model are as follows:

=sample+vote+cls+regsubscript𝑠𝑎𝑚𝑝𝑙𝑒subscript𝑣𝑜𝑡𝑒subscript𝑐𝑙𝑠subscript𝑟𝑒𝑔\mathcal{L}=\mathcal{L}_{sample}+\mathcal{L}_{vote}+\mathcal{L}_{cls}+\mathcal% {L}_{reg}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_v italic_o italic_t italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT (13)
reg=loc+size+anglebin+angleres+corner+IoUsubscript𝑟𝑒𝑔subscript𝑙𝑜𝑐subscript𝑠𝑖𝑧𝑒subscript𝑎𝑛𝑔𝑙𝑒𝑏𝑖𝑛subscript𝑎𝑛𝑔𝑙𝑒𝑟𝑒𝑠subscript𝑐𝑜𝑟𝑛𝑒𝑟subscript𝐼𝑜𝑈\mathcal{L}_{reg}=\mathcal{L}_{loc}+\mathcal{L}_{size}+\mathcal{L}_{angle-bin}% +\mathcal{L}_{angle-res}+\mathcal{L}_{corner}+\mathcal{L}_{IoU}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_a italic_n italic_g italic_l italic_e - italic_b italic_i italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_a italic_n italic_g italic_l italic_e - italic_r italic_e italic_s end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_r italic_n italic_e italic_r end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_I italic_o italic_U end_POSTSUBSCRIPT (14)
TABLE I: Quantitative comparison with state-of-the-art methods on the KITTI test set for Car BEV and 3D detection, under the evaluation metric of 3D Average Precision (AP𝐴𝑃APitalic_A italic_P) of 40 sampling recall points. The best and our SGCCNet results are highlighted in BOLD and underlined, respectively
Method Backbone Type 3D(IoU=0.7) BEV(IoU=0.7)
R40 R40
Easy Moderate Hard Easy Moderate Hard
VoxelNet [11] Voxel-based 1-stage 77.4777.4777.4777.47 65.1165.1165.1165.11 57.7357.7357.7357.73 87.9587.9587.9587.95 78.3978.3978.3978.39 71.2971.2971.2971.29
PointPillars [13] Voxel-based 1-stage 82.5882.5882.5882.58 74.3174.3174.3174.31 68.9968.9968.9968.99 90.0790.0790.0790.07 86.5686.5686.5686.56 82.8182.8182.8182.81
SECOND [9] Voxel-based 1-stage 84.6584.6584.6584.65 75.9675.9675.9675.96 68.7168.7168.7168.71 89.3989.3989.3989.39 83.7783.7783.7783.77 78.5978.5978.5978.59
3DIoULoss [45] Voxel-based 2-stage 86.1686.1686.1686.16 76.576.576.576.5 71.3971.3971.3971.39 90.2390.2390.2390.23 86.6186.6186.6186.61 86.3786.3786.3786.37
TANet [46] Voxel-based 1-stage 84.3984.3984.3984.39 75.9475.9475.9475.94 68.8268.8268.8268.82 91.5891.5891.5891.58 86.5486.5486.5486.54 81.1981.1981.1981.19
Part-A2 [47] Voxel-based 2-stage 87.8187.8187.8187.81 78.4978.4978.4978.49 73.5173.5173.5173.51 91.791.791.791.7 87.7987.7987.7987.79 84.6184.6184.6184.61
CIASSD [35] Voxel-based 1-stage 89.5989.5989.5989.59 80.2880.2880.2880.28 72.8772.8772.8772.87 93.7493.7493.7493.74 89.8489.8489.8489.84 82.3982.3982.3982.39
SASSD [48] Voxel-based 1-stage 88.7588.7588.7588.75 79.7979.7979.7979.79 74.6174.6174.6174.61 93.7493.7493.7493.74 89.8489.8489.8489.84 82.3982.3982.3982.39
Associate-3Det [49] Voxel-based 1-stage 85.9985.9985.9985.99 77.477.477.477.4 70.5370.5370.5370.53 91.491.491.491.4 88.0988.0988.0988.09 82.9682.9682.9682.96
SAT-GCN [50] Voxel-based 1-stage 86.5586.5586.5586.55 78.1278.1278.1278.12 73.7273.7273.7273.72 92.8392.8392.8392.83 88.0688.0688.0688.06 83.5183.5183.5183.51
SVGA-Net [51] Voxel-based 1-stage 87.3387.3387.3387.33 80.4780.4780.4780.47 75.9175.9175.9175.91 - - -
Fast Point R-CNN [52] Point-Voxel 2-stage 85.2985.2985.2985.29 77.477.477.477.4 70.2470.2470.2470.24 90.8790.8790.8790.87 87.8487.8487.8487.84 80.5280.5280.5280.52
STD [53] Point-Voxel 2-stage 87.9287.9287.9287.92 79.7179.7179.7179.71 75.0975.0975.0975.09 - - -
PV-RCNN [17] Point-Voxel 2-stage 90.2590.25\uline{90.25}90.25 81.4381.4381.4381.43 76.8276.8276.8276.82 94.9894.98\uline{94.98}94.98 90.6290.62\uline{90.62}90.62 86.1486.1486.1486.14
EQ-PVRCNN [54] Point-Voxel 2-stage 90.1390.1390.1390.13 82.0182.01\uline{82.01}82.01 77.5377.53\uline{77.53}77.53 94.5594.5594.5594.55 89.0989.0989.0989.09 86.4286.42\uline{86.42}86.42
VIC-Net [55] Point-Voxel 1-stage 88.2588.2588.2588.25 80.6180.6180.6180.61 75.8375.8375.8375.83 - - -
HVPR [56] Point-Voxel 1-stage 86.3886.3886.3886.38 77.9277.9277.9277.92 73.0473.0473.0473.04 - - -
PointRCNN [57] Point-based 2-stage 86.9686.9686.9686.96 75.6475.6475.6475.64 70.770.770.770.7 92.1392.1392.1392.13 87.3987.3987.3987.39 82.7282.7282.7282.72
3D IoU-Net [58] Point-based 2-stage 87.9687.9687.9687.96 79.0379.0379.0379.03 72.7872.7872.7872.78 94.7694.7694.7694.76 88.3888.3888.3888.38 81.9381.9381.9381.93
3DSSD [3] Point-based 1-stage 88.3688.3688.3688.36 79.5779.5779.5779.57 74.5574.5574.5574.55 92.6692.6692.6692.66 89.0289.0289.0289.02 85.8685.8685.8685.86
IA-SSD [6] Point-based 1-stage 88.3488.3488.3488.34 80.1380.1380.1380.13 75.0475.0475.0475.04 92.7992.7992.7992.79 89.3389.3389.3389.33 84.3584.3584.3584.35
IA-SSD* [6] Point-based 1-stage 87.1487.1487.1487.14 78.4778.4778.4778.47 73.5373.5373.5373.53 92.2192.2192.2192.21 88.7188.7188.7188.71 83.7283.7283.7283.72
DBQ-SSD [7] Point-based 1-stage 87.9387.9387.9387.93 79.3979.3979.3979.39 74.474.474.474.4 - - -
SGCCNet Point-based 1-stage 89.24 80.82 75.58 93.57 89.88 85.27
-Fast Point R-CNN - - 3.95 3.42 5.34 2.7 2.04 4.75
-IA-SSD - - 0.9 0.69 0.54 0.78 0.55 0.92
- IA-SSD* - - 2.1 2.35 2.05 1.36 1.17 1.55

IV Experiments

In this section, we will provide detailed experiments to demonstrate the efficiency and accuracy of SGCCNet. Specifically, we introduced the specific settings and implementation details of the experiments in Section IV-A. Then, an analysis of SGCCNet’s detection performance on the KITTI dataset and a comparison with other state-of-the-art models is reported in Section IV-B. In Section IV-C, we analyze the inference efficiency of SGCCNet. Furthermore, various ablation experiments are conducted in Section IV-D to demonstrate the effectiveness of the SGCCNet design. Finally, in Section IV-E, we add our model components to other advanced models to demonstrate the effectiveness of our design.

TABLE II: Quantitative comparison with state-of-the-art methods on the KITTI val set for Car BEV and 3D detection, under the evaluation metric of 3D Average Precision (AP𝐴𝑃APitalic_A italic_P) of 40 sampling recall points. The best and our SGCCNet results are highlighted in Bold and underlined, respectively
Method Backbone Type 3D(IoU=0.7) BEV(IoU=0.7)
R11 R40 R11 R40
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
VoxelNet [11] Voxel-based 1-stage 81.9781.9781.9781.97 65.4665.4665.4665.46 62.8562.8562.8562.85 - - - 89.689.689.689.6 84.8184.8184.8184.81 78.5778.5778.5778.57 - - -
PointPillars [13] Voxel-based 1-stage 86.4486.4486.4486.44 77.2877.2877.2877.28 74.6574.6574.6574.65 87.7587.7587.7587.75 78.3878.3878.3878.38 75.1875.1875.1875.18 89.6689.6689.6689.66 87.1687.1687.1687.16 84.3984.3984.3984.39 92.0592.0592.0592.05 88.0588.0588.0588.05 86.6786.6786.6786.67
SECOND [9] Voxel-based 1-stage 88.6188.6188.6188.61 78.6278.6278.6278.62 77.2277.2277.2277.22 90.5590.5590.5590.55 81.6181.6181.6181.61 78.6178.6178.6178.61 90.0190.0190.0190.01 87.9287.9287.9287.92 86.4586.4586.4586.45 92.4292.4292.4292.42 88.5588.5588.5588.55 87.6587.6587.6587.65
SECOND-iou [9] Voxel-based 1-stage 84.9384.9384.9384.93 76.376.376.376.3 75.9875.9875.9875.98 86.7786.7786.7786.77 79.2379.2379.2379.23 77.1777.1777.1777.17 87.987.987.987.9 76.376.376.376.3 75.9575.9575.9575.95 90.2390.2390.2390.23 86.6186.6186.6186.61 86.3786.3786.3786.37
TANet [46] Voxel-based 1-stage 88.1788.1788.1788.17 77.7577.7577.7577.75 75.3175.3175.3175.31 - - - - - - - - -
Part-A2 [47] Voxel-based 2-stage 89.5589.5589.5589.55 79.4179.4179.4179.41 78.8578.8578.8578.85 92.1592.1592.1592.15 82.9182.9182.9182.91 82.0582.0582.0582.05 90.290.290.290.2 87.9687.9687.9687.96 87.5687.5687.5687.56 92.992.992.992.9 90.0190.0190.0190.01 88.3588.3588.3588.35
Part-A2-free [47] Voxel-based 1-stage 89.1289.1289.1289.12 78.7378.7378.7378.73 77.9877.9877.9877.98 91.6891.6891.6891.68 80.3180.3180.3180.31 78.178.178.178.1 90.190.190.190.1 86.7986.7986.7986.79 84.684.684.684.6 92.8492.8492.8492.84 88.1588.1588.1588.15 86.1686.1686.1686.16
SASSD [48] Voxel-based 1-stage 89.6989.6989.6989.69 79.4179.4179.4179.41 78.3378.3378.3378.33 - - - 90.5990.59\uline{90.59}90.59 88.4388.4388.4388.43 87.4987.4987.4987.49 - - -
Associate-3Det [49] Voxel-based 1-stage 00 79.1779.1779.1779.17 - - - - - - - - - -
CIASSD [35] Voxel-based 1-stage 00 79.8179.8179.8179.81 - - - - - - - - - -
PV-RCNN [17] Point-Voxel 2-stage 89.2689.2689.2689.26 79.1679.1679.1679.16 79.3979.39\uline{79.39}79.39 91.3791.3791.3791.37 82.7882.7882.7882.78 80.2480.2480.2480.24 89.9889.9889.9889.98 87.787.787.787.7 86.5986.5986.5986.59 92.7292.7292.7292.72 88.5988.5988.5988.59 88.0488.0488.0488.04
Fast Point R-CNN [52] Point-Voxel 2-stage 00 79797979 - - - - - - - - - -
STD [53] Point-Voxel 2-stage 00 79.879.879.879.8 - - - - - - - - - -
VIC-Net [55] Point-Voxel 1-stage 00 79.2579.2579.2579.25 - - - - - - - - - -
PointRCNN [57] Point-based 2-stage 88.9588.9588.9588.95 78.6778.6778.6778.67 77.7877.7877.7877.78 91.8391.8391.8391.83 80.6180.6180.6180.61 78.1878.1878.1878.18 89.9289.9289.9289.92 78.6778.6778.6778.67 77.7877.7877.7877.78 93.0793.0793.0793.07 88.8588.8588.8588.85 86.7386.7386.7386.73
PointRCNN-iou [57] Point-based 2-stage 89.0989.0989.0989.09 78.7878.7878.7878.78 78.2678.2678.2678.26 89.8989.8989.8989.89 80.6880.6880.6880.68 78.4178.4178.4178.41 90.1990.1990.1990.19 87.4987.4987.4987.49 85.9185.9185.9185.91 94.9994.9994.9994.99 88.8288.8288.8288.82 86.7186.7186.7186.71
SPSNet [10] Point-based 2-stage 89.1989.1989.1989.19 79.2979.2979.2979.29 78.278.278.278.2 90.5290.5290.5290.52 83.0383.0383.0383.03 80.1580.1580.1580.15 90.3190.3190.3190.31 88.7288.7288.7288.72 87.3187.3187.3187.31 93.293.293.293.2 91.2191.2191.2191.21 88.988.988.988.9
3DSSD [3] Point-based 1-stage 88.7988.7988.7988.79 78.5878.5878.5878.58 77.4777.4777.4777.47 91.3291.3291.3291.32 82.9582.9582.9582.95 80.3780.3780.3780.37 90.0890.0890.0890.08 87.8787.8787.8787.87 86.3586.3586.3586.35 93.8693.8693.8693.86 91.191.191.191.1 88.7888.7888.7888.78
SASA [4] Point-based 1-stage 88.8888.8888.8888.88 79.379.379.379.3 78.6478.6478.6478.64 91.2391.2391.2391.23 83.0983.0983.0983.09 82.3482.3482.3482.34 90.0990.0990.0990.09 88.388.388.388.3 87.1987.1987.1987.19 94.8394.8394.8394.83 91.0891.0891.0891.08 88.9388.9388.9388.93
IA-SSD [6] Point-based 1-stage 88.7888.7888.7888.78 79.1279.1279.1279.12 78.1278.1278.1278.12 89.5289.5289.5289.52 82.8682.8682.8682.86 80.0580.0580.0580.05 90.3490.3490.3490.34 88.1988.1988.1988.19 86.7886.7886.7886.78 93.1793.1793.1793.17 89.5489.5489.5489.54 88.6488.6488.6488.64
DBQ-SSD [7] Point-based 1-stage 88.5688.5688.5688.56 78.7478.7478.7478.74 77.4577.4577.4577.45 89.9589.9589.9589.95 81.3181.3181.3181.31 77.1477.1477.1477.14 89.9689.9689.9689.96 87.887.887.887.8 85.6485.6485.6485.64 93.4993.4993.4993.49 88.4288.4288.4288.42 87.3687.3687.3687.36
SGCCNet Point-based 1-stage 89.9289.92\uline{\textbf{89.92}}89.92 80.880.8\uline{\textbf{80.8}}80.8 78.97 93.2793.27\uline{\textbf{93.27}}93.27 84.1784.17\uline{\textbf{84.17}}84.17 82.9282.92\uline{\textbf{82.92}}82.92 90.55 88.7388.73\uline{\textbf{88.73}}88.73 87.9487.94\uline{\textbf{87.94}}87.94 96.4496.44\uline{\textbf{96.44}}96.44 91.6591.65\uline{\textbf{91.65}}91.65 89.4289.42\uline{\textbf{89.42}}89.42
-IA-SSD* 1.14 1.68 0.85 3.75 1.31 2.87 0.21 0.54 1.16 3.27 2.11 0.78
-3DSSD 1.13 2.22 1.5 1.95 1.22 2.55 0.47 0.86 1.59 2.58 0.55 0.64

IV-A Setting

Datasets. We validate our method on the KITTI dataset, which best demonstrates the advantages of point-based approaches. The KITTI dataset uses a 64-beam LiDAR, resulting in relatively dense point cloud scenes where a sparse backbone can still learn ideal features. The manually annotated range in the KITTI dataset is small, leading to a significant reduction in the number of points in the scene compared to other datasets, allowing second-order complexity FPS-related algorithms to maintain fast inference speeds in this scenario. Currently, comparisons of point-based detectors are primarily focused on the KITTI dataset. The KITTI dataset is sponsored by the Karlsruhe Institute of Technology and the Toyota Technological Institute at Chicago for research in the field of autonomous driving. The widely-used dataset contains 7481 training samples with annotations in the camera field of vision and 7518 testing samples. Following the common protocol, we further divide the training samples into a training set (3,712 samples) and a validation set (3,769 samples). Additionally, the samples are divided into three difficulty levels: easy, moderate, and hard based on the occlusion level, visibility, and bounding box size. The moderate average precision is the official ranking metric for both 3D and BEV detection on the KITTI website. It is worth noting that SGCCNet, submitted to the KITTI website for comparison, is trained on 80% of the complete training set.

Evaluation metrics. For the KITTI scene, we evaluate the performance of each class using both the 3D and BEV average precision (AP) metric. To ensure an objective comparison, we employed both the AP𝐴𝑃APitalic_A italic_P with 40 recall points (AP40𝐴subscript𝑃40AP_{40}italic_A italic_P start_POSTSUBSCRIPT 40 end_POSTSUBSCRIPT) and the AP𝐴𝑃APitalic_A italic_P with 11 recall points (AP11𝐴subscript𝑃11AP_{11}italic_A italic_P start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT). Consistent with the majority of state-of-the-art methods, we utilize Intersection over Union (IoU) thresholds of 0.7, 0.5, and 0.5 for Car, Pedestrian, and Cyclist, respectively.

Implementation details. SGCCNet is trained for 80 epochs with a batch size of 8 on 2 Nvidia A40 GPUs. The first 70 epochs use saliency-guided point removal and new ground truth pooling for augmentation, while the last 10 epochs fine-tune using the initial ground truth pooling. The initial learning rate is set to 0.01, which is decayed by 0.1 at 35 and 45 epochs and updated with the one cycle policy. We use the Adam optimizer with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.85subscript𝛽20.85\beta_{2}=0.85italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.85 for optimization. The weight decay coefficient is set to 0.01, and the momentum coefficient is set to 0.9. Other data augmentation methods follow the default setting of Zhang et al [6]. In the WCE loss, ϵitalic-ϵ\epsilonitalic_ϵ is set to 0.001. In CCM, the initial object score threshold score_thres1=0.01𝑠𝑐𝑜𝑟𝑒_𝑡𝑟𝑒subscript𝑠10.01score\_{thres}_{1}=0.01italic_s italic_c italic_o italic_r italic_e _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.01, and the final object score threshold score_thres2=0.45𝑠𝑐𝑜𝑟𝑒_𝑡𝑟𝑒subscript𝑠20.45score\_{thres}_{2}=0.45italic_s italic_c italic_o italic_r italic_e _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.45. For potential missed detections, the missed sample incremental confidence ΔcΔ𝑐\Delta croman_Δ italic_c is set to 0.2, the missed sample IoU threshold iou_thresmissed𝑖𝑜𝑢_𝑡𝑟𝑒subscript𝑠𝑚𝑖𝑠𝑠𝑒𝑑iou\_thres_{missed}italic_i italic_o italic_u _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s italic_e italic_d end_POSTSUBSCRIPT is set to 0.9, and the threshold of the number of missed sample neighbors neighbor_thresmissed𝑛𝑒𝑖𝑔𝑏𝑜𝑟_𝑡𝑟𝑒subscript𝑠𝑚𝑖𝑠𝑠𝑒𝑑neighbor\_thres_{missed}italic_n italic_e italic_i italic_g italic_h italic_b italic_o italic_r _ italic_t italic_h italic_r italic_e italic_s start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s italic_e italic_d end_POSTSUBSCRIPT is set to 10. All experiments are implemented using the OpenPCDet framework 111https://github.com/open-mmlab/OpenPCDet. The SGCCNet-elite for classification tasks is trained for 10 epochs on a single Nvidia 3090 GPU. The initial learning rate is set to 0.01, the weight decay coefficient is set to 0.0002, and the momentum coefficient is set to 0.9.

Benchmark detector. The most similar work to our SGCCNet is IA-SSD, which provides the best balance between detection accuracy and inference speed in point-based pipelines to date. Its multiple metrics achieve state of the art on the KITTI dataset. The training details, model parameter configurations, and overall performance of SGCCNet are roughly similar to IA-SSD, and we will focus on comparing the performance of SGCCNet and IA-SSD.

IV-B Comparison with state-of-the-art (SOTA) methods

Note. SGCCNet is trained on multiple classes simultaneously. For models trained on single classes only, they are marked as MODEL in the table. The IA-SSD model has sparked widespread research interest since its release, but many researchers have been unable to reproduce the performance reported in the original paper, and the authors have not responded to these concerns 222https://github.com/yifanzhang713/IA-SSD/issues/54. We believe that the IA-SSD model may rely on the training environment, especially the versions of Pytorch and CUDA. To ensure fair comparisons, we trained and tested the model in our local environment. The reproduced model is marked as MODEL in the table.

TABLE III: Quantitative comparison other point-based methods on the KITTI val set for Cyclist and Pedestrian 3D detection, under the evaluation metric of 3D Average Precision (AP𝐴𝑃APitalic_A italic_P) of 11and 40 sampling recall points. Our SGCCNet results are highlighted in Bold.
Method 3D@Cyclist(IoU=0.5) 3D@Pedestrian(IoU=0.5)
R11 R40 R11 R40
Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard
3DSSD 86.3686.3686.3686.36 71.271.271.271.2 66.166.166.166.1 91.491.491.491.4 71.871.871.871.8 67.667.667.667.6 55.955.955.955.9 50.6250.6250.6250.62 47.8847.8847.8847.88 55.155.155.155.1 50.6350.6350.6350.63 46.246.246.246.2
IA-SSD* 86.3386.3386.3386.33 69.1369.1369.1369.13 65.3265.3265.3265.32 89.0289.0289.0289.02 69.369.369.369.3 65.2965.2965.2965.29 61.961.961.961.9 57.857.857.857.8 52.5852.5852.5852.58 61.0661.0661.0661.06 56.656.656.656.6 51.8451.8451.8451.84
DBQ-SSD* 85.8685.8685.8685.86 70.1570.1570.1570.15 66.4366.4366.4366.43 90.2490.2490.2490.24 70.970.970.970.9 66.1266.1266.1266.12 59.7359.7359.7359.73 54.7154.7154.7154.71 50.3250.3250.3250.32 59.2659.2659.2659.26 54.2354.2354.2354.23 48.5148.5148.5148.51
SGCCNet 86.52 72.23 66.32 90.84 71.2 66.73 62.78 58.1 52.5 62.06 56.84 51.73
-3DSSD 0.16 1.03 0.22 -0.56 -0.6 -0.87 6.88 7.48 4.62 6.96 6.21 5.53
-IA-SSD* 0.19 3.1 1 1.82 1.9 1.44 0.88 0.3 -0.08 1 0.24 -0.11
-DBQ-SSD* 0.66 2.08 -0.11 0.6 0.3 0.61 3.05 3.39 2.18 2.8 2.61 3.22

Performance on KITTI test set. We compared the performance of SGCCNet with other SOTA models on the KITTI test set as shown in Table. I. SGCCNet achieves an AP3D40𝐴superscriptsubscript𝑃3𝐷40AP_{3D}^{40}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT of 80.82% on the moderate level for Car class objects, which is the best among all point-based detectors. In detail, SGCCNet outperforms IA-SSD by 0.9%,0.69%,0.54%percent0.9percent0.69percent0.540.9\%,0.69\%,0.54\%0.9 % , 0.69 % , 0.54 % in AP3D40𝐴superscriptsubscript𝑃3𝐷40AP_{3D}^{40}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT on easy, moderate, and hard levels respectively, and by 0.78%,0.55%,0.92%percent0.78percent0.55percent0.920.78\%,0.55\%,0.92\%0.78 % , 0.55 % , 0.92 % in APBEV40𝐴superscriptsubscript𝑃𝐵𝐸𝑉40AP_{BEV}^{40}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT. Compared to the reproduced IA-SSD, SGCCNet surpasses it by 2.1%,2.35%,2.05%percent2.1percent2.35percent2.052.1\%,2.35\%,2.05\%2.1 % , 2.35 % , 2.05 % in AP3D40𝐴superscriptsubscript𝑃3𝐷40AP_{3D}^{40}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT and by 1.36%,1.17%,1.55%percent1.36percent1.17percent1.551.36\%,1.17\%,1.55\%1.36 % , 1.17 % , 1.55 % in APBEV40𝐴superscriptsubscript𝑃𝐵𝐸𝑉40AP_{BEV}^{40}italic_A italic_P start_POSTSUBSCRIPT italic_B italic_E italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT. Surprisingly, SGCCNet’s performance even exceeds some structure-based models, with mAP3D40𝑚𝐴superscriptsubscript𝑃3𝐷40mAP_{3D}^{40}italic_m italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT surpassing Fast Point R-CNN by approximately 4.26%percent4.264.26\%4.26 % and Part-A2222 by about 1.94%percent1.941.94\%1.94 %.

Performance on KITTI val set. We further provide the results of the KITTI validation set to better present the detection performance of our SGCCNet, as shown in Table. II. SGCCNet remains the best performing point-based detector, especially in the Easy category, with AP3D40𝐴superscriptsubscript𝑃3𝐷40AP_{3D}^{40}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 end_POSTSUPERSCRIPT exceeding IA-SSD and 3DSSD by approximately 3.75%percent3.753.75\%3.75 % and 1.95%percent1.951.95\%1.95 %, respectively. We attribute this improvement to the confidence calibration mechanism, which allows the model to no longer rely solely on semantic scores. We also compared the performance of SGCCNet with other state-of-the-art point-based models on the Cyclist and Pedestrian classes, as shown in Table. III. Compared to the single-class trained 3DSSD, SGCCNet outperforms it by 1.03%percent1.031.03\%1.03 % and 7.48%percent7.487.48\%7.48 % in the moderate level of these two classes, respectively. Although its accuracy is slightly lower than 3DSSD in the Hard class, our model is trained end-to-end for multiple classes. Compared to other multi-class trained point-based detectors such as IA-SSD and DBQ-SSD, SGCCNet outperforms them by 3.1%percent3.13.1\%3.1 % and 2.08%percent2.082.08\%2.08 % in the Moderate level of Cyclist class, and by 0.24%percent0.240.24\%0.24 % and 2.61%percent2.612.61\%2.61 % in the Moderate level of Pedestrian class. These results demonstrate that SGCCNet also excels in detecting small objects.

Refer to caption
Figure 10: Visualize the detection results of SGCCNet on the KITTI dataset. We compare the results of SGCCNet with IA-SSD, highlight the role of SGCCNet in correcting the confidence of missed detection targets and filtering false positive targets.

Visualization. We quantitatively demonstrated the superiority of our SGCCNet by explaining all the above indicators. Additionally, we qualitatively illustrated the results through visualization. In Figure 8, we show four scenes from the KITTI val set and compare the detection results of IA-SSD and SGCCNet. The yellow boxes represent ground truth (GT), while the green boxes represent the predicted results. We highlighted the areas where the predictions of the two detectors differ. In detail, in scenes (a) and (b), there is a missed detection by IA-SSD for each scene, which are high-quality targets with obvious Car features. We believe that the reason for the missed detections is the interference of features from other objects in the region, causing the geometry and semantic information of the targets to be overlooked. However, SGCCNet can accurately detect these targets. In scenes (c) and (d), IA-SSD detected numerous false positive targets because it only considers semantic scores as confidence criteria during post-processing. In contrast, SGCCNet adjusts confidence by considering the predictions of neighboring vote points, filtering out these false positive targets. Overall, in these scenes, SGCCNet can accurately detect each target in the GT, providing intuitive evidence of the superiority of our method.

TABLE IV: Compare the runtime performance of models based on GPU occupancy, single-frame inference time, and data requirements. All other models are reproduced from the configuration files and weight files provided by the OpenPCD project.
Method Mem Time Input scale Mod.
PointPliiars 354MB 21212121 2~9k 77.2877.2877.2877.28
SECOND 713MB 33333333 11-17k 78.6278.6278.6278.62
3DSSD 521MB 92929292 16384163841638416384 78.5878.5878.5878.58
PointRCNN 567MB 94949494 16384163841638416384 78.6778.6778.6778.67
Part-A2 720MB 90909090 11-17k 79.4179.4179.4179.41
PV-RCNN 1233MB 126126126126 11-17k 79.1679.1679.1679.16
IA-SSD 102MB 16161616 16384163841638416384 79.1279.1279.1279.12
SGCCNet 217MB 23232323 16384163841638416384 80.880.880.880.8

IV-C Runtime Analysis

One major advantage of Point-based detectors is real-time detection. In this section, we will illustrate this point by comparing the metrics of GPU memory usage, single-frame inference time, and input data volume. Table. IV shows these metrics for several representative models. It is worth noting that since their original papers did not analyze this aspect of performance, and the metrics depend on the software and hardware environment, we retested them locally using the OpenPCD project to ensure a fair comparison. The configuration files and pre-trained weights for these models all come from the OpenPCD project. The hardware used for runtime testing was a single NVIDIA RTX 3090 GPU with an Intel i7-12700KF [email protected], and the software used python=3.9.0 and pytorch=2.1.0+cu118. From the results, SGCCNet’s memory usage is only slightly higher than IA-SSD, although the single-frame inference time is 7ms longer, it can bring a performance gain of 1.68% AP3D11𝐴superscriptsubscript𝑃3𝐷11AP_{3D}^{11}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT under the same input scale, which is undoubtedly worth it. Moreover, the 23ms single-frame inference time meets the real-time requirements of current LiDAR. PointPillars can achieve a similar inference time to SGCCNet in testing, but it occupies twice the memory and has a performance decrease of 3.52% AP3D11𝐴superscriptsubscript𝑃3𝐷11AP_{3D}^{11}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT.

TABLE V: The basic information of the created classification dataset.
Split Category Number Total
Train Car 13382133821338213382 16239162391623916239
Pedestrian 2159215921592159
Cyclist 698698698698
Validation Car 13874138741387413874 17014170141701417014
Pedestrian 2260226022602260
Cyclist 880880880880
Refer to caption
Figure 11: (a) The classification results of SGCCNet-elite on the created dataset. (b) The PR curve of the detection task on the KITTI val set before and after adding CCM.
Refer to caption
Figure 12: Validate the saliency results on the classification task. The disruption to the model’s classification performance from discarding points based on saliency scores is much greater than randomly discarding points, with higher saliency scores resulting in greater disruption.
Refer to caption
Figure 13: Visualize the saliency heatmap. Even after discarding some points based on saliency scores, the point cloud still retains the characteristics of their respective categories. However, the model misclassifies them as other categories. This indicates that the model relies on saliency features, which is detrimental to the model’s robustness. Our goal is to let the model learn from the discarded points and improve the diversity of training data at the feature level. In the figure, Ca, Cy, and P respectively represent Car, Cyclist, and Pedestrian.
TABLE VI: The results of the ablation experiment. It can be seen that the model containing all components achieves the best performance.
Model SGDA GNM SCB CCM E M H
A 89.5289.5289.5289.52 82.8682.8682.8682.86 80.0580.0580.0580.05
B 92.1192.1192.1192.11 83.583.583.583.5 82.182.182.182.1
C 92.5392.5392.5392.53 83.6583.6583.6583.65 82.4182.4182.4182.41
D 92.992.992.992.9 83.883.883.883.8 82.482.482.482.4
E 93.27 84.17 82.92
Refer to caption
Figure 14: Validate the saliency results on the KITTI dataset detection task. The impact of discarding points based on their saliency score on the model’s detection performance is much greater than randomly discarding points. This demonstrates that saliency obtained from classification tasks is equally applicable to detection tasks, validating our previous hypothesis.

IV-D Ablation Study

In this section, we will analyze the roles of various model components in SGCCNet and the selection of hyperparameters. The models involved are all trained on the KITTI train set and tested on the val set.

Effect of Saliency-Guide Data Augmentation (SGDA). The purpose of SGDA is to increase the diversity of training data at the feature level, reduce the model’s reliance on saliency features, and improve its robustness to low-quality targets. As described in Section III-C, SGDA constructs an SGCCNet-elite model to perform classification tasks and obtain saliency scores for each target. We first need to demonstrate that the saliency scores obtained from classification are also applicable to detection tasks. The data set partitioning of the training set we created is the same as KITTI, with specific sample details shown in Table. V.

Fig. 11 shows the training and testing results of SGCCNet-elite on the classification dataset. It achieves good classification results, with an accuracy of 97.8%percent97.897.8\%97.8 % and an average accuracy of 88.42%percent88.4288.42\%88.42 %. The classification accuracies for Car, Pedestrian, and Cyclist are 99.93%percent99.9399.93\%99.93 %, 95.67%percent95.6795.67\%95.67 %, and 69.67%percent69.6769.67\%69.67 %, respectively. To apply the method for obtaining point-wise saliency score as described in Section III-C to SGCCNet-elite, we conducted a dropout experiment as shown in Fig. 12. We gradually removed salient points (SD) from 5%percent55\%5 % to 80%percent8080\%80 % of the samples according to Algorithm. 1, and observed the changes in classification metrics, while also comparing with random dropout (RD). As shown in the Fig. 12, as the number of dropped points increases, both SD and RD methods decrease the classification performance of the model. However, SD is more destructive to the model, with a faster decline in performance. Even when only 5% of points are dropped, the classification accuracy for the Cyclist class decreased by 14.67%. This comparison demonstrates that the significance scores we obtained are suitable for classification tasks. In Fig. 13, we list some samples for saliency analysis, showing that for the same type of target, the distribution of saliency regions is similar. For example, Car targets are concentrated on the roof position, while Pedestrian and Cyclist targets are concentrated on the human body position. The model’s dependence on saliency features reduces its robustness. We discard some saliency points, although still clearly retaining the basic features of these targets, the model misclassifies them as other categories. Our SGDA aims to alleviate this issue.

Similarly, we applied this dropout experiment on the targets in each scene of the KITTI val set for the detection task, and the experimental results are shown in Fig. 14. It can be seen that SD significantly decreases the detection performance of each class of targets, and the decrease rate is much higher than RD. We believe that such experimental results are sufficient to demonstrate that the saliency obtained from the classification task is also applicable to the detection task. By using this saliency-guided data augmentation method to discard salient region features, it differs from previous random augmentation methods and can truly enhance the diversity of training data at the feature level with a specific purpose. Table. VI quantitatively illustrates the performance gain brought by SGDA, which can lead to a 1.76%mAP3D11percent1.76𝑚𝐴superscriptsubscript𝑃3𝐷111.76\%mAP_{3D}^{11}1.76 % italic_m italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT improvement. The above results intuitively demonstrate the effectiveness of SGDA.

TABLE VII: We test GNM and SCB at different stages to determine the best combination.
Stage1 Stage2 Stage3 E M H
A Norm 92.1192.1192.1192.11 83.583.583.583.5 82.182.182.182.1
Skip
B Norm 92.5392.5392.5392.53 83.6283.6283.6283.62 82.3182.3182.3182.31
Skip
C Norm 92.96 83.4783.4783.4783.47 82.2482.2482.2482.24
Skip
D Norm 92.992.992.992.9 83.8 82.4
Skip
E Norm 92.7192.7192.7192.71 83.5483.5483.5483.54 82.2982.2982.2982.29
Skip

Effects of Geometric Normalization Module (GNM) and Skip Connection Block (SCB). The original intention of GNM and SCB designs is to address the issues of Internal Covariate Shift and feature forgetting that may arise in models, respectively. The structures of both are often used in advanced classification models. In Table. VII, we analyze the performance changes of models when GNM and SCB are placed in different positions. Adding GNM and SCB in the second and third stages both bring significant benefits to the model, resulting in a 0.42%AP3D11percent0.42𝐴superscriptsubscript𝑃3𝐷110.42\%AP_{3D}^{11}0.42 % italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT improvement for the Car class at the Easy level. However, when GNM and SCB are added to all three stages, the performance of the model may even decrease. After conducting multiple experiments to rule out the influence of random factors on the model, we found that it may be the GNM or SCB in the first stage that damaged the feature distribution. Through experiments (models C, D in Table. VII), we ultimately decided to use only GNM in the first stage and abandon SCB. The early point-wise features did not learn the essential features of the targets, and directly concatenating them with the features of the later stages may affect the predictions in the later stages of the model. As shown in Table. VI, the combination of GNM and SCB can bring a 0.46%mAP3D11percent0.46𝑚𝐴superscriptsubscript𝑃3𝐷110.46\%mAP_{3D}^{11}0.46 % italic_m italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT improvement to the model.

TABLE VIII: Comparison of model performance before and after considering missed targets.
Method Car(IoU=0.7)
R11 R40
Easy Moderate Hard Easy Moderate Hard
SGCCNet w/o missed 89.6289.6289.6289.62 80.6580.6580.6580.65 78.7878.7878.7878.78 92.8992.8992.8992.89 83.9383.9383.9383.93 82.9482.9482.9482.94
SGCCNet w/ missed 89.9289.9289.9289.92 80.880.880.880.8 78.9778.9778.9778.97 93.2793.2793.2793.27 84.1784.1784.1784.17 82.9282.9282.9282.92
Refer to caption
Figure 15: Our CCM takes into account the possibility of missing targets in the model. For accurate localization predictions, we give it a certain confidence gain to ensure that it is detected by the model.

Confidence Correction Mechanism (CCM). The current single-stage point-based detector tends to cause misalignment between localization accuracy and classification confidence (MLC) issue by simply using semantic scores as the sole criterion for filtering predicted boxes during post-processing. SGCCNet proposes a Confidence Calibration Module (CCM) for point-based multi-class detectors to alleviate this problem. As shown in Algorithm 2, the purpose of this CCM is to correct the confidence of the current vote point by using the prediction information of neighboring vote points. Table VI quantitatively demonstrates the effectiveness of CCM, which can improve the model’s mAP3D11𝑚𝐴superscriptsubscript𝑃3𝐷11mAP_{3D}^{11}italic_m italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT by 0.42%percent0.420.42\%0.42 %.

More intuitively, in Fig. 11 (b), we demonstrate the Precision-Recall (PR) curve of the model before and after adopting CCM. Although CCM itself does not possess learning capabilities, it can filter out false positive boxes and boxes with high semantic scores but inaccurate localization based on the predicted geometric information. As shown in the figure, the model can maintain a higher accuracy for a longer duration at higher recall, indicating a lower number of false positive targets compared to the scenario without CCM. We also illustrate the effectiveness of our consideration for missed targets in Table. VIII. It can be observed that after adding a certain confidence gain to targets with accurate localization but low semantic scores, the model exhibits significant improvements in various performance metrics. Fig. 15 visually demonstrates two such targets that possess distinct geometric information similar to the target, and they are ultimately detected by CCM.

Refer to caption
Figure 16: The basic structure of 3DSSD and SASA. The only difference lies in the downsampling method, where the former utilizes Farthest Point Sampling based on feature distance (F-FPS), while the latter employs Semantic Weighted Farthest Point Sampling (S-FPS).
TABLE IX: The versatility test of SGCCNet. It can be seen that SGDA and CCM can provide significant performance gains for 3DSSD and SASA.
Method Car(IoU=0.7)
R11 R40
Easy Moderate Hard Easy Moderate Hard
3DSSD 88.7988.7988.7988.79 78.5878.5878.5878.58 77.4777.4777.4777.47 91.3291.3291.3291.32 82.9582.9582.9582.95 80.3780.3780.3780.37
3DSSD+SGDA 89.1389.1389.1389.13 79.6779.6779.6779.67 78.3378.3378.3378.33 91.9791.9791.9791.97 83.2983.2983.2983.29 82.1882.1882.1882.18
Gain 0.34 1.09 0.86 0.65 0.34 1.81
3DSSD+SGDA+CCM 89.6589.6589.6589.65 80.1480.1480.1480.14 78.7378.7378.7378.73 92.1692.1692.1692.16 83.6583.6583.6583.65 82.2982.2982.2982.29
Gain 0.86 1.56 1.26 0.84 0.7 1.92
SASA 88.8888.8888.8888.88 79.379.379.379.3 78.6478.6478.6478.64 91.2391.2391.2391.23 83.0983.0983.0983.09 82.3482.3482.3482.34
SASA+SGDA 89.4589.4589.4589.45 79.4779.4779.4779.47 78.7778.7778.7778.77 91.6491.6491.6491.64 83.2983.2983.2983.29 82.3682.3682.3682.36
Gain 0.57 0.17 0.13 0.41 0.2 0.02
SASA+SGDA+CCM 89.7489.7489.7489.74 79.8579.8579.8579.85 79.0179.0179.0179.01 92.0392.0392.0392.03 83.3583.3583.3583.35 82.5282.5282.5282.52
Gain 0.86 0.55 0.37 0.8 0.26 0.18

IV-E Generalizability Analysis

The core components of SGCCNet are plug-and-play, and we also tested their versatility on other point-based detectors. We selected the 3DSSD and SASA detectors for analysis, as they have similar basic structures as shown in Fig. 16, with the only difference being the downsampling method. To maintain the basic structure of the models, we only analyzed the effects of SGDA and CCM. The experimental results, as shown in Table. IX, demonstrate significant improvements in the performance of both models after adding SGDA and CCM. 3DSSD shows an improvement of 0.34%percent0.340.34\%0.34 %, 1.09%percent1.091.09\%1.09 %, and 0.86%percent0.860.86\%0.86 % in AP113D𝐴superscript𝑃113𝐷AP^{11}{3D}italic_A italic_P start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT 3 italic_D on Easy, Moderate, and Hard levels, respectively. SASA shows improvements of 0.86%percent0.860.86\%0.86 %, 0.55%percent0.550.55\%0.55 %, and 0.37%percent0.370.37\%0.37 % respectively. These results directly indicate that SGCCNet is easy to be plugged into popular architectures.

IV-F Discussion

In this section, we have detailed the superiority and effectiveness of SGCCNet from the perspectives of detection metrics, visualization, efficiency analysis, and ablation experiments. Specifically, SGCCNet is currently the best-performing multi-class single-stage end-to-end point-based detector on the KITTI dataset, achieving an AP3D11𝐴superscriptsubscript𝑃3𝐷11AP_{3D}^{11}italic_A italic_P start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT of 80.82%percent80.8280.82\%80.82 % on the KITTI test set, surpassing IA-SSD by approximately 2.35%percent2.352.35\%2.35 %, and even outperforming detectors that adopt structure-backbone. Additionally, SGCCNet has an inference speed of 23ms, slightly lower than IA-SSD but fully compliant with the real-time requirements of current LiDAR systems. Furthermore, the core components of SGCCNet are plug-and-play, providing significant performance gains for other similar detectors such as 3DSSD and SASA. Overall, the design of SGCCNet is based on addressing the issues of ILQ and MLC that exist in current point-based detectors. From the results, it is evident that SGCCNet effectively mitigates the impact of these issues on the model, improving the safety baseline and application prospects of point-based detectors.

V Conclusion

In this paper, we propose a new single-stage point-based 3D object detector called SGCCNet. It incorporates a saliency-guided data augmentation method that enhances the diversity of training data at the feature level, reducing the model’s reliance on salient features and improving its robustness to low-quality objects. It also includes a geometric normalization module and skip connection block to address the challenges of Internal Covariate Shift and feature forgetting. To tackle the misalignment between localization accuracy and classification confidence in single-stage detectors, SGCCNet introduces a confidence calibration mechanism suitable for multi-class point-based detectors. Experimental results demonstrate that SGCCNet is currently the best-performing point-based detector, with its model components playing a substantial role in enhancing model performance and exhibiting good transferability.

Limitations and outlook. Like other point-based detectors, SGCCNet performs well only in LiDAR scenes with multi-beams, such as the KITTI dataset, but its overall performance is far inferior to structure-based detectors on datasets like NuScenes [59] and Waymo [60]. Firstly, this is because these scenes are more complex, making it difficult to form a suitable fixed configuration, and the sparser distribution of objects hampers feature learning for point-based detectors. Secondly, the quadratic complexity of the FPS algorithm requires a significant amount of time in such multi-point scenes. In future work, we will develop more efficient point-based backbones to address these two issues.

References

  • [1] Y. Wang, Q. Mao, H. Zhu, J. Deng, Y. Zhang, J. Ji, H. Li, and Y. Zhang, “Multi-modal 3d object detection in autonomous driving: a survey,” International Journal of Computer Vision, vol. 131, no. 8, pp. 2122–2152, 2023.
  • [2] Z. Meng, X. Xia, R. Xu, W. Liu, and J. Ma, “Hydro-3d: Hybrid object detection and tracking for cooperative perception using 3d lidar,” IEEE Transactions on Intelligent Vehicles, 2023.
  • [3] Z. Yang, Y. Sun, S. Liu, and J. Jia, “3dssd: Point-based 3d single stage object detector,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 040–11 048.
  • [4] C. Chen, Z. Chen, J. Zhang, and D. Tao, “Sasa: Semantics-augmented set abstraction for point-based 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 221–229.
  • [5] Z. Huang, Y. Wang, J. Wen, P. Wang, and X. Cai, “An object detection algorithm combining semantic and geometric information of the 3d point cloud,” Advanced Engineering Informatics, vol. 56, p. 101971, 2023.
  • [6] Y. Zhang, Q. Hu, G. Xu, Y. Ma, J. Wan, and Y. Guo, “Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 953–18 962.
  • [7] J. Yang, L. Song, S. Liu, W. Mao, Z. Li, X. Li, H. Sun, J. Sun, and N. Zheng, “Dbq-ssd: Dynamic ball query for efficient 3d object detection,” in The Eleventh International Conference on Learning Representations, 2022.
  • [8] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [9] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, 2018. [Online]. Available: https://www.mdpi.com/1424-8220/18/10/3337
  • [10] A. Liang, H. Zhang, H. Hua, W. Chen, and H. Zhao, “Spsnet: Boosting 3d point-based object detectors with stable point sampling,” Engineering Applications of Artificial Intelligence, vol. 126, p. 106807, 2023.
  • [11] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4490–4499.
  • [12] Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse voxelnet for 3d object detection and tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 674–21 683.
  • [13] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 697–12 705.
  • [14] J. Li, C. Luo, and X. Yang, “Pillarnext: Rethinking network designs for 3d object detection in lidar point clouds,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • [15] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 784–11 793.
  • [16] G. Zhang, C. Junnan, G. Gao, J. Li, and X. Hu, “Hednet: A hierarchical encoder-decoder network for 3d object detection in point clouds,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [17] S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 529–10 538.
  • [18] S. Shi, L. Jiang, J. Deng, Z. Wang, C. Guo, J. Shi, X. Wang, and H. Li, “Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection,” International Journal of Computer Vision, vol. 131, no. 2, pp. 531–551, 2023.
  • [19] A. Paigwar, D. Sierra-Gonzalez, Ö. Erkent, and C. Laugier, “Frustum-pointpillars: A multi-stage approach for 3d object detection using rgb camera and lidar,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2926–2933.
  • [20] H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y. Li, Y. Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y. Qiao, “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2151–2170, 2024.
  • [21] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European conference on computer vision.   Springer, 2022, pp. 1–18.
  • [22] T. Li, P. Jia, B. Wang, L. Chen, K. JIANG, J. Yan, and H. Li, “Lanesegnet: Map learning with lane segment perception for autonomous driving,” in The Twelfth International Conference on Learning Representations, 2023.
  • [23] J. Zeng, L. Chen, H. Deng, L. Lu, J. Yan, Y. Qiao, and H. Li, “Distilling focal knowledge from imperfect expert for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 992–1001.
  • [24] G. P. Meyer, A. Laddha, E. Kee, C. Vallespi-Gonzalez, and C. K. Wellington, “Lasernet: An efficient probabilistic 3d object detector for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12 677–12 686.
  • [25] Y. Bai, B. Fei, Y. Liu, T. Ma, Y. Hou, B. Shi, and Y. Li, “Rangeperception: Taming lidar range view for efficient and accurate 3d object detection,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [27] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
  • [28] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” in International Conference on Learning Representations, 2021.
  • [29] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  • [30] J. Choi, Y. Song, and N. Kwak, “Part-aware data augmentation for 3d object detection in point cloud,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 3391–3397.
  • [31] M. Reuse, M. Simon, and B. Sick, “About the ambiguity of data augmentation for 3d object detection in autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 979–987.
  • [32] J. S. Hu and S. L. Waslander, “Pattern-aware data augmentation for lidar 3d object detection,” in 2021 IEEE International Intelligent Transportation Systems Conference (ITSC).   IEEE, 2021, pp. 2703–2710.
  • [33] C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-modal augmentation for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 794–11 803.
  • [34] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware single-stage 3d object detection from point cloud,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 870–11 879.
  • [35] W. Zheng, W. Tang, S. Chen, L. Jiang, and C.-W. Fu, “Cia-ssd: Confident iou-aware single-stage object detector from point cloud,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 4, 2021, pp. 3555–3562.
  • [36] H. Wang, Y. Cong, O. Litany, Y. Gao, and L. J. Guibas, “3dioumatch: Leveraging iou prediction for semi-supervised 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 615–14 624.
  • [37] H. Sheng, S. Cai, N. Zhao, B. Deng, J. Huang, X.-S. Hua, M.-J. Zhao, and G. H. Lee, “Rethinking iou-based optimization for single-stage 3d object detection,” in European Conference on Computer Vision.   Springer, 2022, pp. 544–561.
  • [38] J. Li, H. Dai, L. Shao, and Y. Ding, “From voxel to point: Iou-guided 3d object detection for point cloud with voxel-to-point decoder,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4622–4631.
  • [39] L. Liu, J. Lu, C. Xu, Q. Tian, and J. Zhou, “Deep fitting degree scoring network for monocular 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1057–1066.
  • [40] Q. Ming, L. Miao, Z. Ma, L. Zhao, Z. Zhou, X. Huang, Y. Chen, and Y. Guo, “Deep dive into gradients: Better optimization for 3d object detection with gradient-corrected iou supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5136–5145.
  • [41] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920.
  • [42] M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1588–1597.
  • [43] T. Zheng, C. Chen, J. Yuan, B. Li, and K. Ren, “Pointcloud saliency maps,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1598–1606.
  • [44] S. Liu, D. Wang, Q. Wang, and K. Huang, “Niv-ssd: Neighbor iou-voting single-stage object detector from point cloud,” arXiv preprint arXiv:2401.12447, 2024.
  • [45] D. Zhou, J. Fang, X. Song, C. Guan, J. Yin, Y. Dai, and R. Yang, “Iou loss for 2d/3d object detection,” in 2019 international conference on 3D vision (3DV).   IEEE, 2019, pp. 85–94.
  • [46] Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai, “Tanet: Robust 3d object detection from point clouds with triple attention,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 677–11 684.
  • [47] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 8, pp. 2647–2664, 2020.
  • [48] C. He, H. Zeng, J. Huang, X.-S. Hua, and L. Zhang, “Structure aware single-stage 3d object detection from point cloud,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 873–11 882.
  • [49] L. Du, X. Ye, X. Tan, J. Feng, Z. Xu, E. Ding, and S. Wen, “Associate-3ddet: Perceptual-to-conceptual association for 3d point cloud object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 329–13 338.
  • [50] L. Wang, Z. Song, X. Zhang, C. Wang, G. Zhang, L. Zhu, J. Li, and H. Liu, “Sat-gcn: Self-attention graph convolutional network-based 3d object detection for autonomous driving,” Knowledge-Based Systems, vol. 259, p. 110080, 2023.
  • [51] Q. He, Z. Wang, H. Zeng, Y. Zeng, and Y. Liu, “Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 870–878.
  • [52] Y. Chen, S. Liu, X. Shen, and J. Jia, “Fast point r-cnn,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9775–9784.
  • [53] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d object detector for point cloud,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1951–1960.
  • [54] Z. Yang, L. Jiang, Y. Sun, B. Schiele, and J. Jia, “A unified query-based paradigm for point cloud understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8541–8551.
  • [55] T. Jiang, N. Song, H. Liu, R. Yin, Y. Gong, and J. Yao, “Vic-net: Voxelization information compensation network for point cloud 3d object detection,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 13 408–13 414.
  • [56] J. Noh, S. Lee, and B. Ham, “Hvpr: Hybrid voxel-point representation for single-stage 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 605–14 614.
  • [57] S. Shi, X. Wang, and H. Li, “Pointrcnn: 3D object proposal generation and detection from point cloud,” in CVPR, 2019, pp. 770–779.
  • [58] J. Li, S. Luo, Z. Zhu, H. Dai, A. S. Krylov, Y. Ding, and L. Shao, “3d iou-net: Iou guided 3d object detector for point clouds,” arXiv preprint arXiv:2004.04962, 2020.
  • [59] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
  • [60] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.