HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bibentry

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.13071v1 [cs.CV] 20 Dec 2023

Point Deformable Network with Enhanced Normal Embedding
for Point Cloud Analysis

Xingyilang Yin1, Xi Yang1, Liangchen Liu1, Nannan Wang1, Xinbo Gao2 Corresponding author.
Abstract

Recently MLP-based methods have shown strong performance in point cloud analysis. Simple MLP architectures are able to learn geometric features in local point groups yet fail to model long-range dependencies directly. In this paper, we propose Point Deformable Network (PDNet), a concise MLP-based network that can capture long-range relations with strong representation ability. Specifically, we put forward Point Deformable Aggregation Module (PDAM) to improve representation capability in both long-range dependency and adaptive aggregation among points. For each query point, PDAM aggregates information from deformable reference points rather than points in limited local areas. The deformable reference points are generated data-dependent, and we initialize them according to the input point positions. Additional offsets and modulation scalars are learned on the whole point features, which shift the deformable reference points to the regions of interest. We also suggest estimating the normal vector for point clouds and applying Enhanced Normal Embedding (ENE) to the geometric extractors to improve the representation ability of single-point. Extensive experiments and ablation studies on various benchmarks demonstrate the effectiveness and superiority of our PDNet.

Introduction

Point cloud analysis receives great interest due to numerous 3D data acquisition devices applied in various areas, such as autonomous driving and robotics. Unlike images that have regular 2D grids, point clouds are inherently sparse, unordered, and unstructured data. Thus directly processing point clouds is challenging. Recently, MLP-based methods have obtained significant performance with simple components. PointNeXt (Qian et al. 2022) revisits the classical PointNet++ (Qi et al. 2017b) and improves it with modern training and scaling strategies. PointNeXt makes enormous improvements compared to the PointNet++ and even outperforms dedicated designed convolution-based (Xu et al. 2021a), graph-based (Zhou et al. 2021), and powerful point transformers methods (Zhao et al. 2021; Lai et al. 2022; Wu et al. 2022). This reveals that concise MLP modules can already describe the local geometric properties of point clouds. The following PointMetaBase (Lin et al. 2023) further modifies PointNeXt with explicit position encoding and MLP before grou** operation. Although the existing MLP-based approaches show strong generalization ability in various tasks, they ignore modeling long-range dependencies. As illustrated in Figure 1(a) and 1(b), the previous MLP-based methods (Qi et al. 2017b; Ma et al. 2022; Qian et al. 2022; Lin et al. 2023) only focus on aggregating information in local point groups constructed by kNN or ball query, which fails to learn features from a long distance. However, capturing long-range relations has been demonstrated to be crucial in understanding global shape context (Wang et al. 2018; Lai et al. 2022).

Refer to caption
(a) kNN
Refer to caption
(b) ball query
Refer to caption
(c) deformable
Figure 1: Comparison of grou** methods. (a) kNN: query point (orange point) and its k𝑘kitalic_k-nearest (k𝑘kitalic_k=3) neighbors (blue points). (b) ball query: query point and its neighbors in local region (red circle). (c) query point and its deformable reference points (blue points) located in non-local regions. The green points and purple arrows represent the initial reference points generated on all point positions and their learnable offsets conditioned on the whole point features, respectively.

To solve the aforementioned problem, we need to explore aggregating information from distant regions for each query point. In the literature on processing images, learning deformable convolution filters has been shown effective in various challenging vision tasks due to adaptive spatial aggregation in long-range and more informative regions (Dai et al. 2017; Zhu et al. 2019, 2020; Wang et al. 2023). This motivates us to design the deformable mechanism for point clouds. In contrast to the images that have structured 2D grids, point clouds are sparse and unstructured data. Thus naive implementation of deformable mechanisms suited for images can not directly apply to point clouds. To alleviate it, KPConv (Thomas et al. 2019) first introduces a deformable mechanism to point clouds. It adopts pseudo-grid convolution through predefined kernel points based on local point positions with weight matrices learned by local point features. The deformable mechanism is further applied to learn offsets constrained in local areas, which helps refine the kernel points.

However, KPConv still focuses on develo** sophisticated modules to extract local structures and long-range dependency is not considered. To this end, we propose a simple and effective MLP-based network named Point Deformable Network (PDNet). Specifically, we first put forward Point Deformable Aggregation Module (PDAM), which achieves both long-range dependency and adaptive spatial aggregation that suits point clouds at the same time. As shown in Figure 1(c), different from extractors that aggregate information in fixed local regions, our PDAM aggregates information from deformable reference points for each query point. The initial reference points are generated on the positions of all points, and the additional offsets and modulation scalars are learned on the whole point features. Thus the reference points are shifted to the relevant regions and bring more informative geometric features for aggregation, which strengthens the representation ability among points. Further, we suggest applying the least square fitting to estimate the normal vector of a point cloud and using Enhanced Normal Embedding (ENE) to improve the representation ability of single-point. Extensive experiments on various challenging benchmarks demonstrate the effectiveness of our methods. Our PDNet outperforms other competitive MLP-based models and achieves state-of-the-art.

Related Work

Point-based networks on point clouds.

In contrast to the project methods that project point clouds to multi-view images (Su et al. 2015; Goyal et al. 2021) or structured 3D voxel (Wu et al. 2015; Maturana and Scherer 2015), point-based methods process unstructured point clouds directly. PointNet (Qi et al. 2017a), the pioneering point-based network, proposes to model the permutation invariance of point clouds by using shared MLPs to encode pointwise features and aggregating them by symmetric functions like max-pooling. To better capture local geometric structures, PointNet++ (Qi et al. 2017b) proposes a hierarchical structure by gradually downsampling with farthest point sampling and aggregating features from neighbor points with kNN or ball query method. Currently, most point-based methods focus on the design of local geometric extractors. Convolution-based approaches (Li et al. 2018; Liu et al. 2019; Thomas et al. 2019; Xu et al. 2021a) propose several invariant and dynamic convolution kernels to aggregate point features. Graph convolution-based methods (Wang, Samari, and Siddiqi 2018; Wang et al. 2019; Zhou et al. 2021) treat points and their relations as vertices and edges of a graph, respectively. Point features can then be extracted by applying graph convolution on the graph. Point Transformers (Zhao et al. 2021; Guo et al. 2021; Lai et al. 2022; Wu et al. 2022) capture local and global information through self-attention. Recently, MLP-based approaches (Ma et al. 2022; Ran, Liu, and Wang 2022; Tang et al. 2022b; Qian et al. 2022; Zhang et al. 2023; Lin et al. 2023) obtain competitive results with simple network architectures. PointMLP proposes a geometric affine module to enhance the residual MLPs network. PointNeXt follows the design philosophy of PointNet++ and integrates with improved training and scaling strategies. PointMetaBase revisits the existing methods and proposes a meta-architecture for point cloud analysis. Although these MLP-based networks show high performance in learning local geometry, the exploration of long-range dependency is omitted. Our PDNet is an MLP-based network that enjoys both long-range dependency and adaptive position aggregation inspired by deformable mechanisms.

Deformable networks on images.

The deformable mechanism is first presented by Deformable convolutional network (DCN) (Dai et al. 2017) to enhance the capability of convolution with additional offsets and adaptive spatial aggregation conditioned on input data. DCNv2 (Zhu et al. 2019) improves its ability by introducing a modulation mechanism. The deformable mechanism has also been applied to ViTs (Zhu et al. 2020; Yue et al. 2021; Xia et al. 2022), which shows powerful capability in refining visual tokens. Recently, InternImage (Wang et al. 2023) proposes large-scale ViT architecture with DCNv3, which gains both benefits in long-range dependency and adaptive spatial aggregation and outperforms related work. However, deformable mechanisms designed for images do not fit unstructured point clouds. This work aims to develop a deformable mechanism for point clouds to aggregate point features from relevant areas through learned initial positions and offsets conditioned on the input points.

Methods

In this section, we first shortly describe the background of MLP-based approaches. Second, we propose Point Deformable Aggregation Module to achieve both long-range dependency and adaptive spatial aggregation in a data-dependent way. Third, we introduce the least square fitting to estimate the point normal vector and suggest applying additional normal embedding to strengthen the representation ability of the network. Finally, we present the overall architectures of PDNet for classification and segmentation tasks.

Preliminary

In this subsection, we briefly revisit some point MLP-based approaches such as PointNet++ (Qi et al. 2017b), PointNeXt (Qian et al. 2022), and PointMetaBase (Lin et al. 2023).

PointNet++

captures local geometric features through the set abstraction (SA) module. SA module consists of subsample layer to select the input points and neighborhood aggregation module to extract local patterns. The neighborhood aggregation is formulated as:

fil+1=𝒜({([fjl,pjlpil]),j𝒩i}),superscriptsubscript𝑓𝑖𝑙1𝒜superscriptsubscript𝑓𝑗𝑙superscriptsubscript𝑝𝑗𝑙superscriptsubscript𝑝𝑖𝑙for-all𝑗subscript𝒩𝑖f_{i}^{l+1}=\mathcal{A}(\{\mathcal{M}([f_{j}^{l},p_{j}^{l}-p_{i}^{l}]),\forall j% \in\mathcal{N}_{i}\}),italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_A ( { caligraphic_M ( [ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] ) , ∀ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) , (1)

where 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the index set of neighbors of point i𝑖iitalic_i. pilsuperscriptsubscript𝑝𝑖𝑙p_{i}^{l}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, pjlsuperscriptsubscript𝑝𝑗𝑙p_{j}^{l}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, fjlsuperscriptsubscript𝑓𝑗𝑙f_{j}^{l}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are the point coordinates selected through farthest point sampling, the coordinates, and the features of neighbor j𝑗jitalic_j in the stage l𝑙litalic_l of the network, respectively. \mathcal{M}caligraphic_M represents the shared MLPs that encode the concatenation of point features of neighbor j𝑗jitalic_j and the relative coordinates pjlpilsuperscriptsubscript𝑝𝑗𝑙superscriptsubscript𝑝𝑖𝑙p_{j}^{l}-p_{i}^{l}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. 𝒜𝒜\mathcal{A}caligraphic_A is the symmetric aggregation function such as max-pooling.

PointNeXt

further appends Inverted residual MLP (InvResMLP) block after SA module to enhance point features:

fil+1=2(𝒜({1([fjl,pjlpil]),j𝒩i}))+fil,superscriptsubscript𝑓𝑖𝑙1subscript2𝒜subscript1superscriptsubscript𝑓𝑗𝑙superscriptsubscript𝑝𝑗𝑙superscriptsubscript𝑝𝑖𝑙for-all𝑗subscript𝒩𝑖superscriptsubscript𝑓𝑖𝑙f_{i}^{l+1}=\mathcal{M}_{2}(\mathcal{A}(\{\mathcal{M}_{1}([f_{j}^{l},p_{j}^{l}% -p_{i}^{l}]),\forall j\in\mathcal{N}_{i}\}))+f_{i}^{l},italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_A ( { caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( [ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ] ) , ∀ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) ) + italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , (2)

where PointNeXt uses one layer MLP 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for neighbor feature aggregation and 2-layer MLP 2subscript2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for point feature update. filsuperscriptsubscript𝑓𝑖𝑙f_{i}^{l}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the input point features in stage l𝑙litalic_l.

PointMetaBase

slightly modifies InvResMLP and applies position encoding δ𝛿\deltaitalic_δ for relative coordinates pjlpilsuperscriptsubscript𝑝𝑗𝑙superscriptsubscript𝑝𝑖𝑙p_{j}^{l}-p_{i}^{l}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT:

fil=3(fil),fjl=Group(fil,pil),formulae-sequencesuperscriptsubscriptsuperscript𝑓𝑙𝑖subscript3superscriptsubscript𝑓𝑖𝑙superscriptsubscriptsuperscript𝑓𝑙𝑗𝐺𝑟𝑜𝑢𝑝superscriptsubscriptsuperscript𝑓𝑙𝑖superscriptsubscript𝑝𝑖𝑙{{f^{l}_{i}}}^{{}^{\prime}}=\mathcal{M}_{3}(f_{i}^{l}),{{f^{l}_{j}}}^{{}^{% \prime}}=Group({{f^{l}_{i}}}^{{}^{\prime}},p_{i}^{l}),italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_G italic_r italic_o italic_u italic_p ( italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , (3)
fil+1=2(𝒜({fjl+δ(pjlpil),j𝒩i}))+fil.superscriptsubscript𝑓𝑖𝑙1subscript2𝒜superscriptsubscriptsuperscript𝑓𝑙𝑗𝛿superscriptsubscript𝑝𝑗𝑙superscriptsubscript𝑝𝑖𝑙for-all𝑗subscript𝒩𝑖superscriptsubscript𝑓𝑖𝑙f_{i}^{l+1}=\mathcal{M}_{2}(\mathcal{A}(\{{{f^{l}_{j}}}^{{}^{\prime}}+\delta(p% _{j}^{l}-p_{i}^{l}),\forall j\in\mathcal{N}_{i}\}))+f_{i}^{l}.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_A ( { italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + italic_δ ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , ∀ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) ) + italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT . (4)

Notice that PointNeXt uses the map** function 1subscript1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (eg: MLP) after the grou** layer while PointMetaBase adopts 3subscript3\mathcal{M}_{3}caligraphic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT before the grou** operation to reduce computation. filsuperscriptsubscriptsuperscript𝑓𝑙𝑖{{f^{l}_{i}}}^{{}^{\prime}}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and fjlsuperscriptsubscriptsuperscript𝑓𝑙𝑗{{f^{l}_{j}}}^{{}^{\prime}}italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT are the updated point features of point i𝑖iitalic_i and its neighbor j𝑗jitalic_j, respectively.

Refer to caption
Figure 2: Illustration of Point Deformable Aggregation Module. Given input N𝑁Nitalic_N points with coordinates {pil}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑝𝑖𝑙𝑖1𝑁\{p_{i}^{l}\}_{i=1}^{N}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and features {fil}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑙𝑖1𝑁\{f_{i}^{l}\}_{i=1}^{N}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, FPS generates the initial reference points (green points) based on {pil}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑝𝑖𝑙𝑖1𝑁\{p_{i}^{l}\}_{i=1}^{N}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, and OGN determines the 3R3𝑅3R3 italic_R offsets (blue arrows) learned on {fil}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑙𝑖1𝑁\{f_{i}^{l}\}_{i=1}^{N}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Then, features of adaptive position f(prl)𝑓superscriptsubscriptsuperscript𝑝𝑙𝑟f({{p^{l}_{r}}}^{{}^{\prime}})italic_f ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) are computed through grou** and interpolation. Finally, for point i𝑖iitalic_i, point position pilsuperscriptsubscript𝑝𝑖𝑙p_{i}^{l}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, point feature filsuperscriptsubscript𝑓𝑖𝑙f_{i}^{l}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, positions {prl}r=1Rsuperscriptsubscriptsuperscriptsubscriptsuperscript𝑝𝑙𝑟𝑟1𝑅\{{{p^{l}_{r}}}^{{}^{\prime}}\}_{r=1}^{R}{ italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT (blue points) and features {f(prl)}r=1Rsuperscriptsubscript𝑓superscriptsubscriptsuperscript𝑝𝑙𝑟𝑟1𝑅\{f({{p^{l}_{r}}}^{{}^{\prime}})\}_{r=1}^{R}{ italic_f ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT of deformable reference points participate together in forming updated point features fil+1superscriptsubscript𝑓𝑖𝑙1f_{i}^{l+1}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT.
Refer to caption
Figure 3: Illustration of Point Deformable Network (PDNet) and macro-design of PDNet-L. For classification (bottom left), we use consecutively PDSA block, which incorporate Set Abstraction module (Qi et al. 2017b) with position encoding and normal embedding. For segmentation (top), we adopt a U-net style architecture with Feature Propagation (Qi et al. 2017b) as decoder and PDSA, PLAM, and PDAM as encoder.

Point Deformable Aggregation Module

As discussed in Section 3.1, previous MLP-based approaches capture geometric features through local point groups. They aggregate features in local areas such as the fixed numbers of local neighbor points or points in a small radius, which fail to directly model long-range dependency. To solve it, we propose Point Deformable Aggregation Module (PDAM), which captures long-range relations and achieves adaptive spatial aggregation at the same time in a data-dependent way. Given an input image xC×H×W𝑥superscript𝐶𝐻𝑊x\in\mathbb{R}^{C\times H\times W}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, deformable mechanism proposed in DCNv2 (Zhu et al. 2019) can be described as:

y(p)=k=1Kwkmkx(p+pk+Δpk),𝑦𝑝superscriptsubscript𝑘1𝐾subscript𝑤𝑘subscript𝑚𝑘𝑥𝑝subscript𝑝𝑘Δsubscript𝑝𝑘y(p)=\sum_{k=1}^{K}w_{k}m_{k}x(p+p_{k}+\Delta p_{k}),italic_y ( italic_p ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x ( italic_p + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (5)

where y(p)𝑦𝑝y(p)italic_y ( italic_p ) and x(p)𝑥𝑝x(p)italic_x ( italic_p ) denote the output feature maps and input feature maps at location p𝑝pitalic_p. K𝐾Kitalic_K represents the number of sampling locations, wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are weight projection and predefined offset for the k𝑘kitalic_k-th location, respectively. For example, for a convolution with 3 ×\times× 3 kernel and dilation 1, K𝐾Kitalic_K = 9, pk{(1,1),(1,0),,(1,1)}subscript𝑝𝑘111011p_{k}\in\{(-1,-1),(-1,0),...,(1,1)\}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ { ( - 1 , - 1 ) , ( - 1 , 0 ) , … , ( 1 , 1 ) }. ΔpkΔsubscript𝑝𝑘\Delta p_{k}roman_Δ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the learnable offset of 2K2𝐾2K2 italic_K channels condition on the input feature x𝑥xitalic_x. mksubscript𝑚𝑘m_{k}italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the modulation scalar of K𝐾Kitalic_K channels obtained through a convolutional layer and sigmoid activation over the same input. However, applying the deformable mechanism in images to points is a non-trivial problem. In contrast to the images that have structured 2D grids, point clouds are kind of unstructured data that are unevenly distributed in space. Directly using the predefined grid sampling like pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT does not suit point clouds.

Inspired by the deformable mechanism presented by (Dai et al. 2017; Zhu et al. 2019), we propose the PDAM to offer point clouds adaptive position aggregation and long-range relations via deformable reference points, as illustrated in Figure 2. Specifically, given the input point cloud with N𝑁Nitalic_N points at l𝑙litalic_l-th stage as {pil,fil}i=1Nsuperscriptsubscriptsuperscriptsubscript𝑝𝑖𝑙superscriptsubscript𝑓𝑖𝑙𝑖1𝑁\{p_{i}^{l},f_{i}^{l}\}_{i=1}^{N}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where pil1×3superscriptsubscript𝑝𝑖𝑙superscript13p_{i}^{l}\in\mathbb{R}^{1\times 3}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT and fil1×Csuperscriptsubscript𝑓𝑖𝑙superscript1𝐶f_{i}^{l}\in\mathbb{R}^{1\times C}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT are the coordinate and feature of point i𝑖iitalic_i, respectively. For every point i𝑖iitalic_i, we first initialize R𝑅Ritalic_R points as references through farthest point sampling (FPS) based on all the point positions, which solves the defect of irregular point cloud and leads the initial reference points {prl}r=1Rsuperscriptsubscriptsuperscriptsubscript𝑝𝑟𝑙𝑟1𝑅\{p_{r}^{l}\}_{r=1}^{R}{ italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT to be uniformly distributed. This process can be described as:

{prl}r=1R=FPS({pil}i=1N).superscriptsubscriptsuperscriptsubscript𝑝𝑟𝑙𝑟1𝑅𝐹𝑃𝑆superscriptsubscriptsuperscriptsubscript𝑝𝑖𝑙𝑖1𝑁\{p_{r}^{l}\}_{r=1}^{R}=FPS(\{p_{i}^{l}\}_{i=1}^{N}).{ italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT = italic_F italic_P italic_S ( { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) . (6)

Different from KPConv (Thomas et al. 2019) that computes and refines kernel points within a local sphere, our initial reference points are based on the positions of all points, which enjoy larger receptive field. Then, to obtain the offset for each reference point, we feed the point features {fil}i=1N,fil1×Csuperscriptsubscriptsuperscriptsubscript𝑓𝑖𝑙𝑖1𝑁superscriptsubscript𝑓𝑖𝑙superscript1𝐶\{f_{i}^{l}\}_{i=1}^{N},f_{i}^{l}\in\mathbb{R}^{1\times C}{ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT of all points to the offset generation network (OGN) to output the offsets {Δprl}r=1R,Δprl1×3superscriptsubscriptΔsuperscriptsubscript𝑝𝑟𝑙𝑟1𝑅Δsuperscriptsubscript𝑝𝑟𝑙superscript13\{\Delta p_{r}^{l}\}_{r=1}^{R},\Delta p_{r}^{l}\in\mathbb{R}^{1\times 3}{ roman_Δ italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , roman_Δ italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT as the following:

{Δprl}r=1R=OGN({fil}i=1N),superscriptsubscriptΔsuperscriptsubscript𝑝𝑟𝑙𝑟1𝑅𝑂𝐺𝑁superscriptsubscriptsuperscriptsubscript𝑓𝑖𝑙𝑖1𝑁\{\Delta p_{r}^{l}\}_{r=1}^{R}=OGN(\{f_{i}^{l}\}_{i=1}^{N}),{ roman_Δ italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT = italic_O italic_G italic_N ( { italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , (7)

where OGN is implemented as two linear layers with learnable weight matrics W1N×Nsubscript𝑊1superscript𝑁𝑁W_{1}\in\mathbb{R}^{N\times N}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and W2N×3Rsubscript𝑊2superscript𝑁3𝑅W_{2}\in\mathbb{R}^{N\times 3R}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 italic_R end_POSTSUPERSCRIPT to get 3R𝑅Ritalic_R offsets, hence the 3R𝑅Ritalic_R offsets will shift reference points to any reasonable position. It will bring the query point more relevant information in global contexts learned on the whole point features. However, since point clouds are discrete data points, there may not exist point on positions {prl}r=1R,prl=prl+Δprlsuperscriptsubscriptsuperscriptsubscriptsuperscript𝑝𝑙𝑟𝑟1𝑅superscriptsubscriptsuperscript𝑝𝑙𝑟superscriptsubscript𝑝𝑟𝑙Δsuperscriptsubscript𝑝𝑟𝑙\{{{p^{l}_{r}}}^{{}^{\prime}}\}_{r=1}^{R},{{p^{l}_{r}}}^{{}^{\prime}}=p_{r}^{l% }+\Delta p_{r}^{l}{ italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + roman_Δ italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. To alleviate it, we adopt inverse distance weighted average based on K𝐾Kitalic_K nearest neighbors (KNN) and interpolate features nearby position prlsuperscriptsubscriptsuperscript𝑝𝑙𝑟{{p^{l}_{r}}}^{{}^{\prime}}italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT to get the features of deformable reference point in the local region of position prlsuperscriptsubscriptsuperscript𝑝𝑙𝑟{{p^{l}_{r}}}^{{}^{\prime}}italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT as follows:

{pkl}k=1K=KNN(prl),superscriptsubscriptsuperscriptsubscript𝑝𝑘𝑙𝑘1𝐾𝐾𝑁𝑁superscriptsubscriptsuperscript𝑝𝑙𝑟\{p_{k}^{l}\}_{k=1}^{K}=KNN({{p^{l}_{r}}}^{{}^{\prime}}),{ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = italic_K italic_N italic_N ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , (8)
f(j)(prl)=k=1Kwk(prl)fk(j)k=1Kwk(prl),j=1,,C,formulae-sequencesuperscript𝑓𝑗superscriptsubscriptsuperscript𝑝𝑙𝑟superscriptsubscript𝑘1𝐾subscript𝑤𝑘superscriptsubscriptsuperscript𝑝𝑙𝑟superscriptsubscript𝑓𝑘𝑗superscriptsubscript𝑘1𝐾subscript𝑤𝑘superscriptsubscriptsuperscript𝑝𝑙𝑟𝑗1𝐶f^{(j)}({{p^{l}_{r}}}^{{}^{\prime}})=\frac{\sum_{k=1}^{K}w_{k}({{p^{l}_{r}}}^{% {}^{\prime}})f_{k}^{(j)}}{\sum_{k=1}^{K}w_{k}({{p^{l}_{r}}}^{{}^{\prime}})},j=% 1,...,C,italic_f start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) end_ARG , italic_j = 1 , … , italic_C , (9)

where {pkl}k=1Ksuperscriptsubscriptsuperscriptsubscript𝑝𝑘𝑙𝑘1𝐾\{p_{k}^{l}\}_{k=1}^{K}{ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT are the K𝐾Kitalic_K nearest neighbor points of position prlsuperscriptsubscriptsuperscript𝑝𝑙𝑟{{p^{l}_{r}}}^{{}^{\prime}}italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. wk(prl)=1d(prl,pkl)subscript𝑤𝑘superscriptsubscriptsuperscript𝑝𝑙𝑟1𝑑superscriptsubscriptsuperscript𝑝𝑙𝑟superscriptsubscript𝑝𝑘𝑙w_{k}({{p^{l}_{r}}}^{{}^{\prime}})=\frac{1}{d({{p^{l}_{r}}}^{{}^{\prime}},p_{k% }^{l})}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_d ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) end_ARG is the weighted parameter, and d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) computes the distance between two points. Thus the deformable reference points have the information f(prl)𝑓superscriptsubscriptsuperscript𝑝𝑙𝑟f({{p^{l}_{r}}}^{{}^{\prime}})italic_f ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) in relevant regions, which will be involved in the aggregation with the query point i𝑖iitalic_i.

Further, we use two linear layers with a sigmoid layer to obtain R𝑅Ritalic_R channels of the modulation scalars {Δmrl}r=1RsuperscriptsubscriptΔsuperscriptsubscript𝑚𝑟𝑙𝑟1𝑅\{\Delta m_{r}^{l}\}_{r=1}^{R}{ roman_Δ italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT for deformable reference point features. Finally, the aggregation procedure for the query point can be defined as:

fil+1=(𝒜({Δmrlf(prl)+δ(prlpil)}r=1R))+fil,superscriptsubscript𝑓𝑖𝑙1𝒜superscriptsubscriptΔsuperscriptsubscript𝑚𝑟𝑙𝑓superscriptsubscriptsuperscript𝑝𝑙𝑟𝛿superscriptsubscriptsuperscript𝑝𝑙𝑟superscriptsubscript𝑝𝑖𝑙𝑟1𝑅superscriptsubscript𝑓𝑖𝑙f_{i}^{l+1}=\mathcal{M}(\mathcal{A}(\{\Delta m_{r}^{l}f({{p^{l}_{r}}}^{{}^{% \prime}})+\delta({{p^{l}_{r}}}^{{}^{\prime}}-p_{i}^{l})\}_{r=1}^{R}))+f_{i}^{l},italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_M ( caligraphic_A ( { roman_Δ italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_f ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_δ ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) ) + italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , (10)

where δ(prlpil)𝛿superscriptsubscriptsuperscript𝑝𝑙𝑟superscriptsubscript𝑝𝑖𝑙\delta({{p^{l}_{r}}}^{{}^{\prime}}-p_{i}^{l})italic_δ ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) is relative position embedding of point i𝑖iitalic_i and its deformable reference points prlsuperscriptsubscriptsuperscript𝑝𝑙𝑟{{p^{l}_{r}}}^{{}^{\prime}}italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. 𝒜𝒜\mathcal{A}caligraphic_A is the symmetric aggregation function (max-pooling), \mathcal{M}caligraphic_M is a map** function such as MLP.

Enhanced Normal Embedding

PDAM aggregates long-range contexts from regions of interest for each query point, strengthening the representation capability among points. In this subsection, we further propose Enhanced Normal Embedding to improve the representation ability of each point itself. Normal features provide geometric information about point clouds. Using additional point normals rather than only consuming point coordinates in the network has been proven effective in various works (Qi et al. 2017a, b; Li, Chen, and Lee 2018; Wu, Qi, and Fuxin 2019). However, it does not work if no point normals exist in the dataset. Inspired by (Mitra and Nguyen 2003), we adopt the least square fitting to estimate the normal vector of a point cloud. Considering point i𝑖iitalic_i and its k1𝑘1k-1italic_k - 1 nearest neighbor points 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the covariance matrix M𝑀Mitalic_M can be computed as:

M=1ki=1k(pip¯)(pip¯)T,𝑀1𝑘superscriptsubscript𝑖1𝑘subscript𝑝𝑖¯𝑝superscriptsubscript𝑝𝑖¯𝑝𝑇M=\frac{1}{k}\sum_{i=1}^{k}(p_{i}-\bar{p})(p_{i}-\bar{p})^{T},italic_M = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_p end_ARG ) ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_p end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (11)

where pi1×3subscript𝑝𝑖superscript13p_{i}\in\mathbb{R}^{1\times 3}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 3 end_POSTSUPERSCRIPT is the coordinate of point i𝑖iitalic_i in the l𝑙litalic_l-th stage, p¯=1ki=1kpi¯𝑝1𝑘superscriptsubscript𝑖1𝑘subscript𝑝𝑖\bar{p}=\frac{1}{k}\sum_{i=1}^{k}p_{i}over¯ start_ARG italic_p end_ARG = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the centroid of point i𝑖iitalic_i and its neighbors. Thus M𝑀Mitalic_M is 3×3333\times 33 × 3 symmetric positive semidefinite matrix. The normal to the local least square plane for point i𝑖iitalic_i can be estimated as the eigenvector corresponding to the minimum eigenvalue of M𝑀Mitalic_M (Mitra and Nguyen 2003). We utilize singular value decomposition to obtain the normal feature nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each point i𝑖iitalic_i.

Like widely used position encoding in attention-based and MLP-based methods (Lai et al. 2022; Wu et al. 2022; Lin et al. 2023) to learn complex point cloud positional relations among point groups or all points, we propose to strengthen the point normal via Enhanced Normal Embedding (ENE). According to (Yang et al. 2020; Ran, Liu, and Wang 2022), point position and point normal are features with different distributions, which can be decoupled along channel dimension and fused through summation after embedding. In this paper, we implement ENE with 2-layer MLP.

Point Deformable Network Architectures

As illustrated in Figure 3, we propose Point Deformable Network (PDNet), shared the similar hierarchical structure as (Lin et al. 2023; Qian et al. 2022), and incorporated with Point Deformable Aggregation Module (PDAM) and Enhanced Normal Embedding (ENE). For the segmentation task, we use a U-net architecture, which contains an encoder and a decoder. For the classification task, we only use an encoder. The decoder comprises widely used Feature Propagation layers (Qi et al. 2017b; Qian et al. 2022; Lin et al. 2023) to gradually upsample features via interpolation. Incorporating position embedding and ENE, we tweak the Set Abstraction module (Qi et al. 2017b; Qian et al. 2022) as the reduction block, termed Point Deformable Set Abstraction (PDSA). The encoder is composed of PDSA, Point Local Aggregation Module (PLAM), and PDAM. In the first and second stages of PDNet, we implement PLAM by modifying the PointMetaBase block (defined in equation 4) with additional ENE as follows:

fil+1=2(𝒜({fjl+δ(pjlpil)+γ(njl),j𝒩i}))+fil,superscriptsubscript𝑓𝑖𝑙1subscript2𝒜superscriptsubscriptsuperscript𝑓𝑙𝑗𝛿superscriptsubscript𝑝𝑗𝑙superscriptsubscript𝑝𝑖𝑙𝛾superscriptsubscript𝑛𝑗𝑙for-all𝑗subscript𝒩𝑖superscriptsubscript𝑓𝑖𝑙\begin{split}f_{i}^{l+1}=\mathcal{M}_{2}(\mathcal{A}(\{{{f^{l}_{j}}}^{{}^{% \prime}}+\delta(p_{j}^{l}-p_{i}^{l})\\ +\gamma(n_{j}^{l}),\forall j\in\mathcal{N}_{i}\}))+f_{i}^{l},\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_A ( { italic_f start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + italic_δ ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL + italic_γ ( italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , ∀ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) ) + italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , end_CELL end_ROW (12)

where njlsuperscriptsubscript𝑛𝑗𝑙n_{j}^{l}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the point normals of point neighbor group of point i𝑖iitalic_i at l𝑙litalic_l-th stage, and γ𝛾\gammaitalic_γ is the ENE implemented with MLP that map the input three dimensions of point normal vector to the dimension of high-level features. Incorporating with ENE, the PDAM defined in equation 10 can be further modified as:

fil+1=(𝒜({Δmrlf(prl)+δ(prlpil)+γ(nrl)}r=1R))+fil.superscriptsubscript𝑓𝑖𝑙1𝒜superscriptsubscriptΔsuperscriptsubscript𝑚𝑟𝑙𝑓superscriptsubscriptsuperscript𝑝𝑙𝑟𝛿superscriptsubscriptsuperscript𝑝𝑙𝑟superscriptsubscript𝑝𝑖𝑙𝛾superscriptsubscript𝑛𝑟𝑙𝑟1𝑅superscriptsubscript𝑓𝑖𝑙\begin{split}f_{i}^{l+1}=\mathcal{M}(\mathcal{A}(\{\Delta m_{r}^{l}f({{p^{l}_{% r}}}^{{}^{\prime}})+\delta({{p^{l}_{r}}}^{{}^{\prime}}-p_{i}^{l})\\ +\gamma(n_{r}^{l})\}_{r=1}^{R}))+f_{i}^{l}.\end{split}start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT = caligraphic_M ( caligraphic_A ( { roman_Δ italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_f ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) + italic_δ ( italic_p start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL + italic_γ ( italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) ) + italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT . end_CELL end_ROW (13)

We introduce parallel PLAM and PDAM in the third and fourth stages of PDNet. The point features are fed into PLAM to aggregate information locally (see equation 12) and passed through PDAM (shown in equation 13) to aggregate information globally at the same time. This design of MLP-based blocks with local and long-range dependencies helps our network learn strong generalization ability.

For a fair comparison, we adopt the same scaling strategies as (Lin et al. 2023; Qian et al. 2022) to construct our PDNet. We define the number of deformable reference points R=32𝑅32R=32italic_R = 32 to be consistent with the number of points in their local point groups. The configuration of three variants of PDNet is shown as follows:

  • PDNet-S: C = 32, B = 0

  • PDNet-L: C = 32, B = (2, 4, 2, 2)

  • PDNet-XXL: C = 64, B = (4, 8, 4, 4)

We denote C as the channel size of the stem MLP and B as the number of blocks in a stage. Notice that B = 0 means only one PDSA block but no PLAM or PDAM blocks are used at each stage.

Method (time order) mAcc (%) OA (%)
PointNet (Qi et al. 2017a) 63.4 68.2
PointNet++ (Qi et al. 2017b) 75.4 77.9
PointCNN (Li et al. 2018) 75.1 78.5
DGCNN (Wang et al. 2019) 73.6 78.1
PRA-Net (Cheng et al. 2021) 77.9 81.0
PointMLP (Ma et al. 2022) 84.4 85.7
PointNeXt (Qian et al. 2022) 86.8 88.2
GAM (Hu et al. 2023) 86.5 88.4
Point-PN (Zhang et al. 2023) - 87.1
PointMetaBase (Lin et al. 2023) 86.8 88.2
PDNet (ours) 86.8 88.5
Table 1: Shape classification results on PB_T50_RS of ScanObjectNN. mAcc is the mean of class accuracy (%) and OA is the overall accuracy (%).

Experiments

In this section, we evaluate our PDNet on ScanObjectNN (Uy et al. 2019) for shape classification, S3DIS (Armeni et al. 2016) for semantic segmentation, and ShapeNetPart (Yi et al. 2016) for part segmentation. We also provide various ablation studies to better understand the PDNet.

Classification and Segmentation

Experimental setups.

We train our models by using CrossEntropy loss with label smoothing (Szegedy et al. 2016), AdamW optimizer (Loshchilov and Hutter 2018), an initial learning rate lr = 0.001, and weight decay 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT with Cosine Decay for all tasks. For S3DIS semantic segmentation task, point clouds are downsampled with a voxel size of 0.4 m following the previous methods (Zhao et al. 2021; Qian et al. 2021, 2022; Lin et al. 2023). For S3DIS, our PDNet is trained using a fixed number of 24000 points per batch with batch size set to 8 with an initial lr=0.01 for 100 epochs on a NVIDIA 3090 GPU and a 12-core Intel Xeon @ 2.50GHz CPU. For ScanObjectNN shape classification task, following (Qian et al. 2021; Lin et al. 2023), our PDNet is trained by 1024 points with a weight decay of 0.05 for 250 epochs on a NVIDIA 3090 GPU. The points are randomly sampled during training and uniformly sampled during testing. For ShapeNetPart, we train our model using 2048 randomly sampled points with normals for 300 epochs on 4 NVIDIA 3090 GPUs. ShapeNetPart has the normal vectors of point clouds, so we do not apply the least square fitting to estimate the point normals. The original point normals are used for normal embedding. All the details of data augmentation are the same as those in PointNeXt (Qian et al. 2022) and PointMetaBase (Lin et al. 2023).

S3DIS 6-Fold S3DIS Area-5
Method (time order) OA (%) mAcc (%) mIoU (%) OA (%) mAcc (%) mIoU (%)
PointNet (Qi et al. 2017a) 78.5 66.2 47.6 - 49.0 41.1
PointCNN (Li et al. 2018) 88.1 75.6 65.4 85.9 63.9 57.3
DGCNN (Wang et al. 2019) 84.1 - 56.1 83.6 - 47.9
KPConv (Thomas et al. 2019) - 79.1 70.6 - 72.8 67.1
PCT (Guo et al. 2021) - 67.7 61.3 - - -
PAConv (Xu et al. 2021a) - - - - 73.0 66.6
AdaptConv (Zhou et al. 2021) - - - 90.0 73.2 67.9
Point Transformer (Zhao et al. 2021) 90.2 81.9 73.5 90.8 76.5 70.4
ASSANet (Qian et al. 2021) - - - - - 66.8
CBL (Tang et al. 2022a) 89.6 79.4 73.1 90.6 75.2 69.4
StratifiedFormer (Lai et al. 2022) - - - 91.5 78.1 72.0
Point TransformerV2 (Wu et al. 2022) - - - 91.1 77.9 71.6
GAM (Hu et al. 2023) 90.6 83.2 74.4 - - -
PointNet++ (Qi et al. 2017b) 81.0 67.1 54.5 83.0 - 53.5
PointNeXt-L (Qian et al. 2022) 89.8 82.2 73.9 90.0±plus-or-minus\pm±0.1 - 69.0±plus-or-minus\pm±0.5
PointNeXt-XL (Qian et al. 2022) 90.3 83.0 74.9 90.6±plus-or-minus\pm±0.1 - 70.5±plus-or-minus\pm±0.3
PointMetaBase-L (Lin et al. 2023) 90.6 - 75.6 90.5±plus-or-minus\pm±0.1 - 69.5±plus-or-minus\pm±0.3
PointMetaBase-XXL (Lin et al. 2023) 91.3 - 77.0 90.8±plus-or-minus\pm±0.6 - 71.3±plus-or-minus\pm±0.7
PDNet-L (ours) 91.4 85.5 76.7 90.7 77.1 70.8
PDNet-XXL (ours) 91.9 86.2 78.3 91.3 78.1 72.3
Table 2: Semantic segmentation results on S3DIS (6-Fold and Area 5). OA is the overall accuracy (%), mAcc is the mean of class accuracy (%), and mIoU is the mean of instance IoU (%).
Refer to caption
Figure 4: Visual comparison between MLP-based networks, PointMetaBase and our PDNet.

Shape Classification.

We first conduct experiments on a real-world shape classification dataset ScanobjectNN (Uy et al. 2019). ScanObjectNN contains approximately 15,000 objects, which have 2902 unique instances that are categorized into 15 classes. We choose the hardest perturbed variant (PB_T50_RS) and report the overall accuracy (OA) and the mean of class accuracy (mAcc) results. As shown in Table 1, our PDNet outperforms all baselines with the mAcc of 86.8% and OA of 88.5%. It shows point normals provide geometric information, and applying ENE helps improve the representation ability of the model.

Semantic Segmentation.

We also validate our PDNet on widely used Stanford Large-Scale 3D Indoor Spaces (S3DIS) (Armeni et al. 2016) dataset for semantic segmentation task. S3DIS is a challenging benchmark that contains 271 rooms with 13 semantic categories in 6 areas. We report the OA, the mAcc, and the mean of instance IoU (mIoU) results of standard 6-fold cross-validation and Area-5 on S3DIS. As illustrated in Table 2, Our PDNet-XXL outperforms all baselines with the OA of 91.9%, mAcc of 86.2%, and mIoU of 78.3% on S3DIS 6-Fold and OA of 91.3%, mAcc of 78.1%, and mIoU of 72.3% on S3DIS Area-5. Notably, the superior performance over powerful point Transformer architectures (StratifiedFormer and Point TransformerV2) shows the potential of MLP-based methods in point cloud analysis. Compared with recent MLP-based networks (PointNeXt and PointMetaBase), our PDNet-L gains +2.8% and +1.1% improvement in mIoU on S3DIS 6-Fold, respectively. Consistent progress is obtained when scaling up the models. It demonstrates the importance of long-range dependency in the point semantic segmentation task and the effectiveness of our method in aggregating information from deformable reference points conditioned on the input points. We also provide visualization of semantic segmentation results in Figure 4, which clearly shows the superiority of our approach. Due to the direct modeling of long-range dependency, our method can recognize the objects in red circles while others fail.

Refer to caption
Figure 5: Visualization results on ShapeNetPart.
Method (time order) c. mIoU i. mIoU
PointNet (Qi et al. 2017a) 80.4 83.7
PointNet++ (Qi et al. 2017b) 81.9 85.1
SO-Net (Li, Chen, and Lee 2018) - 84.6
PointCNN (Li et al. 2018) 84.6 86.1
DGCNN (Wang et al. 2019) 82.3 85.1
KPConv (Thomas et al. 2019) 85.1 86.4
PCT (Guo et al. 2021) - 86.4
PAConv (Xu et al. 2021a) 84.6 86.1
AdaptConv (Zhou et al. 2021) 83.4 86.4
GDANet (Xu et al. 2021b) 85.0 86.5
Point Trans. (Zhao et al. 2021) 83.7 86.6
PointMLP (Ma et al. 2022) 84.6 86.1
PointNeXt (Qian et al. 2022) 85.2±plus-or-minus\pm±0.1 87.0±plus-or-minus\pm±0.1
GAM (Hu et al. 2023) - 87.0
Point-PN (Zhang et al. 2023) - 86.6
PointMetaBase (Lin et al. 2023) 85.1±plus-or-minus\pm±0.3 87.1±plus-or-minus\pm±0.0
PDNet (ours) 85.4 87.2
Table 3: Part segmentation results on ShapeNetPart.

Part Segmentation.

ShapeNetPart (Yi et al. 2016) is an object-level dataset for part segmentation. It contains 16,880 models with 16 different shape categories. Each category has 2-6 parts and up to 50 part labels in total. We evaluate the performance with the mean of class IoU (c. mIoU) and the mean of instance IoU (i. mIoU) in Table 3. PDNet also achieves the best performance of 85.4% in cls. mIoU and 87.2% in mIoU. Visualization of part segmentation results are presented in Figure 5.

PLAM PDAM NE mIoU Params FLOPs
69.5±plus-or-minus\pm±0.3 2.7 2.0
70.1±plus-or-minus\pm±0.2 3.6 2.1
69.7±plus-or-minus\pm±0.3 2.7 2.0
70.4±plus-or-minus\pm±0.3 4.9 2.4
70.6±plus-or-minus\pm±0.2 4.9 2.4
Table 4: Evaluation of proposed components on S3DIS Area-5. mIoU is the mean of IoU (%).

Ablation Studies

Effectiveness of Proposed Components.

We evaluate the performance of proposed components of PDNet-L in Table 4 with mean±std in three random runs. PLAM and PDAM in the first and second column of Table 4 represents whether to use them in the third and fourth stages of PDNet-L. The results show that using PDAM to aggregate information from the deformable reference regions is better than adopting PLAM to aggregate information within local point groups. It demonstrates that the network prefers to learn global relations in the deep stages. In the case of using both PLAM and PDAM, it obtains better result than only use PLAM or PDAM. Our PDNet-L with all the proposed components achieves the best performance of 70.6±0.2% in mIoU.

Stage2 Stage3 Stage4 mIoU (%)
69.7
70.1
70.8
69.9
Table 5: Ablation study on applying PDAM in different stages on S3DIS Area-5.
method mIoU (%)
random 70.1
center 70.6
FPS 70.8
successive 70.4
parallel 70.8
Table 6: Ablation study on different types of initial reference points and ablation study on combining strategy of PDAM and PLAM on S3DIS Area-5.

PDAM.

We first explore adopting Point Deformable Aggregation Module (PDAM) at different stages. As shown in Table 5, only adopting PDAM in the last stage improves by 0.4% and applying it in the last two stages leads to the best performance of 70.8% in mIoU. However, using PDAM at the early stage obtains decreasement in mIoU, which reveals the cues that our network performs better when adopting Point Local Aggregation Module (PLAM) in the early stages to capture local geometries and PDAM in the deeper stages to model long-range dependencies.

We also investigate several types of initializing methods for the reference points on S3DIS Area-5. The results are presented in Table 6, which suggests that using FPS to acquire the initial reference point based on the input point positions is superior to random initialization. Moreover, using the center of the input points as the initial status for all the initial reference points performs worse than considering each reference point independently to be uniformly distributed in space. We further conduct an ablation study on combining strategy of PDAM and PLAM in Table 6. For 3D point clouds, applying PLAM and PDAM to aggregate information from local and distant regions at the same time is better than widely used successively designed (Chu et al. 2021; Yang et al. 2022; Xia et al. 2022) in 2D images that adopt PDAM to model long-range relations after capturing local features by PLAM.

Conclusion

In this paper, we propose PDNet, a concise MLP-based network for point cloud processing. Equipped with Point Deformable Aggregation Module (PDAM), our model achieves both long-range dependency and adaptive spatial aggregation in a data-dependent way. For each query point, PDAM aggregates information from deformable reference points, which are initialized according to the point positions and then shifted via additional offsets and modulation scalars conditioned on the input point features. Enhanced Normal Embedding further helps improve the representation ability of point itself. Extensive experiments and ablation studies illustrate the effectiveness of PDNet over various tasks. We hope our work can inspire insights toward exploring suitable deformable mechanisms for point clouds.

References

  • Armeni et al. (2016) Armeni, I.; Sener, O.; Zamir, A. R.; Jiang, H.; Brilakis, I.; Fischer, M.; and Savarese, S. 2016. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1534–1543.
  • Cheng et al. (2021) Cheng, S.; Chen, X.; He, X.; Liu, Z.; and Bai, X. 2021. Pra-net: Point relation-aware network for 3d point cloud analysis. IEEE Transactions on Image Processing, 30: 4436–4448.
  • Chu et al. (2021) Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; and Shen, C. 2021. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34: 9355–9366.
  • Dai et al. (2017) Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 764–773.
  • Goyal et al. (2021) Goyal, A.; Law, H.; Liu, B.; Newell, A.; and Deng, J. 2021. Revisiting point cloud shape classification with a simple and effective baseline. In International Conference on Machine Learning, 3809–3820. PMLR.
  • Guo et al. (2021) Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R. R.; and Hu, S.-M. 2021. Pct: Point cloud transformer. Computational Visual Media, 7: 187–199.
  • Hu et al. (2023) Hu, H.; Fanyi, W.; **gwen, S.; Hongtao, Z.; Yaonong, W.; Laifeng, H.; Yanhao, Z.; and Zhiwang, Z. 2023. GAM : Gradient Attention Module of Optimization for Point Clouds Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 835–843.
  • Lai et al. (2022) Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; and Jia, J. 2022. Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8500–8509.
  • Li, Chen, and Lee (2018) Li, J.; Chen, B. M.; and Lee, G. H. 2018. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9397–9406.
  • Li et al. (2018) Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen, B. 2018. Pointcnn: Convolution on x-transformed points. Advances in Neural Information Processing Systems, 31.
  • Lin et al. (2023) Lin, H.; Zheng, X.; Li, L.; Chao, F.; Wang, S.; Wang, Y.; Tian, Y.; and Ji, R. 2023. Meta Architecture for Point Cloud Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17682–17691.
  • Liu et al. (2019) Liu, Y.; Fan, B.; Xiang, S.; and Pan, C. 2019. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8895–8904.
  • Loshchilov and Hutter (2018) Loshchilov, I.; and Hutter, F. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  • Ma et al. (2022) Ma, X.; Qin, C.; You, H.; Ran, H.; and Fu, Y. 2022. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv preprint arXiv:2202.07123.
  • Maturana and Scherer (2015) Maturana, D.; and Scherer, S. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 922–928. IEEE.
  • Mitra and Nguyen (2003) Mitra, N. J.; and Nguyen, A. 2003. Estimating surface normals in noisy point cloud data. In Proceedings of the Nineteenth Annual Symposium on Computational Geometry, 322–328.
  • Qi et al. (2017a) Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–660.
  • Qi et al. (2017b) Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30.
  • Qian et al. (2021) Qian, G.; Hammoud, H.; Li, G.; Thabet, A.; and Ghanem, B. 2021. Assanet: An anisotropic separable set abstraction for efficient point cloud representation learning. Advances in Neural Information Processing Systems, 34: 28119–28130.
  • Qian et al. (2022) Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; and Ghanem, B. 2022. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35: 23192–23204.
  • Ran, Liu, and Wang (2022) Ran, H.; Liu, J.; and Wang, C. 2022. Surface representation for point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18942–18952.
  • Su et al. (2015) Su, H.; Maji, S.; Kalogerakis, E.; and Learned-Miller, E. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, 945–953.
  • Szegedy et al. (2016) Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
  • Tang et al. (2022a) Tang, L.; Zhan, Y.; Chen, Z.; Yu, B.; and Tao, D. 2022a. Contrastive boundary learning for point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8489–8499.
  • Tang et al. (2022b) Tang, Y.; Qian, Y.; Zhang, Q.; Zeng, Y.; Hou, J.; and Zhe, X. 2022b. War**GAN: War** multiple uniform priors for adversarial 3D point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6397–6405.
  • Thomas et al. (2019) Thomas, H.; Qi, C. R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; and Guibas, L. J. 2019. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6411–6420.
  • Uy et al. (2019) Uy, M. A.; Pham, Q.-H.; Hua, B.-S.; Nguyen, T.; and Yeung, S.-K. 2019. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1588–1597.
  • Wang, Samari, and Siddiqi (2018) Wang, C.; Samari, B.; and Siddiqi, K. 2018. Local spectral graph convolution for point set feature learning. In Proceedings of the European Conference on Computer Vision (ECCV), 52–66.
  • Wang et al. (2023) Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. 2023. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14408–14419.
  • Wang et al. (2018) Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803.
  • Wang et al. (2019) Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2019. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (ToG), 38(5): 1–12.
  • Wu, Qi, and Fuxin (2019) Wu, W.; Qi, Z.; and Fuxin, L. 2019. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9621–9630.
  • Wu et al. (2022) Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; and Zhao, H. 2022. Point transformer v2: Grouped vector attention and partition-based pooling. Advances in Neural Information Processing Systems, 35: 33330–33342.
  • Wu et al. (2015) Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1912–1920.
  • Xia et al. (2022) Xia, Z.; Pan, X.; Song, S.; Li, L. E.; and Huang, G. 2022. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4794–4803.
  • Xu et al. (2021a) Xu, M.; Ding, R.; Zhao, H.; and Qi, X. 2021a. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3173–3182.
  • Xu et al. (2021b) Xu, M.; Zhang, J.; Zhou, Z.; Xu, M.; Qi, X.; and Qiao, Y. 2021b. Learning geometry-disentangled representation for complementary understanding of 3d object point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 3056–3064.
  • Yang et al. (2022) Yang, C.; Qiao, S.; Yu, Q.; Yuan, X.; Zhu, Y.; Yuille, A.; Adam, H.; and Chen, L.-C. 2022. MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models. In International Conference on Learning Representations.
  • Yang et al. (2020) Yang, Z.; Sun, Y.; Liu, S.; Qi, X.; and Jia, J. 2020. Cn: Channel normalization for point cloud recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 600–616.
  • Yi et al. (2016) Yi, L.; Kim, V. G.; Ceylan, D.; Shen, I.-C.; Yan, M.; Su, H.; Lu, C.; Huang, Q.; Sheffer, A.; and Guibas, L. 2016. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6): 1–12.
  • Yue et al. (2021) Yue, X.; Sun, S.; Kuang, Z.; Wei, M.; Torr, P. H.; Zhang, W.; and Lin, D. 2021. Vision transformer with progressive sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 387–396.
  • Zhang et al. (2023) Zhang, R.; Wang, L.; Wang, Y.; Gao, P.; Li, H.; and Shi, J. 2023. Parameter is not all you need: Starting from non-parametric networks for 3d point cloud analysis. arXiv:2303.08134.
  • Zhao et al. (2021) Zhao, H.; Jiang, L.; Jia, J.; Torr, P. H.; and Koltun, V. 2021. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16259–16268.
  • Zhou et al. (2021) Zhou, H.; Feng, Y.; Fang, M.; Wei, M.; Qin, J.; and Lu, T. 2021. Adaptive graph convolution for point cloud analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4965–4974.
  • Zhu et al. (2019) Zhu, X.; Hu, H.; Lin, S.; and Dai, J. 2019. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9308–9316.
  • Zhu et al. (2020) Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.