Point Deformable Network with Enhanced Normal Embedding
for Point Cloud Analysis

Xingyilang Yin¹, Xi Yang¹, Liangchen Liu¹, Nannan Wang¹, Xinbo Gao² Corresponding author.

Abstract

Recently MLP-based methods have shown strong performance in point cloud analysis. Simple MLP architectures are able to learn geometric features in local point groups yet fail to model long-range dependencies directly. In this paper, we propose Point Deformable Network (PDNet), a concise MLP-based network that can capture long-range relations with strong representation ability. Specifically, we put forward Point Deformable Aggregation Module (PDAM) to improve representation capability in both long-range dependency and adaptive aggregation among points. For each query point, PDAM aggregates information from deformable reference points rather than points in limited local areas. The deformable reference points are generated data-dependent, and we initialize them according to the input point positions. Additional offsets and modulation scalars are learned on the whole point features, which shift the deformable reference points to the regions of interest. We also suggest estimating the normal vector for point clouds and applying Enhanced Normal Embedding (ENE) to the geometric extractors to improve the representation ability of single-point. Extensive experiments and ablation studies on various benchmarks demonstrate the effectiveness and superiority of our PDNet.

Introduction

Point cloud analysis receives great interest due to numerous 3D data acquisition devices applied in various areas, such as autonomous driving and robotics. Unlike images that have regular 2D grids, point clouds are inherently sparse, unordered, and unstructured data. Thus directly processing point clouds is challenging. Recently, MLP-based methods have obtained significant performance with simple components. PointNeXt (Qian et al. 2022) revisits the classical PointNet++ (Qi et al. 2017b) and improves it with modern training and scaling strategies. PointNeXt makes enormous improvements compared to the PointNet++ and even outperforms dedicated designed convolution-based (Xu et al. 2021a), graph-based (Zhou et al. 2021), and powerful point transformers methods (Zhao et al. 2021; Lai et al. 2022; Wu et al. 2022). This reveals that concise MLP modules can already describe the local geometric properties of point clouds. The following PointMetaBase (Lin et al. 2023) further modifies PointNeXt with explicit position encoding and MLP before grou** operation. Although the existing MLP-based approaches show strong generalization ability in various tasks, they ignore modeling long-range dependencies. As illustrated in Figure 1(a) and 1(b), the previous MLP-based methods (Qi et al. 2017b; Ma et al. 2022; Qian et al. 2022; Lin et al. 2023) only focus on aggregating information in local point groups constructed by kNN or ball query, which fails to learn features from a long distance. However, capturing long-range relations has been demonstrated to be crucial in understanding global shape context (Wang et al. 2018; Lai et al. 2022).

To solve the aforementioned problem, we need to explore aggregating information from distant regions for each query point. In the literature on processing images, learning deformable convolution filters has been shown effective in various challenging vision tasks due to adaptive spatial aggregation in long-range and more informative regions (Dai et al. 2017; Zhu et al. 2019, 2020; Wang et al. 2023). This motivates us to design the deformable mechanism for point clouds. In contrast to the images that have structured 2D grids, point clouds are sparse and unstructured data. Thus naive implementation of deformable mechanisms suited for images can not directly apply to point clouds. To alleviate it, KPConv (Thomas et al. 2019) first introduces a deformable mechanism to point clouds. It adopts pseudo-grid convolution through predefined kernel points based on local point positions with weight matrices learned by local point features. The deformable mechanism is further applied to learn offsets constrained in local areas, which helps refine the kernel points.

However, KPConv still focuses on develo** sophisticated modules to extract local structures and long-range dependency is not considered. To this end, we propose a simple and effective MLP-based network named Point Deformable Network (PDNet). Specifically, we first put forward Point Deformable Aggregation Module (PDAM), which achieves both long-range dependency and adaptive spatial aggregation that suits point clouds at the same time. As shown in Figure 1(c), different from extractors that aggregate information in fixed local regions, our PDAM aggregates information from deformable reference points for each query point. The initial reference points are generated on the positions of all points, and the additional offsets and modulation scalars are learned on the whole point features. Thus the reference points are shifted to the relevant regions and bring more informative geometric features for aggregation, which strengthens the representation ability among points. Further, we suggest applying the least square fitting to estimate the normal vector of a point cloud and using Enhanced Normal Embedding (ENE) to improve the representation ability of single-point. Extensive experiments on various challenging benchmarks demonstrate the effectiveness of our methods. Our PDNet outperforms other competitive MLP-based models and achieves state-of-the-art.

Related Work

Point-based networks on point clouds.

In contrast to the project methods that project point clouds to multi-view images (Su et al. 2015; Goyal et al. 2021) or structured 3D voxel (Wu et al. 2015; Maturana and Scherer 2015), point-based methods process unstructured point clouds directly. PointNet (Qi et al. 2017a), the pioneering point-based network, proposes to model the permutation invariance of point clouds by using shared MLPs to encode pointwise features and aggregating them by symmetric functions like max-pooling. To better capture local geometric structures, PointNet++ (Qi et al. 2017b) proposes a hierarchical structure by gradually downsampling with farthest point sampling and aggregating features from neighbor points with kNN or ball query method. Currently, most point-based methods focus on the design of local geometric extractors. Convolution-based approaches (Li et al. 2018; Liu et al. 2019; Thomas et al. 2019; Xu et al. 2021a) propose several invariant and dynamic convolution kernels to aggregate point features. Graph convolution-based methods (Wang, Samari, and Siddiqi 2018; Wang et al. 2019; Zhou et al. 2021) treat points and their relations as vertices and edges of a graph, respectively. Point features can then be extracted by applying graph convolution on the graph. Point Transformers (Zhao et al. 2021; Guo et al. 2021; Lai et al. 2022; Wu et al. 2022) capture local and global information through self-attention. Recently, MLP-based approaches (Ma et al. 2022; Ran, Liu, and Wang 2022; Tang et al. 2022b; Qian et al. 2022; Zhang et al. 2023; Lin et al. 2023) obtain competitive results with simple network architectures. PointMLP proposes a geometric affine module to enhance the residual MLPs network. PointNeXt follows the design philosophy of PointNet++ and integrates with improved training and scaling strategies. PointMetaBase revisits the existing methods and proposes a meta-architecture for point cloud analysis. Although these MLP-based networks show high performance in learning local geometry, the exploration of long-range dependency is omitted. Our PDNet is an MLP-based network that enjoys both long-range dependency and adaptive position aggregation inspired by deformable mechanisms.

Deformable networks on images.

The deformable mechanism is first presented by Deformable convolutional network (DCN) (Dai et al. 2017) to enhance the capability of convolution with additional offsets and adaptive spatial aggregation conditioned on input data. DCNv2 (Zhu et al. 2019) improves its ability by introducing a modulation mechanism. The deformable mechanism has also been applied to ViTs (Zhu et al. 2020; Yue et al. 2021; Xia et al. 2022), which shows powerful capability in refining visual tokens. Recently, InternImage (Wang et al. 2023) proposes large-scale ViT architecture with DCNv3, which gains both benefits in long-range dependency and adaptive spatial aggregation and outperforms related work. However, deformable mechanisms designed for images do not fit unstructured point clouds. This work aims to develop a deformable mechanism for point clouds to aggregate point features from relevant areas through learned initial positions and offsets conditioned on the input points.

Methods

In this section, we first shortly describe the background of MLP-based approaches. Second, we propose Point Deformable Aggregation Module to achieve both long-range dependency and adaptive spatial aggregation in a data-dependent way. Third, we introduce the least square fitting to estimate the point normal vector and suggest applying additional normal embedding to strengthen the representation ability of the network. Finally, we present the overall architectures of PDNet for classification and segmentation tasks.

Preliminary

In this subsection, we briefly revisit some point MLP-based approaches such as PointNet++ (Qi et al. 2017b), PointNeXt (Qian et al. 2022), and PointMetaBase (Lin et al. 2023).

PointNet++

captures local geometric features through the set abstraction (SA) module. SA module consists of subsample layer to select the input points and neighborhood aggregation module to extract local patterns. The neighborhood aggregation is formulated as:

f_{i}^{l+1}=\mathcal{A}(\{\mathcal{M}([f_{j}^{l},p_{j}^{l}-p_{i}^{l}]),\forall j% \in\mathcal{N}_{i}\}),

(1)

where $\mathcal{N}_{i}$ is the index set of neighbors of point $i$ . $p_{i}^{l}$ , $p_{j}^{l}$ , $f_{j}^{l}$ are the point coordinates selected through farthest point sampling, the coordinates, and the features of neighbor $j$ in the stage $l$ of the network, respectively. $\mathcal{M}$ represents the shared MLPs that encode the concatenation of point features of neighbor $j$ and the relative coordinates $p_{j}^{l}-p_{i}^{l}$ . $\mathcal{A}$ is the symmetric aggregation function such as max-pooling.

PointNeXt

further appends Inverted residual MLP (InvResMLP) block after SA module to enhance point features:

f_{i}^{l+1}=\mathcal{M}_{2}(\mathcal{A}(\{\mathcal{M}_{1}([f_{j}^{l},p_{j}^{l}% -p_{i}^{l}]),\forall j\in\mathcal{N}_{i}\}))+f_{i}^{l},

(2)

where PointNeXt uses one layer MLP $\mathcal{M}_{1}$ for neighbor feature aggregation and 2-layer MLP $\mathcal{M}_{2}$ for point feature update. $f_{i}^{l}$ is the input point features in stage $l$ .

PointMetaBase

slightly modifies InvResMLP and applies position encoding $\delta$ for relative coordinates $p_{j}^{l}-p_{i}^{l}$ :

{{f^{l}_{i}}}^{{}^{\prime}}=\mathcal{M}_{3}(f_{i}^{l}),{{f^{l}_{j}}}^{{}^{% \prime}}=Group({{f^{l}_{i}}}^{{}^{\prime}},p_{i}^{l}),

(3)

f_{i}^{l+1}=\mathcal{M}_{2}(\mathcal{A}(\{{{f^{l}_{j}}}^{{}^{\prime}}+\delta(p% _{j}^{l}-p_{i}^{l}),\forall j\in\mathcal{N}_{i}\}))+f_{i}^{l}.

(4)

Notice that PointNeXt uses the map** function $\mathcal{M}_{1}$ (eg: MLP) after the grou** layer while PointMetaBase adopts $\mathcal{M}_{3}$ before the grou** operation to reduce computation. ${{f^{l}_{i}}}^{{}^{\prime}}$ and ${{f^{l}_{j}}}^{{}^{\prime}}$ are the updated point features of point $i$ and its neighbor $j$ , respectively.

Point Deformable Aggregation Module

As discussed in Section 3.1, previous MLP-based approaches capture geometric features through local point groups. They aggregate features in local areas such as the fixed numbers of local neighbor points or points in a small radius, which fail to directly model long-range dependency. To solve it, we propose Point Deformable Aggregation Module (PDAM), which captures long-range relations and achieves adaptive spatial aggregation at the same time in a data-dependent way. Given an input image $x\in\mathbb{R}^{C\times H\times W}$ , deformable mechanism proposed in DCNv2 (Zhu et al. 2019) can be described as:

y(p)=\sum_{k=1}^{K}w_{k}m_{k}x(p+p_{k}+\Delta p_{k}),

(5)

where $y(p)$ and $x(p)$ denote the output feature maps and input feature maps at location $p$ . $K$ represents the number of sampling locations, $w_{k}$ and $p_{k}$ are weight projection and predefined offset for the $k$ -th location, respectively. For example, for a convolution with 3 $\times$ 3 kernel and dilation 1, $K$ = 9, $p_{k}\in\{(-1,-1),(-1,0),...,(1,1)\}$ . $\Delta p_{k}$ is the learnable offset of $2K$ channels condition on the input feature $x$ . $m_{k}$ is the modulation scalar of $K$ channels obtained through a convolutional layer and sigmoid activation over the same input. However, applying the deformable mechanism in images to points is a non-trivial problem. In contrast to the images that have structured 2D grids, point clouds are kind of unstructured data that are unevenly distributed in space. Directly using the predefined grid sampling like $p_{k}$ does not suit point clouds.

Inspired by the deformable mechanism presented by (Dai et al. 2017; Zhu et al. 2019), we propose the PDAM to offer point clouds adaptive position aggregation and long-range relations via deformable reference points, as illustrated in Figure 2. Specifically, given the input point cloud with $N$ points at $l$ -th stage as $\{p_{i}^{l},f_{i}^{l}\}_{i=1}^{N}$ , where $p_{i}^{l}\in\mathbb{R}^{1\times 3}$ and $f_{i}^{l}\in\mathbb{R}^{1\times C}$ are the coordinate and feature of point $i$ , respectively. For every point $i$ , we first initialize $R$ points as references through farthest point sampling (FPS) based on all the point positions, which solves the defect of irregular point cloud and leads the initial reference points $\{p_{r}^{l}\}_{r=1}^{R}$ to be uniformly distributed. This process can be described as:

\{p_{r}^{l}\}_{r=1}^{R}=FPS(\{p_{i}^{l}\}_{i=1}^{N}).

(6)

Different from KPConv (Thomas et al. 2019) that computes and refines kernel points within a local sphere, our initial reference points are based on the positions of all points, which enjoy larger receptive field. Then, to obtain the offset for each reference point, we feed the point features $\{f_{i}^{l}\}_{i=1}^{N},f_{i}^{l}\in\mathbb{R}^{1\times C}$ of all points to the offset generation network (OGN) to output the offsets $\{\Delta p_{r}^{l}\}_{r=1}^{R},\Delta p_{r}^{l}\in\mathbb{R}^{1\times 3}$ as the following:

\{\Delta p_{r}^{l}\}_{r=1}^{R}=OGN(\{f_{i}^{l}\}_{i=1}^{N}),

(7)

where OGN is implemented as two linear layers with learnable weight matrics $W_{1}\in\mathbb{R}^{N\times N}$ and $W_{2}\in\mathbb{R}^{N\times 3R}$ to get 3 $R$ offsets, hence the 3 $R$ offsets will shift reference points to any reasonable position. It will bring the query point more relevant information in global contexts learned on the whole point features. However, since point clouds are discrete data points, there may not exist point on positions $\{{{p^{l}_{r}}}^{{}^{\prime}}\}_{r=1}^{R},{{p^{l}_{r}}}^{{}^{\prime}}=p_{r}^{l% }+\Delta p_{r}^{l}$ . To alleviate it, we adopt inverse distance weighted average based on $K$ nearest neighbors (KNN) and interpolate features nearby position ${{p^{l}_{r}}}^{{}^{\prime}}$ to get the features of deformable reference point in the local region of position ${{p^{l}_{r}}}^{{}^{\prime}}$ as follows:

\{p_{k}^{l}\}_{k=1}^{K}=KNN({{p^{l}_{r}}}^{{}^{\prime}}),

(8)

f^{(j)}({{p^{l}_{r}}}^{{}^{\prime}})=\frac{\sum_{k=1}^{K}w_{k}({{p^{l}_{r}}}^{% {}^{\prime}})f_{k}^{(j)}}{\sum_{k=1}^{K}w_{k}({{p^{l}_{r}}}^{{}^{\prime}})},j=% 1,...,C,

(9)

where $\{p_{k}^{l}\}_{k=1}^{K}$ are the $K$ nearest neighbor points of position ${{p^{l}_{r}}}^{{}^{\prime}}$ . $w_{k}({{p^{l}_{r}}}^{{}^{\prime}})=\frac{1}{d({{p^{l}_{r}}}^{{}^{\prime}},p_{k% }^{l})}$ is the weighted parameter, and $d(\cdot,\cdot)$ computes the distance between two points. Thus the deformable reference points have the information $f({{p^{l}_{r}}}^{{}^{\prime}})$ in relevant regions, which will be involved in the aggregation with the query point $i$ .

Further, we use two linear layers with a sigmoid layer to obtain $R$ channels of the modulation scalars $\{\Delta m_{r}^{l}\}_{r=1}^{R}$ for deformable reference point features. Finally, the aggregation procedure for the query point can be defined as:

f_{i}^{l+1}=\mathcal{M}(\mathcal{A}(\{\Delta m_{r}^{l}f({{p^{l}_{r}}}^{{}^{% \prime}})+\delta({{p^{l}_{r}}}^{{}^{\prime}}-p_{i}^{l})\}_{r=1}^{R}))+f_{i}^{l},

(10)

where $\delta({{p^{l}_{r}}}^{{}^{\prime}}-p_{i}^{l})$ is relative position embedding of point $i$ and its deformable reference points ${{p^{l}_{r}}}^{{}^{\prime}}$ . $\mathcal{A}$ is the symmetric aggregation function (max-pooling), $\mathcal{M}$ is a map** function such as MLP.

Enhanced Normal Embedding

PDAM aggregates long-range contexts from regions of interest for each query point, strengthening the representation capability among points. In this subsection, we further propose Enhanced Normal Embedding to improve the representation ability of each point itself. Normal features provide geometric information about point clouds. Using additional point normals rather than only consuming point coordinates in the network has been proven effective in various works (Qi et al. 2017a, b; Li, Chen, and Lee 2018; Wu, Qi, and Fuxin 2019). However, it does not work if no point normals exist in the dataset. Inspired by (Mitra and Nguyen 2003), we adopt the least square fitting to estimate the normal vector of a point cloud. Considering point $i$ and its $k-1$ nearest neighbor points $\mathcal{N}_{i}$ , the covariance matrix $M$ can be computed as:

M=\frac{1}{k}\sum_{i=1}^{k}(p_{i}-\bar{p})(p_{i}-\bar{p})^{T},

(11)

where $p_{i}\in\mathbb{R}^{1\times 3}$ is the coordinate of point $i$ in the $l$ -th stage, $\bar{p}=\frac{1}{k}\sum_{i=1}^{k}p_{i}$ denotes the centroid of point $i$ and its neighbors. Thus $M$ is $3\times 3$ symmetric positive semidefinite matrix. The normal to the local least square plane for point $i$ can be estimated as the eigenvector corresponding to the minimum eigenvalue of $M$ (Mitra and Nguyen 2003). We utilize singular value decomposition to obtain the normal feature $n_{i}$ for each point $i$ .

Like widely used position encoding in attention-based and MLP-based methods (Lai et al. 2022; Wu et al. 2022; Lin et al. 2023) to learn complex point cloud positional relations among point groups or all points, we propose to strengthen the point normal via Enhanced Normal Embedding (ENE). According to (Yang et al. 2020; Ran, Liu, and Wang 2022), point position and point normal are features with different distributions, which can be decoupled along channel dimension and fused through summation after embedding. In this paper, we implement ENE with 2-layer MLP.

Point Deformable Network Architectures

As illustrated in Figure 3, we propose Point Deformable Network (PDNet), shared the similar hierarchical structure as (Lin et al. 2023; Qian et al. 2022), and incorporated with Point Deformable Aggregation Module (PDAM) and Enhanced Normal Embedding (ENE). For the segmentation task, we use a U-net architecture, which contains an encoder and a decoder. For the classification task, we only use an encoder. The decoder comprises widely used Feature Propagation layers (Qi et al. 2017b; Qian et al. 2022; Lin et al. 2023) to gradually upsample features via interpolation. Incorporating position embedding and ENE, we tweak the Set Abstraction module (Qi et al. 2017b; Qian et al. 2022) as the reduction block, termed Point Deformable Set Abstraction (PDSA). The encoder is composed of PDSA, Point Local Aggregation Module (PLAM), and PDAM. In the first and second stages of PDNet, we implement PLAM by modifying the PointMetaBase block (defined in equation 4) with additional ENE as follows:

\begin{split}f_{i}^{l+1}=\mathcal{M}_{2}(\mathcal{A}(\{{{f^{l}_{j}}}^{{}^{% \prime}}+\delta(p_{j}^{l}-p_{i}^{l})\\ +\gamma(n_{j}^{l}),\forall j\in\mathcal{N}_{i}\}))+f_{i}^{l},\end{split}

(12)

where $n_{j}^{l}$ is the point normals of point neighbor group of point $i$ at $l$ -th stage, and $\gamma$ is the ENE implemented with MLP that map the input three dimensions of point normal vector to the dimension of high-level features. Incorporating with ENE, the PDAM defined in equation 10 can be further modified as:

\begin{split}f_{i}^{l+1}=\mathcal{M}(\mathcal{A}(\{\Delta m_{r}^{l}f({{p^{l}_{% r}}}^{{}^{\prime}})+\delta({{p^{l}_{r}}}^{{}^{\prime}}-p_{i}^{l})\\ +\gamma(n_{r}^{l})\}_{r=1}^{R}))+f_{i}^{l}.\end{split}

(13)

We introduce parallel PLAM and PDAM in the third and fourth stages of PDNet. The point features are fed into PLAM to aggregate information locally (see equation 12) and passed through PDAM (shown in equation 13) to aggregate information globally at the same time. This design of MLP-based blocks with local and long-range dependencies helps our network learn strong generalization ability.

For a fair comparison, we adopt the same scaling strategies as (Lin et al. 2023; Qian et al. 2022) to construct our PDNet. We define the number of deformable reference points $R=32$ to be consistent with the number of points in their local point groups. The configuration of three variants of PDNet is shown as follows:

•

PDNet-S: C = 32, B = 0
•

PDNet-L: C = 32, B = (2, 4, 2, 2)
•

PDNet-XXL: C = 64, B = (4, 8, 4, 4)

We denote C as the channel size of the stem MLP and B as the number of blocks in a stage. Notice that B = 0 means only one PDSA block but no PLAM or PDAM blocks are used at each stage.

Method (time order)	mAcc (%)	OA (%)
PointNet (Qi et al. 2017a)	63.4	68.2
PointNet++ (Qi et al. 2017b)	75.4	77.9
PointCNN (Li et al. 2018)	75.1	78.5
DGCNN (Wang et al. 2019)	73.6	78.1
PRA-Net (Cheng et al. 2021)	77.9	81.0
PointMLP (Ma et al. 2022)	84.4	85.7
PointNeXt (Qian et al. 2022)	86.8	88.2
GAM (Hu et al. 2023)	86.5	88.4
Point-PN (Zhang et al. 2023)	-	87.1
PointMetaBase (Lin et al. 2023)	86.8	88.2
PDNet (ours)	86.8	88.5

Table 1: Shape classification results on PB_T50_RS of ScanObjectNN. mAcc is the mean of class accuracy (%) and OA is the overall accuracy (%).

Experiments

In this section, we evaluate our PDNet on ScanObjectNN (Uy et al. 2019) for shape classification, S3DIS (Armeni et al. 2016) for semantic segmentation, and ShapeNetPart (Yi et al. 2016) for part segmentation. We also provide various ablation studies to better understand the PDNet.

Classification and Segmentation

Experimental setups.

We train our models by using CrossEntropy loss with label smoothing (Szegedy et al. 2016), AdamW optimizer (Loshchilov and Hutter 2018), an initial learning rate lr = 0.001, and weight decay $10^{-4}$ with Cosine Decay for all tasks. For S3DIS semantic segmentation task, point clouds are downsampled with a voxel size of 0.4 m following the previous methods (Zhao et al. 2021; Qian et al. 2021, 2022; Lin et al. 2023). For S3DIS, our PDNet is trained using a fixed number of 24000 points per batch with batch size set to 8 with an initial lr=0.01 for 100 epochs on a NVIDIA 3090 GPU and a 12-core Intel Xeon @ 2.50GHz CPU. For ScanObjectNN shape classification task, following (Qian et al. 2021; Lin et al. 2023), our PDNet is trained by 1024 points with a weight decay of 0.05 for 250 epochs on a NVIDIA 3090 GPU. The points are randomly sampled during training and uniformly sampled during testing. For ShapeNetPart, we train our model using 2048 randomly sampled points with normals for 300 epochs on 4 NVIDIA 3090 GPUs. ShapeNetPart has the normal vectors of point clouds, so we do not apply the least square fitting to estimate the point normals. The original point normals are used for normal embedding. All the details of data augmentation are the same as those in PointNeXt (Qian et al. 2022) and PointMetaBase (Lin et al. 2023).

	S3DIS 6-Fold			S3DIS Area-5
Method (time order)	OA (%)	mAcc (%)	mIoU (%)	OA (%)	mAcc (%)	mIoU (%)
PointNet (Qi et al. 2017a)	78.5	66.2	47.6	-	49.0	41.1
PointCNN (Li et al. 2018)	88.1	75.6	65.4	85.9	63.9	57.3
DGCNN (Wang et al. 2019)	84.1	-	56.1	83.6	-	47.9
KPConv (Thomas et al. 2019)	-	79.1	70.6	-	72.8	67.1
PCT (Guo et al. 2021)	-	67.7	61.3	-	-	-
PAConv (Xu et al. 2021a)	-	-	-	-	73.0	66.6
AdaptConv (Zhou et al. 2021)	-	-	-	90.0	73.2	67.9
Point Transformer (Zhao et al. 2021)	90.2	81.9	73.5	90.8	76.5	70.4
ASSANet (Qian et al. 2021)	-	-	-	-	-	66.8
CBL (Tang et al. 2022a)	89.6	79.4	73.1	90.6	75.2	69.4
StratifiedFormer (Lai et al. 2022)	-	-	-	91.5	78.1	72.0
Point TransformerV2 (Wu et al. 2022)	-	-	-	91.1	77.9	71.6
GAM (Hu et al. 2023)	90.6	83.2	74.4	-	-	-
PointNet++ (Qi et al. 2017b)	81.0	67.1	54.5	83.0	-	53.5
PointNeXt-L (Qian et al. 2022)	89.8	82.2	73.9	90.0 $\pm$ 0.1	-	69.0 $\pm$ 0.5
PointNeXt-XL (Qian et al. 2022)	90.3	83.0	74.9	90.6 $\pm$ 0.1	-	70.5 $\pm$ 0.3
PointMetaBase-L (Lin et al. 2023)	90.6	-	75.6	90.5 $\pm$ 0.1	-	69.5 $\pm$ 0.3
PointMetaBase-XXL (Lin et al. 2023)	91.3	-	77.0	90.8 $\pm$ 0.6	-	71.3 $\pm$ 0.7
PDNet-L (ours)	91.4	85.5	76.7	90.7	77.1	70.8
PDNet-XXL (ours)	91.9	86.2	78.3	91.3	78.1	72.3

Table 2: Semantic segmentation results on S3DIS (6-Fold and Area 5). OA is the overall accuracy (%), mAcc is the mean of class accuracy (%), and mIoU is the mean of instance IoU (%).

Shape Classification.

We first conduct experiments on a real-world shape classification dataset ScanobjectNN (Uy et al. 2019). ScanObjectNN contains approximately 15,000 objects, which have 2902 unique instances that are categorized into 15 classes. We choose the hardest perturbed variant (PB_T50_RS) and report the overall accuracy (OA) and the mean of class accuracy (mAcc) results. As shown in Table 1, our PDNet outperforms all baselines with the mAcc of 86.8% and OA of 88.5%. It shows point normals provide geometric information, and applying ENE helps improve the representation ability of the model.

Semantic Segmentation.

We also validate our PDNet on widely used Stanford Large-Scale 3D Indoor Spaces (S3DIS) (Armeni et al. 2016) dataset for semantic segmentation task. S3DIS is a challenging benchmark that contains 271 rooms with 13 semantic categories in 6 areas. We report the OA, the mAcc, and the mean of instance IoU (mIoU) results of standard 6-fold cross-validation and Area-5 on S3DIS. As illustrated in Table 2, Our PDNet-XXL outperforms all baselines with the OA of 91.9%, mAcc of 86.2%, and mIoU of 78.3% on S3DIS 6-Fold and OA of 91.3%, mAcc of 78.1%, and mIoU of 72.3% on S3DIS Area-5. Notably, the superior performance over powerful point Transformer architectures (StratifiedFormer and Point TransformerV2) shows the potential of MLP-based methods in point cloud analysis. Compared with recent MLP-based networks (PointNeXt and PointMetaBase), our PDNet-L gains +2.8% and +1.1% improvement in mIoU on S3DIS 6-Fold, respectively. Consistent progress is obtained when scaling up the models. It demonstrates the importance of long-range dependency in the point semantic segmentation task and the effectiveness of our method in aggregating information from deformable reference points conditioned on the input points. We also provide visualization of semantic segmentation results in Figure 4, which clearly shows the superiority of our approach. Due to the direct modeling of long-range dependency, our method can recognize the objects in red circles while others fail.

Method (time order)	c. mIoU	i. mIoU
PointNet (Qi et al. 2017a)	80.4	83.7
PointNet++ (Qi et al. 2017b)	81.9	85.1
SO-Net (Li, Chen, and Lee 2018)	-	84.6
PointCNN (Li et al. 2018)	84.6	86.1
DGCNN (Wang et al. 2019)	82.3	85.1
KPConv (Thomas et al. 2019)	85.1	86.4
PCT (Guo et al. 2021)	-	86.4
PAConv (Xu et al. 2021a)	84.6	86.1
AdaptConv (Zhou et al. 2021)	83.4	86.4
GDANet (Xu et al. 2021b)	85.0	86.5
Point Trans. (Zhao et al. 2021)	83.7	86.6
PointMLP (Ma et al. 2022)	84.6	86.1
PointNeXt (Qian et al. 2022)	85.2 $\pm$ 0.1	87.0 $\pm$ 0.1
GAM (Hu et al. 2023)	-	87.0
Point-PN (Zhang et al. 2023)	-	86.6
PointMetaBase (Lin et al. 2023)	85.1 $\pm$ 0.3	87.1 $\pm$ 0.0
PDNet (ours)	85.4	87.2

Table 3: Part segmentation results on ShapeNetPart.

Part Segmentation.

ShapeNetPart (Yi et al. 2016) is an object-level dataset for part segmentation. It contains 16,880 models with 16 different shape categories. Each category has 2-6 parts and up to 50 part labels in total. We evaluate the performance with the mean of class IoU (c. mIoU) and the mean of instance IoU (i. mIoU) in Table 3. PDNet also achieves the best performance of 85.4% in cls. mIoU and 87.2% in mIoU. Visualization of part segmentation results are presented in Figure 5.

PLAM	PDAM	NE	mIoU	Params	FLOPs
✓			69.5 $\pm$ 0.3	2.7	2.0
	✓		70.1 $\pm$ 0.2	3.6	2.1
✓		✓	69.7 $\pm$ 0.3	2.7	2.0
✓	✓		70.4 $\pm$ 0.3	4.9	2.4
✓	✓	✓	70.6 $\pm$ 0.2	4.9	2.4

Table 4: Evaluation of proposed components on S3DIS Area-5. mIoU is the mean of IoU (%).

Ablation Studies

Effectiveness of Proposed Components.

We evaluate the performance of proposed components of PDNet-L in Table 4 with mean±std in three random runs. PLAM and PDAM in the first and second column of Table 4 represents whether to use them in the third and fourth stages of PDNet-L. The results show that using PDAM to aggregate information from the deformable reference regions is better than adopting PLAM to aggregate information within local point groups. It demonstrates that the network prefers to learn global relations in the deep stages. In the case of using both PLAM and PDAM, it obtains better result than only use PLAM or PDAM. Our PDNet-L with all the proposed components achieves the best performance of 70.6±0.2% in mIoU.

Stage2	Stage3	Stage4	mIoU (%)
			69.7
		✓	70.1
	✓	✓	70.8
✓	✓	✓	69.9

Table 5: Ablation study on applying PDAM in different stages on S3DIS Area-5.

method	mIoU (%)
random	70.1
center	70.6
FPS	70.8
successive	70.4
parallel	70.8

Table 6: Ablation study on different types of initial reference points and ablation study on combining strategy of PDAM and PLAM on S3DIS Area-5.

PDAM.

We first explore adopting Point Deformable Aggregation Module (PDAM) at different stages. As shown in Table 5, only adopting PDAM in the last stage improves by 0.4% and applying it in the last two stages leads to the best performance of 70.8% in mIoU. However, using PDAM at the early stage obtains decreasement in mIoU, which reveals the cues that our network performs better when adopting Point Local Aggregation Module (PLAM) in the early stages to capture local geometries and PDAM in the deeper stages to model long-range dependencies.

We also investigate several types of initializing methods for the reference points on S3DIS Area-5. The results are presented in Table 6, which suggests that using FPS to acquire the initial reference point based on the input point positions is superior to random initialization. Moreover, using the center of the input points as the initial status for all the initial reference points performs worse than considering each reference point independently to be uniformly distributed in space. We further conduct an ablation study on combining strategy of PDAM and PLAM in Table 6. For 3D point clouds, applying PLAM and PDAM to aggregate information from local and distant regions at the same time is better than widely used successively designed (Chu et al. 2021; Yang et al. 2022; Xia et al. 2022) in 2D images that adopt PDAM to model long-range relations after capturing local features by PLAM.

Conclusion

In this paper, we propose PDNet, a concise MLP-based network for point cloud processing. Equipped with Point Deformable Aggregation Module (PDAM), our model achieves both long-range dependency and adaptive spatial aggregation in a data-dependent way. For each query point, PDAM aggregates information from deformable reference points, which are initialized according to the point positions and then shifted via additional offsets and modulation scalars conditioned on the input point features. Enhanced Normal Embedding further helps improve the representation ability of point itself. Extensive experiments and ablation studies illustrate the effectiveness of PDNet over various tasks. We hope our work can inspire insights toward exploring suitable deformable mechanisms for point clouds.

References

Armeni et al. (2016) Armeni, I.; Sener, O.; Zamir, A. R.; Jiang, H.; Brilakis, I.; Fischer, M.; and Savarese, S. 2016. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1534–1543.
Cheng et al. (2021) Cheng, S.; Chen, X.; He, X.; Liu, Z.; and Bai, X. 2021. Pra-net: Point relation-aware network for 3d point cloud analysis. IEEE Transactions on Image Processing, 30: 4436–4448.
Chu et al. (2021) Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; and Shen, C. 2021. Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34: 9355–9366.
Dai et al. (2017) Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; and Wei, Y. 2017. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 764–773.
Goyal et al. (2021) Goyal, A.; Law, H.; Liu, B.; Newell, A.; and Deng, J. 2021. Revisiting point cloud shape classification with a simple and effective baseline. In International Conference on Machine Learning, 3809–3820. PMLR.
Guo et al. (2021) Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R. R.; and Hu, S.-M. 2021. Pct: Point cloud transformer. Computational Visual Media, 7: 187–199.
Hu et al. (2023) Hu, H.; Fanyi, W.; **gwen, S.; Hongtao, Z.; Yaonong, W.; Laifeng, H.; Yanhao, Z.; and Zhiwang, Z. 2023. GAM : Gradient Attention Module of Optimization for Point Clouds Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 835–843.
Lai et al. (2022) Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; and Jia, J. 2022. Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8500–8509.
Li, Chen, and Lee (2018) Li, J.; Chen, B. M.; and Lee, G. H. 2018. So-net: Self-organizing network for point cloud analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9397–9406.
Li et al. (2018) Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen, B. 2018. Pointcnn: Convolution on x-transformed points. Advances in Neural Information Processing Systems, 31.
Lin et al. (2023) Lin, H.; Zheng, X.; Li, L.; Chao, F.; Wang, S.; Wang, Y.; Tian, Y.; and Ji, R. 2023. Meta Architecture for Point Cloud Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17682–17691.
Liu et al. (2019) Liu, Y.; Fan, B.; Xiang, S.; and Pan, C. 2019. Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8895–8904.
Loshchilov and Hutter (2018) Loshchilov, I.; and Hutter, F. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
Ma et al. (2022) Ma, X.; Qin, C.; You, H.; Ran, H.; and Fu, Y. 2022. Rethinking network design and local geometry in point cloud: A simple residual MLP framework. arXiv preprint arXiv:2202.07123.
Maturana and Scherer (2015) Maturana, D.; and Scherer, S. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 922–928. IEEE.
Mitra and Nguyen (2003) Mitra, N. J.; and Nguyen, A. 2003. Estimating surface normals in noisy point cloud data. In Proceedings of the Nineteenth Annual Symposium on Computational Geometry, 322–328.
Qi et al. (2017a) Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–660.
Qi et al. (2017b) Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems, 30.
Qian et al. (2021) Qian, G.; Hammoud, H.; Li, G.; Thabet, A.; and Ghanem, B. 2021. Assanet: An anisotropic separable set abstraction for efficient point cloud representation learning. Advances in Neural Information Processing Systems, 34: 28119–28130.
Qian et al. (2022) Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; and Ghanem, B. 2022. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35: 23192–23204.
Ran, Liu, and Wang (2022) Ran, H.; Liu, J.; and Wang, C. 2022. Surface representation for point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18942–18952.
Su et al. (2015) Su, H.; Maji, S.; Kalogerakis, E.; and Learned-Miller, E. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, 945–953.
Szegedy et al. (2016) Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
Tang et al. (2022a) Tang, L.; Zhan, Y.; Chen, Z.; Yu, B.; and Tao, D. 2022a. Contrastive boundary learning for point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8489–8499.
Tang et al. (2022b) Tang, Y.; Qian, Y.; Zhang, Q.; Zeng, Y.; Hou, J.; and Zhe, X. 2022b. War**GAN: War** multiple uniform priors for adversarial 3D point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6397–6405.
Thomas et al. (2019) Thomas, H.; Qi, C. R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; and Guibas, L. J. 2019. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6411–6420.
Uy et al. (2019) Uy, M. A.; Pham, Q.-H.; Hua, B.-S.; Nguyen, T.; and Yeung, S.-K. 2019. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1588–1597.
Wang, Samari, and Siddiqi (2018) Wang, C.; Samari, B.; and Siddiqi, K. 2018. Local spectral graph convolution for point set feature learning. In Proceedings of the European Conference on Computer Vision (ECCV), 52–66.
Wang et al. (2023) Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. 2023. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14408–14419.
Wang et al. (2018) Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803.
Wang et al. (2019) Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S. E.; Bronstein, M. M.; and Solomon, J. M. 2019. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (ToG), 38(5): 1–12.
Wu, Qi, and Fuxin (2019) Wu, W.; Qi, Z.; and Fuxin, L. 2019. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9621–9630.
Wu et al. (2022) Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; and Zhao, H. 2022. Point transformer v2: Grouped vector attention and partition-based pooling. Advances in Neural Information Processing Systems, 35: 33330–33342.
Wu et al. (2015) Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1912–1920.
Xia et al. (2022) Xia, Z.; Pan, X.; Song, S.; Li, L. E.; and Huang, G. 2022. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4794–4803.
Xu et al. (2021a) Xu, M.; Ding, R.; Zhao, H.; and Qi, X. 2021a. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3173–3182.
Xu et al. (2021b) Xu, M.; Zhang, J.; Zhou, Z.; Xu, M.; Qi, X.; and Qiao, Y. 2021b. Learning geometry-disentangled representation for complementary understanding of 3d object point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 3056–3064.
Yang et al. (2022) Yang, C.; Qiao, S.; Yu, Q.; Yuan, X.; Zhu, Y.; Yuille, A.; Adam, H.; and Chen, L.-C. 2022. MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models. In International Conference on Learning Representations.
Yang et al. (2020) Yang, Z.; Sun, Y.; Liu, S.; Qi, X.; and Jia, J. 2020. Cn: Channel normalization for point cloud recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 600–616.
Yi et al. (2016) Yi, L.; Kim, V. G.; Ceylan, D.; Shen, I.-C.; Yan, M.; Su, H.; Lu, C.; Huang, Q.; Sheffer, A.; and Guibas, L. 2016. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6): 1–12.
Yue et al. (2021) Yue, X.; Sun, S.; Kuang, Z.; Wei, M.; Torr, P. H.; Zhang, W.; and Lin, D. 2021. Vision transformer with progressive sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 387–396.
Zhang et al. (2023) Zhang, R.; Wang, L.; Wang, Y.; Gao, P.; Li, H.; and Shi, J. 2023. Parameter is not all you need: Starting from non-parametric networks for 3d point cloud analysis. arXiv:2303.08134.
Zhao et al. (2021) Zhao, H.; Jiang, L.; Jia, J.; Torr, P. H.; and Koltun, V. 2021. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16259–16268.
Zhou et al. (2021) Zhou, H.; Feng, Y.; Fang, M.; Wei, M.; Qin, J.; and Lu, T. 2021. Adaptive graph convolution for point cloud analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4965–4974.
Zhu et al. (2019) Zhu, X.; Hu, H.; Lin, S.; and Dai, J. 2019. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9308–9316.
Zhu et al. (2020) Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.

Point Deformable Network with Enhanced Normal Embedding for Point Cloud Analysis