Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

Yong He, Hongshan Yu, Muhammad Ibrahim,
Xiaoyan Liu, Tongjia Chen, Anwaar Ulhaq, Ajmal Mian Yong He, Hongshan Yu, Xiaoyan Liu, and Tongjia Chen are with the National Engineering Laboratory for Robot Visual Perception and Control Technology, College of Electrical and Information Engineering, Hunan University, Lushan South Rd., Yuelu Dist., 410082, Changsha, China. This work was partially supported by the National Natural Science Foundation of China (Grants U2013203, 61973106).Muhammad Ibrahim, Ajmal Mian is with the Department of Computer Science, The University of Western Australia, WA 6009, Australia. Ajmal Mian is the recipient of an Australian Research Council Future Fellowship Award (project number FT210100268) funded by the Australian Government.Anwaar Ulhaq is with Central Queensland University, Sydney Campus, Australia.

Abstract

Point cloud processing methods leverage local and global point features to cater to downstream tasks, yet they often overlook the task-level context inherent in point clouds during the encoding stage. We argue that integrating task-level information into the encoding stage significantly enhances performance. To that end, we propose SMTransformer which incorporates task-level information into a vector-based transformer by utilizing a soft mask generated from task-level queries and keys to learn the attention weights. Additionally, to facilitate effective communication between features from the encoding and decoding layers in high-level tasks such as segmentation, we introduce a skip-attention-based up-sampling block. This block dynamically fuses features from various resolution points across the encoding and decoding layers. To mitigate the increase in network parameters and training time resulting from the complexity of the aforementioned blocks, we propose a novel shared position encoding strategy. This strategy allows various transformer blocks to share the same position information over the same resolution points, thereby reducing network parameters and training time without compromising accuracy. Experimental comparisons with existing methods on multiple datasets demonstrate the efficacy of SMTransformer and skip-attention-based up-sampling for point cloud processing tasks, including semantic segmentation and classification. In particular, we achieve state-of-the-art semantic segmentation results of 73.4% mIoU on S3DIS Area 5 and 62.4% mIoU on SWAN dataset.

Index Terms:

Deep learning, 3D point clouds, Soft mask, Transformer, Attention-based up-sampling.

I Introduction

With the rapid development of 3D sensors such as LiDARs and depth cameras, the accessibility of 3D point cloud data has dramatically increased. Consequently, its importance is growing across various applications, such as autonomous driving [1], [2], robotics[3], and industrial automation[4]. Techniques for processing 3D point clouds have attracted the interest of many researchers. Effectively learning features from 3D point clouds is challenging due to its irregular nature, as opposed to regular grid like structure of images. Inspired by the success of grid convolutions, several methods have been proposed to transform irregular point clouds into regular representations, including projected images[5],[6], or voxels[7], [8]. However, the discretisation process inevitably sacrifices significant geometric information.

Early approaches learn features from raw point clouds by employing a shared multi-layer perceptron (MLP) on each point and use symmetric functions (e.g., max-pooling) to aggregate the most prominent features over the receptive field. However, this design ignores local structures crucial for shape representation. Inspired by 2D grid convolutions, point convolutions utilize correlation functions to quantify connections among points, enabling the model to leverage local features (e.g., edges, corners, surfaces) and contextual information (e.g., scene layout). Some point convolutions use weight functions to learn weights from various local point geometric connections, such as point coordinates, coordinate differences, distances, etc., [9], [10], [11], [12]. Others associate coefficients (derived from point coordinates) [13], [14], [15] with weight functions to adjust the learned weights.

In contrast to point convolution, the point transformer focuses on learning feature connections and attention maps between point features through a scalar or vector-based attention mechanism globally [16], [17], [18], [19], [20], [21] or over a local receptive field [22], [23]. Recently, with the help of position encoding, the aggregation ability of point transformers has seen significantly improvement. While point transformers are powerful learners, they generally fall short of simultaneously exploiting local features and global context, failing to establish communication between the two. Current point convolution and transformer based approaches strive to enhance the encoder and decoder designs to facilitate proficient point connection learning for improved performance on downstream tasks such as segmentation, detection and classification. However, these methods often overlook capturing the task-level contextual information of the entire point cloud which results in sub-optimal performance on the target task, such as inaccurate predictions when segmenting small objects with intricate boundaries.

To address the aforementioned issues, we propose a Soft Masked Transformer (SMTransformer) that incorporates task-level contextual information into attention mechanisms and utilises this information to guide local feature learning softly. Specifically, the SMTransformer predicts task score keys and queries over the global receptive field and then obtains a soft mask from these keys and queries to re-weight the local attention map.

To establish communication between encoder and decoder layers, conventional point transformer networks typically incorporate skip connections and fusion within the up-sampling block of the decoder layer. However, this straightforward fusion approach may not dynamically blend features from the encoder and decoder layers. Furthermore, these operations are limited to the same resolution point cloud i.e. there is no communication between points at different resolutions. To address this challenge, we introduce a novel Skip-Attention-based up-sampling Block (SAUB). This block initially augments the resolution of low-resolution points at the position level, while preserving their original features. Subsequently, it learns attention maps to establish meaningful connections by skip attention between low-resolution and high-resolution point features.

Moreover, in conventional networks, different point transformer blocks often employ distinct position encoding information for the same resolution. This increases the network parameters and the training time. Intuitively, the same resolution points in the network have the same position information, a characteristic independent of the number and location of point transformers.

Motivation and Contributions: The motivation for this paper lies in addressing the limitations of existing point cloud processing methods. While existing methods effectively utilize local and global point features, they often neglect the inherent task-level context in point clouds during the encoding stage. This oversight can lead to sub-optimal performance in downstream tasks. To overcome this challenge, the paper proposes the SMTransformer, which integrates task-level information into the encoding stage by introducing a soft mask generated from task-level queries and keys to adjust the attention weights. Additionally, this paper introduces a skip-attention-based up-sampling block to enhance communication between features from encoding layers over different resolutions, particularly in high-level tasks like segmentation. To further improve efficiency, a novel shared position encoding strategy is proposed to reduce network parameters and training time without sacrificing accuracy. These innovations aim to enhance the performance, effectiveness and efficiency of point cloud processing methods in general. To summarize, our contributions are threefold:

•

We propose a Soft Masked Transformer block, which integrates task-level information into the attention mechanism, enhancing its effectiveness for downstream tasks such as semantic segmentation and classification.
•

We introduce a Skip Attention-based Up-sampling block, which dynamically combines features from different resolution points across the encoding layers, improving the model’s ability to capture contextual information.
•

We present a shared position encoding strategy, which reduces network parameters by 24.3% and training time by 33.3%. This strategy enhances the efficiency of the network without sacrificing performance.

We conduct extensive experiments on benchmark datasets to showcase the effectiveness of our proposed method and their robust generalization across various tasks, including indoor semantic segmentation, outdoor semantic segmentation, and object classification. Our method consistently achieves competitive results compared to existing point transformer-based approaches. Particularly noteworthy is our method’s achievement of state-of-the-art semantic segmentation performance, attaining the remarkable mIoU of 73.4% on the S3DIS Area 5 and 62.4% on the SWAN dataset without any pre-training.

II Related Work

II-A Point-based Methods

Aiming to maximize the preservation of geometric information in point clouds, state-of-the-art methods prefer to directly process the raw point clouds. The development of the most important unit (i.e., local aggregation) in the point cloud processing network can be broadly divided into three categories explained below.

1) MLP-Based Approaches: PointNet [24] is considered a milestone in point cloud based deep learning. It employs shared MLPs to leverage point-wise features and utilizes a symmetric function such as max-pooling to aggregate these features into global representations. However, the network’s performance is limited as it does not account for the spatial relationships among local points, which are crucial for vision tasks. To address this issue, hierarchical architectures have been proposed to aggregate local features with MLPs [25, 26] such that the model can benefit from efficient sampling and grou** of the point set. Recent works [27, 28, 29] have focused on enhancing point-wise features by hand-crafting geometric connections such as curves, triangles, umbrella orientation, or affine transformations. Additionally, graphs have been introduced to points [30, 20, 31, 32, 33, 34, 35] and subsequent geometric representations, such as edges, contours, curvature, and connectivity. However, these strategies might lack generality due to the need to optimize hyperparameters for hand-crafted representations or graphs across datasets with varying densities or shape styles.

2) Convolution-Based Approaches: Inspired by the success of 2D convolution, various works have successively proposed novel point convolutions on points or point graphs. These methods dynamically learn convolutional weights through functions derived from local geometric connections. A popular category of methods in this domain focuses on designing weight functions. Among these, some approximate the weight function using MLPs [36, 37, 38, 13, 14, 33, 39], spline functions [11], a family of polynomial functions [12], or standard unparameterized Fourier functions [40]. Unlike these methods with dynamic convolution kernels, KCNet [9] and KPConv [10] predefine a set of fixed kernels (i.e., template points) in the local receptive field and then learn the weights on these kernels from the geometric connections between local points and template points using Gaussian and linear correlation functions, respectively. However, the number and position of kernel points need to be optimized for different datasets. Another approach [13, 14, 15] associates coefficients with kernels to further adjust the learned weights, where the coefficients are obtained through kernel density estimation, inverse density functions, and fuzzy functions of point coordinates.

3) Transformer-Based Approaches: Unlike convolution-based methods, which learn convolutional weights from low-level point coordinates, attention mechanisms learn attention weights from the connections between point features, thereby exploiting high-level contextual information. Motivated by the success of attention mechanisms in natural language processing and image processing tasks, early methods applied self-attention to global points through scalar dot-product [16, 17, 18, 19, 20, 21], but suffered from high computational costs. These early attention-based methods did not demonstrate superior performance due to the lack of employing position encoding.

The Point Transformer [23] introduces local vector attention to the local points. Additionally, it emphasizes the significance of point position encoding. Subsequent work, Point Transformer V2 [22], incorporates multi-grou** into vector attention inspired by the multi-head strategy. It also enhances position encoding by introducing an additional multiplier to the relation vector, which facilitates learning complex positional relations.

To exploit long-range contextual information, the Stratified Transformer [41] densely selects nearby points over a cubic window and sparsely selects distant points. This stratified strategy enlarges the effective receptive field without incurring too much computational overhead. However, by using a window-based approach, the Stratified Transformer focuses on an expanded local region rather than the global region. Moreover, hyper-parameters such as the window size and number of distant points must be optimized for different datasets with varying densities.

II-B Up-sampling in 3D Point Clouds

The hierarchical architecture of a network is instrumental in learning long-range contextual information through down-sampling and grou** operations. Up-sampling involves interpolating new points between known points and adjusting the features of these new points based on their mutual distance to propagate the learned context features to each point. Compared to the extensive research on down-sampling [14, 42, 21, 18, 43] and grou** operations [25, 26, 20, 44, 45, 46, 34, 33, 37], few works specifically emphasize the up-sampling operation. In the Point Transformer [23], the transition up module uses an interpolation operation to recover new point features from known point features through indexing and then integrates these features with those from the encoding layer via a skip connection. Similarly, in Point Transformer V2 [22], the fusion of encoding and decoding layer features is achieved through a skip connection. The distinguishing factor here is that the new point features are unpooled by grid unpooling instead of interpolation. While these methods are simple, they lack semantic awareness and ignore the contextual connection between the encoding and decoding layers.

Our survey highlights several gaps of in current point cloud processing methods. Their fundamental unit, such as point MLP, convolution, transformer, completely ignores task-level information which then leads to sub-optimal performance in downstream tasks. Moreover, the upsampling methods do not effectively communicate between the encoding layers over different resolution points to refine the context information for high-level tasks. Motivated by these gaps, we propose a soft-masked transformer and skip-attention-based upsampling. Moreover, we propose shared position encoding to reduce the network parameters and training time.

III Method

We revisit the two classical local vector attention-based point transformers, namely Point Transformer and Point Transformer V2, in Section III-A. Then, we present the SMTransformer block in Section III-B, followed by our skip attention-based up-sampling block in Section III-C. We introduce the shared position encoding strategy in Section III-D, and finally, in Section III-E, we provide details about our network.

III-A Rethinking Vector Attention based Transformer

Denote a point cloud $p_{i}\in\mathbb{R}^{{N\times 3}}$ (where $p_{i}$ defines point positions) and its corresponding features $f_{i}\in\mathbb{R}^{{N\times C}}$ . $f_{i}$ is a feature vector that may contain attributes such as normal vectors and colour of the surface. $N$ and $C$ are the number of points and feature channels, respectively. We denote the K neighbors of $p_{i}$ as $p_{ij}\in\mathbb{R}^{{N\times K\times 3}}$ and their corresponding features as $f_{ij}\in\mathbb{R}^{{N\times K\times C}}$ . The position over the local receptive field can be expressed as $\Delta p_{ij}$ . Point Transformer[23] on point cloud $p_{i}$ can be expressed as,

\mathcal{G}_{i}=\sum_{j=1}^{K}\mathcal{A}\big{(}(K_{ij}^{f}\ominus Q_{i}^{f})% \oplus\delta_{b}\big{)}\odot\big{(}V_{ij}^{f}\oplus\delta_{b}\big{)},

(1)

where $Q_{i}^{f}\in\mathbb{R}^{{N\times C}}$ is the query matrices of $f_{i}$ . $K_{ij}^{f}$ and $V_{ij}^{f}\in\mathbb{R}^{{N\times K\times C}}$ are the key and value matrices of $f_{ij}$ . The subtraction operation $\ominus$ between $Q_{i}^{f}$ and $K_{ij}^{f}$ is performed via broadcasting to ensure dimension compatibility. $\delta_{b}=\delta_{b}(\Delta p_{ij})$ is the position encoding bias function. $\mathcal{A}(\cdot)$ donates the local vector attention function, implemented by MLP, followed by the softmax. $\sum$ stands for the symmetric operation (SOP) (e.g. summation) and $\odot$ is the element-wise multiplication operation. $\oplus$ is the element-wise addition operation and $\mathcal{G}_{i}$ is the output feature of transformer.

To exploit the more complex geometric relationships between points, Point Transformer V2[22] strengthens the position encoding with an additional multiplier to the relation feature vector (i.e. $K_{ij}^{f}-Q_{i}^{f}$ ), which can be formulated as,

\mathcal{G}_{i}=\sum_{j=1}^{K}\mathcal{A}\big{(}{\color[rgb]{0,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}\delta_{m}}(K_{ij}^{f}\ominus Q_{i}^{f})\oplus\delta% _{b}\big{)}\odot\big{(}V_{ij}^{f}\oplus\delta_{b}\big{)},

(2)

where $\delta_{m}=\delta_{m}(\Delta p_{ij})$ is the position encoding multiplier function. The diagram of the above two classical point transformer are illustrated as Fig. 1.(a)(b). The attention function $\mathcal{A}(\cdot)$ learns robust weights from the rich relationships, including the low-level geometric relationship (i.e. relation position) and the high-level contextual relationship (i.e. relation feature). Despite the overarching influence of the task on the entire network, the fundamental unit transformer overlooks task-level information, resulting in the loss of crucial task-related details (e.g. semantic information) during the encoding and decoding stages.

III-B Soft Masked Transformer Block (SMTB)

We propose a novel soft-masked transformer for point cloud processing. The soft-masked transformer can be expressed as,

	$\displaystyle\mathcal{G}_{i}$	$\displaystyle={\rm SMTransformer}(f_{i}),$		(3)
		$\displaystyle=\sum_{j=1}^{K}{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\mathcal{S}(\cdot)}\odot\mathcal{A}\big{(}(K_{ij}^{f}\ominus Q_{i}^{f})% \oplus{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\delta(\cdot)}\big{)}% \odot\big{(}V_{ij}^{f}\oplus{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\delta(\cdot)}\big{)},$		(3)

where $\mathcal{S}(\cdot)$ is the soft mask function, which re-weights the attention weights. $\delta(\cdot)$ is the enhanced position encoding function, learning more complex relationships between points. We introduce them in detail in the following text.

III-B1 Soft Mask

The soft mask can be interpreted as the learnable coefficient of the attention function. Its significance lies in modelling the semantic context, a prior for calculating a task score difference. This difference is then used to softly mask the attention weights at the point level rather than the channel level. Inspired by the vector attention, SMTransformer divides the features $f_{i}$ into two entities: the scoring query $Q_{i}^{s}$ and score key $K_{ij}^{s}$ , separately.

Q_{i}^{s}=W_{q}^{s}(f_{i}),~{}~{}K_{ij}^{s}=G[W_{k}^{s}(f_{i})],

(4)

where the $W_{q}^{s}$ and $W_{k}^{s}$ ( $\mathbb{R}^{N\times C}\rightarrow\mathbb{R}^{N\times T}$ ) are prediction functions implemented as linear layers followed by softmax, where $T$ is the number of the task classes. $G[\cdot]$ is the grou** operation to obtain the task scores of neighbour points at point $p_{i}$ . The soft mask is generated from the score difference $(K_{ij}^{s}-Q_{i}^{s})\in\mathbb{R}^{{N\times K\times T}}$ as,

S(\cdot)=S(Q_{i}^{s},K_{ij}^{s})=||{\rm Max}({\rm Norm}(K_{ij}^{s}\ominus Q_{i% }^{s}))||_{2},

(5)

where the ${\rm Max}(\cdot)$ is the maximum, ${\rm Norm}(\cdot)$ is Min-Max Normalization, and $||\cdot||_{2}$ is the Euclidean Norm. The soft mask consists of real numbers ranging between 0 and 1, typically representing the probability values of identical labels among different points. This allows the model to assign higher importance to neighbouring points with distinct predicted labels. Taking segmentation as an example, the soft mask enhances the robustness of attention weights around class boundaries. Unlike traditional hard masks (i.e., binary masks), soft masks are more flexible and efficient as they do not require explicit rules or conditions for determination.

III-B2 Enhanced Position Encoding

Most existing position encoding methods in local point transformers focus solely on local positions. While this approach greatly assists the transformer in understanding local shapes, it struggles to capture long-range shapes beyond the limited local receptive field. Therefore, global position encoding is equally important as local position encoding. Similarly, SMTransformer encodes the global point position into two entities: position query $Q_{i}^{p}$ and position key $K_{ij}^{p}$ .

Q_{i}^{p}=W_{q}^{p}(p_{i}),~{}~{}K_{ij}^{p}=G[Q_{i}^{p}],

(6)

where the $W_{q}^{p}$ ( $\mathbb{R}^{N\times 3}\rightarrow\mathbb{R}^{N\times C}$ ) are global position encoding functions, implemented as MLPs. The local relative position information can be expressed as $(K_{ij}^{p}-Q_{i}^{p})\oslash\Delta p_{ij}$ , where the $\oslash$ donates the concatenation operation. The enhanced position encoding can be expressed as,

\delta(\cdot)=\delta\big{(}(K_{ij}^{p}-Q_{i}^{p})\oslash\Delta p_{ij}\big{)},

(7)

where the $\delta$ is the local position encoding function implemented by MLPs. By constructing the Query and Key matrices of global point position, SMTransformer can bypass the local receptive field limitation to learn the global geometric information.

Refer to caption — Figure 1: Comparison of the attention, position encoding in Transformers. (a) The vector attention with position encoding bias in Point Transformer, see Eq.(1). (b) The vector attention with position encoding multiplier in Point Transformer V2, see Eq.(2). (c) The vector attention with soft mask and enhanced position encoding bias in our proposed SMTransformer, see Eq.(3).

To compare the differences between SMTransformer and the classical vector attention-based point transformer, we illustrate their architectures in Fig. 1. There are two key distinctions:

i) Both Point Transformer and Point Transformer V2 emphasize learning contextual relationships. In contrast, SMTransformer not only grasps contextual relationships through vector attention but also integrates a soft mask as the coefficient with the attention function, driven from the task at hand.

ii) Point Transformer and Point Transformer V2 both effectively capture local fine-grained details through local position encoding. In contrast, SMTransformer introduces an innovative enhanced position encoding that represents positions across the global point cloud, enabling modelling of the global shape without being confined to local receptive fields. Additionally, it encodes positions across local points, allowing for learning fine-grained details.

Residual connections are instrumental in training deep neural networks, facilitating gradient flow during backpropagation. Therefore, we combine the Soft Masked Transformer with residual connections to construct a transformer block. As illustrated in Fig. 4.(b), the Soft Masked Transformer Block (SMTB) can be expressed as,

$\displaystyle f_{i}$	$\displaystyle={\rm Linear}(f_{in1}),$	(8)
$\displaystyle\mathcal{G}_{i}$	$\displaystyle={\rm SMTransformer}(f_{i}),$
$\displaystyle f_{out1}$	$\displaystyle={\rm Linear}(\mathcal{G}_{i}\oplus f_{i}),$

where $f_{in1}$ is the input feature and $f_{out1}$ is the output feature of SMTB. The projection layer $\rm Linear()$ is achieved through a series of layers, including one linear layer, one batch normalization layer, and one Relu layer.

III-C Skip Attention-based Up-sampling Block (SAUB)

To facilitate deep communication between features over various resolution points, we introduce a skip attention-based up-sampling block that combines conventional unpooling with a learnable unit to learn and refine contextual information between features from the encoding and decoding layers across different resolutions.

As illustrated in Fig. 4.(c), given the skip feature I $f_{h}\in\mathbb{R}^{{M\times C_{h}}}$ and skip feature II $f_{l}\in\mathbb{R}^{{m\times C_{l}}}$ from the two adjacent encoding layers over the low and high-resolution points, respectively. To build the communication between the different resolution point features, we first balance their feature dimension and point resolution,

f_{mid}={\rm Gridup}\big{(}{\rm Linear}(f_{in2}\oslash f_{l})\big{)},

(9)

where the ${\rm Linear()}$ serves as the projection layer; its primary role involves integrating both the input features $f_{in2}$ and skip features II $f_{l}$ , thereby augmenting the dimension of low-resolution point features to match that of high-resolution point features ( $\mathbb{R}^{{m\times C_{l}}}\rightarrow\mathbb{R}^{{m\times C_{h}}}$ ), where $m$ is the number of low-resolution points. The ${\rm Gridup}$ is the common practice of unpooling, implemented by grid-based unpooling, to augment the resolution ( $\mathbb{R}^{{m\times C_{h}}}\rightarrow\mathbb{R}^{{M\times C_{h}}}$ ), where $M$ is the number of high-resolution points. By the above two steps, the low-resolution features are expanded in terms of feature dimension and point resolution.

To build deep communication between the expanded low-resolution features and skip features I, we propose the skip attention,

Q_{i}=w_{q}(f_{l}),K_{ij}=G[w_{k}(f_{mid})],V_{ij}=G[w_{v}(f_{mid})],

(10)

\mathcal{G}_{i}^{sa}=\sum_{j=1}^{K}\mathcal{A}\big{(}(K_{ij}\ominus Q_{i})% \oplus{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\delta(\cdot)}\big{)}% \odot\big{(}V_{ij}\oplus{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}% {rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\delta(\cdot% )}\big{)},

(11)

where $\delta(\cdot)$ is the proposed enhanced position encoding. $\mathcal{G}_{i}^{sa}$ is the output of skip attention. The skip attention use $Q_{i}$ , derived from high-resolution features $f_{l}$ , as query and $K_{ij}$ , derived from expanded low-resolution features $f_{mid}$ to learn the contextual connection (i.e. attention map) between different resolution points.

To prevent the loss of some important information, we utilize a residual connection, concentrating the output on the input of skip attention and skip features,

f_{out2}={\rm Linear}(\mathcal{G}_{i}^{sa}\oplus f_{l}\oplus f_{mid}).

(12)

To compare the differences between SAUB and classical upsampling block ( i.e. Transition up[22]), we illustrate their diagrams in Fig. 2. The key differences lie in two aspects:

i) The classical upsampling block only learns the connection between features from the encoding and decoding layers over the same resolution points. In contrast, our proposed SAUB can learn the connection between features from the encoding and decoding layers of different resolution points.

ii) The classical upsampling block uses a simple skip connection to learn the contextual information between the features from the encoding and decoding layers. However, we propose skip attention to refine the contextual information between the features.

III-D Shared Position Encoding

In conventional point transformer networks, several transformer blocks typically utilize different position encoding information within the same encoding or decoding layer across resolution points. We refer to this practice as unshared position encoding. Intuitively, points within the same position in the point cloud should have the same position information. Building on this intuition, we propose the shared position encoding strategy. Under this strategy, transformer blocks within the same encoding or decoding layer share identical position encoding information across resolution points. This approach enhances network robustness and efficiency, particularly in large-scale scene processing. Their comparison is illustrated in Fig. 3. The robustness and efficiency experiments are provided in Section IV-D.

III-E Network Architecture

We employ a U-Net-like architecture comprising five encoding and decoding layers with skip connections for the semantic segmentation task. The first encoding and decoding layers consist of one MLP and SMTransformer block. The subsequent encoding layers incorporate one Grid Pooling layer [22] followed by several SMTransformer blocks. The number of SMTransformer blocks in the five encoding layers is [1, 2, 2, 6, 2]. The subsequent decoding layers consist of one skip attention-based up-sampling block and one SMTransformer block. We set the feature dimensions $C$ as [32, 64, 128, 256, 512] for the 5 encoding and decoding layers. At the network’s end, we append an MLP to predict the final point-wise labels. The network architecture for semantic segmentation is illustrated in Fig. 4.(a).

For the classification task, we utilize the basic PointNext [47] as the backbone and replace one SMTransformer block with an MLPs block to form the new network architecture for classification. Further details regarding the configurations of the segmentation and classification networks are in Section IV.

IV Experimental Results

We evaluate our network on three tasks namely, semantic segmentation of indoor scenes, semantic segmentation of outdoor scenes, and shape classification. We also perform detailed ablation studies to demonstrate the effectiveness and robustness of the proposed Soft Mask Transformer Block, Skip Attention-based up-sampling Block and the Shared Position Encoding strategy.

TABLE I: Semantic segmentation results on the S3DIS dataset Area-5. We report the mean class-wise Intersection over Union (mIoU), mean class-wise accuracy (mAcc), and overall accuracy (OA). The best result is highlighted in bold, and the second best is underlined.

Year	Methods	mIoU	mAcc	OA	ceil.	floor	wall	beam	column	window	door	chair	table	bookcase	sofa	board	clut.
2017 CVPR	PointNet[24]	41.09	48.98	–	88.80	97.33	69.80	0.05	3.92	46.26	10.76	52.61	58.93	40.28	5.85	26.38	33.22
2018 NIPS	PointCNN[36]	57.26	63.86	85.9	92.3	98.2	79.4	0.0	17.6	22.8	62.1	74.4	80.6	31.7	66.7	62.1	56.7
2019 ICCV	KPConv[10]	67.1	72.8	–	92.8	97.3	82.4	0.0	23.9	58.0	69.0	91.0	81.5	75.3	75.4	66.7	58.9
2020 PAMI	SPH3D-GCN[48]	59.5	65.9	–	93.3	97.1	81.1	0.0	33.2	45.8	43.8	79.7	86.9	33.2	71.5	54.1	53.7
2020 CVPR	PointANSL[21]	62.6	68.5	87.7	94.3	98.4	79.1	0.0	26.7	55.2	66.2	86.8	83.3	68.3	47.6	56.4	52.1
2020 CVPR	SegGCN[15]	63.6	70.4	–	93.7	98.6	80.6	0.0	28.5	42.6	74.5	80.9	88.7	69.0	71.3	44.4	54.3
2021 CVPR	PAConv[49]	66.6	73.0	–	94.5	98.6	82.4	0.0	26.4	58.0	60.0	89.7	80.4	74.3	69.8	73.5	57.7
2021 CVPR	BAAF-Net[50]	65.4	73.1	88.9	92.9	97.9	82.3	0.0	23.1	65.5	64.9	87.5	78.5	70.7	61.4	68.7	57.2
2021 ICCV	Point Transformer[23]	70.4	76.5	90.8	94.0	98.5	86.3	0.0	38.0	63.4	74.3	82.4	89.1	80.2	74.3	76.0	59.3
2022 CVPR	CBL[51]	69.4	75.2	90.6	93.9	98.4	84.2	0.0	37.0	57.7	71.9	81.8	91.7	75.6	77.8	69.1	62.9
2022 CVPR	RepSurf-U[27]	68.9	76.0	90.2	–	–	–	–	–	–	–	–	–	–	–	–	–
2022 CVPR	Stratified Transformer[41]	72.0	78.1	91.5	–	–	–	–	–	–	–	–	–	–	–	–
2022 ECCV	PointMixer[52]	71.4	77.4	–	94.2	98.2	86.0	0.0	43.8	62.1	78.5	82.2	90.8	79.8	73.9	78.5	59.4
2022 NIPS	PointNeXt[47]	71.1	77.2	91.0	94.2	98.5	84.4	0.0	37.7	59.3	74.0	91.6	83.1	77.2	77.4	78.8	60.6
2022 NIPS	PointTransformerV2[22]	71.6	77.9	91.1	–	–	–	–	–	–	–	–	–	–	–	–	–
2023 TCSVT	LCPFormer[53]	70.2	76.8	90.8	–	–	–	–	–	–	–	–	–	–	–	–	–
2023 TCSVT	SAKS[54]	68.8	74.0	90.8	95.2	98.6	84.1	0.0	27.5	58.5	75.1	80.4	90.8	69.0	77.0	73.5	62.1
2023 TNNLS	PicassoNet++[55]	71.0	77.2	91.3	94.4	98.4	87.5	0.0	46.9	63.7	75.5	81.4	90.3	71.3	76.2	76.7	61.1
2023 CVPR	Point Vector[29]	72.3	78.1	91.0	95.1	98.6	85.1	0.0	41.4	60.8	76.7	92.1	84.4	77.2	82.0	85.1	61.4
	SMTransformer(ours)	73.4	78.9	91.8	95.2	98.7	87.7	0.0	45.8	64.8	75.2	85.2	92.7	86.7	76.8	83.5	62.4

TABLE II: Semantic segmentation results on S3DIS with 6-fold cross validation.

Year	Methods	mIoU	mAcc	OA	ceil.	floor	wall	beam	column	window	door	chair	table	bookcase	sofa	board	clut.
2017 CVPR	PointNet[24]	47.6	66.2	78.6	88.0	88.7	69.3	42.4	23.1	47.5	51.6	42.0	54.1	38.2	9.6	29.4	35.2
2018 NIPS	PointCNN[36]	65.4	75.6	88.1	94.8	97.3	75.8	63.3	51.7	58.4	57.2	69.1	71.6	61.2	39.1	52.2	58.6
2019 CVPR	PointWeb[30]	66.7	76.2	87.3	93.5	94.2	80.8	52.4	41.3	64.9	68.1	67.1	71.4	62.7	50.3	62.2	58.5
2019 ICCV	KPConv[10]	70.6	79.1	–	93.6	92.4	83.1	63.9	54.3	66.1	76.6	64.0	57.8	74.9	69.3	61.3	60.3
2020 PAMI	SPH3D-GCN[48]	68.9	77.9	88.6	93.3	96.2	81.9	58.6	55.9	55.9	71.7	82.4	72.1	64.5	48.5	54.8	60.4
2020 CVPR	PointANSL[21]	68.7	79.0	88.8	95.3	97.9	81.9	47.0	48.0	67.3	70.5	77.8	71.3	60.4	50.7	63.0	62.8
2020 CVPR	RandLA-Net[42]	70.0	82.0	88.0	93.1	96.1	80.6	62.4	48.0	64.4	69.4	76.4	69.4	64.2	60.0	65.9	60.1
2021 CVPR	PAConv[49]	69.3	78.7	–	94.3	93.5	82.8	56.9	45.7	65.2	74.9	59.7	74.6	67.4	61.8	65.8	58.4
2021 CVPR	SCF-Net[56]	71.6	82.7	88.4	93.3	96.4	80.9	64.9	47.4	64.5	70.1	81.6	71.4	64.4	67.2	67.5	60.9
2021 CVPR	BAAF-Net[50]	72.2	83.1	88.9	93.3	96.8	81.6	61.9	49.5	65.4	73.3	83.7	72.0	64.3	67.5	67.0	62.4
2021 ICCV	Point Transformer [23]	73.5	81.9	90.2	94.3	97.5	84.7	55.6	58.1	66.1	78.2	74.1	77.6	71.2	67.3	65.7	64.8
2022 NIPS	PointNeXt[47]	74.9	83.0	90.3	–	–	–	–	–	–	–	–	–	–	–	–	–
2022 CVPR	RepSurf-U[27]	74.3	82.6	90.8	–	–	–	–	–	–	–	–	–	–	–	–	–
2022 CVPR	CBL[51]	73.1	79.4	89.6	94.1	94.2	85.5	50.4	58.8	70.3	78.3	75.0	75.7	74.0	71.8	60.0	62.4
2023 CVPR	Point Vector[29]	78.4	86.1	91.9	–	–	–	–	–	–	–	–	–	–	–	–	–
	SMTransformer(ours)	79.0	86.9	91.9	97.4	98.3	89.4	68.0	66.1	70.4	78.4	82.6	84.0	78.5	72.2	73.2	68.5

TABLE III: Semantic segmentation results (mIoU) on ScanNetV2 validation and test set.

Year	Methods	Input	Val(%)	Test(%)
2018 NIPS	PointNet++[25]	point	55.7	53.5
2018 CVPR	SparseConvNet[57]	voxel	72.5	69.3
2019 CVPR	PointConv[13]	point	66.6	61.0
2020 CVPR	PointANSL[21]	point	63.5	66.6
2019 ICCV	MVPNet[58]	point	66.4	–
2019 ICCV	KPConv[10]	point	69.2	68.6
2019 3DV	JointPointBased[59]	point	69.2	63.4
2019 CVPR	MinkowskiNet[60]	voxel	72.2	73.6
2022 CVPR	RepSurf-U[27]	point	70.0	–
2022 CVPR	Stratified Transformer[41]	point	74.3	73.7
2021 CVPR	PointTransformer[23]	point	70.6	–
2022 CVPR	FastPointTransformer[61]	voxel	72.0	–
2022 NIPS	PointTransformerV2[22]	point	75.4	75.2
2023 TNNLS	PicassoNet++[55]	mesh	–	69.2
	SMTransformer (ours)	point	75.9	75.7

IV-A Indoor Semantic Segmentation

Datasets: We evaluate our network on two large-scale indoor scene datasets, namely S3DIS [62] and ScanNetV2 [63]. The S3DIS dataset consists of RGB-D point clouds annotated point-wise with 13 classes. It encompasses 271 rooms from 6 large-scale indoor scenes, totalling 6020 square meters. We utilize 6-dimensional point features, including 3-dimensional normalized colour and 3-dimensional normalized location. For evaluation, we conduct a 6-fold cross-validation on S3DIS and focus more extensively on comparisons using Area 5 as the test set, which is distinct from the other areas and not within the same building.

The ScanNetV2 dataset comprises coloured point clouds of indoor scenes with point-wise semantic labels for 20 object categories. It is divided into 1201 scenes for training and 312 for validation. Our approach utilizes 9-dimensional point features corresponding to 3-dimensional normalized colour, 3-dimensional normalized location, and 3-dimensional normals.

Network Configurations: For semantic segmentation on S3DIS, we set the voxel size as 4cm and the maximum number of voxels to 60,000. We adopt the SGD optimizer and weight decay as 0.0001. The base learning rate is set as 0.4 and the learning rate is scheduled by the MultiStepLR at the 40th and 80th epoch. We train and test the model with batch size 16 and 8 on 4 GPUs, respectively. We adopt random scaling, random flip, chromatic contrast, chromatic translation, chromatic jitter and hue saturation translation to augment training data. We set the grid size in grid pooling as [0.08, 0.1, 0.2, 0.4]cm and the number of neighbour points in SMTransformer as 16.

On ScanNet, we set the voxel size as 2cm and the maximum number of voxels to 100,000. We use the Adam optimizer, where the weight decay is set as 0.02. The base learning rate is set as 0.02 and the learning rate is scheduled by the MultiStepLR every 40 epochs. We train and test the model with batch size 24 on 4 GPUs. We adopt random rotation, random scaling, random flip, elastic distortion, chromatic contrast, chromatic translation, chromatic jitter and hue saturation translation to augment training data. The grid size is set as [0.04, 0.12, 0.36, 1.08]cm, and the number of neighbour points is set as 16. During the test, the network uses the test time augmentation, following the Point Transformer V2[22] and Stratified Transformer[41].

Results: We compare our method with the recent state-of-the-art on S3DIS dataset, using three metrics i.e. mean class-wise intersection over union (mIoU), mean overall accuracy (mAcc) and overall accuracy (OA). Results are reported in Table I. Our network demonstrates superior performance on all three metrics i.e. 73.4% mIoU, 78.9% mAcc and 91.8% OA. It achieves the top 2 results on 9 out of 13 classes including ceiling, floor, wall, column, window, table, bookcase, board, clutter. Notably, the segmentation performance of bookcase class exceeds the second-best method by 6.5% mIoU. Compared to the previous state-of-the-art point transformers (e.g. Point Transformer V2 and Stratified Transformer), our network outperforms them by 1.8% and 1.4% in terms of mIoU, respectively. Compared to the MLP-based method (e.g. Point Vector), the performance of our method exceeds it by 1.1% mIoU. Compared to the LCPFormer and SAKS, our method outperforms them by a large margin on all metrics. Fig. 5. shows visualizations of our results on S3DIS area 5 in comparison to Point Transformer V2. We can see that our method is more robust to the object boundaries.

Table II shows results with the 6-fold validation setting on the S3DIS dataset. Our method again achieves state-of-the-art results of 79.0% mIoU, 86.9% mAcc and 91.9% OA. It achieves the best results on 12 out of 13 classes including ceiling, floor, wall, beam, column, window, door, table, bookcase, sofa, board and clutter.

The ScanNetV2 validation and test set results are illustrated in Table III. Compared to the Stratified Transformer, our method exhibits a substantial improvement of +1.6% mIoU and +2.2% mIoU on validation and test sets, respectively. Against the Point Transformer V2, our method delivers enhanced performance with an improvement of (+0.5%, +0.5% in terms of mIoU) on the validation and test sets, respectively.

TABLE IV: Semantic segmentation results on the SemanticKITTI test set. ‘*’ means the network is pre-trained on other datasets.

Year	Methods	mIoU(%)	car	bicycle	motorcycle	truck	other-vehicle	person	bicyclist	motorcyclist	road	parking	sidewalk	other-ground	building	fence	vegetation	trunk	terrain	pole	traffic-sign
2017 NIPS	Pointnet++[25]	20.1	53.7	1.9	0.2	0.9	0.2	0.9	1.0	0.0	72.0	18.7	41.8	5.6	62.3	16.9	46.5	13.8	30.0	6.0	8.9
2018 ICRA	SqueezeSeg[64]	30.8	68.3	18.1	5.1	4.1	4.8	16.5	17.3	1.2	84.9	28.4	54.7	4.6	61.5	29.2	59.6	25.5	54.7	11.2	36.3
2019 ICRA	SqueezeSegV2[65]	39.6	82.7	21.0	22.6	14.5	15.9	20.2	24.3	2.9	88.5	42.4	65.5	18.7	73.8	41.0	68.5	36.9	58.9	12.9	41.0
2019 IROS	RangNet++[66]	52.2	91.4	25.7	34.4	25.7	23.0	38.3	38.8	4.8	91.8	65.0	75.2	27.8	87.4	58.6	80.5	55.1	64.6	47.9	55.9
2020 CVPR	PolarNet[67]	54.3	93.8	40.3	30.1	22.9	28.5	43.2	40.2	5.6	90.8	61.7	74.4	21.7	90.0	61.3	84.0	65.5	67.8	51.8	57.5
2021 CVPR	Cylinder3D[68]	68.9	97.1	67.6	63.8	50.8	58.5	73.7	69.2	48.0	92.2	65.0	77.0	32.3	90.7	66.5	85.6	72.5	69.8	62.4	66.2
2021 CVPR	(AF) ${}^{2}$ -S3Net[69]	69.7	94.5	65.4	86.8	39.2	41.1	80.7	80.4	74.3	91.3	68.8	72.5	53.5	87.9	63.2	70.2	68.5	53.7	61.5	71.0
2022 CVPR	PVKD[70]	71.2	97.0	67.9	69.3	53.5	60.2	75.1	73.5	50.5	91.8	70.9	77.5	41.0	92.4	69.4	86.5	73.8	71.9	64.9	65.8
2022 ECCV	2DPASS[71]	72.9	97.0	63.6	63.4	61.1	61.5	77.9	81.3	74.1	89.7	67.4	74.7	40.0	93.5	72.9	86.2	73.9	71.0	65.0	70.4
2023 TITS	SAT3D[72]	61.3	94.5	42.1	45.6	21.6	39.4	63.4	61.2	18.6	91.8	68.6	77.3	27.2	91.8	67.8	85.8	70.3	71.5	60.3	64.9
2023 ICCV	RangFormer*[73]	73.3	96.7	69.4	73.7	59.9	66.2	78.1	75.9	58.1	92.4	73.0	78.8	42.4	92.3	70.1	86.6	73.3	72.8	66.4	66.6
	SMTransformer(ours)	74.9	97.3	66.4	65.8	67.2	68.2	80.3	82.7	76.5	92.8	71.4	82.3	38.2	92.8	70.6	86.1	74.5	70.5	67.2	72.3

IV-B Outdoor Semantic Segmentation

Datasets: We conduct experiments on two popular datasets: SemanticKITTI[74] and SWAN [72] dataset. The SemanticKITTI provides 22 sequence point clouds consisting of 43,552 frames. Adhering to standard practice, we employ sequences 0 to 10 (excluding 8) for training, use sequence 8 for validation and sequences 11 to 21 for testing. The labels for the test set are exclusively available to the online server, necessitating result submissions for remote evaluation. The demanding SWAN dataset comprises 32 sequences of point clouds totalling 10,000 frames and containing approximately 0.9 billion points. Sequences 0 to 23 are allocated for training, while sequences 24 to 31 are designated for testing.

Network Configurations: For semantic segmentation on SemanticKITTI, we set the voxel size as 5cm and the maximum number of voxels to 100,000. We use the AdamW optimizer and weight decay as 0.02. The base learning rate is set as 0.004 and the learning rate is scheduled by the Cosine. We adopt rotation, flip, scaling, and transformation to augment training data. On SWAN, we opt not to employ voxelization to reduce point resolution; instead, we directly process the raw point cloud data. We set the maximum number of points to 80,000. We use the AdamW optimizer and set weight decay as 0.04. The base learning rate is set as 0.004 and the learning rate is scheduled by the MultiStepLR. We use the same data augmentation as the ones on SemanticKITTI to preprocess the input data. Our model undergoes training and testing phases with a batch size of 16 and 8 distributed across 4 GPUs.

TABLE V: Semantic segmentation results on the Swan test set.

Methods	mIoU(%)	car	truck	pedestrian	bicycle	motorcycle	bus	bridge	tree	bushnes	building	road	r-driver	rub-bin	bus-stop	pole	wall	Traffic sign	rs-board	sidewalk	adv-board
Pointnet++[25]	14.5	31.2	7.3	4.7	9.0	0.0	4.8	0.0	33.9	12.7	59.5	68.4	0.0	13.7	9.0	6.9	15.9	1.0	2.1	11.0	0.0
PointConv[13]	37.3	53.7	20.5	36.5	19.6	5.2	68.7	7.7	61.0	52.5	74.2	77.1	42.3	19.6	38.9	22.1	45.1	26.7	27.3	40.4	6.4
$\psi$ -CNN[75]	39.8	48.5	25.2	31.1	22.4	4.2	77.6	7.3	69.0	56.2	73.9	75.6	47.1	23.0	46.1	30.1	57.0	29.1	25.8	38.2	9.4
PolarNet[67]	40.5	78.1	20.3	21.6	4.6	15.3	18.0	8.3	84.0	30.9	91.9	92.7	54.7	33.4	29.5	48.0	60.3	42.7	22.2	42.2	11.8
Cylinder3D[68]	54.9	80.8	30.4	48.7	28.0	6.8	91.7	13.7	85.3	69.0	92.7	92.3	75.2	37.5	72.1	48.8	71.1	46.5	32.5	56.7	18.9
SAT3D[72]	58.2	83.7	45.7	38.7	42.4	11.3	89.5	49.3	85.5	68.2	93.2	92.7	74.9	40.6	77.1	43.3	74.8	42.9	26.7	65.6	17.9
SMTransformer(ours)	62.4	86.4	50.2	47.2	44.6	19.6	87.0	56.0	88.4	70.4	94.0	92.2	83.3	56.8	75.1	46.7	74.8	54.9	35.6	67.0	21.0

Results: Table IV presents the outcomes of our network alongside results from well-established methods on the SemanticKITTI dataset. Our approach demonstrates commendable performance, achieving 74.9% mIoU. Notably, compared to the cutting-edge projection-based method RangFormer, which is pre-trained on other datasets, our method exhibits superior performance (+1.6%) without any pre-training. Furthermore, our proposed method surpasses voxel partitioning and 3D convolution-based techniques, such as Cylinder3D, 2DPASS, and PVKD. Importantly, our model showcases a remarkable understanding of certain small object categories, including poles, traffic signs and motorcyclists. Fig. 6 shows visualizations of our results on SemanticKITTI validation set.

We present the results for 20 classes of interest in the SWAN test frames in Table V. We compare our results to PointNet++ [22], PointConv [13], Cylinder3D [68], PolarNet [67], $\psi$ -CNN [75] and SAT3D [72]. As depicted in Table V, the mIoU values for these compared methods on the SWAN dataset are lower than those on the semanticKITTI dataset. This discrepancy can be attributed to the heightened complexity of the scenes in the SWAN dataset which was captured in dense central business district are of the city of Perth, Australia. Notably, on this dataset, our method demonstrates the best performance with a remarkable improvement of +4.2% in mIoU compared to the nearest competitor SAT3D. Our method is generally able to show remarkable prediction accuracy towards some small objects, including light poles, traffic sign and pedestrians.

IV-C Object Classification

Datasets: We evaluate SMTransformer on the synthetic data ModelNet40[76], and real-world data ScanobjectNN[77]. The ModelNet40 comprises 12,311 CAD models from 40 categories and is divided into 9,843 training and 2,468 test models. Each sample has about 10,000 points and the features contain coordinates and normals. The ScanObjectNN dataset comprises 15,000 objects categorized into 15 classes, selected from ScanNetv2. In contrast to the synthetic ModelNet40 objects, these objects exhibit occlusion, background noise, deformed geometric shapes, and non-uniform surface density, presenting a more challenging scenario. Our experiments are conducted on its most challenging perturbed variant, denoted as PB $\_$ T50 $\_$ RS. Here, we uniformly sample 1024 points from each model and only use their $(X,Y,Z)$ coordinates as input. We follow the data augmentation used in PointNext[47].

Network Configurations: We employ identical network configurations for both ModelNet40 and ScanObjectNN. Throughout the training, we utilize the SGD optimizer with a momentum of 0.9 and an initial learning rate set to 0.1, training the model for 350 epochs using a batch size of 32. To dynamically adapt the learning rate, we implement cosine annealing, adjusting it when it decreases to 0.001 and applying a dropout ratio of 0.4.

TABLE VI: Classification results on ModelNet40 and ScanObjectNN dataset. ‘xyz’ and ‘n’ represent coordinates and normal vector. ‘K’ stands for one thousand and ‘PN++’ for PointNet++. Our network achieves the best overall accuracy.

Methods	Input	#Points	OA(%)
Methods	Input	#Points	ModelNet40	ScanObjectNN
PointWeb[30]	xyz, n	1K	92.3	-
PointConv[13]	xyz, n	1K	92.5	-
SpiderCNN[12]	xyz, n	5K	92.4	-
KPConv[10]	xyz	7K	92.9	-
PointASNL[21]	xyz, n	1K	93.2	-
PRANet[78]	xyz	2K	93.7	82.1
RS-CNN[79]	xyz	1K	93.6	-
PointNet[24]	xyz	1K	89.2	68.2
PointNet++[25]	xyz	1K	90.7	77.9
DGCNN[33]	xyz	1K	92.9	78.2
PointCNN[36]	xyz	1K	92.2	78.5
BGA-DGCN[77]	xyz	1K	-	79.9
BGA-PN++[77]	xyz	1K	-	80.2
PointASNL[21]	xyz	1K	92.9	-
PRANet[78]	xyz	1K	93.2	81.0
PointTransformer[23]	xyz	1K	93.7	–
PointMLP[28]	xyz	1K	94.1	85.4
PointTransformerV2[22]	xyz	1K	94.2	–
DANet[80]	xyz	1K	93.6	–
LCPformer[53]	xyz	1K	93.6	–
PointNext[47]	xyz	1K	93.2	87.7
PointVector[47]	xyz	1K	–	87.8
SMTransformer(ours)	xyz	1K	94.2	88.0

Results: We compare our method with representative state-of-the-art methods in Table VI using the overall accuracy (OA) metric. For better comparison, we also show the input data type and the number of input points for each method. Our network achieves the best performance of 94.2% OA on ModelNet40 and the best performance of 88.0% OA on ScanObjectNN.

On ModelNet40, our network surpasses the classical local point convolution KPConv by 1.3%, even though KPConv uses 7,000 input points while our network uses only 1,024 points. In comparison to the previous state-of-the-art MLP-based method (e.g., PointNext), our network outperforms it by 1%. Furthermore, compared to the classical vector attention-based transformers, our network outperforms the Point Transformer by 0.6% and achieves similar competitive results 94.2% as the Point Transformer V2 in terms of OA. Compared to the scalar attention-based methods, our network outperforms the PointANSL and LCPFormer by 1% and 0.6%, respectively.

On ScanObjectNN, our network achieves the state-of-the-art overall accuracy (OA) of 88.0%. Specifically, when compared to MLP-based methods, our network outperforms PointMLP, PointNext, and PointVector by 2.6%, 0.3%, and 0.2% in terms of OA. This superior performance on a real-world dataset highlights the suitability of our method for practical applications.

IV-D Ablation Studies

We conduct ablation studies on the S3DIS dataset to demonstrate the effectiveness of SMTransformer, SAUB and Shared Position Encoding.

1) Effect of Various Components: Table VII displays the influence of our introduced modules (i.e. SMTransformer and SAUB) and strategy (i.e. shared position encoding). Case I is the baseline Point Transformer which does not include any of our modules. Cases II to IV systematically incorporate each of our proposed components, progressively enhancing the baseline result to reach 73.4%. Case V employs the shared position encoding strategy while maintaining performance similar to the unshared position encoding strategy. This indicates that point clouds with the same resolution could share the same position encoding information without sacrificing accuracy.

TABLE VII: Effect of various components on semantic segmentation (S3DIS Area-5).‘ SMTB

{}^{\dagger}

’ donates the Soft Masked Transformer with position encoding used in the Point Transformer[23]. ‘EPE’ denotes the proposed enhanced position encoding. ‘SAUB’ donates the proposed skip attention-based up-sampling block. ‘SPE’ is the shared position encoding strategy. ‘Para.’ donates the network parameters.

Case	SMTB ${}^{\dagger}$	EPE	SAUB	SPE	mIoU(%)	mAcc(%)	Para.(M)
I					70.6	76.5	7.8
II	✓				72.0	77.8	8.4
III	✓	✓			72.9	78.2	9.0
IV	✓	✓	✓		73.3	78.8	10.3
V	✓	✓	✓	✓	73.4	78.9	7.8

2) SPE versus Unshared Position Encoding (USPE): We conducted a comparative analysis between a network employing shared position encoding and another utilizing unshared position encoding while maintaining consistency in the remaining network configuration. The findings, as presented in Table VIII, highlight the noteworthy advantages of the shared position encoding strategy. The network with shared position encoding not only delivers a superior performance of 73.4% mIoU, but also boasts efficiency gains with fewer parameters (7.8 million) and shorter training time (16 hours).

TABLE VIII: Comparative Analysis of Network Performance with Shared and Unshared Position Encoding.

Case	mIoU(%) $\uparrow$	Para.(M) $\downarrow$	Training time(h) $\downarrow$
SMTransformer + USPE	73.3	10.3	24
SMTransformer + SPE	73.4	7.8	16

3) Soft Mask versus Hard Mask: To demonstrate the effectiveness and versatility of the soft mask, we conduct a comparison with the hard mask. The hard mask, represented by binary mask, can be expressed as,

S(\cdot)=\left\{\begin{aligned} &0\quad K_{ij}^{s}-Q_{i}^{s}<\tau\\ &1\quad K_{ij}^{s}-Q_{i}^{s}\geq\tau\\ \end{aligned}\right.,

where $\tau$ represents the threshold of the mask. When the difference between the task score key $K_{ij}^{s}$ and the task score query $Q_{i}^{s}$ is greater than or equal to the threshold, the corresponding position in the hard mask is set to 1. Otherwise, it is set to 0. Typically, the threshold needs to be optimized for different datasets. Here, we set the $\tau$ as 0.5 on S3DIS through iterative experimentation. Fig. 7 illustrates a comparison of attention weights on the point cloud with soft mask and hard mask, respectively. The point transformer with a mask demonstrates robustness to object boundaries. In particular, the transformer with a hard mask is highly sensitive to certain classes, such as tables and chairs. On the other hand, the transformer with a soft mask not only exhibits common sensitivity to the mentioned classes but also displays high robustness to challenging classes, such as boards.

4) SAUB versus Classic Up-sampling: To prove the effectiveness of our proposed skip attention-based up-sampling block (SAUB), we compare it with two types of up-sampling blocks including the transition up-sampling block (TUB)[23] and grid unpooling block (GUB)[22]. TUB consists of one projection layer, one interpolation, and one addition operation. GUB consists of one projection layer, one grid unpooling operation and one addition operation. The addition operation connects the features from the encoding and decoding layers. Table IX presents the performance of our network with various up-sampling blocks. The network employing the SAUB achieves the best performance with mIoU, mAcc, and OA values of 73.4%, 78.9%, and 91.8%, respectively, surpassing the performance of the network using the TUB by a significant margin.

TABLE IX: Segmentation performance of our model on S3DIS area 5 with different up-sampling blocks. TUB: transition up-sampling block, GUB: grid unpooling block, SAUB: skip-attention-based up-sampling block.

Up-sampling block	mIoU(%)	mAcc(%)	OA(%)
TUB	71.0	76.8	90.6
GUB	72.3	78.0	91.2
SAUB(ours)	73.4	78.9	91.8

IV-E Robustness Analysis

1) Robustness to Density: We compare the robustness of our model to inter- and intra-point cloud density with several typical baselines such as PointNet [24], PointNet++ [25], DGCNN[33], classical convolutional network such as PointConv [13], RS-CNN[79], DANet[80] and attention network PointASNL[21]. For a fair comparison, all the networks are trained on modelnet40_normal_resampled dataset[76] with 1024 points using only coordinates as the input. To showcase the robustness across inter-point clouds with varying densities, we utilize downsampled points of 512, 256, 128, and 64 as input to the trained model. To evaluate the robustness of intra-point cloud with various densities, we divide the 1024 points into four equal parts along the $X$ coordinate according to the point number, and then we randomly sample 128 points from each part in sequence. This generates the test samples with 896, 768, 640 and 512 points, respectively. The results are shown in Fig. 8. Our SMTransformer shows significantly superior robustness, surpassing existing approaches for both inter and intra-point cloud variations.

2) Robustness to Transformation: To demonstrate the robustness of our SMTransformer, we evaluate its performance on S3DIS and ModelNet40 under a variety of perturbations in the test data, including permutation, translation, scaling and jitter. As shown in Table X, on S3DIS, Point Transformer and Point Vector have a huge performance drop on scaling transformation. Our method exhibits remarkable stability across diverse transformations. Particularly noteworthy is its stable performance even amidst a 0.2 translation along the $X$ , $Y$ , and $Z$ axes and jitter. All methods are invariant to permutations. In terms of sensitivity to point scaling, SMTransformer performs relatively better when the scaling range is decreased. Our method achieves the best accuracy under all transformations on both segmentation and classification datasets.

TABLE X: Robustness study for random point permutations, translation of

\pm

0.2 in

X,Y,Z

axis, scaling (

\times

0.8,

\times

1.2) and jittering. Note that this ablation study is without test time augmentation.

Methods	None	Perm.	Translation		Scaling		Jitter
Methods	None	Perm.	+ 0.2	- 0.2	$\times$ 0.8	$\times$ 1.2	Jitter
S3DIS Dataset mIoU(%)
PointNet[24]	57.75	59.71	22.33	29.85	56.24	59.74	59.04
MinkowskiNet[60]	64.68	64.56	64.59	64.96	59.60	61.93	58.96
PAConv[49]	65.63	65.64	55.81	57.42	64.20	63.94	65.12
Point Transformer[23]	70.36	70.45	70.44	70.43	65.73	66.15	59.67
Stratified Transformer[41]	71.96	72.02	71.99	71.93	70.42	71.21	72.02
Point Vector[29]	72.29	72.29	72.29	72.29	69.34	69.26	72.16
SMTransformer(ours)	72.62	72.62	72.83	72.96	72.30	71.94	72.75
ModelNet40 Dataset OA(%)
PointNet++[25]	92.1	92.1	90.7	90.8	91.2	91.0	91.0
DGCNN[33]	92.5	92.5	92.3	92.3	92.1	92.3	91.5
PointConv[13]	91.8	91.8	91.8	91.8	89.9	90.6	90.6
SMTransformer(ours)	94.2	94.2	94.1	94.2	93.5	93.9	92.3

3) Robustness to Noise: To assess the robustness of SMTransformer to noise, we conducted experiments using the PB_T50_RS variant of ScanObjectNN dataset, measuring the performance with and without background noise (denoted as ‘obj_bg’ and ‘obj_nobg’ respectively). Table XI presents a comparative analysis between our model and several baselines from [77]. We observe that the overall accuracy of all networks diminishes when trained and tested under conditions involving background noise. However, our model achieves the highest accuracy, exhibiting the smallest performance drop of 1.4% OA from the ‘obj_nobg’ variant to the ‘obj_bg’ variant, surpassing all other networks in comparison.

TABLE XI: Robustness to background noise on ScanObjectNN. ‘obj_bg’, ‘obj_nobg’ stand for objects with and without noise.

Method	obj_nobg	obj_bg	OA drop(%)
3DmFV[81]	69.8	63.0	6.8 $\downarrow$
PointNet[24]	74.4	68.2	6.2 $\downarrow$
PointNet++[25]	80.2	77.9	2.3 $\downarrow$
SpiderCNN[12]	76.9	73.7	3.2 $\downarrow$
DGCNN[13]	81.5	78.1	3.4 $\downarrow$
PointCNN[36]	80.8	78.5	2.3 $\downarrow$
SMTransformer(ours)	88.0	86.6	1.4 $\downarrow$

V Conclusion

In this paper, we introduced a novel Soft Masked Transformer to effectively capture contextual and task-specific information from point clouds. Additionally, we proposed a Skip Attention-based up-sampling block to integrate features from different resolution points across encoding layers. Furthermore, we presented a Shared Position Encoding strategy. By incorporating these modules, we constructed an SMTransformer network. Our method was evaluated across various tasks, including indoor and outdoor semantic segmentation and classification. Through extensive experiments on challenging benchmarks, thorough ablation studies and theoretical analysis, we demonstrated the robustness and effectiveness of our approach on real-world datasets. Our contributions significantly advance the state-of-the-art in-point cloud processing. The introduced techniques, including the Soft Masked Transformer, Skip Attention-based Up-sampling block, and Shared Position Encoding strategy, provide notable improvements in capturing intricate details and enhancing the performance of point cloud processing tasks. Experimental results on diverse datasets confirmed the efficacy of our proposed approach. As future work, exploring additional priors or refining the network architecture could offer promising avenues to further improve point cloud processing.

References

[1] Z. Ma, Z. Zheng, J. Wei, Y. Yang, and H. T. Shen, “Instance-dictionary learning for open-world object detection in autonomous driving scenarios,” IEEE Trans. Circuits Syst. Video Technol., 2023.
[2] D. W. Shu and J. Kwon, “Hierarchical bidirected graph convolutions for large-scale 3-d point cloud place recognition,” IEEE Trans. Neural Netw. Learn. Syst., 2023.
[3] Z. Wang, W. Li, and D. Xu, “Domain adaptive sampling for cross-domain point cloud recognition,” IEEE Trans.Circuits Syst. Video Technol., 2023.
[4] Y. Ren, Y. Cong, J. Dong, and G. Sun, “Uni3da: Universal 3d domain adaptation for object recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 379–392, 2022.
[5] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg, “Deep projective 3d semantic segmentation,” in Proc. Int. Conf. Pattern Recognit. Image Anal. Springer, 2017, pp. 95–107.
[6] A. Boulch, J. Guerry, B. Le Saux, and N. Audebert, “Snapnet: 3d point cloud semantic labeling with 2d deep segmentation networks,” Comput. Graph., vol. 71, pp. 189–198, 2018.
[7] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, and S. Savarese, “Segcloud: Semantic segmentation of 3d point clouds,” in Proc. Int. COnf. 3D Vis. IEEE, 2017, pp. 537–547.
[8] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. IEEE Int. Conf. Intell. Rob. Syst. IEEE, 2015, pp. 922–928.
[9] Y. Shen, C. Feng, Y. Yang, and D. Tian, “Mining point cloud local structures by kernel correlation and graph pooling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 4548–4557.
[10] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 6411–6420.
[11] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast geometric deep learning with continuous b-spline kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 869–877.
[12] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao, “Spidercnn: Deep learning on point sets with parameterized convolutional filters,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 87–102.
[13] W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9621–9630.
[14] P. Hermosilla, T. Ritschel, P.-P. Vázquez, À. Vinacua, and T. Ropinski, “Monte carlo convolution for learning on non-uniformly sampled point clouds,” ACM Trans. Graph., vol. 37, no. 6, pp. 1–12, 2018.
[15] H. Lei, N. Akhtar, and A. Mian, “Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2020.
[16] S. Xie, S. Liu, Z. Chen, and Z. Tu, “Attentional shapecontextnet for point cloud recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 4606–4615.
[17] X. Liu, Z. Han, Y.-S. Liu, and M. Zwicker, “Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, 2019, pp. 8778–8785.
[18] J. Yang, Q. Zhang, B. Ni, L. Li, J. Liu, M. Zhou, and Q. Tian, “Modeling point clouds with self-attention and gumbel subset sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3323–3332.
[19] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, “Set transformer: A framework for attention-based permutation-invariant neural networks,” in Proc. Int. Conf. Mach. Learn. PMLR, 2019, pp. 3744–3753.
[20] M. Feng, L. Zhang, X. Lin, S. Z. Gilani, and A. Mian, “Point attention network for semantic segmentation of 3d point clouds,” Pattern Recognit., vol. 107, p. 107446, 2020.
[21] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui, “Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5589–5598.
[22] X. Wu, Y. Lao, L. Jiang, X. Liu, and H. Zhao, “Point transformer v2: Grouped vector attention and partition-based pooling,” Proc. Adv. Neural Inf. Process. Syst., vol. 35, pp. 33 330–33 342, 2022.
[23] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 16 259–16 268.
[24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 652–660.
[25] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” arXiv preprint arXiv:1706.02413, 2017.
[26] J. Li, B. M. Chen, and G. H. Lee, “So-net: Self-organizing network for point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9397–9406.
[27] H. Ran, J. Liu, and C. Wang, “Surface representation for point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022.
[28] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” in Proc. Int. Conf. Learn. Represent., 2021.
[29] X. Deng, W. Zhang, Q. Ding, and X. Zhang, “Pointvector: A vector representation in point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 9455–9465.
[30] H. Zhao, L. Jiang, C.-W. Fu, and J. Jia, “Pointweb: Enhancing local neighborhood features for point cloud processing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5565–5573.
[31] L. Jiang, H. Zhao, S. Liu, X. Shen, C.-W. Fu, and J. Jia, “Hierarchical point-edge interaction network for point cloud semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 10 433–10 441.
[32] M. Xu, J. Zhang, Z. Zhou, M. Xu, X. Qi, and Y. Qiao, “Learning geometry-disentangled representation for complementary understanding of 3d object point cloud,” in Proc. AAAI Conf. Artif. Intell., vol. 35, 2021, pp. 3056–3064.
[33] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019.
[34] R. Klokov and V. Lempitsky, “Escape from cells: Deep kd-networks for the recognition of 3d point cloud models,” in Proc. IEEE Int. Conf. Compu. Vis., 2017, pp. 863–872.
[35] M. Xu, Z. Zhou, and Y. Qiao, “Geometry sharing network for 3d point cloud classification and segmentation,” in Proc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 12 500–12 507.
[36] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” Proc. Adv. Neural Inf. Process. Syst., vol. 31, pp. 820–830, 2018.
[37] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 3693–3702.
[38] S. Wang, S. Suo, W.-C. Ma, A. Pokrovsky, and R. Urtasun, “Deep parametric continuous convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2589–2597.
[39] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan, “Graph attention convolution for point cloud semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10 296–10 305.
[40] C. Wang, B. Samari, and K. Siddiqi, “Local spectral graph convolution for point set feature learning,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 52–66.
[41] X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi, and J. Jia, “Stratified transformer for 3d point cloud segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8500–8509.
[42] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham, “Randla-net: Efficient semantic segmentation of large-scale point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11 108–11 117.
[43] F. Groh, P. Wieschollek, and H. P. Lensch, “Flex-convolution,” in Proc. Asian Conf. Comput. Vis. Springer, 2018, pp. 105–122.
[44] F. Engelmann, T. Kontogianni, J. Schult, and B. Leibe, “Know what your neighbors do: 3d semantic segmentation of point clouds,” in Proc. Eur. Conf. Comput. Vis. Worksh., 2018, pp. 0–0.
[45] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe, “Exploring spatial context for 3d semantic segmentation of point clouds,” in Proc. IEEE Int. Conf. Comput. Vis. Worksh., 2017, pp. 716–724.
[46] Z. Zhang, B.-S. Hua, and S.-K. Yeung, “Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1607–1616.
[47] G. Qian, Y. Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,” in Proc. Adv. Neural Inf. Process. Syst., 2022.
[48] H. Lei, N. Akhtar, and A. Mian, “Spherical kernel for efficient graph convolution on 3d point clouds,” IEEE Trans. Pattern Anal. Mach. Intell., 2020.
[49] M. Xu, R. Ding, H. Zhao, and X. Qi, “Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3173–3182.
[50] S. Qiu, S. Anwar, and N. Barnes, “Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1757–1767.
[51] L. Tang, Y. Zhan, Z. Chen, B. Yu, and D. Tao, “Contrastive boundary learning for point cloud segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8489–8499.
[52] J. Choe, C. Park, F. Rameau, J. Park, and I. S. Kweon, “Pointmixer: Mlp-mixer for point cloud understanding,” in Proc. Eur. Conf. Comput. Vis. Springer, 2022, pp. 620–640.
[53] Z. Huang, Z. Zhao, B. Li, and J. Han, “Lcpformer: Towards effective 3d point cloud analysis via local context propagation in transformers,” IEEE Trans. Circuits Syst. Video Technol., 2023.
[54] C. Chen, D. Liu, C. Xu, and T.-K. Truong, “Saks: Sampling adaptive kernels from subspace for point cloud graph convolution,” IEEE Trans. Circuits Syst. Video Technol., 2023.
[55] H. Lei, N. Akhtar, M. Shah, and A. Mian, “Mesh convolution with continuous filters for 3-d surface parsing,” IEEE Trans. Neural Netw. Learn. Syst., 2023.
[56] S. Fan, Q. Dong, F. Zhu, Y. Lv, P. Ye, and F.-Y. Wang, “Scf-net: Learning spatial contextual features for large-scale point cloud segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 504–14 513.
[57] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9224–9232.
[58] M. Jaritz, J. Gu, and H. Su, “Multi-view pointnet for 3d scene understanding,” in Proc. IEEE Int. Conf. Comput. Vis. Worksh., 2019, pp. 0–0.
[59] H.-Y. Chiang, Y.-L. Lin, Y.-C. Liu, and W. H. Hsu, “A unified point-based framework for 3d segmentation,” in Proc. Int. Conf. 3D Vis., 2019, pp. 155–163.
[60] C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3075–3084.
[61] C. Park, Y. Jeong, M. Cho, and J. Park, “Fast point transformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 949–16 958.
[62] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3d semantic parsing of large-scale indoor spaces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 1534–1543.
[63] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5828–5839.
[64] B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud,” in Proc. IEEE Int. Conf. Robot. Autom. IEEE, 2018, pp. 1887–1893.
[65] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer, “Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud,” in Proc. IEEE Int. Conf. Robot. Autom. IEEE, 2019, pp. 4376–4382.
[66] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “Rangenet++: Fast and accurate lidar semantic segmentation,” in Proc. IEEE Int. Conf. Intell. Rob. Syst. IEEE, 2019, pp. 4213–4220.
[67] Y. Zhang, Z. Zhou, P. David, X. Yue, Z. Xi, B. Gong, and H. Foroosh, “Polarnet: An improved grid representation for online lidar point clouds semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9601–9610.
[68] X. Zhu, H. Zhou, T. Wang, F. Hong, Y. Ma, W. Li, H. Li, and D. Lin, “Cylindrical and asymmetrical 3d convolution networks for lidar segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9939–9948.
[69] R. Cheng, R. Razani, E. Taghavi, E. Li, and B. Liu, “2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12 547–12 556.
[70] Y. Hou, X. Zhu, Y. Ma, C. C. Loy, and Y. Li, “Point-to-voxel knowledge distillation for lidar semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8479–8488.
[71] X. Yan, J. Gao, C. Zheng, C. Zheng, R. Zhang, S. Cui, and Z. Li, “2dpass: 2d priors assisted semantic segmentation on lidar point clouds,” in Proc. Eur. Conf. on Comput. Vis. Springer, 2022, pp. 677–695.
[72] M. Ibrahim, N. Akhtar, S. Anwar, and A. Mian, “Sat3d: Slot attention transformer for 3d point cloud semantic segmentation,” IEEE Trans. Intell. Transp. Syst., 2023.
[73] L. Kong, Y. Liu, R. Chen, Y. Ma, X. Zhu, Y. Li, Y. Hou, Y. Qiao, and Z. Liu, “Rethinking range view representation for lidar segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 228–240.
[74] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 9297–9307.
[75] H. Lei, N. Akhtar, and A. Mian, “Octree guided cnn with spherical kernels for 3d point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9631–9640.
[76] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1912–1920.
[77] M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1588–1597.
[78] S. Cheng, X. Chen, X. He, Z. Liu, and X. Bai, “Pra-net: Point relation-aware network for 3d point cloud analysis,” IEEE Trans. Image Process., vol. 30, pp. 4436–4448, 2021.
[79] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8895–8904.
[80] Y. He, H. Yu, Z. Yang, W. Sun, M. Feng, and A. Mian, “Danet: Density adaptive convolutional network with interactive attention for 3d point clouds,” IEEE Robot. Autom. Lett., 2023.
[81] Y. Ben-Shabat, M. Lindenbaum, and A. Fischer, “3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3145–3152, 2018.