License: arXiv.org perpetual non-exclusive license
arXiv:2403.14124v1 [cs.CV] 21 Mar 2024

Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

Yong He, Hongshan Yu, Muhammad Ibrahim,
Xiaoyan Liu, Tongjia Chen, Anwaar Ulhaq, Ajmal Mian
Yong He, Hongshan Yu, Xiaoyan Liu, and Tongjia Chen are with the National Engineering Laboratory for Robot Visual Perception and Control Technology, College of Electrical and Information Engineering, Hunan University, Lushan South Rd., Yuelu Dist., 410082, Changsha, China. This work was partially supported by the National Natural Science Foundation of China (Grants U2013203, 61973106).Muhammad Ibrahim, Ajmal Mian is with the Department of Computer Science, The University of Western Australia, WA 6009, Australia. Ajmal Mian is the recipient of an Australian Research Council Future Fellowship Award (project number FT210100268) funded by the Australian Government.Anwaar Ulhaq is with Central Queensland University, Sydney Campus, Australia.
Abstract

Point cloud processing methods leverage local and global point features to cater to downstream tasks, yet they often overlook the task-level context inherent in point clouds during the encoding stage. We argue that integrating task-level information into the encoding stage significantly enhances performance. To that end, we propose SMTransformer which incorporates task-level information into a vector-based transformer by utilizing a soft mask generated from task-level queries and keys to learn the attention weights. Additionally, to facilitate effective communication between features from the encoding and decoding layers in high-level tasks such as segmentation, we introduce a skip-attention-based up-sampling block. This block dynamically fuses features from various resolution points across the encoding and decoding layers. To mitigate the increase in network parameters and training time resulting from the complexity of the aforementioned blocks, we propose a novel shared position encoding strategy. This strategy allows various transformer blocks to share the same position information over the same resolution points, thereby reducing network parameters and training time without compromising accuracy. Experimental comparisons with existing methods on multiple datasets demonstrate the efficacy of SMTransformer and skip-attention-based up-sampling for point cloud processing tasks, including semantic segmentation and classification. In particular, we achieve state-of-the-art semantic segmentation results of 73.4% mIoU on S3DIS Area 5 and 62.4% mIoU on SWAN dataset.

Index Terms:
Deep learning, 3D point clouds, Soft mask, Transformer, Attention-based up-sampling.

I Introduction

With the rapid development of 3D sensors such as LiDARs and depth cameras, the accessibility of 3D point cloud data has dramatically increased. Consequently, its importance is growing across various applications, such as autonomous driving [1], [2], robotics[3], and industrial automation[4]. Techniques for processing 3D point clouds have attracted the interest of many researchers. Effectively learning features from 3D point clouds is challenging due to its irregular nature, as opposed to regular grid like structure of images. Inspired by the success of grid convolutions, several methods have been proposed to transform irregular point clouds into regular representations, including projected images[5],[6], or voxels[7], [8]. However, the discretisation process inevitably sacrifices significant geometric information.

Early approaches learn features from raw point clouds by employing a shared multi-layer perceptron (MLP) on each point and use symmetric functions (e.g., max-pooling) to aggregate the most prominent features over the receptive field. However, this design ignores local structures crucial for shape representation. Inspired by 2D grid convolutions, point convolutions utilize correlation functions to quantify connections among points, enabling the model to leverage local features (e.g., edges, corners, surfaces) and contextual information (e.g., scene layout). Some point convolutions use weight functions to learn weights from various local point geometric connections, such as point coordinates, coordinate differences, distances, etc., [9], [10], [11], [12]. Others associate coefficients (derived from point coordinates) [13], [14], [15] with weight functions to adjust the learned weights.

In contrast to point convolution, the point transformer focuses on learning feature connections and attention maps between point features through a scalar or vector-based attention mechanism globally [16], [17], [18], [19], [20], [21] or over a local receptive field [22], [23]. Recently, with the help of position encoding, the aggregation ability of point transformers has seen significantly improvement. While point transformers are powerful learners, they generally fall short of simultaneously exploiting local features and global context, failing to establish communication between the two. Current point convolution and transformer based approaches strive to enhance the encoder and decoder designs to facilitate proficient point connection learning for improved performance on downstream tasks such as segmentation, detection and classification. However, these methods often overlook capturing the task-level contextual information of the entire point cloud which results in sub-optimal performance on the target task, such as inaccurate predictions when segmenting small objects with intricate boundaries.

To address the aforementioned issues, we propose a Soft Masked Transformer (SMTransformer) that incorporates task-level contextual information into attention mechanisms and utilises this information to guide local feature learning softly. Specifically, the SMTransformer predicts task score keys and queries over the global receptive field and then obtains a soft mask from these keys and queries to re-weight the local attention map.

To establish communication between encoder and decoder layers, conventional point transformer networks typically incorporate skip connections and fusion within the up-sampling block of the decoder layer. However, this straightforward fusion approach may not dynamically blend features from the encoder and decoder layers. Furthermore, these operations are limited to the same resolution point cloud i.e. there is no communication between points at different resolutions. To address this challenge, we introduce a novel Skip-Attention-based up-sampling Block (SAUB). This block initially augments the resolution of low-resolution points at the position level, while preserving their original features. Subsequently, it learns attention maps to establish meaningful connections by skip attention between low-resolution and high-resolution point features.

Moreover, in conventional networks, different point transformer blocks often employ distinct position encoding information for the same resolution. This increases the network parameters and the training time. Intuitively, the same resolution points in the network have the same position information, a characteristic independent of the number and location of point transformers.

Motivation and Contributions: The motivation for this paper lies in addressing the limitations of existing point cloud processing methods. While existing methods effectively utilize local and global point features, they often neglect the inherent task-level context in point clouds during the encoding stage. This oversight can lead to sub-optimal performance in downstream tasks. To overcome this challenge, the paper proposes the SMTransformer, which integrates task-level information into the encoding stage by introducing a soft mask generated from task-level queries and keys to adjust the attention weights. Additionally, this paper introduces a skip-attention-based up-sampling block to enhance communication between features from encoding layers over different resolutions, particularly in high-level tasks like segmentation. To further improve efficiency, a novel shared position encoding strategy is proposed to reduce network parameters and training time without sacrificing accuracy. These innovations aim to enhance the performance, effectiveness and efficiency of point cloud processing methods in general. To summarize, our contributions are threefold:

  • We propose a Soft Masked Transformer block, which integrates task-level information into the attention mechanism, enhancing its effectiveness for downstream tasks such as semantic segmentation and classification.

  • We introduce a Skip Attention-based Up-sampling block, which dynamically combines features from different resolution points across the encoding layers, improving the model’s ability to capture contextual information.

  • We present a shared position encoding strategy, which reduces network parameters by 24.3% and training time by 33.3%. This strategy enhances the efficiency of the network without sacrificing performance.

We conduct extensive experiments on benchmark datasets to showcase the effectiveness of our proposed method and their robust generalization across various tasks, including indoor semantic segmentation, outdoor semantic segmentation, and object classification. Our method consistently achieves competitive results compared to existing point transformer-based approaches. Particularly noteworthy is our method’s achievement of state-of-the-art semantic segmentation performance, attaining the remarkable mIoU of 73.4% on the S3DIS Area 5 and 62.4% on the SWAN dataset without any pre-training.

II Related Work

II-A Point-based Methods

Aiming to maximize the preservation of geometric information in point clouds, state-of-the-art methods prefer to directly process the raw point clouds. The development of the most important unit (i.e., local aggregation) in the point cloud processing network can be broadly divided into three categories explained below.

1) MLP-Based Approaches: PointNet [24] is considered a milestone in point cloud based deep learning. It employs shared MLPs to leverage point-wise features and utilizes a symmetric function such as max-pooling to aggregate these features into global representations. However, the network’s performance is limited as it does not account for the spatial relationships among local points, which are crucial for vision tasks. To address this issue, hierarchical architectures have been proposed to aggregate local features with MLPs [25, 26] such that the model can benefit from efficient sampling and grou** of the point set. Recent works [27, 28, 29] have focused on enhancing point-wise features by hand-crafting geometric connections such as curves, triangles, umbrella orientation, or affine transformations. Additionally, graphs have been introduced to points [30, 20, 31, 32, 33, 34, 35] and subsequent geometric representations, such as edges, contours, curvature, and connectivity. However, these strategies might lack generality due to the need to optimize hyperparameters for hand-crafted representations or graphs across datasets with varying densities or shape styles.

2) Convolution-Based Approaches: Inspired by the success of 2D convolution, various works have successively proposed novel point convolutions on points or point graphs. These methods dynamically learn convolutional weights through functions derived from local geometric connections. A popular category of methods in this domain focuses on designing weight functions. Among these, some approximate the weight function using MLPs [36, 37, 38, 13, 14, 33, 39], spline functions [11], a family of polynomial functions [12], or standard unparameterized Fourier functions [40]. Unlike these methods with dynamic convolution kernels, KCNet [9] and KPConv [10] predefine a set of fixed kernels (i.e., template points) in the local receptive field and then learn the weights on these kernels from the geometric connections between local points and template points using Gaussian and linear correlation functions, respectively. However, the number and position of kernel points need to be optimized for different datasets. Another approach [13, 14, 15] associates coefficients with kernels to further adjust the learned weights, where the coefficients are obtained through kernel density estimation, inverse density functions, and fuzzy functions of point coordinates.

3) Transformer-Based Approaches: Unlike convolution-based methods, which learn convolutional weights from low-level point coordinates, attention mechanisms learn attention weights from the connections between point features, thereby exploiting high-level contextual information. Motivated by the success of attention mechanisms in natural language processing and image processing tasks, early methods applied self-attention to global points through scalar dot-product [16, 17, 18, 19, 20, 21], but suffered from high computational costs. These early attention-based methods did not demonstrate superior performance due to the lack of employing position encoding.

The Point Transformer [23] introduces local vector attention to the local points. Additionally, it emphasizes the significance of point position encoding. Subsequent work, Point Transformer V2 [22], incorporates multi-grou** into vector attention inspired by the multi-head strategy. It also enhances position encoding by introducing an additional multiplier to the relation vector, which facilitates learning complex positional relations.

To exploit long-range contextual information, the Stratified Transformer [41] densely selects nearby points over a cubic window and sparsely selects distant points. This stratified strategy enlarges the effective receptive field without incurring too much computational overhead. However, by using a window-based approach, the Stratified Transformer focuses on an expanded local region rather than the global region. Moreover, hyper-parameters such as the window size and number of distant points must be optimized for different datasets with varying densities.

II-B Up-sampling in 3D Point Clouds

The hierarchical architecture of a network is instrumental in learning long-range contextual information through down-sampling and grou** operations. Up-sampling involves interpolating new points between known points and adjusting the features of these new points based on their mutual distance to propagate the learned context features to each point. Compared to the extensive research on down-sampling [14, 42, 21, 18, 43] and grou** operations [25, 26, 20, 44, 45, 46, 34, 33, 37], few works specifically emphasize the up-sampling operation. In the Point Transformer [23], the transition up module uses an interpolation operation to recover new point features from known point features through indexing and then integrates these features with those from the encoding layer via a skip connection. Similarly, in Point Transformer V2 [22], the fusion of encoding and decoding layer features is achieved through a skip connection. The distinguishing factor here is that the new point features are unpooled by grid unpooling instead of interpolation. While these methods are simple, they lack semantic awareness and ignore the contextual connection between the encoding and decoding layers.

Our survey highlights several gaps of in current point cloud processing methods. Their fundamental unit, such as point MLP, convolution, transformer, completely ignores task-level information which then leads to sub-optimal performance in downstream tasks. Moreover, the upsampling methods do not effectively communicate between the encoding layers over different resolution points to refine the context information for high-level tasks. Motivated by these gaps, we propose a soft-masked transformer and skip-attention-based upsampling. Moreover, we propose shared position encoding to reduce the network parameters and training time.

III Method

We revisit the two classical local vector attention-based point transformers, namely Point Transformer and Point Transformer V2, in Section III-A. Then, we present the SMTransformer block in Section III-B, followed by our skip attention-based up-sampling block in Section III-C. We introduce the shared position encoding strategy in Section III-D, and finally, in Section III-E, we provide details about our network.

III-A Rethinking Vector Attention based Transformer

Denote a point cloud piN×3subscript𝑝𝑖superscript𝑁3p_{i}\in\mathbb{R}^{{N\times 3}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT (where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT defines point positions) and its corresponding features fiN×Csubscript𝑓𝑖superscript𝑁𝐶f_{i}\in\mathbb{R}^{{N\times C}}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT. fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a feature vector that may contain attributes such as normal vectors and colour of the surface. N𝑁Nitalic_N and C𝐶Citalic_C are the number of points and feature channels, respectively. We denote the K neighbors of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as pijN×K×3subscript𝑝𝑖𝑗superscript𝑁𝐾3p_{ij}\in\mathbb{R}^{{N\times K\times 3}}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × 3 end_POSTSUPERSCRIPT and their corresponding features as fijN×K×Csubscript𝑓𝑖𝑗superscript𝑁𝐾𝐶f_{ij}\in\mathbb{R}^{{N\times K\times C}}italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × italic_C end_POSTSUPERSCRIPT. The position over the local receptive field can be expressed as ΔpijΔsubscript𝑝𝑖𝑗\Delta p_{ij}roman_Δ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Point Transformer[23] on point cloud pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be expressed as,

𝒢i=j=1K𝒜((KijfQif)δb)(Vijfδb),subscript𝒢𝑖superscriptsubscript𝑗1𝐾direct-product𝒜direct-sumsymmetric-differencesuperscriptsubscript𝐾𝑖𝑗𝑓superscriptsubscript𝑄𝑖𝑓subscript𝛿𝑏direct-sumsuperscriptsubscript𝑉𝑖𝑗𝑓subscript𝛿𝑏\mathcal{G}_{i}=\sum_{j=1}^{K}\mathcal{A}\big{(}(K_{ij}^{f}\ominus Q_{i}^{f})% \oplus\delta_{b}\big{)}\odot\big{(}V_{ij}^{f}\oplus\delta_{b}\big{)},caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_A ( ( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ⊖ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ) ⊕ italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ⊙ ( italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ⊕ italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) , (1)

where QifN×Csuperscriptsubscript𝑄𝑖𝑓superscript𝑁𝐶Q_{i}^{f}\in\mathbb{R}^{{N\times C}}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT is the query matrices of fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Kijfsuperscriptsubscript𝐾𝑖𝑗𝑓K_{ij}^{f}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and VijfN×K×Csuperscriptsubscript𝑉𝑖𝑗𝑓superscript𝑁𝐾𝐶V_{ij}^{f}\in\mathbb{R}^{{N\times K\times C}}italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × italic_C end_POSTSUPERSCRIPT are the key and value matrices of fijsubscript𝑓𝑖𝑗f_{ij}italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. The subtraction operation symmetric-difference\ominus between Qifsuperscriptsubscript𝑄𝑖𝑓Q_{i}^{f}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and Kijfsuperscriptsubscript𝐾𝑖𝑗𝑓K_{ij}^{f}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT is performed via broadcasting to ensure dimension compatibility. δb=δb(Δpij)subscript𝛿𝑏subscript𝛿𝑏Δsubscript𝑝𝑖𝑗\delta_{b}=\delta_{b}(\Delta p_{ij})italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( roman_Δ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is the position encoding bias function. 𝒜()𝒜\mathcal{A}(\cdot)caligraphic_A ( ⋅ ) donates the local vector attention function, implemented by MLP, followed by the softmax. \sum stands for the symmetric operation (SOP) (e.g. summation) and direct-product\odot is the element-wise multiplication operation. direct-sum\oplus is the element-wise addition operation and 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the output feature of transformer.

To exploit the more complex geometric relationships between points, Point Transformer V2[22] strengthens the position encoding with an additional multiplier to the relation feature vector (i.e. KijfQifsuperscriptsubscript𝐾𝑖𝑗𝑓superscriptsubscript𝑄𝑖𝑓K_{ij}^{f}-Q_{i}^{f}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT), which can be formulated as,

𝒢i=j=1K𝒜(δm(KijfQif)δb)(Vijfδb),subscript𝒢𝑖superscriptsubscript𝑗1𝐾direct-product𝒜direct-sumsubscript𝛿𝑚symmetric-differencesuperscriptsubscript𝐾𝑖𝑗𝑓superscriptsubscript𝑄𝑖𝑓subscript𝛿𝑏direct-sumsuperscriptsubscript𝑉𝑖𝑗𝑓subscript𝛿𝑏\mathcal{G}_{i}=\sum_{j=1}^{K}\mathcal{A}\big{(}{\color[rgb]{0,0,0}% \definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}% \pgfsys@color@gray@fill{0}\delta_{m}}(K_{ij}^{f}\ominus Q_{i}^{f})\oplus\delta% _{b}\big{)}\odot\big{(}V_{ij}^{f}\oplus\delta_{b}\big{)},caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_A ( italic_δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ⊖ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ) ⊕ italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ⊙ ( italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ⊕ italic_δ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) , (2)

where δm=δm(Δpij)subscript𝛿𝑚subscript𝛿𝑚Δsubscript𝑝𝑖𝑗\delta_{m}=\delta_{m}(\Delta p_{ij})italic_δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_δ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( roman_Δ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is the position encoding multiplier function. The diagram of the above two classical point transformer are illustrated as Fig. 1.(a)(b). The attention function 𝒜()𝒜\mathcal{A}(\cdot)caligraphic_A ( ⋅ ) learns robust weights from the rich relationships, including the low-level geometric relationship (i.e. relation position) and the high-level contextual relationship (i.e. relation feature). Despite the overarching influence of the task on the entire network, the fundamental unit transformer overlooks task-level information, resulting in the loss of crucial task-related details (e.g. semantic information) during the encoding and decoding stages.

III-B Soft Masked Transformer Block (SMTB)

We propose a novel soft-masked transformer for point cloud processing. The soft-masked transformer can be expressed as,

𝒢isubscript𝒢𝑖\displaystyle\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =SMTransformer(fi),absentSMTransformersubscript𝑓𝑖\displaystyle={\rm SMTransformer}(f_{i}),= roman_SMTransformer ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (3)
=j=1K𝒮()𝒜((KijfQif)δ())(Vijfδ()),absentsuperscriptsubscript𝑗1𝐾direct-productdirect-product𝒮𝒜direct-sumsymmetric-differencesuperscriptsubscript𝐾𝑖𝑗𝑓superscriptsubscript𝑄𝑖𝑓𝛿direct-sumsuperscriptsubscript𝑉𝑖𝑗𝑓𝛿\displaystyle=\sum_{j=1}^{K}{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\mathcal{S}(\cdot)}\odot\mathcal{A}\big{(}(K_{ij}^{f}\ominus Q_{i}^{f})% \oplus{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\delta(\cdot)}\big{)}% \odot\big{(}V_{ij}^{f}\oplus{\color[rgb]{0,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill% {0}\delta(\cdot)}\big{)},= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_S ( ⋅ ) ⊙ caligraphic_A ( ( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ⊖ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ) ⊕ italic_δ ( ⋅ ) ) ⊙ ( italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ⊕ italic_δ ( ⋅ ) ) ,

where 𝒮()𝒮\mathcal{S}(\cdot)caligraphic_S ( ⋅ ) is the soft mask function, which re-weights the attention weights. δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) is the enhanced position encoding function, learning more complex relationships between points. We introduce them in detail in the following text.

III-B1 Soft Mask

The soft mask can be interpreted as the learnable coefficient of the attention function. Its significance lies in modelling the semantic context, a prior for calculating a task score difference. This difference is then used to softly mask the attention weights at the point level rather than the channel level. Inspired by the vector attention, SMTransformer divides the features fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into two entities: the scoring query Qissuperscriptsubscript𝑄𝑖𝑠Q_{i}^{s}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and score key Kijssuperscriptsubscript𝐾𝑖𝑗𝑠K_{ij}^{s}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, separately.

Qis=Wqs(fi),Kijs=G[Wks(fi)],formulae-sequencesuperscriptsubscript𝑄𝑖𝑠superscriptsubscript𝑊𝑞𝑠subscript𝑓𝑖superscriptsubscript𝐾𝑖𝑗𝑠𝐺delimited-[]superscriptsubscript𝑊𝑘𝑠subscript𝑓𝑖Q_{i}^{s}=W_{q}^{s}(f_{i}),~{}~{}K_{ij}^{s}=G[W_{k}^{s}(f_{i})],italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_G [ italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] , (4)

where the Wqssuperscriptsubscript𝑊𝑞𝑠W_{q}^{s}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and Wkssuperscriptsubscript𝑊𝑘𝑠W_{k}^{s}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (N×CN×Tsuperscript𝑁𝐶superscript𝑁𝑇\mathbb{R}^{N\times C}\rightarrow\mathbb{R}^{N\times T}blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_N × italic_T end_POSTSUPERSCRIPT) are prediction functions implemented as linear layers followed by softmax, where T𝑇Titalic_T is the number of the task classes. G[]𝐺delimited-[]G[\cdot]italic_G [ ⋅ ] is the grou** operation to obtain the task scores of neighbour points at point pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The soft mask is generated from the score difference (KijsQis)N×K×Tsuperscriptsubscript𝐾𝑖𝑗𝑠superscriptsubscript𝑄𝑖𝑠superscript𝑁𝐾𝑇(K_{ij}^{s}-Q_{i}^{s})\in\mathbb{R}^{{N\times K\times T}}( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × italic_T end_POSTSUPERSCRIPT as,

S()=S(Qis,Kijs)=Max(Norm(KijsQis))2,𝑆𝑆superscriptsubscript𝑄𝑖𝑠superscriptsubscript𝐾𝑖𝑗𝑠subscriptnormMaxNormsymmetric-differencesuperscriptsubscript𝐾𝑖𝑗𝑠superscriptsubscript𝑄𝑖𝑠2S(\cdot)=S(Q_{i}^{s},K_{ij}^{s})=||{\rm Max}({\rm Norm}(K_{ij}^{s}\ominus Q_{i% }^{s}))||_{2},italic_S ( ⋅ ) = italic_S ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) = | | roman_Max ( roman_Norm ( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⊖ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (5)

where the Max()Max{\rm Max}(\cdot)roman_Max ( ⋅ ) is the maximum, Norm()Norm{\rm Norm}(\cdot)roman_Norm ( ⋅ ) is Min-Max Normalization, and ||||2||\cdot||_{2}| | ⋅ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the Euclidean Norm. The soft mask consists of real numbers ranging between 0 and 1, typically representing the probability values of identical labels among different points. This allows the model to assign higher importance to neighbouring points with distinct predicted labels. Taking segmentation as an example, the soft mask enhances the robustness of attention weights around class boundaries. Unlike traditional hard masks (i.e., binary masks), soft masks are more flexible and efficient as they do not require explicit rules or conditions for determination.

III-B2 Enhanced Position Encoding

Most existing position encoding methods in local point transformers focus solely on local positions. While this approach greatly assists the transformer in understanding local shapes, it struggles to capture long-range shapes beyond the limited local receptive field. Therefore, global position encoding is equally important as local position encoding. Similarly, SMTransformer encodes the global point position into two entities: position query Qipsuperscriptsubscript𝑄𝑖𝑝Q_{i}^{p}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and position key Kijpsuperscriptsubscript𝐾𝑖𝑗𝑝K_{ij}^{p}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT.

Qip=Wqp(pi),Kijp=G[Qip],formulae-sequencesuperscriptsubscript𝑄𝑖𝑝superscriptsubscript𝑊𝑞𝑝subscript𝑝𝑖superscriptsubscript𝐾𝑖𝑗𝑝𝐺delimited-[]superscriptsubscript𝑄𝑖𝑝Q_{i}^{p}=W_{q}^{p}(p_{i}),~{}~{}K_{ij}^{p}=G[Q_{i}^{p}],italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = italic_G [ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] , (6)

where the Wqpsuperscriptsubscript𝑊𝑞𝑝W_{q}^{p}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT (N×3N×Csuperscript𝑁3superscript𝑁𝐶\mathbb{R}^{N\times 3}\rightarrow\mathbb{R}^{N\times C}blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT) are global position encoding functions, implemented as MLPs. The local relative position information can be expressed as (KijpQip)Δpijsuperscriptsubscript𝐾𝑖𝑗𝑝superscriptsubscript𝑄𝑖𝑝Δsubscript𝑝𝑖𝑗(K_{ij}^{p}-Q_{i}^{p})\oslash\Delta p_{ij}( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ⊘ roman_Δ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, where the \oslash donates the concatenation operation. The enhanced position encoding can be expressed as,

δ()=δ((KijpQip)Δpij),𝛿𝛿superscriptsubscript𝐾𝑖𝑗𝑝superscriptsubscript𝑄𝑖𝑝Δsubscript𝑝𝑖𝑗\delta(\cdot)=\delta\big{(}(K_{ij}^{p}-Q_{i}^{p})\oslash\Delta p_{ij}\big{)},italic_δ ( ⋅ ) = italic_δ ( ( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) ⊘ roman_Δ italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , (7)

where the δ𝛿\deltaitalic_δ is the local position encoding function implemented by MLPs. By constructing the Query and Key matrices of global point position, SMTransformer can bypass the local receptive field limitation to learn the global geometric information.

Refer to caption
Figure 1: Comparison of the attention, position encoding in Transformers. (a) The vector attention with position encoding bias in Point Transformer, see Eq.(1). (b) The vector attention with position encoding multiplier in Point Transformer V2, see Eq.(2). (c) The vector attention with soft mask and enhanced position encoding bias in our proposed SMTransformer, see Eq.(3).

To compare the differences between SMTransformer and the classical vector attention-based point transformer, we illustrate their architectures in Fig. 1. There are two key distinctions:

i) Both Point Transformer and Point Transformer V2 emphasize learning contextual relationships. In contrast, SMTransformer not only grasps contextual relationships through vector attention but also integrates a soft mask as the coefficient with the attention function, driven from the task at hand.

ii) Point Transformer and Point Transformer V2 both effectively capture local fine-grained details through local position encoding. In contrast, SMTransformer introduces an innovative enhanced position encoding that represents positions across the global point cloud, enabling modelling of the global shape without being confined to local receptive fields. Additionally, it encodes positions across local points, allowing for learning fine-grained details.

Residual connections are instrumental in training deep neural networks, facilitating gradient flow during backpropagation. Therefore, we combine the Soft Masked Transformer with residual connections to construct a transformer block. As illustrated in Fig. 4.(b), the Soft Masked Transformer Block (SMTB) can be expressed as,

fisubscript𝑓𝑖\displaystyle f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Linear(fin1),absentLinearsubscript𝑓𝑖𝑛1\displaystyle={\rm Linear}(f_{in1}),= roman_Linear ( italic_f start_POSTSUBSCRIPT italic_i italic_n 1 end_POSTSUBSCRIPT ) , (8)
𝒢isubscript𝒢𝑖\displaystyle\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =SMTransformer(fi),absentSMTransformersubscript𝑓𝑖\displaystyle={\rm SMTransformer}(f_{i}),= roman_SMTransformer ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
fout1subscript𝑓𝑜𝑢𝑡1\displaystyle f_{out1}italic_f start_POSTSUBSCRIPT italic_o italic_u italic_t 1 end_POSTSUBSCRIPT =Linear(𝒢ifi),absentLineardirect-sumsubscript𝒢𝑖subscript𝑓𝑖\displaystyle={\rm Linear}(\mathcal{G}_{i}\oplus f_{i}),= roman_Linear ( caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where fin1subscript𝑓𝑖𝑛1f_{in1}italic_f start_POSTSUBSCRIPT italic_i italic_n 1 end_POSTSUBSCRIPT is the input feature and fout1subscript𝑓𝑜𝑢𝑡1f_{out1}italic_f start_POSTSUBSCRIPT italic_o italic_u italic_t 1 end_POSTSUBSCRIPT is the output feature of SMTB. The projection layer Linear()Linear\rm Linear()roman_Linear ( ) is achieved through a series of layers, including one linear layer, one batch normalization layer, and one Relu layer.

III-C Skip Attention-based Up-sampling Block (SAUB)

To facilitate deep communication between features over various resolution points, we introduce a skip attention-based up-sampling block that combines conventional unpooling with a learnable unit to learn and refine contextual information between features from the encoding and decoding layers across different resolutions.

As illustrated in Fig. 4.(c), given the skip feature I fhM×Chsubscript𝑓superscript𝑀subscript𝐶f_{h}\in\mathbb{R}^{{M\times C_{h}}}italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and skip feature II flm×Clsubscript𝑓𝑙superscript𝑚subscript𝐶𝑙f_{l}\in\mathbb{R}^{{m\times C_{l}}}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the two adjacent encoding layers over the low and high-resolution points, respectively. To build the communication between the different resolution point features, we first balance their feature dimension and point resolution,

fmid=Gridup(Linear(fin2fl)),subscript𝑓𝑚𝑖𝑑GridupLinearsubscript𝑓𝑖𝑛2subscript𝑓𝑙f_{mid}={\rm Gridup}\big{(}{\rm Linear}(f_{in2}\oslash f_{l})\big{)},italic_f start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT = roman_Gridup ( roman_Linear ( italic_f start_POSTSUBSCRIPT italic_i italic_n 2 end_POSTSUBSCRIPT ⊘ italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) , (9)

where the Linear()Linear{\rm Linear()}roman_Linear ( ) serves as the projection layer; its primary role involves integrating both the input features fin2subscript𝑓𝑖𝑛2f_{in2}italic_f start_POSTSUBSCRIPT italic_i italic_n 2 end_POSTSUBSCRIPT and skip features II flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, thereby augmenting the dimension of low-resolution point features to match that of high-resolution point features (m×Clm×Chsuperscript𝑚subscript𝐶𝑙superscript𝑚subscript𝐶\mathbb{R}^{{m\times C_{l}}}\rightarrow\mathbb{R}^{{m\times C_{h}}}blackboard_R start_POSTSUPERSCRIPT italic_m × italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m × italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT), where m𝑚mitalic_m is the number of low-resolution points. The GridupGridup{\rm Gridup}roman_Gridup is the common practice of unpooling, implemented by grid-based unpooling, to augment the resolution (m×ChM×Chsuperscript𝑚subscript𝐶superscript𝑀subscript𝐶\mathbb{R}^{{m\times C_{h}}}\rightarrow\mathbb{R}^{{M\times C_{h}}}blackboard_R start_POSTSUPERSCRIPT italic_m × italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_M × italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT), where M𝑀Mitalic_M is the number of high-resolution points. By the above two steps, the low-resolution features are expanded in terms of feature dimension and point resolution.

To build deep communication between the expanded low-resolution features and skip features I, we propose the skip attention,

Qi=wq(fl),Kij=G[wk(fmid)],Vij=G[wv(fmid)],formulae-sequencesubscript𝑄𝑖subscript𝑤𝑞subscript𝑓𝑙formulae-sequencesubscript𝐾𝑖𝑗𝐺delimited-[]subscript𝑤𝑘subscript𝑓𝑚𝑖𝑑subscript𝑉𝑖𝑗𝐺delimited-[]subscript𝑤𝑣subscript𝑓𝑚𝑖𝑑Q_{i}=w_{q}(f_{l}),K_{ij}=G[w_{k}(f_{mid})],V_{ij}=G[w_{v}(f_{mid})],italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_G [ italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT ) ] , italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_G [ italic_w start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT ) ] , (10)
𝒢isa=j=1K𝒜((KijQi)δ())(Vijδ()),superscriptsubscript𝒢𝑖𝑠𝑎superscriptsubscript𝑗1𝐾direct-product𝒜direct-sumsymmetric-differencesubscript𝐾𝑖𝑗subscript𝑄𝑖𝛿direct-sumsubscript𝑉𝑖𝑗𝛿\mathcal{G}_{i}^{sa}=\sum_{j=1}^{K}\mathcal{A}\big{(}(K_{ij}\ominus Q_{i})% \oplus{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\delta(\cdot)}\big{)}% \odot\big{(}V_{ij}\oplus{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}% {rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\delta(\cdot% )}\big{)},caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT caligraphic_A ( ( italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⊖ italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⊕ italic_δ ( ⋅ ) ) ⊙ ( italic_V start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⊕ italic_δ ( ⋅ ) ) , (11)

where δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) is the proposed enhanced position encoding. 𝒢isasuperscriptsubscript𝒢𝑖𝑠𝑎\mathcal{G}_{i}^{sa}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT is the output of skip attention. The skip attention use Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, derived from high-resolution features flsubscript𝑓𝑙f_{l}italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, as query and Kijsubscript𝐾𝑖𝑗K_{ij}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, derived from expanded low-resolution features fmidsubscript𝑓𝑚𝑖𝑑f_{mid}italic_f start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT to learn the contextual connection (i.e. attention map) between different resolution points.

To prevent the loss of some important information, we utilize a residual connection, concentrating the output on the input of skip attention and skip features,

fout2=Linear(𝒢isaflfmid).subscript𝑓𝑜𝑢𝑡2Lineardirect-sumsuperscriptsubscript𝒢𝑖𝑠𝑎subscript𝑓𝑙subscript𝑓𝑚𝑖𝑑f_{out2}={\rm Linear}(\mathcal{G}_{i}^{sa}\oplus f_{l}\oplus f_{mid}).italic_f start_POSTSUBSCRIPT italic_o italic_u italic_t 2 end_POSTSUBSCRIPT = roman_Linear ( caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_a end_POSTSUPERSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT ) . (12)

To compare the differences between SAUB and classical upsampling block ( i.e. Transition up[22]), we illustrate their diagrams in Fig. 2. The key differences lie in two aspects:

i) The classical upsampling block only learns the connection between features from the encoding and decoding layers over the same resolution points. In contrast, our proposed SAUB can learn the connection between features from the encoding and decoding layers of different resolution points.

ii) The classical upsampling block uses a simple skip connection to learn the contextual information between the features from the encoding and decoding layers. However, we propose skip attention to refine the contextual information between the features.

Refer to caption
Figure 2: Comparison of (a) Transition up and (b) Skip attention-based up-sampling. ‘interpo.’ stands for the interpolation operation. ‘grid up’ stands for the grid-based unpooling. ‘SA’ donates the skip attention.

III-D Shared Position Encoding

In conventional point transformer networks, several transformer blocks typically utilize different position encoding information within the same encoding or decoding layer across resolution points. We refer to this practice as unshared position encoding. Intuitively, points within the same position in the point cloud should have the same position information. Building on this intuition, we propose the shared position encoding strategy. Under this strategy, transformer blocks within the same encoding or decoding layer share identical position encoding information across resolution points. This approach enhances network robustness and efficiency, particularly in large-scale scene processing. Their comparison is illustrated in Fig. 3. The robustness and efficiency experiments are provided in Section IV-D.

Refer to caption
Figure 3: (a) Unshared position encoding: Various transformer blocks (coloured in grey) within the same encoding or decoding layer (coloured in yellow), operating over the same resolution point cloud, require different position information (coloured in blue). (b) Shared position encoding: Various transformer blocks within the encoding and decoding layers share position information over the same resolution point cloud.
Refer to caption
Figure 4: (a) Network architecture for semantic segmentation. (b) Soft masked transformer block and (c) skip attention-based up-sampling block.

III-E Network Architecture

We employ a U-Net-like architecture comprising five encoding and decoding layers with skip connections for the semantic segmentation task. The first encoding and decoding layers consist of one MLP and SMTransformer block. The subsequent encoding layers incorporate one Grid Pooling layer [22] followed by several SMTransformer blocks. The number of SMTransformer blocks in the five encoding layers is [1, 2, 2, 6, 2]. The subsequent decoding layers consist of one skip attention-based up-sampling block and one SMTransformer block. We set the feature dimensions C𝐶Citalic_C as [32, 64, 128, 256, 512] for the 5 encoding and decoding layers. At the network’s end, we append an MLP to predict the final point-wise labels. The network architecture for semantic segmentation is illustrated in Fig. 4.(a).

For the classification task, we utilize the basic PointNext [47] as the backbone and replace one SMTransformer block with an MLPs block to form the new network architecture for classification. Further details regarding the configurations of the segmentation and classification networks are in Section IV.

IV Experimental Results

We evaluate our network on three tasks namely, semantic segmentation of indoor scenes, semantic segmentation of outdoor scenes, and shape classification. We also perform detailed ablation studies to demonstrate the effectiveness and robustness of the proposed Soft Mask Transformer Block, Skip Attention-based up-sampling Block and the Shared Position Encoding strategy.

TABLE I: Semantic segmentation results on the S3DIS dataset Area-5. We report the mean class-wise Intersection over Union (mIoU), mean class-wise accuracy (mAcc), and overall accuracy (OA). The best result is highlighted in bold, and the second best is underlined.
Year Methods mIoU mAcc OA ceil. floor wall beam column window door chair table bookcase sofa board clut.
2017 CVPR PointNet[24] 41.09 48.98 88.80 97.33 69.80 0.05 3.92 46.26 10.76 52.61 58.93 40.28 5.85 26.38 33.22
2018 NIPS PointCNN[36] 57.26 63.86 85.9 92.3 98.2 79.4 0.0 17.6 22.8 62.1 74.4 80.6 31.7 66.7 62.1 56.7
2019 ICCV KPConv[10] 67.1 72.8 92.8 97.3 82.4 0.0 23.9 58.0 69.0 91.0 81.5 75.3 75.4 66.7 58.9
2020 PAMI SPH3D-GCN[48] 59.5 65.9 93.3 97.1 81.1 0.0 33.2 45.8 43.8 79.7 86.9 33.2 71.5 54.1 53.7
2020 CVPR PointANSL[21] 62.6 68.5 87.7 94.3 98.4 79.1 0.0 26.7 55.2 66.2 86.8 83.3 68.3 47.6 56.4 52.1
2020 CVPR SegGCN[15] 63.6 70.4 93.7 98.6 80.6 0.0 28.5 42.6 74.5 80.9 88.7 69.0 71.3 44.4 54.3
2021 CVPR PAConv[49] 66.6 73.0 94.5 98.6 82.4 0.0 26.4 58.0 60.0 89.7 80.4 74.3 69.8 73.5 57.7
2021 CVPR BAAF-Net[50] 65.4 73.1 88.9 92.9 97.9 82.3 0.0 23.1 65.5 64.9 87.5 78.5 70.7 61.4 68.7 57.2
2021 ICCV Point Transformer[23] 70.4 76.5 90.8 94.0 98.5 86.3 0.0 38.0 63.4 74.3 82.4 89.1 80.2 74.3 76.0 59.3
2022 CVPR CBL[51] 69.4 75.2 90.6 93.9 98.4 84.2 0.0 37.0 57.7 71.9 81.8 91.7 75.6 77.8 69.1 62.9
2022 CVPR RepSurf-U[27] 68.9 76.0 90.2
2022 CVPR Stratified Transformer[41] 72.0 78.1 91.5
2022 ECCV PointMixer[52] 71.4 77.4 94.2 98.2 86.0 0.0 43.8 62.1 78.5 82.2 90.8 79.8 73.9 78.5 59.4
2022 NIPS PointNeXt[47] 71.1 77.2 91.0 94.2 98.5 84.4 0.0 37.7 59.3 74.0 91.6 83.1 77.2 77.4 78.8 60.6
2022 NIPS PointTransformerV2[22] 71.6 77.9 91.1
2023 TCSVT LCPFormer[53] 70.2 76.8 90.8
2023 TCSVT SAKS[54] 68.8 74.0 90.8 95.2 98.6 84.1 0.0 27.5 58.5 75.1 80.4 90.8 69.0 77.0 73.5 62.1
2023 TNNLS PicassoNet++[55] 71.0 77.2 91.3 94.4 98.4 87.5 0.0 46.9 63.7 75.5 81.4 90.3 71.3 76.2 76.7 61.1
2023 CVPR Point Vector[29] 72.3 78.1 91.0 95.1 98.6 85.1 0.0 41.4 60.8 76.7 92.1 84.4 77.2 82.0 85.1 61.4
SMTransformer(ours) 73.4 78.9 91.8 95.2 98.7 87.7 0.0 45.8 64.8 75.2 85.2 92.7 86.7 76.8 83.5 62.4
TABLE II: Semantic segmentation results on S3DIS with 6-fold cross validation.
Year Methods mIoU mAcc OA ceil. floor wall beam column window door chair table bookcase sofa board clut.
2017 CVPR PointNet[24] 47.6 66.2 78.6 88.0 88.7 69.3 42.4 23.1 47.5 51.6 42.0 54.1 38.2 9.6 29.4 35.2
2018 NIPS PointCNN[36] 65.4 75.6 88.1 94.8 97.3 75.8 63.3 51.7 58.4 57.2 69.1 71.6 61.2 39.1 52.2 58.6
2019 CVPR PointWeb[30] 66.7 76.2 87.3 93.5 94.2 80.8 52.4 41.3 64.9 68.1 67.1 71.4 62.7 50.3 62.2 58.5
2019 ICCV KPConv[10] 70.6 79.1 93.6 92.4 83.1 63.9 54.3 66.1 76.6 64.0 57.8 74.9 69.3 61.3 60.3
2020 PAMI SPH3D-GCN[48] 68.9 77.9 88.6 93.3 96.2 81.9 58.6 55.9 55.9 71.7 82.4 72.1 64.5 48.5 54.8 60.4
2020 CVPR PointANSL[21] 68.7 79.0 88.8 95.3 97.9 81.9 47.0 48.0 67.3 70.5 77.8 71.3 60.4 50.7 63.0 62.8
2020 CVPR RandLA-Net[42] 70.0 82.0 88.0 93.1 96.1 80.6 62.4 48.0 64.4 69.4 76.4 69.4 64.2 60.0 65.9 60.1
2021 CVPR PAConv[49] 69.3 78.7 94.3 93.5 82.8 56.9 45.7 65.2 74.9 59.7 74.6 67.4 61.8 65.8 58.4
2021 CVPR SCF-Net[56] 71.6 82.7 88.4 93.3 96.4 80.9 64.9 47.4 64.5 70.1 81.6 71.4 64.4 67.2 67.5 60.9
2021 CVPR BAAF-Net[50] 72.2 83.1 88.9 93.3 96.8 81.6 61.9 49.5 65.4 73.3 83.7 72.0 64.3 67.5 67.0 62.4
2021 ICCV Point Transformer [23] 73.5 81.9 90.2 94.3 97.5 84.7 55.6 58.1 66.1 78.2 74.1 77.6 71.2 67.3 65.7 64.8
2022 NIPS PointNeXt[47] 74.9 83.0 90.3
2022 CVPR RepSurf-U[27] 74.3 82.6 90.8
2022 CVPR CBL[51] 73.1 79.4 89.6 94.1 94.2 85.5 50.4 58.8 70.3 78.3 75.0 75.7 74.0 71.8 60.0 62.4
2023 CVPR Point Vector[29] 78.4 86.1 91.9
SMTransformer(ours) 79.0 86.9 91.9 97.4 98.3 89.4 68.0 66.1 70.4 78.4 82.6 84.0 78.5 72.2 73.2 68.5
Refer to caption
Figure 5: Visualization of semantic segmentation results on S3DIS Area-5. The red boxes highlight the object boundaries in the scenes where our proposed SMTransformer performs particularly better than the Point Transformer V2 (PTv2).
TABLE III: Semantic segmentation results (mIoU) on ScanNetV2 validation and test set.
Year Methods Input Val(%) Test(%)
2018 NIPS PointNet++[25] point 55.7 53.5
2018 CVPR SparseConvNet[57] voxel 72.5 69.3
2019 CVPR PointConv[13] point 66.6 61.0
2020 CVPR PointANSL[21] point 63.5 66.6
2019 ICCV MVPNet[58] point 66.4
2019 ICCV KPConv[10] point 69.2 68.6
2019 3DV JointPointBased[59] point 69.2 63.4
2019 CVPR MinkowskiNet[60] voxel 72.2 73.6
2022 CVPR RepSurf-U[27] point 70.0
2022 CVPR Stratified Transformer[41] point 74.3 73.7
2021 CVPR PointTransformer[23] point 70.6
2022 CVPR FastPointTransformer[61] voxel 72.0
2022 NIPS PointTransformerV2[22] point 75.4 75.2
2023 TNNLS PicassoNet++[55] mesh 69.2
SMTransformer (ours) point 75.9 75.7

IV-A Indoor Semantic Segmentation

Datasets: We evaluate our network on two large-scale indoor scene datasets, namely S3DIS [62] and ScanNetV2 [63]. The S3DIS dataset consists of RGB-D point clouds annotated point-wise with 13 classes. It encompasses 271 rooms from 6 large-scale indoor scenes, totalling 6020 square meters. We utilize 6-dimensional point features, including 3-dimensional normalized colour and 3-dimensional normalized location. For evaluation, we conduct a 6-fold cross-validation on S3DIS and focus more extensively on comparisons using Area 5 as the test set, which is distinct from the other areas and not within the same building.

The ScanNetV2 dataset comprises coloured point clouds of indoor scenes with point-wise semantic labels for 20 object categories. It is divided into 1201 scenes for training and 312 for validation. Our approach utilizes 9-dimensional point features corresponding to 3-dimensional normalized colour, 3-dimensional normalized location, and 3-dimensional normals.

Network Configurations: For semantic segmentation on S3DIS, we set the voxel size as 4cm and the maximum number of voxels to 60,000. We adopt the SGD optimizer and weight decay as 0.0001. The base learning rate is set as 0.4 and the learning rate is scheduled by the MultiStepLR at the 40th and 80th epoch. We train and test the model with batch size 16 and 8 on 4 GPUs, respectively. We adopt random scaling, random flip, chromatic contrast, chromatic translation, chromatic jitter and hue saturation translation to augment training data. We set the grid size in grid pooling as [0.08, 0.1, 0.2, 0.4]cm and the number of neighbour points in SMTransformer as 16.

On ScanNet, we set the voxel size as 2cm and the maximum number of voxels to 100,000. We use the Adam optimizer, where the weight decay is set as 0.02. The base learning rate is set as 0.02 and the learning rate is scheduled by the MultiStepLR every 40 epochs. We train and test the model with batch size 24 on 4 GPUs. We adopt random rotation, random scaling, random flip, elastic distortion, chromatic contrast, chromatic translation, chromatic jitter and hue saturation translation to augment training data. The grid size is set as [0.04, 0.12, 0.36, 1.08]cm, and the number of neighbour points is set as 16. During the test, the network uses the test time augmentation, following the Point Transformer V2[22] and Stratified Transformer[41].

Results: We compare our method with the recent state-of-the-art on S3DIS dataset, using three metrics i.e. mean class-wise intersection over union (mIoU), mean overall accuracy (mAcc) and overall accuracy (OA). Results are reported in Table I. Our network demonstrates superior performance on all three metrics i.e. 73.4% mIoU, 78.9% mAcc and 91.8% OA. It achieves the top 2 results on 9 out of 13 classes including ceiling, floor, wall, column, window, table, bookcase, board, clutter. Notably, the segmentation performance of bookcase class exceeds the second-best method by 6.5% mIoU. Compared to the previous state-of-the-art point transformers (e.g. Point Transformer V2 and Stratified Transformer), our network outperforms them by 1.8% and 1.4% in terms of mIoU, respectively. Compared to the MLP-based method (e.g. Point Vector), the performance of our method exceeds it by 1.1% mIoU. Compared to the LCPFormer and SAKS, our method outperforms them by a large margin on all metrics. Fig. 5. shows visualizations of our results on S3DIS area 5 in comparison to Point Transformer V2. We can see that our method is more robust to the object boundaries.

Table II shows results with the 6-fold validation setting on the S3DIS dataset. Our method again achieves state-of-the-art results of 79.0% mIoU, 86.9% mAcc and 91.9% OA. It achieves the best results on 12 out of 13 classes including ceiling, floor, wall, beam, column, window, door, table, bookcase, sofa, board and clutter.

The ScanNetV2 validation and test set results are illustrated in Table III. Compared to the Stratified Transformer, our method exhibits a substantial improvement of +1.6% mIoU and +2.2% mIoU on validation and test sets, respectively. Against the Point Transformer V2, our method delivers enhanced performance with an improvement of (+0.5%, +0.5% in terms of mIoU) on the validation and test sets, respectively.

TABLE IV: Semantic segmentation results on the SemanticKITTI test set. ‘*’ means the network is pre-trained on other datasets.
Year Methods

mIoU(%)

car

bicycle

motorcycle

truck

other-vehicle

person

bicyclist

motorcyclist

road

parking

sidewalk

other-ground

building

fence

vegetation

trunk

terrain

pole

traffic-sign

2017 NIPS Pointnet++[25] 20.1 53.7 1.9 0.2 0.9 0.2 0.9 1.0 0.0 72.0 18.7 41.8 5.6 62.3 16.9 46.5 13.8 30.0 6.0 8.9
2018 ICRA SqueezeSeg[64] 30.8 68.3 18.1 5.1 4.1 4.8 16.5 17.3 1.2 84.9 28.4 54.7 4.6 61.5 29.2 59.6 25.5 54.7 11.2 36.3
2019 ICRA SqueezeSegV2[65] 39.6 82.7 21.0 22.6 14.5 15.9 20.2 24.3 2.9 88.5 42.4 65.5 18.7 73.8 41.0 68.5 36.9 58.9 12.9 41.0
2019 IROS RangNet++[66] 52.2 91.4 25.7 34.4 25.7 23.0 38.3 38.8 4.8 91.8 65.0 75.2 27.8 87.4 58.6 80.5 55.1 64.6 47.9 55.9
2020 CVPR PolarNet[67] 54.3 93.8 40.3 30.1 22.9 28.5 43.2 40.2 5.6 90.8 61.7 74.4 21.7 90.0 61.3 84.0 65.5 67.8 51.8 57.5
2021 CVPR Cylinder3D[68] 68.9 97.1 67.6 63.8 50.8 58.5 73.7 69.2 48.0 92.2 65.0 77.0 32.3 90.7 66.5 85.6 72.5 69.8 62.4 66.2
2021 CVPR (AF)22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-S3Net[69] 69.7 94.5 65.4 86.8 39.2 41.1 80.7 80.4 74.3 91.3 68.8 72.5 53.5 87.9 63.2 70.2 68.5 53.7 61.5 71.0
2022 CVPR PVKD[70] 71.2 97.0 67.9 69.3 53.5 60.2 75.1 73.5 50.5 91.8 70.9 77.5 41.0 92.4 69.4 86.5 73.8 71.9 64.9 65.8
2022 ECCV 2DPASS[71] 72.9 97.0 63.6 63.4 61.1 61.5 77.9 81.3 74.1 89.7 67.4 74.7 40.0 93.5 72.9 86.2 73.9 71.0 65.0 70.4
2023 TITS SAT3D[72] 61.3 94.5 42.1 45.6 21.6 39.4 63.4 61.2 18.6 91.8 68.6 77.3 27.2 91.8 67.8 85.8 70.3 71.5 60.3 64.9
2023 ICCV RangFormer*[73] 73.3 96.7 69.4 73.7 59.9 66.2 78.1 75.9 58.1 92.4 73.0 78.8 42.4 92.3 70.1 86.6 73.3 72.8 66.4 66.6
SMTransformer(ours) 74.9 97.3 66.4 65.8 67.2 68.2 80.3 82.7 76.5 92.8 71.4 82.3 38.2 92.8 70.6 86.1 74.5 70.5 67.2 72.3

IV-B Outdoor Semantic Segmentation

Datasets: We conduct experiments on two popular datasets: SemanticKITTI[74] and SWAN [72] dataset. The SemanticKITTI provides 22 sequence point clouds consisting of 43,552 frames. Adhering to standard practice, we employ sequences 0 to 10 (excluding 8) for training, use sequence 8 for validation and sequences 11 to 21 for testing. The labels for the test set are exclusively available to the online server, necessitating result submissions for remote evaluation. The demanding SWAN dataset comprises 32 sequences of point clouds totalling 10,000 frames and containing approximately 0.9 billion points. Sequences 0 to 23 are allocated for training, while sequences 24 to 31 are designated for testing.

Network Configurations: For semantic segmentation on SemanticKITTI, we set the voxel size as 5cm and the maximum number of voxels to 100,000. We use the AdamW optimizer and weight decay as 0.02. The base learning rate is set as 0.004 and the learning rate is scheduled by the Cosine. We adopt rotation, flip, scaling, and transformation to augment training data. On SWAN, we opt not to employ voxelization to reduce point resolution; instead, we directly process the raw point cloud data. We set the maximum number of points to 80,000. We use the AdamW optimizer and set weight decay as 0.04. The base learning rate is set as 0.004 and the learning rate is scheduled by the MultiStepLR. We use the same data augmentation as the ones on SemanticKITTI to preprocess the input data. Our model undergoes training and testing phases with a batch size of 16 and 8 distributed across 4 GPUs.

TABLE V: Semantic segmentation results on the Swan test set.
Methods

mIoU(%)

car

truck

pedestrian

bicycle

motorcycle

bus

bridge

tree

bushnes

building

road

r-driver

rub-bin

bus-stop

pole

wall

Traffic sign

rs-board

sidewalk

adv-board

Pointnet++[25] 14.5 31.2 7.3 4.7 9.0 0.0 4.8 0.0 33.9 12.7 59.5 68.4 0.0 13.7 9.0 6.9 15.9 1.0 2.1 11.0 0.0
PointConv[13] 37.3 53.7 20.5 36.5 19.6 5.2 68.7 7.7 61.0 52.5 74.2 77.1 42.3 19.6 38.9 22.1 45.1 26.7 27.3 40.4 6.4
ψ𝜓\psiitalic_ψ-CNN[75] 39.8 48.5 25.2 31.1 22.4 4.2 77.6 7.3 69.0 56.2 73.9 75.6 47.1 23.0 46.1 30.1 57.0 29.1 25.8 38.2 9.4
PolarNet[67] 40.5 78.1 20.3 21.6 4.6 15.3 18.0 8.3 84.0 30.9 91.9 92.7 54.7 33.4 29.5 48.0 60.3 42.7 22.2 42.2 11.8
Cylinder3D[68] 54.9 80.8 30.4 48.7 28.0 6.8 91.7 13.7 85.3 69.0 92.7 92.3 75.2 37.5 72.1 48.8 71.1 46.5 32.5 56.7 18.9
SAT3D[72] 58.2 83.7 45.7 38.7 42.4 11.3 89.5 49.3 85.5 68.2 93.2 92.7 74.9 40.6 77.1 43.3 74.8 42.9 26.7 65.6 17.9
SMTransformer(ours) 62.4 86.4 50.2 47.2 44.6 19.6 87.0 56.0 88.4 70.4 94.0 92.2 83.3 56.8 75.1 46.7 74.8 54.9 35.6 67.0 21.0
Refer to caption
Figure 6: Visualization of semantic segmentation results on SemanticKITTI (each scene is 50m by 50m, centered around the LiDAR). To accentuate the disparity between predictions and ground truth, we colour right/wrong predictions in gray/red colour, respectively. Notice that there are very few red points i.e. very few wrong predictions.

Results: Table IV presents the outcomes of our network alongside results from well-established methods on the SemanticKITTI dataset. Our approach demonstrates commendable performance, achieving 74.9% mIoU. Notably, compared to the cutting-edge projection-based method RangFormer, which is pre-trained on other datasets, our method exhibits superior performance (+1.6%) without any pre-training. Furthermore, our proposed method surpasses voxel partitioning and 3D convolution-based techniques, such as Cylinder3D, 2DPASS, and PVKD. Importantly, our model showcases a remarkable understanding of certain small object categories, including poles, traffic signs and motorcyclists. Fig. 6 shows visualizations of our results on SemanticKITTI validation set.

We present the results for 20 classes of interest in the SWAN test frames in Table V. We compare our results to PointNet++ [22], PointConv [13], Cylinder3D [68], PolarNet [67], ψ𝜓\psiitalic_ψ-CNN [75] and SAT3D [72]. As depicted in Table V, the mIoU values for these compared methods on the SWAN dataset are lower than those on the semanticKITTI dataset. This discrepancy can be attributed to the heightened complexity of the scenes in the SWAN dataset which was captured in dense central business district are of the city of Perth, Australia. Notably, on this dataset, our method demonstrates the best performance with a remarkable improvement of +4.2% in mIoU compared to the nearest competitor SAT3D. Our method is generally able to show remarkable prediction accuracy towards some small objects, including light poles, traffic sign and pedestrians.

IV-C Object Classification

Datasets: We evaluate SMTransformer on the synthetic data ModelNet40[76], and real-world data ScanobjectNN[77]. The ModelNet40 comprises 12,311 CAD models from 40 categories and is divided into 9,843 training and 2,468 test models. Each sample has about 10,000 points and the features contain coordinates and normals. The ScanObjectNN dataset comprises 15,000 objects categorized into 15 classes, selected from ScanNetv2. In contrast to the synthetic ModelNet40 objects, these objects exhibit occlusion, background noise, deformed geometric shapes, and non-uniform surface density, presenting a more challenging scenario. Our experiments are conducted on its most challenging perturbed variant, denoted as PB__\__T50__\__RS. Here, we uniformly sample 1024 points from each model and only use their (X,Y,Z)𝑋𝑌𝑍(X,Y,Z)( italic_X , italic_Y , italic_Z ) coordinates as input. We follow the data augmentation used in PointNext[47].

Network Configurations: We employ identical network configurations for both ModelNet40 and ScanObjectNN. Throughout the training, we utilize the SGD optimizer with a momentum of 0.9 and an initial learning rate set to 0.1, training the model for 350 epochs using a batch size of 32. To dynamically adapt the learning rate, we implement cosine annealing, adjusting it when it decreases to 0.001 and applying a dropout ratio of 0.4.

Refer to caption
Figure 7: (a) Input point cloud. (b) Attention weights of the hard-masked point transformer. (c) Attention weights of the soft-masked point transformer. The red boxes emphasize the boundaries of challenging board class where our proposed SMTransformer exhibits superior performance compared to the hard-masked point transformer.
TABLE VI: Classification results on ModelNet40 and ScanObjectNN dataset. ‘xyz’ and ‘n’ represent coordinates and normal vector. ‘K’ stands for one thousand and ‘PN++’ for PointNet++. Our network achieves the best overall accuracy.
Methods Input #Points OA(%)
ModelNet40 ScanObjectNN
PointWeb[30] xyz, n 1K 92.3 -
PointConv[13] xyz, n 1K 92.5 -
SpiderCNN[12] xyz, n 5K 92.4 -
KPConv[10] xyz 7K 92.9 -
PointASNL[21] xyz, n 1K 93.2 -
PRANet[78] xyz 2K 93.7 82.1
RS-CNN[79] xyz 1K 93.6 -
PointNet[24] xyz 1K 89.2 68.2
PointNet++[25] xyz 1K 90.7 77.9
DGCNN[33] xyz 1K 92.9 78.2
PointCNN[36] xyz 1K 92.2 78.5
BGA-DGCN[77] xyz 1K - 79.9
BGA-PN++[77] xyz 1K - 80.2
PointASNL[21] xyz 1K 92.9 -
PRANet[78] xyz 1K 93.2 81.0
PointTransformer[23] xyz 1K 93.7
PointMLP[28] xyz 1K 94.1 85.4
PointTransformerV2[22] xyz 1K 94.2
DANet[80] xyz 1K 93.6
LCPformer[53] xyz 1K 93.6
PointNext[47] xyz 1K 93.2 87.7
PointVector[47] xyz 1K 87.8
SMTransformer(ours) xyz 1K 94.2 88.0

Results: We compare our method with representative state-of-the-art methods in Table VI using the overall accuracy (OA) metric. For better comparison, we also show the input data type and the number of input points for each method. Our network achieves the best performance of 94.2% OA on ModelNet40 and the best performance of 88.0% OA on ScanObjectNN.

On ModelNet40, our network surpasses the classical local point convolution KPConv by 1.3%, even though KPConv uses 7,000 input points while our network uses only 1,024 points. In comparison to the previous state-of-the-art MLP-based method (e.g., PointNext), our network outperforms it by 1%. Furthermore, compared to the classical vector attention-based transformers, our network outperforms the Point Transformer by 0.6% and achieves similar competitive results 94.2% as the Point Transformer V2 in terms of OA. Compared to the scalar attention-based methods, our network outperforms the PointANSL and LCPFormer by 1% and 0.6%, respectively.

On ScanObjectNN, our network achieves the state-of-the-art overall accuracy (OA) of 88.0%. Specifically, when compared to MLP-based methods, our network outperforms PointMLP, PointNext, and PointVector by 2.6%, 0.3%, and 0.2% in terms of OA. This superior performance on a real-world dataset highlights the suitability of our method for practical applications.

IV-D Ablation Studies

We conduct ablation studies on the S3DIS dataset to demonstrate the effectiveness of SMTransformer, SAUB and Shared Position Encoding.

1) Effect of Various Components: Table VII displays the influence of our introduced modules (i.e. SMTransformer and SAUB) and strategy (i.e. shared position encoding). Case I is the baseline Point Transformer which does not include any of our modules. Cases II to IV systematically incorporate each of our proposed components, progressively enhancing the baseline result to reach 73.4%. Case V employs the shared position encoding strategy while maintaining performance similar to the unshared position encoding strategy. This indicates that point clouds with the same resolution could share the same position encoding information without sacrificing accuracy.

TABLE VII: Effect of various components on semantic segmentation (S3DIS Area-5).‘ SMTB{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT’ donates the Soft Masked Transformer with position encoding used in the Point Transformer[23]. ‘EPE’ denotes the proposed enhanced position encoding. ‘SAUB’ donates the proposed skip attention-based up-sampling block. ‘SPE’ is the shared position encoding strategy. ‘Para.’ donates the network parameters.
Case SMTB{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT EPE SAUB SPE mIoU(%) mAcc(%) Para.(M)
I 70.6 76.5 7.8
II 72.0 77.8 8.4
III 72.9 78.2 9.0
IV 73.3 78.8 10.3
V 73.4 78.9 7.8

2) SPE versus Unshared Position Encoding (USPE): We conducted a comparative analysis between a network employing shared position encoding and another utilizing unshared position encoding while maintaining consistency in the remaining network configuration. The findings, as presented in Table VIII, highlight the noteworthy advantages of the shared position encoding strategy. The network with shared position encoding not only delivers a superior performance of 73.4% mIoU, but also boasts efficiency gains with fewer parameters (7.8 million) and shorter training time (16 hours).

TABLE VIII: Comparative Analysis of Network Performance with Shared and Unshared Position Encoding.
Case mIoU(%)\uparrow Para.(M)\downarrow Training time(h)\downarrow
SMTransformer + USPE 73.3 10.3 24
SMTransformer + SPE 73.4 7.8 16

3) Soft Mask versus Hard Mask: To demonstrate the effectiveness and versatility of the soft mask, we conduct a comparison with the hard mask. The hard mask, represented by binary mask, can be expressed as,

S()={0KijsQis<τ1KijsQisτ,S(\cdot)=\left\{\begin{aligned} &0\quad K_{ij}^{s}-Q_{i}^{s}<\tau\\ &1\quad K_{ij}^{s}-Q_{i}^{s}\geq\tau\\ \end{aligned}\right.,italic_S ( ⋅ ) = { start_ROW start_CELL end_CELL start_CELL 0 italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT < italic_τ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ≥ italic_τ end_CELL end_ROW ,

where τ𝜏\tauitalic_τ represents the threshold of the mask. When the difference between the task score key Kijssuperscriptsubscript𝐾𝑖𝑗𝑠K_{ij}^{s}italic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and the task score query Qissuperscriptsubscript𝑄𝑖𝑠Q_{i}^{s}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is greater than or equal to the threshold, the corresponding position in the hard mask is set to 1. Otherwise, it is set to 0. Typically, the threshold needs to be optimized for different datasets. Here, we set the τ𝜏\tauitalic_τ as 0.5 on S3DIS through iterative experimentation. Fig. 7 illustrates a comparison of attention weights on the point cloud with soft mask and hard mask, respectively. The point transformer with a mask demonstrates robustness to object boundaries. In particular, the transformer with a hard mask is highly sensitive to certain classes, such as tables and chairs. On the other hand, the transformer with a soft mask not only exhibits common sensitivity to the mentioned classes but also displays high robustness to challenging classes, such as boards.

4) SAUB versus Classic Up-sampling: To prove the effectiveness of our proposed skip attention-based up-sampling block (SAUB), we compare it with two types of up-sampling blocks including the transition up-sampling block (TUB)[23] and grid unpooling block (GUB)[22]. TUB consists of one projection layer, one interpolation, and one addition operation. GUB consists of one projection layer, one grid unpooling operation and one addition operation. The addition operation connects the features from the encoding and decoding layers. Table IX presents the performance of our network with various up-sampling blocks. The network employing the SAUB achieves the best performance with mIoU, mAcc, and OA values of 73.4%, 78.9%, and 91.8%, respectively, surpassing the performance of the network using the TUB by a significant margin.

TABLE IX: Segmentation performance of our model on S3DIS area 5 with different up-sampling blocks. TUB: transition up-sampling block, GUB: grid unpooling block, SAUB: skip-attention-based up-sampling block.
Up-sampling block mIoU(%) mAcc(%) OA(%)
TUB 71.0 76.8 90.6
GUB 72.3 78.0 91.2
SAUB(ours) 73.4 78.9 91.8

IV-E Robustness Analysis

1) Robustness to Density: We compare the robustness of our model to inter- and intra-point cloud density with several typical baselines such as PointNet [24], PointNet++ [25], DGCNN[33], classical convolutional network such as PointConv [13], RS-CNN[79], DANet[80] and attention network PointASNL[21]. For a fair comparison, all the networks are trained on modelnet40_normal_resampled dataset[76] with 1024 points using only coordinates as the input. To showcase the robustness across inter-point clouds with varying densities, we utilize downsampled points of 512, 256, 128, and 64 as input to the trained model. To evaluate the robustness of intra-point cloud with various densities, we divide the 1024 points into four equal parts along the X𝑋Xitalic_X coordinate according to the point number, and then we randomly sample 128 points from each part in sequence. This generates the test samples with 896, 768, 640 and 512 points, respectively. The results are shown in Fig. 8. Our SMTransformer shows significantly superior robustness, surpassing existing approaches for both inter and intra-point cloud variations.

Refer to caption
Figure 8: Comparison of classification results on ModelNet40 when points are downsampled to generate (a) inter-point cloud density robustness and (b) intra-point cloud density robustness.

2) Robustness to Transformation: To demonstrate the robustness of our SMTransformer, we evaluate its performance on S3DIS and ModelNet40 under a variety of perturbations in the test data, including permutation, translation, scaling and jitter. As shown in Table X, on S3DIS, Point Transformer and Point Vector have a huge performance drop on scaling transformation. Our method exhibits remarkable stability across diverse transformations. Particularly noteworthy is its stable performance even amidst a 0.2 translation along the X𝑋Xitalic_X, Y𝑌Yitalic_Y, and Z𝑍Zitalic_Z axes and jitter. All methods are invariant to permutations. In terms of sensitivity to point scaling, SMTransformer performs relatively better when the scaling range is decreased. Our method achieves the best accuracy under all transformations on both segmentation and classification datasets.

TABLE X: Robustness study for random point permutations, translation of ±plus-or-minus\pm± 0.2 in X,Y,Z𝑋𝑌𝑍X,Y,Zitalic_X , italic_Y , italic_Z axis, scaling (×\times×0.8,×\times×1.2) and jittering. Note that this ablation study is without test time augmentation.
Methods None Perm. Translation Scaling Jitter
+ 0.2 - 0.2 ×\times× 0.8 ×\times× 1.2
S3DIS Dataset mIoU(%)
PointNet[24] 57.75 59.71 22.33 29.85 56.24 59.74 59.04
MinkowskiNet[60] 64.68 64.56 64.59 64.96 59.60 61.93 58.96
PAConv[49] 65.63 65.64 55.81 57.42 64.20 63.94 65.12
Point Transformer[23] 70.36 70.45 70.44 70.43 65.73 66.15 59.67
Stratified Transformer[41] 71.96 72.02 71.99 71.93 70.42 71.21 72.02
Point Vector[29] 72.29 72.29 72.29 72.29 69.34 69.26 72.16
SMTransformer(ours) 72.62 72.62 72.83 72.96 72.30 71.94 72.75
ModelNet40 Dataset OA(%)
PointNet++[25] 92.1 92.1 90.7 90.8 91.2 91.0 91.0
DGCNN[33] 92.5 92.5 92.3 92.3 92.1 92.3 91.5
PointConv[13] 91.8 91.8 91.8 91.8 89.9 90.6 90.6
SMTransformer(ours) 94.2 94.2 94.1 94.2 93.5 93.9 92.3

3) Robustness to Noise: To assess the robustness of SMTransformer to noise, we conducted experiments using the PB_T50_RS variant of ScanObjectNN dataset, measuring the performance with and without background noise (denoted as ‘obj_bg’ and ‘obj_nobg’ respectively). Table XI presents a comparative analysis between our model and several baselines from [77]. We observe that the overall accuracy of all networks diminishes when trained and tested under conditions involving background noise. However, our model achieves the highest accuracy, exhibiting the smallest performance drop of 1.4% OA from the ‘obj_nobg’ variant to the ‘obj_bg’ variant, surpassing all other networks in comparison.

TABLE XI: Robustness to background noise on ScanObjectNN. ‘obj_bg’, ‘obj_nobg’ stand for objects with and without noise.
Method obj_nobg obj_bg OA drop(%)
3DmFV[81] 69.8 63.0 6.8\downarrow
PointNet[24] 74.4 68.2 6.2\downarrow
PointNet++[25] 80.2 77.9 2.3\downarrow
SpiderCNN[12] 76.9 73.7 3.2\downarrow
DGCNN[13] 81.5 78.1 3.4\downarrow
PointCNN[36] 80.8 78.5 2.3\downarrow
SMTransformer(ours) 88.0 86.6 1.4normal-↓\downarrow

V Conclusion

In this paper, we introduced a novel Soft Masked Transformer to effectively capture contextual and task-specific information from point clouds. Additionally, we proposed a Skip Attention-based up-sampling block to integrate features from different resolution points across encoding layers. Furthermore, we presented a Shared Position Encoding strategy. By incorporating these modules, we constructed an SMTransformer network. Our method was evaluated across various tasks, including indoor and outdoor semantic segmentation and classification. Through extensive experiments on challenging benchmarks, thorough ablation studies and theoretical analysis, we demonstrated the robustness and effectiveness of our approach on real-world datasets. Our contributions significantly advance the state-of-the-art in-point cloud processing. The introduced techniques, including the Soft Masked Transformer, Skip Attention-based Up-sampling block, and Shared Position Encoding strategy, provide notable improvements in capturing intricate details and enhancing the performance of point cloud processing tasks. Experimental results on diverse datasets confirmed the efficacy of our proposed approach. As future work, exploring additional priors or refining the network architecture could offer promising avenues to further improve point cloud processing.

References

  • [1] Z. Ma, Z. Zheng, J. Wei, Y. Yang, and H. T. Shen, “Instance-dictionary learning for open-world object detection in autonomous driving scenarios,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  • [2] D. W. Shu and J. Kwon, “Hierarchical bidirected graph convolutions for large-scale 3-d point cloud place recognition,” IEEE Trans. Neural Netw. Learn. Syst., 2023.
  • [3] Z. Wang, W. Li, and D. Xu, “Domain adaptive sampling for cross-domain point cloud recognition,” IEEE Trans.Circuits Syst. Video Technol., 2023.
  • [4] Y. Ren, Y. Cong, J. Dong, and G. Sun, “Uni3da: Universal 3d domain adaptation for object recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 379–392, 2022.
  • [5] F. J. Lawin, M. Danelljan, P. Tosteberg, G. Bhat, F. S. Khan, and M. Felsberg, “Deep projective 3d semantic segmentation,” in Proc. Int. Conf. Pattern Recognit. Image Anal.   Springer, 2017, pp. 95–107.
  • [6] A. Boulch, J. Guerry, B. Le Saux, and N. Audebert, “Snapnet: 3d point cloud semantic labeling with 2d deep segmentation networks,” Comput. Graph., vol. 71, pp. 189–198, 2018.
  • [7] L. Tchapmi, C. Choy, I. Armeni, J. Gwak, and S. Savarese, “Segcloud: Semantic segmentation of 3d point clouds,” in Proc. Int. COnf. 3D Vis.   IEEE, 2017, pp. 537–547.
  • [8] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. IEEE Int. Conf. Intell. Rob. Syst.   IEEE, 2015, pp. 922–928.
  • [9] Y. Shen, C. Feng, Y. Yang, and D. Tian, “Mining point cloud local structures by kernel correlation and graph pooling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 4548–4557.
  • [10] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 6411–6420.
  • [11] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller, “Splinecnn: Fast geometric deep learning with continuous b-spline kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 869–877.
  • [12] Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao, “Spidercnn: Deep learning on point sets with parameterized convolutional filters,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 87–102.
  • [13] W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9621–9630.
  • [14] P. Hermosilla, T. Ritschel, P.-P. Vázquez, À. Vinacua, and T. Ropinski, “Monte carlo convolution for learning on non-uniformly sampled point clouds,” ACM Trans. Graph., vol. 37, no. 6, pp. 1–12, 2018.
  • [15] H. Lei, N. Akhtar, and A. Mian, “Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2020.
  • [16] S. Xie, S. Liu, Z. Chen, and Z. Tu, “Attentional shapecontextnet for point cloud recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 4606–4615.
  • [17] X. Liu, Z. Han, Y.-S. Liu, and M. Zwicker, “Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, 2019, pp. 8778–8785.
  • [18] J. Yang, Q. Zhang, B. Ni, L. Li, J. Liu, M. Zhou, and Q. Tian, “Modeling point clouds with self-attention and gumbel subset sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3323–3332.
  • [19] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, “Set transformer: A framework for attention-based permutation-invariant neural networks,” in Proc. Int. Conf. Mach. Learn.   PMLR, 2019, pp. 3744–3753.
  • [20] M. Feng, L. Zhang, X. Lin, S. Z. Gilani, and A. Mian, “Point attention network for semantic segmentation of 3d point clouds,” Pattern Recognit., vol. 107, p. 107446, 2020.
  • [21] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui, “Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5589–5598.
  • [22] X. Wu, Y. Lao, L. Jiang, X. Liu, and H. Zhao, “Point transformer v2: Grouped vector attention and partition-based pooling,” Proc. Adv. Neural Inf. Process. Syst., vol. 35, pp. 33 330–33 342, 2022.
  • [23] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 16 259–16 268.
  • [24] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 652–660.
  • [25] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” arXiv preprint arXiv:1706.02413, 2017.
  • [26] J. Li, B. M. Chen, and G. H. Lee, “So-net: Self-organizing network for point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9397–9406.
  • [27] H. Ran, J. Liu, and C. Wang, “Surface representation for point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022.
  • [28] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” in Proc. Int. Conf. Learn. Represent., 2021.
  • [29] X. Deng, W. Zhang, Q. Ding, and X. Zhang, “Pointvector: A vector representation in point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 9455–9465.
  • [30] H. Zhao, L. Jiang, C.-W. Fu, and J. Jia, “Pointweb: Enhancing local neighborhood features for point cloud processing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5565–5573.
  • [31] L. Jiang, H. Zhao, S. Liu, X. Shen, C.-W. Fu, and J. Jia, “Hierarchical point-edge interaction network for point cloud semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 10 433–10 441.
  • [32] M. Xu, J. Zhang, Z. Zhou, M. Xu, X. Qi, and Y. Qiao, “Learning geometry-disentangled representation for complementary understanding of 3d object point cloud,” in Proc. AAAI Conf. Artif. Intell., vol. 35, 2021, pp. 3056–3064.
  • [33] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019.
  • [34] R. Klokov and V. Lempitsky, “Escape from cells: Deep kd-networks for the recognition of 3d point cloud models,” in Proc. IEEE Int. Conf. Compu. Vis., 2017, pp. 863–872.
  • [35] M. Xu, Z. Zhou, and Y. Qiao, “Geometry sharing network for 3d point cloud classification and segmentation,” in Proc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 12 500–12 507.
  • [36] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” Proc. Adv. Neural Inf. Process. Syst., vol. 31, pp. 820–830, 2018.
  • [37] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 3693–3702.
  • [38] S. Wang, S. Suo, W.-C. Ma, A. Pokrovsky, and R. Urtasun, “Deep parametric continuous convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2589–2597.
  • [39] L. Wang, Y. Huang, Y. Hou, S. Zhang, and J. Shan, “Graph attention convolution for point cloud semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10 296–10 305.
  • [40] C. Wang, B. Samari, and K. Siddiqi, “Local spectral graph convolution for point set feature learning,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 52–66.
  • [41] X. Lai, J. Liu, L. Jiang, L. Wang, H. Zhao, S. Liu, X. Qi, and J. Jia, “Stratified transformer for 3d point cloud segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8500–8509.
  • [42] Q. Hu, B. Yang, L. Xie, S. Rosa, Y. Guo, Z. Wang, N. Trigoni, and A. Markham, “Randla-net: Efficient semantic segmentation of large-scale point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11 108–11 117.
  • [43] F. Groh, P. Wieschollek, and H. P. Lensch, “Flex-convolution,” in Proc. Asian Conf. Comput. Vis.   Springer, 2018, pp. 105–122.
  • [44] F. Engelmann, T. Kontogianni, J. Schult, and B. Leibe, “Know what your neighbors do: 3d semantic segmentation of point clouds,” in Proc. Eur. Conf. Comput. Vis. Worksh., 2018, pp. 0–0.
  • [45] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe, “Exploring spatial context for 3d semantic segmentation of point clouds,” in Proc. IEEE Int. Conf. Comput. Vis. Worksh., 2017, pp. 716–724.
  • [46] Z. Zhang, B.-S. Hua, and S.-K. Yeung, “Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1607–1616.
  • [47] G. Qian, Y. Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,” in Proc. Adv. Neural Inf. Process. Syst., 2022.
  • [48] H. Lei, N. Akhtar, and A. Mian, “Spherical kernel for efficient graph convolution on 3d point clouds,” IEEE Trans. Pattern Anal. Mach. Intell., 2020.
  • [49] M. Xu, R. Ding, H. Zhao, and X. Qi, “Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3173–3182.
  • [50] S. Qiu, S. Anwar, and N. Barnes, “Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1757–1767.
  • [51] L. Tang, Y. Zhan, Z. Chen, B. Yu, and D. Tao, “Contrastive boundary learning for point cloud segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8489–8499.
  • [52] J. Choe, C. Park, F. Rameau, J. Park, and I. S. Kweon, “Pointmixer: Mlp-mixer for point cloud understanding,” in Proc. Eur. Conf. Comput. Vis.   Springer, 2022, pp. 620–640.
  • [53] Z. Huang, Z. Zhao, B. Li, and J. Han, “Lcpformer: Towards effective 3d point cloud analysis via local context propagation in transformers,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  • [54] C. Chen, D. Liu, C. Xu, and T.-K. Truong, “Saks: Sampling adaptive kernels from subspace for point cloud graph convolution,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  • [55] H. Lei, N. Akhtar, M. Shah, and A. Mian, “Mesh convolution with continuous filters for 3-d surface parsing,” IEEE Trans. Neural Netw. Learn. Syst., 2023.
  • [56] S. Fan, Q. Dong, F. Zhu, Y. Lv, P. Ye, and F.-Y. Wang, “Scf-net: Learning spatial contextual features for large-scale point cloud segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 14 504–14 513.
  • [57] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9224–9232.
  • [58] M. Jaritz, J. Gu, and H. Su, “Multi-view pointnet for 3d scene understanding,” in Proc. IEEE Int. Conf. Comput. Vis. Worksh., 2019, pp. 0–0.
  • [59] H.-Y. Chiang, Y.-L. Lin, Y.-C. Liu, and W. H. Hsu, “A unified point-based framework for 3d segmentation,” in Proc. Int. Conf. 3D Vis., 2019, pp. 155–163.
  • [60] C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 3075–3084.
  • [61] C. Park, Y. Jeong, M. Cho, and J. Park, “Fast point transformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 16 949–16 958.
  • [62] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3d semantic parsing of large-scale indoor spaces,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 1534–1543.
  • [63] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5828–5839.
  • [64] B. Wu, A. Wan, X. Yue, and K. Keutzer, “Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud,” in Proc. IEEE Int. Conf. Robot. Autom.   IEEE, 2018, pp. 1887–1893.
  • [65] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer, “Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud,” in Proc. IEEE Int. Conf. Robot. Autom.   IEEE, 2019, pp. 4376–4382.
  • [66] A. Milioto, I. Vizzo, J. Behley, and C. Stachniss, “Rangenet++: Fast and accurate lidar semantic segmentation,” in Proc. IEEE Int. Conf. Intell. Rob. Syst.   IEEE, 2019, pp. 4213–4220.
  • [67] Y. Zhang, Z. Zhou, P. David, X. Yue, Z. Xi, B. Gong, and H. Foroosh, “Polarnet: An improved grid representation for online lidar point clouds semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9601–9610.
  • [68] X. Zhu, H. Zhou, T. Wang, F. Hong, Y. Ma, W. Li, H. Li, and D. Lin, “Cylindrical and asymmetrical 3d convolution networks for lidar segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9939–9948.
  • [69] R. Cheng, R. Razani, E. Taghavi, E. Li, and B. Liu, “2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12 547–12 556.
  • [70] Y. Hou, X. Zhu, Y. Ma, C. C. Loy, and Y. Li, “Point-to-voxel knowledge distillation for lidar semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8479–8488.
  • [71] X. Yan, J. Gao, C. Zheng, C. Zheng, R. Zhang, S. Cui, and Z. Li, “2dpass: 2d priors assisted semantic segmentation on lidar point clouds,” in Proc. Eur. Conf. on Comput. Vis.   Springer, 2022, pp. 677–695.
  • [72] M. Ibrahim, N. Akhtar, S. Anwar, and A. Mian, “Sat3d: Slot attention transformer for 3d point cloud semantic segmentation,” IEEE Trans. Intell. Transp. Syst., 2023.
  • [73] L. Kong, Y. Liu, R. Chen, Y. Ma, X. Zhu, Y. Li, Y. Hou, Y. Qiao, and Z. Liu, “Rethinking range view representation for lidar segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 228–240.
  • [74] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 9297–9307.
  • [75] H. Lei, N. Akhtar, and A. Mian, “Octree guided cnn with spherical kernels for 3d point clouds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9631–9640.
  • [76] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1912–1920.
  • [77] M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1588–1597.
  • [78] S. Cheng, X. Chen, X. He, Z. Liu, and X. Bai, “Pra-net: Point relation-aware network for 3d point cloud analysis,” IEEE Trans. Image Process., vol. 30, pp. 4436–4448, 2021.
  • [79] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8895–8904.
  • [80] Y. He, H. Yu, Z. Yang, W. Sun, M. Feng, and A. Mian, “Danet: Density adaptive convolutional network with interactive attention for 3d point clouds,” IEEE Robot. Autom. Lett., 2023.
  • [81] Y. Ben-Shabat, M. Lindenbaum, and A. Fischer, “3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3145–3152, 2018.