PoinTramba:
[Uncaptioned image] A Hybrid Transformer-Mamba Framework
for Point Cloud Analysis

Zicheng Wang1,2, Zhenghao Chen2, Yiming Wu1, Zhen Zhao2, Lu** Zhou2, Dong Xu1
1The University of Hong Kong
2The University of Sydney
{edmond02, yimingwu, dongxu}@hku.hk
{zhenghao.chen, zhen.zhao, lu**.zhou}@sydney.edu.au
Corresponding author
Abstract

Point cloud analysis has seen substantial advancements due to deep learning, although previous Transformer-based methods excel at modeling long-range dependencies on this task, their computational demands are substantial. Conversely, the Mamba offers greater efficiency but shows limited potential compared with Transformer-based methods. In this study, we introduce PoinTramba, a pioneering hybrid framework that synergies the analytical power of Transformer with the remarkable computational efficiency of Mamba for enhanced point cloud analysis. Specifically, our approach first segments point clouds into groups, where the Transformer meticulously captures intricate intra-group dependencies and produces group embeddings, whose inter-group relationships will be simultaneously and adeptly captured by efficient Mamba architecture, ensuring comprehensive analysis. Unlike previous Mamba approaches, we introduce a bi-directional importance-aware ordering (BIO) strategy to tackle the challenges of random ordering effects. This innovative strategy intelligently reorders group embeddings based on their calculated importance scores, significantly enhancing Mamba’s performance and optimizing the overall analytical process. Our framework achieves a superior balance between computational efficiency and analytical performance by seamlessly integrating these advanced techniques, marking a substantial leap forward in point cloud analysis. Extensive experiments on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our approach, establishing a new state-of-the-art analysis benchmark on point cloud recognition. For the first time, this paradigm leverages the combined strengths of both Transformer and Mamba architectures, facilitating a new standard in the field. The code is available at https://github.com/xiaoyao3302/PoinTramba.

1 Introduction

Point clouds, which serve as crucial 3D visual signals containing essential geometric information, have garnered increasing research interest [1, 2]. Point cloud analysis tasks, such as classification and segmentation, have achieved remarkable success thanks to the advent of deep learning methods. Early approaches either leveraged auxiliary data structures, like voxels, processed with 3D convolutional operations [3, 4], or directly utilized multilayer perceptions (MLPs) [5, 6, 7] to extract visual representations from raw point sets. However, these methods primarily rely on local dependencies and often neglect global features. To enhance global perception in point clouds, recent studies, e.g., PointBERT [8] and PointMAE [9]), have introduced Transformer [10] with attention mechanisms to process long sequences. Although the powerful long-range modeling capacity of the Transformer intuitively enhances analytical performance, the comprehensive mechanism inevitably results in high computational complexity and memory usage, rendering these methods less practical.

On the other hand, to avoid the large memory burden when modeling long-range relationships, a new architecture called Mamba [11] with a state space model [12, 13] module has been introduced. This architecture has achieved success in various Natural Language Processing [14, 15] and Computer Vision [16, 17] tasks. However, despite the efficiency of Mamba-based methods, their performance still lags behind that of Transformer-based methods of comparable size in point cloud analysis [18, 19]. Furthermore, the application of Mamba to point cloud analysis remains an area requiring further investigation. For example, one significant challenge is that point clouds are inherently unordered, whereas Mamba is primarily designed to process structured data. Consequently, the effectiveness of Mamba in handling point cloud data is still uncertain.

In this study, we introduce PoinTramba, a novel hybrid framework for point cloud analysis that harnesses the robust analytical capabilities of Transformer along with the efficiency of Mamba. Specifically, we segment point clouds into distinct groups and utilize the Transformer and Mamba to model intra-group and inter-group relationships, respectively. Initially, capitalizing on the Transformer’s exceptional ability to model long-range dependencies, we employ it to capture intra-group dependencies and generate enhanced intra-group features,i.e., group embeddings. While using the Transformer to model intra-group point clouds remains feasible due to the small number of points within the group, this approach would lead to a significant increase in complexity when modeling inter-group dependencies, due to the large number of groups. To efficiently model inter-group dependencies, we integrate Mamba, which reduces complexity to a linear scale. Specifically, we feed the group embeddings produced by the Transformer into a Mamba encoder. This encoder extracts inter-group global features from each point cloud sample, facilitating analytical tasks such as classification and segmentation.

Particularly, instead of applying Mamba directly to unordered point cloud groups, we introduce a bi-directional importance-aware ordering (BIO) strategy. This approach reorders the groups to mitigate the negative effects of random point cloud ordering to our Mamba encoder. Unlike previous methods [18, 19] such as those using conventional z ordering [20] or Hilbert ordering [21], our proposed algorithm learns an “importance” score for each group embedding and reorders the groups accordingly. Specifically, we map the group embeddings and the global features from the Mamba encoder to the same feature space. Then, we calculate the cosine similarity between each group embedding and the global feature. Additionally, we utilize an importance score prediction module that enables each group embedding to predict an importance score closely aligned with the calculated cosine similarity. Based on these importance scores, we reorder the group embeddings in a bi-directional manner, combining both descending and ascending orders. This ensures that each group embedding aggregates additional information from all other group embeddings. This reordering strategy fully exploits Mamba’s potential for processing structured data, resulting in more refined global inter-group features and significantly enhancing analytical performance.

By incorporating both Transformer and Mamba modules into our end-to-end hybrid analytical framework, we achieve comparable performance results on benchmark datasets while maintaining efficient complexity. Our contributions can be summarized as follows: 1) We propose PoinTramba, a novel hybrid framework combining Transformer and Mamba for efficient and effective point cloud analysis. This framework leverages the powerful modeling capacity of Transformers to produce enhanced intra-group features and the linear complexity of Mamba to generate inter-group features from a large number of group embeddings. 2) We propose a new bi-directional importance-aware ordering strategy to reorder the group embeddings. Such an operation can richly exploit the recurrent nature of Mamba for better processing the structured group order and aggregating additional information from all other group embeddings. 3) We conduct extensive analytical experiments on point cloud classification and segmentation using ScanObjectNN, ModelNet40, and ShapeNetPart benchmark datasets. These experiments demonstrate that our method achieves comparable quantitative results and validates its effectiveness.

2 Related Work

2.1 Deep Learning on Point Cloud Recognition

Deep learning on point cloud recognition has attracted great attention as point clouds contain abundant depth information that can be used in various applications like autonomous driving and robotics [22]. Compared with 2D images, 3D point clouds are sparse and the points are unordered, making it difficult to directly process point clouds for recognition.

Earlier analytical studies [3, 4] have directly utilized auxiliary data structures like voxels, enabling the application of conventional deep learning methods such as 3D convolutions [23, 24]. However, using these additional data structures results in significant computational and memory costs. Facing the issue, PointNet [5] is a pioneering work that proposes to use MLPs to directly process the point sets, which is simple but achieves promising performance and has inspired a series of works like PointNet++ [6] and DGCNN [7], etc. However, these methods only rely on local dependencies that often neglect global features, thus leading to limited performance. Recently, inspired by the success of the Transformer architecture in natural language processing and 2D vision on long-range context modeling [10, 25], various works have been proposed to explore the effectiveness of the Transformer architecture in point clouds [26, 27, 8, 9] and have achieved great performance.

However, the computational complexity of the attention module in Transformer is 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where n𝑛nitalic_n indicates the number of input tokens. Therefore, despite the remarkable performance of the Transformer on point cloud recognition, when the number of input tokens increases, such Transformer-based methods will result in significant computational overhead, limiting the capacity of such methods.

2.2 State Space Models

The state space models are widely used in processing sequences of information by connecting the inputs and the outputs using latent states to model a system [28]. Recently, inspired by the structured state space (S4) model [13], various works [29, 30] have been proposed to use state space models to model long-range dependencies. In particular, the newly proposed Mamba [11] has attracted great attention due to its great potential in global perception. Compared with the computational complexity of Transformer as 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), the computational complexity of Mamba is 𝒪(n)𝒪𝑛\mathcal{O}(n)caligraphic_O ( italic_n ), leading to its linear scalability in sequence length. Inspired by the success of Mamba, various works have been proposed to examine the effectiveness of Mamba on 2D vision tasks, including image classification [16, 17] and medical image segmentation [31], etc. However, these Mamba-based methods still cannot outperform Transformer-based methods with comparable size [32]. Therefore, various works have been proposed to combine the advantages of Transformer and Mamba for better performance and lower computational complexity, which is still an open issue and deserves exploring [32, 33].

2.3 Mamba on Point Clouds

Inspired by the success of Mamba on 2D vision tasks [16, 17, 31], some recent works have been proposed to examine the effectiveness of Mamba on 3D point clouds. In particular, PointMamba [18] combines the Mamba encoder with the standard point cloud group dividing operation while PCMamba [19] combines Mamba with PointMLP [34] to extract the potential of Mamba on point clouds.

Compared with sequential language or image data, point clouds are highly unordered. However, Mamba is a recurrent model, and the order of the hidden states will severely influence the performance of the long-range dependency modeling of Mamba. Therefore, the key issue in adapting Mamba to point clouds lies in the ordering strategy of point clouds. To tackle the issue, PointMamba reorders the point clouds along the axes while PCMamba reorders the point clouds according to multiple orders including z order [20], Hilbert order [21], etc. However, whether these ordering strategies are suitable for point cloud analysis with Mamba is heuristic.

3 Method

3.1 PoinTramaba

The pipeline of our method is shown in Fig. 1 (a), given an input point cloud 𝒫𝒫\mathcal{P}caligraphic_P, we aim to produce a global feature F𝐹Fitalic_F that can be utilized for various downstream tasks. Our method begins by segmenting 𝒫𝒫\mathcal{P}caligraphic_P into G𝐺Gitalic_G point groups. For each group, we employ an Intra-group Transformer encoder, consisting of T𝑇Titalic_T Transformer layers, to generate G𝐺Gitalic_G group embeddings {𝒆g}g=1Gsuperscriptsubscriptsubscript𝒆𝑔𝑔1𝐺\{\boldsymbol{e}_{g}\}_{g=1}^{G}{ bold_italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, where g𝑔gitalic_g indicates the g𝑔gitalic_g-th point group, as shown in Fig. 1 (b). Subsequently, we introduce a bi-directional importance-aware ordering (BIO) algorithm to reorder the group embeddings {𝒆g}g=1Gsuperscriptsubscriptsubscript𝒆𝑔𝑔1𝐺\{\boldsymbol{e}_{g}\}_{g=1}^{G}{ bold_italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT in a bi-directional manner, combining both descending and ascending orders into {𝒆g0}g=12Gsuperscriptsubscriptsubscriptsuperscript𝒆0𝑔𝑔12𝐺\{\boldsymbol{e}^{0}_{g}\}_{g=1}^{2G}{ bold_italic_e start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT. These reordered embeddings are then passed through an Inter-group Mamba encoder, as shown in Fig. 1 (c). After updating with M𝑀Mitalic_M Mamba layers, we obtain an updated set of features{𝒆gM}g=12Gsuperscriptsubscriptsubscriptsuperscript𝒆𝑀𝑔𝑔12𝐺\{\boldsymbol{e}^{M}_{g}\}_{g=1}^{2G}{ bold_italic_e start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT, which is used to produce the global feature 𝒇𝒇\boldsymbol{f}bold_italic_f via an importance-aware pooling (IAP) operation. Note that the superscript m{0,,M}𝑚0𝑀m\in\{0,\cdots,M\}italic_m ∈ { 0 , ⋯ , italic_M } indicates the m𝑚mitalic_m-th Mamba layer, where m=0𝑚0m=0italic_m = 0 indicates the input of the Mamba layer.

Refer to caption
Figure 1: The overview of our newly proposed PoinTramba framework (a) and its two main modules, the Intra-group Transformer Encoder (b) and the Inter-group Mamba Encoder (c). Initially, we segment the input point cloud into distinct point groups. Following this, we employ a Transformer encoder to model intra-group relationships and generate group embeddings. An importance-score prediction module is then utilized to predict the importance score for each group embedding. These predicted importance scores are used to reorder the group embeddings. Finally, a Mamba encoder extracts inter-group relationships from the reordered group embeddings, which are subsequently fed into an importance-aware pooling layer. This layer captures the global feature that can be further utilized for various downstream tasks such as classification and segmentation.

3.2 Intra-group Transformer and Inter-group Mamba Encoder

Intra-group Transformer Encoder. We begin by segmenting the point cloud 𝒫𝒫\mathcal{P}caligraphic_P into G𝐺Gitalic_G groups, each containing K𝐾Kitalic_K points, formally denoted as 𝒫={𝒫g}g=1G𝒫superscriptsubscriptsubscript𝒫𝑔𝑔1𝐺\mathcal{P}=\{\mathcal{P}_{g}\}_{g=1}^{G}caligraphic_P = { caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, s.t., 𝒫g={𝒫gk}k=1Ksubscript𝒫𝑔superscriptsubscriptsubscript𝒫subscript𝑔𝑘𝑘1𝐾\mathcal{P}_{g}=\{\mathcal{P}_{g_{k}}\}_{k=1}^{K}caligraphic_P start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = { caligraphic_P start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Initially, we employ the Farthest Point Sampling (FPS) algorithm to select G𝐺Gitalic_G keypoints. Subsequently, we use the K-Nearest Neighbors (KNN) algorithm to find the K𝐾Kitalic_K nearest neighbors for each keypoint. In this notation, the subscript g𝑔gitalic_g refers to the g𝑔gitalic_g-th group of points, and gksubscript𝑔𝑘g_{k}italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the k𝑘kitalic_k-th point within the g𝑔gitalic_g-th group. After segmenting the point cloud into G𝐺Gitalic_G groups, we leverage the Transformer’s exceptional capability to model long-range dependencies to capture intra-group dependencies and generate enhanced intra-group features, referred to as group embeddings. Specifically, for each point 𝒫gksubscript𝒫subscript𝑔𝑘\mathcal{P}_{g_{k}}caligraphic_P start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we first project the coordinates of the point, i.e., (xgk,ygk,zgk)subscript𝑥subscript𝑔𝑘subscript𝑦subscript𝑔𝑘subscript𝑧subscript𝑔𝑘(x_{g_{k}},y_{g_{k}},z_{g_{k}})( italic_x start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), into a point embedding and a point position embedding using a standard embedding layer of the Transformer [10, 25, 35]. The sums of these point embeddings and point position embeddings are then fed into a standard Transformer encoder, which consists of T𝑇Titalic_T layers, to model intra-group relationships and generate the group embedding 𝒆gsubscript𝒆𝑔\boldsymbol{e}_{g}bold_italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

Inter-group Mamba Encoder. While using the Transformer to model intra-group point clouds is feasible due to the relatively small number of points within each group, modeling dependencies among a large number of groups (inter-group) would significantly increase computational complexity. To address this, we integrate the Mamba Encoder, which efficiently reduces the complexity to a linear scale when modeling inter-group dependencies. Specifically, after obtaining the group embeddings {𝒆g}g=1Gsuperscriptsubscriptsubscript𝒆𝑔𝑔1𝐺\{\boldsymbol{e}_{g}\}_{g=1}^{G}{ bold_italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, we propose a BIO strategy to obtain the reordered group embeddings {𝒆~g}g=12Gsuperscriptsubscriptsubscript~𝒆𝑔𝑔12𝐺\{\tilde{\boldsymbol{e}}_{g}\}_{g=1}^{2G}{ over~ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT. More details about the BIO strategy will be provided in Sec. 3.3. Similarly, we reorder the position embeddings of the coordinates of the keypoints using our BIO strategy. With both reordered group and positional embeddings, we add them to obtain the aggregation 𝑬msuperscript𝑬𝑚\boldsymbol{E}^{m}bold_italic_E start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT of the collection {𝒆gm}g=12Gsuperscriptsubscriptsubscriptsuperscript𝒆𝑚𝑔𝑔12𝐺\{\boldsymbol{e}^{m}_{g}\}_{g=1}^{2G}{ bold_italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT, which will be updated by an M𝑀Mitalic_M-layer Mamba encoder. As shown in Fig. 1 (c), we adopt the Mamba layer from [18], which is a standard Mamba layer that can be presented as:

𝒛m=DW_Conv(MLP(LN(𝑬m1))),𝑬m=MLP(LN(SSM(σ(𝒛m)))σ(LN(𝑬m1)))+𝑬m1.formulae-sequencesuperscript𝒛𝑚𝐷𝑊_𝐶𝑜𝑛𝑣𝑀𝐿𝑃𝐿𝑁superscript𝑬𝑚1superscript𝑬𝑚𝑀𝐿𝑃𝐿𝑁𝑆𝑆𝑀𝜎superscript𝒛𝑚𝜎𝐿𝑁superscript𝑬𝑚1superscript𝑬𝑚1\begin{split}\boldsymbol{z}^{m}&=DW\_Conv\left(MLP\left(LN\left(\boldsymbol{E}% ^{m-1}\right)\right)\right),\\ \boldsymbol{E}^{m}&=MLP\left(LN\left(SSM\left(\sigma\left(\boldsymbol{z}^{m}% \right)\right)\right)\cdot\sigma\left(LN\left(\boldsymbol{E}^{m-1}\right)% \right)\right)+\boldsymbol{E}^{m-1}.\end{split}start_ROW start_CELL bold_italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_CELL start_CELL = italic_D italic_W _ italic_C italic_o italic_n italic_v ( italic_M italic_L italic_P ( italic_L italic_N ( bold_italic_E start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ) ) ) , end_CELL end_ROW start_ROW start_CELL bold_italic_E start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_CELL start_CELL = italic_M italic_L italic_P ( italic_L italic_N ( italic_S italic_S italic_M ( italic_σ ( bold_italic_z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) ) ) ⋅ italic_σ ( italic_L italic_N ( bold_italic_E start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT ) ) ) + bold_italic_E start_POSTSUPERSCRIPT italic_m - 1 end_POSTSUPERSCRIPT . end_CELL end_ROW (1)

where m𝑚mitalic_m indicates the m𝑚mitalic_m-th Mamba layer. Here, DW_Conv𝐷𝑊_𝐶𝑜𝑛𝑣DW\_Convitalic_D italic_W _ italic_C italic_o italic_n italic_v refers to depth-wise convolution, LN𝐿𝑁LNitalic_L italic_N denotes the LayerNorm operation, SSM𝑆𝑆𝑀SSMitalic_S italic_S italic_M represents the state space model, which serves as a basic layer of our Mamba encoder, and \cdot denotes the dot production. Additionally, σ𝜎\sigmaitalic_σ denotes the SiLU [36] operation. After updating the group embeddings through M𝑀Mitalic_M Mamba layers, we obtain the final updated group embeddings {𝒆gM}g=12Gsuperscriptsubscriptsuperscriptsubscript𝒆𝑔𝑀𝑔12𝐺\{\boldsymbol{e}_{g}^{M}\}_{g=1}^{2G}{ bold_italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT. These updated group embeddings are subsequently passed into an importance-aware pooling (IAP) operation to extract the global feature F𝐹Fitalic_F. By leveraging long-range relationship modeling and the linear scalability in sequence length, we enable the Mamba encoder to perform global perception.

3.3 Importance-aware Ordering and Pooling

Bi-directional Importance-aware Ordering. Due to the recurrent nature of Mamba [11] and the unordered nature of point clouds, feeding randomly ordered group embeddings into the Mamba encoder significantly affects Mamba’s performance of long-range dependency modeling. To address this issue, we propose a novel bi-directional importance-aware ordering (BIO) strategy to reorder the group embeddings, mitigating the adverse effects of random point cloud ordering.

Refer to caption
Figure 2: The detailed design of our importance score prediction module (a) and our importance-aware pooling layer (b). The importance score prediction module targets at calculating the similarity between the group embeddings and the global feature, thus predicting the importance scores for group embeddings. The importance-aware pooling layer targets at aggregating the updated group embeddings to obtain the global feature.

In particular, as illustrated in Fig. 2, given the group embeddings {𝒆g}g=1Gsuperscriptsubscriptsubscript𝒆𝑔𝑔1𝐺\{\boldsymbol{e}_{g}\}_{g=1}^{G}{ bold_italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT, we employ two non-linear projection layers, i.e., an embedding projection layer and a global projection layer. These two layers map the group embeddings and the global feature of the point cloud 𝒇𝒇\boldsymbol{f}bold_italic_f into the same feature space, respectively, i.e., 𝒆^gsubscript^𝒆𝑔\hat{\boldsymbol{e}}_{g}over^ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and 𝒇^^𝒇\hat{\boldsymbol{f}}over^ start_ARG bold_italic_f end_ARG. Note we will introduce the global feature of the point cloud later. Therefore, we can calculate the cosine similarity Sgsubscript𝑆𝑔S_{g}italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT between 𝒆^gsubscript^𝒆𝑔\hat{\boldsymbol{e}}_{g}over^ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and 𝒇^^𝒇\hat{\boldsymbol{f}}over^ start_ARG bold_italic_f end_ARG as the importance score that can be presented as:

Sg=𝒆^g𝒇^𝒆^g×𝒇^,subscript𝑆𝑔superscriptsubscript^𝒆𝑔top^𝒇normsubscript^𝒆𝑔norm^𝒇S_{g}=\frac{\hat{\boldsymbol{e}}_{g}^{\top}\hat{\boldsymbol{f}}}{||\hat{% \boldsymbol{e}}_{g}||\times||\hat{\boldsymbol{f}}||},italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = divide start_ARG over^ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_italic_f end_ARG end_ARG start_ARG | | over^ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT | | × | | over^ start_ARG bold_italic_f end_ARG | | end_ARG , (2)

However, calculating Sgsubscript𝑆𝑔S_{g}italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT for each group embedding is not feasible as it requires a known global feature in the ordering stage, which is impractical. To overcome this limitation, we use a non-linear projection layer to predict the corresponding importance score of each group embedding, i.e., Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. We encourage Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to approximate Sgsubscript𝑆𝑔S_{g}italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT using an importance loss, importancesuperscript𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒\mathcal{L}^{importance}caligraphic_L start_POSTSUPERSCRIPT italic_i italic_m italic_p italic_o italic_r italic_t italic_a italic_n italic_c italic_e end_POSTSUPERSCRIPT, which can be formulated as:

importance=1N×Gn=1Ng=1Gsmooth(Sn,g,In,g),superscript𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒1𝑁𝐺superscriptsubscript𝑛1𝑁superscriptsubscript𝑔1𝐺subscript𝑠𝑚𝑜𝑜𝑡subscript𝑆𝑛𝑔subscript𝐼𝑛𝑔\mathcal{L}^{importance}=\frac{1}{N\times G}\sum_{n=1}^{N}\sum_{g=1}^{G}% \mathcal{L}_{smooth}(S_{n,g},I_{n,g}),caligraphic_L start_POSTSUPERSCRIPT italic_i italic_m italic_p italic_o italic_r italic_t italic_a italic_n italic_c italic_e end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N × italic_G end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_n , italic_g end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_n , italic_g end_POSTSUBSCRIPT ) , (3)

where smoothsubscript𝑠𝑚𝑜𝑜𝑡\mathcal{L}_{smooth}caligraphic_L start_POSTSUBSCRIPT italic_s italic_m italic_o italic_o italic_t italic_h end_POSTSUBSCRIPT is smooth L1𝐿1L1italic_L 1 loss. Note that we introduce a new subscript n𝑛nitalic_n here, indicating the n𝑛nitalic_n-th input point cloud, where there are N𝑁Nitalic_N point clouds in total. Note that below we may ignore the subscript n𝑛nitalic_n when it is unnecessary to distinguish the index n𝑛nitalic_n of the point cloud.

In addition, to learn meaningful projection layers, following previous works [37, 38], we perform a group embedding-to-global feature alignment using an alignment loss, i.e., alignmentsuperscript𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡\mathcal{L}^{alignment}caligraphic_L start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n italic_m italic_e italic_n italic_t end_POSTSUPERSCRIPT, which can be formulated as:

alignment=1N×Gn=1Ng=1G(log𝒆^n,g𝒇^nm𝒆^n,g𝒇^m).superscript𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡1𝑁𝐺superscriptsubscript𝑛1𝑁superscriptsubscript𝑔1𝐺logsuperscriptsubscript^𝒆𝑛𝑔topsubscript^𝒇𝑛subscript𝑚superscriptsubscript^𝒆𝑛𝑔topsubscript^𝒇𝑚\mathcal{L}^{alignment}=\frac{1}{N\times G}\sum_{n=1}^{N}\sum_{g=1}^{G}(-% \mathrm{log}\frac{\hat{\boldsymbol{e}}_{n,g}^{\top}\hat{\boldsymbol{f}}_{n}}{% \sum_{m}\hat{\boldsymbol{e}}_{n,g}^{\top}\hat{\boldsymbol{f}}_{m}}).caligraphic_L start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n italic_m italic_e italic_n italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N × italic_G end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( - roman_log divide start_ARG over^ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_n , italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over^ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_n , italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over^ start_ARG bold_italic_f end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ) . (4)

After predicting the importance scores Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT of the group embeddings, we reorder them in a bi-directional manner, i.e., 𝒐g=[𝒐g1,𝒐g2]subscript𝒐𝑔superscriptsubscript𝒐𝑔1superscriptsubscript𝒐𝑔2\boldsymbol{o}_{g}=\left[\boldsymbol{o}_{g}^{1},\boldsymbol{o}_{g}^{2}\right]bold_italic_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = [ bold_italic_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], where 𝒐g1superscriptsubscript𝒐𝑔1\boldsymbol{o}_{g}^{1}bold_italic_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is the descending order of Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and 𝒐g2superscriptsubscript𝒐𝑔2\boldsymbol{o}_{g}^{2}bold_italic_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the ascending order of Igsubscript𝐼𝑔I_{g}italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. This process yields the reordered group embeddings {𝒆~g}g=12Gsuperscriptsubscriptsubscript~𝒆𝑔𝑔12𝐺\{\tilde{\boldsymbol{e}}_{g}\}_{g=1}^{2G}{ over~ start_ARG bold_italic_e end_ARG start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT. This bi-directional ordering strategy ensures that each group embedding aggregates information from all other group embeddings.

Importance-aware Pooling. Given the updated group embeddings {𝒆gM}g=12Gsuperscriptsubscriptsubscriptsuperscript𝒆𝑀𝑔𝑔12𝐺\{\boldsymbol{e}^{M}_{g}\}_{g=1}^{2G}{ bold_italic_e start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT, we perform a pooling operation to obtain the global feature 𝒇𝒇\boldsymbol{f}bold_italic_f from these embeddings. Recall the predicted importance scores of the group embeddings represent the cosine similarities between the group embeddings and the global feature, highlighting the importance of each group embedding. A negative importance score suggests that the corresponding group embedding negatively impacts the global feature. Therefore, we propose to discard group embeddings with negative importance scores and reweight the remaining embeddings using their importance scores. This process is represented as:

𝒇=g=12G𝒆gM𝟏(Ig),𝒇superscriptsubscript𝑔12𝐺subscriptsuperscript𝒆𝑀𝑔1subscript𝐼𝑔\boldsymbol{f}=\sum_{g=1}^{2G}\boldsymbol{e}^{M}_{g}\cdot\mathbf{1}(I_{g}),bold_italic_f = ∑ start_POSTSUBSCRIPT italic_g = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_G end_POSTSUPERSCRIPT bold_italic_e start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⋅ bold_1 ( italic_I start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , (5)

where 𝟏()1\mathbf{1}(\cdot)bold_1 ( ⋅ ) is the unit step function. The resulting global feature of the point cloud can be utilized for various downstream tasks such as classification or segmentation, using the corresponding loss tasksuperscript𝑡𝑎𝑠𝑘\mathcal{L}^{task}caligraphic_L start_POSTSUPERSCRIPT italic_t italic_a italic_s italic_k end_POSTSUPERSCRIPT.

3.4 Objective Function

We adopt a multi-faceted approach to loss computation for optimization. Specifically, we incorporate the following loss components: 1) Task loss tasksuperscript𝑡𝑎𝑠𝑘\mathcal{L}^{task}caligraphic_L start_POSTSUPERSCRIPT italic_t italic_a italic_s italic_k end_POSTSUPERSCRIPT optimizes the specific downstream tasks, such as classification or segmentation, ensuring robust performance. 2) Importance loss Importancesuperscript𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒\mathcal{L}^{Importance}caligraphic_L start_POSTSUPERSCRIPT italic_I italic_m italic_p italic_o italic_r italic_t italic_a italic_n italic_c italic_e end_POSTSUPERSCRIPT optimizes the ordering of importance scores for different group embeddings. 3) Alignment loss alignmentsuperscript𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡\mathcal{L}^{alignment}caligraphic_L start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n italic_m italic_e italic_n italic_t end_POSTSUPERSCRIPT optimizes the prediction of importance scores, ensuring that the model learns semantically meaningful embedding projection layers. We use three trade-off parameters, α𝛼\alphaitalic_α, β𝛽\betaitalic_β, and γ𝛾\gammaitalic_γ, to balance each loss component. Consequently, we train the entire network by solving the following optimization problem in an end-to-end fashion:

=αtask+βimportance+γalignment.𝛼superscript𝑡𝑎𝑠𝑘𝛽superscript𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒𝛾superscript𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡\mathcal{L}=\alpha\mathcal{L}^{task}+\beta\mathcal{L}^{importance}+\gamma% \mathcal{L}^{alignment}.caligraphic_L = italic_α caligraphic_L start_POSTSUPERSCRIPT italic_t italic_a italic_s italic_k end_POSTSUPERSCRIPT + italic_β caligraphic_L start_POSTSUPERSCRIPT italic_i italic_m italic_p italic_o italic_r italic_t italic_a italic_n italic_c italic_e end_POSTSUPERSCRIPT + italic_γ caligraphic_L start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n italic_m italic_e italic_n italic_t end_POSTSUPERSCRIPT . (6)

4 Experiments

4.1 Experimental Protocols

Following previous works [18], we evaluate the effectiveness of our method on three downstream tasks, i.e., real-world object classification on ScanObjectNN [39], synthetic object classification on ModelNet40 [4], and part segmentation on ShapeNetPart [40]. The details of the datasets are provided in Sec. A.1 and the implementation details are outlined in Sec. A.2.

Table 1: Comparison of classification accuracies (in %) with the state-of-the-art methods on the ScanObjectNN on three variants, with PB-T50-RS being the most challenging one. The best performance is highlighted in bold. denotes using rotational augmentation for training. Hybrid denotes the hybrid Transformer and Mamba backbone. All of the methods are trained from scratch without pre-training.
Methods Backbone Param. (M) \downarrow OBJ-BG \uparrow OBJ-ONLY \uparrow PB-T50-RS \uparrow
PointNet [5] MLP 3.5 73.3 79.2 68.0
PointNet++ [6] MLP 1.5 82.3 84.3 77.9
PointCNN [41] MLP 0.6 86.1 85.5 78.5
DGCNN [7] MLP 1.8 82.8 86.2 78.1
MVTN [42] MLP 11.2 - - 82.8
PointNeXt [43] MLP 1.4 - - 87.7
PointMLP [34] MLP 13.2 - - 85.4
Point-BERT [8] Transformer 22.1 79.9 80.6 77.2
PointMAE [9] Transformer 22.1 86.8 86.9 80.8
PointMamba [18] Mamba 12.3 88.3 87.8 82.5
PCM [19] Mamba 34.2 - - 88.1
PoinTramba (Ours) Hybrid 19.5 92.3 ±plus-or-minus\pm± 0.4 90.9 ±plus-or-minus\pm± 0.2 84.5 ±plus-or-minus\pm± 0.1
PoinTramba (Ours) Hybrid 19.5 92.3 ±plus-or-minus\pm± 0.2 91.3 ±plus-or-minus\pm± 0.4 89.1 ±plus-or-minus\pm± 0.2
Table 2: Comparison of classification accuracies (in %) with the state-of-the-art methods on the ModelNet40. The best performance is highlighted in bold. denotes reproduced results. denotes using voting by averaging the results of 10 randomly scaled input point clouds. Hybrid denotes the hybrid Transformer and Mamba backbone. All of the methods are trained from scratch without pre-training.
Methods Backbone Param. (M) \downarrow Accuracy \uparrow
PointNet [5] MLP 3.5 89.2
PointNet++ [6] MLP 1.5 90.7
PointCNN [41] MLP 0.6 92.2
DGCNN [7] MLP 1.8 92.9
PointNeXt [43] MLP 1.4 92.9
OctFormer [44] Transformer - 92.7
PointMAE [9] Transformer 22.1 92.3
PointMamba [18] Mamba 12.3 92.4
PCM [19] Mamba 34.2 92.6
PoinTramba (Ours) Hybrid 19.5 92.7 ±plus-or-minus\pm± 0.1
PoinTramba (Ours) Hybrid 19.5 92.9 ±plus-or-minus\pm± 0.1

4.2 Experimental results

Real-world Object Classification on ScanObjectNN. We first compare our PoinTramba with other methods on the real-world object classification benchmark dataset ScanObjNN in Table 1. It can be inferred from the table that our method surpasses the current methods, including the MLP-based methods like PointNet++ and Transformer-based methods like PointMAE, by a significant margin. Specifically, our method exceeds the previous state-of-the-art (SOTA) by 4.0%, 3.5% and 1.0% on three variants of ScanObjNN, respectively. Notably, the number of parameters in PoinTramba is even less than those in standard Transformer-based methods, such as PointBERT and PointMAE. Although some MLP-based methods, like PointCNN, introduce fewer parameters, their performance lags significantly behind ours. The primary reason is that the intra-group Transformer encoder in our method enhances local feature extraction, outperforming other methods that rely on MLPs for this task. Additionally, the intra-group Transformer encoder is lightweight, ensuring that PoinTramba does not introduce significantly more parameters compared to PointMamba. It is also worth noting that PCM utilizes nearly double the parameters of PoinTramba, yet our method still outperforms it, highlighting the effectiveness of our approach.

Synthetic Object Classification on ModelNet40. We then compare our PoinTramba with other methods on the synthetic object classification benchmark dataset ModelNet40 in Table 2. Note that we reproduce the result of PCM as it follows a different setting from ours. It can be inferred from the table that our method achieves the SOTA performance, surpassing the current Transformer-based and Mamba-based methods by a large margin.

Part Segmentation on ShapeNetPart. Finally, we compare our PoinTramba with other methods on the part segmentation benchmark dataset ShapeNetPart in Table 3. Note that we reproduce the results of PCM since it requires not only coordinates but also additional four-dimensional information as input. It can be inferred from the table that despite the ShapeNetPart being a highly competitive benchmark, our method achieves comparable performance to previous SOTA methods, validating the effectiveness of our approach.

Table 3: Comparison of part segmentation accuracies (in %) with the state-of-the-art methods on the ShapeNetPart. The mIoU for all instances (Inst.) is reported. The best performance is highlighted in bold. denotes reproduced results. Hybrid denotes the hybrid Transformer and Mamba backbone. All of the methods are trained from scratch without pre-training. The inputs are 3D coordinates.
Methods Backbone Param. (M) \downarrow Inst. mIoU \uparrow
PointNet [5] MLP - 83.7
PointNet++ [6] MLP - 85.1
DGCNN [7] MLP - 85.2
APES [45] MLP - 85.8
PointMAE [9] Transformer 27.1 85.7
PointMamba [18] Mamba 17.4 85.8
PCM [19] Mamba 40.6 84.3
PoinTramba (Ours) Hybrid 25.4 85.7 ±plus-or-minus\pm± 0.1
Table 4: Ablation study on the effectiveness of different components in our method, including the intra-group Transformer encoder (Transformer), the inter-group Mamba encoder (Mamba), the alignment loss (Align), the bi-directional importance-aware ordering strategy (BIO) and the importance-aware pooling (IAP). Experiments are conducted on the PB-T50-RS variant of the ScanObjNN dataset. The baseline method for comparison (Variant No. 0) is PointNet++.
Variant No. Mamba Transformer Align BIO IAP Acc. \uparrow
0 77.9
1 \checkmark 86.1
2 \checkmark \checkmark 87.5
3 \checkmark \checkmark 86.5
4 \checkmark \checkmark \checkmark 88.2
5 \checkmark \checkmark \checkmark \checkmark 87.9
6 \checkmark \checkmark \checkmark \checkmark 88.6
7 \checkmark \checkmark \checkmark \checkmark \checkmark 89.1

4.3 Ablation Study

In this section, we highlight the contributions of our module designs on the PB-T50-RS variant of the ScanObjNN dataset, as shown in Table 4. From the table, we observe that deploying an inter-group Mamba encoder alone improves recognition performance by 8.2% over PointNet++ (Variant No. 1 vs. Variant No. 0), primarily due to Mamba’s global perception capability. Adding an intra-group Transformer encoder to the inter-group Mamba encoder further enhances performance by 0.4% (Variant No. 3 vs. Variant No. 1), mainly owing to the Transformer’s long-range modeling ability to capture intra-group dependencies. Incorporating our BIO strategy with the PoinTramba model results in a substantial performance improvement of 2.1% (Variant No. 6 vs. Variant No. 3), underscoring the effectiveness of our ordering algorithm. Additionally, the importance-aware pooling operation helps the model focus on useful group embeddings while ignoring irrelevant ones, leading to a 0.5% performance increase compared to the random ordering strategy (Variant No. 7 vs. Variant No. 6). Combining BIO and IAP with the Mamba encoder only can also achieve a recognition accuracy of 87.9%, surpassing the Mamba encoder by 1.8% (Variant No. 5 vs. Variant No. 1). Moreover, considering that the BIO strategy includes an alignment loss, which enhances local feature extraction, we further examine the effectiveness of alignmentsuperscript𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡\mathcal{L}^{alignment}caligraphic_L start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n italic_m italic_e italic_n italic_t end_POSTSUPERSCRIPT. As shown in the table, adopting alignmentsuperscript𝑎𝑙𝑖𝑔𝑛𝑚𝑒𝑛𝑡\mathcal{L}^{alignment}caligraphic_L start_POSTSUPERSCRIPT italic_a italic_l italic_i italic_g italic_n italic_m italic_e italic_n italic_t end_POSTSUPERSCRIPT alone improves the Mamba model’s performance by 1.4% (Variant No. 2 vs. Variant No. 1) and the PoinTramba model’s performance by 1.7% (Variant No. 4 vs. Variant No. 3). Additionally, the BIO strategy alone brings a 0.4% performance improvement (Variant No. 4 vs. Variant No. 6). These ablation studies demonstrate the effectiveness of each component of our method.

Refer to caption
(a) Ordering Strategies
Refer to caption
(b) Pooling Methods
Figure 3: Ablation studies on different ordering strategies and pooling methods. Experiments are conducted on the PB-T50-RS variant of the ScanObjNN dataset. PoinTramba is adopted as the backbone. (a) shows the comparison of different ordering strategies, i.e., random ordering strategy, XYZ ordering strategy, z ordering strategy, Hilbert Ordering strategy, our single-directional importance-aware ordering strategy (SIO) and bi-directional importance-aware ordering strategy (BIO). (b) illustrates the comparison of different pooling methods, i.e., average pooling, max-pooling, weighted sum and our importance-aware pooling (IAP).

We then verify the importance of ordering strategy for the Mamba model by adopting PoinTramba as the backbone and comparing six different ordering strategies, i.e., random ordering strategy, coordinate-based ordering strategy, i.e., the XYZ ordering strategy [18], z ordering strategy [20], Hilbert ordering strategy [21], single-directional importance-aware ordering strategy in descending order (SIO) and bi-directional importance-aware ordering strategy (BIO). The results are shown in Fig. 3 (a). As can be seen from the table, the random ordering strategy outperforms the coordinate-based ordering strategy by 0.8%. The main reason, based on our analysis, is likely that the random ordering strategy places embeddings that are both far from and close to a certain embedding into the neighboring region of that embedding. This way, the embedding aggregates information from both long-range and short-range distances, leading to competitive global perception. Notably, our BIO strategy surpasses the SIO strategy by 0.5%, as the BIO strategy ensures that each embedding aggregates information from all other embeddings, resulting in better global perception performance. Additionally, our BIO strategy outperforms the other ordering strategies by a significant margin, indicating its effectiveness.

Finally, we verify the importance of our importance-aware pooling operation (IAP), as shown in Fig. 3 (b). We compare our IAP with several different pooling strategies, including the average pooling strategy, the max pooling strategy, and the weighted sum strategy. It can be inferred from the figure that preventing the model from being influenced by useless group embeddings introduces a performance improvement of 0.6%, compared to the direct weighted sum strategy, indicating the effectiveness of our IAP strategy.

5 Conclusion

In this work, we introduce PoinTramba, a pioneering hybrid framework that combines the powerful modeling capacity of Transformers with the computational efficiency of Mamba for point cloud analysis. By integrating these two architectures, PoinTramba achieves a superior balance between computational complexity and analytical performance. Additionally, our innovative BIO strategy significantly exploits Mamba’s performance. Extensive experiments demonstrate the effectiveness of PoinTramba, establishing a new state-of-the-art benchmark in point cloud analysis. This novel approach leverages the combined strengths of Transformer and Mamba architectures, marking a significant advancement in the field. For future work, we plan to explore further optimization techniques to enhance the scalability of PoinTramba and investigate its applicability to a broader range of point cloud tasks and further refine our ordering strategy to further improve efficiency and performance.

Limitation. In this study, we focused solely on an importance-aware ordering strategy. However, it is not clear whether this approach is the most optimal sorting algorithm that can fully harness the potential of Mamba. Additionally, further experiments across a broader range of tasks are necessary to comprehensively evaluate the capabilities of our PoinTramba.

References

  • [1] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jian** Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020.
  • [2] Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16454–16463, 2022.
  • [3] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
  • [4] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [5] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  • [6] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  • [7] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
  • [8] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022.
  • [9] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. In European conference on computer vision, pages 604–621. Springer, 2022.
  • [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [11] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  • [12] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  • [13] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • [14] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
  • [15] Wei He, Kai Han, Yehui Tang, Chengcheng Wang, Yujie Yang, Tianyu Guo, and Yunhe Wang. Densemamba: State space models with dense hidden connection for efficient large language models. arXiv preprint arXiv:2403.00818, 2024.
  • [16] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  • [17] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  • [18] Dingkang Liang, Xin Zhou, Xinyu Wang, Xingkui Zhu, Wei Xu, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024.
  • [19] Tao Zhang, ** Ji, and Shuicheng Yan. Point could mamba: Point cloud learning via state space model. arXiv preprint arXiv:2403.00762, 2024.
  • [20] Guy M Morton. A computer oriented geodetic data base and a new technique in file sequencing. 1966.
  • [21] David Hilbert and David Hilbert. Über die stetige abbildung einer linie auf ein flächenstück. Dritter Band: Analysis· Grundlagen der Mathematik· Physik Verschiedenes: Nebst Einer Lebensgeschichte, pages 1–2, 1935.
  • [22] Lichen Zhao, **yang Guo, Dong Xu, and Lu Sheng. Transformer3d-det: Improving 3d object detection by vote refinement. IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4735–4746, 2021.
  • [23] Daniel Maturana and Sebastian Scherer. 3d convolutional neural networks for landing zone detection from lidar. In 2015 IEEE international conference on robotics and automation (ICRA), pages 3471–3478. IEEE, 2015.
  • [24] Danil Prokhorov. A convolutional learning system for object classification in 3-d lidar data. IEEE Transactions on neural networks, 21(5):858–863, 2010.
  • [25] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [26] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021.
  • [27] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer. Computational Visual Media, 7:187–199, 2021.
  • [28] Rui Xu, Shu Yang, Yihui Wang, Bo Du, and Hao Chen. A survey on vision mamba: Models, applications and challenges. arXiv preprint arXiv:2404.18861, 2024.
  • [29] Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
  • [30] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  • [31] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  • [32] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024.
  • [33] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
  • [34] Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
  • [35] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  • [36] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  • [37] Zicheng Wang, Wen Li, and Dong Xu. Domain adaptive sampling for cross-domain point cloud recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [38] Yongming Rao, Jiwen Lu, and Jie Zhou. Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5376–5385, 2020.
  • [39] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
  • [40] Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
  • [41] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018.
  • [42] Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2021.
  • [43] Guocheng Qian, Yuchen Li, Houwen Peng, **jie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
  • [44] Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  • [45] Chengzhi Wu, Junwei Zheng, Julius Pfrommer, and Jürgen Beyerer. Attention-based point cloud edge sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2023.
  • [46] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

Appendix A More Details of Experimental Protocols

A.1 Datasets

Following previous works [18], we evaluate the effectiveness of our method on three downstream tasks, i.e., real-world object classification on ScanObjectNN [39], synthetic object classification on ModelNet40 [4], and part segmentation on ShapeNetPart [40]. The ScanObjNN dataset [39] is a challenging point cloud object classification dataset consisting of 15,000 point cloud objects from 15 categories scanned from the real world, which includes three variants: OBJ_BG, OBJ_ONLY, and PB_T50_RS. Following previous works [18, 19], we set the number of points for each sample as 2048. The ModelNet dataset [4] is a widely-used point cloud object classification dataset covering 40 categories where each category includes 100 synthetic samples. Following previous works [18, 19], we set the number of points for each sample as 1024. The ShapeNetPart [40] dataset is a challenging part segmentation dataset for point cloud, consisting of 16,880 samples from 16 different shape categories with 50 part labels. Following previous works [18, 19], we set the number of points for each sample as 2048.

A.2 Implementation Details

For all of our experiments, we adopt AdamW [46] optimizer with a cosine scheduler. We train our model for 300 epochs, including a warm-up stage of 10 epochs. The initial learning rate is set to 3e-4 for all classification models and 2e-4 for part segmentation models, with a weight decay of 5e-2 for all experiments. The batch size is set to 32 for all tasks. Our intra-group Transformer encoder includes 4 Transformer layers with a channel size of 384, and no classification token is utilized. Our inter-group Mamba encoder includes 12 Mamba layers with a channel size of 384, and no classification token is utilized. For classification tasks on both the ScanObjectNN and ModelNet datasets, the point cloud samples are divided into 256 groups, with each group consisting of 16 points. For part segmentation tasks on the ShapeNetPart dataset, the point cloud samples are divided into 128 groups, with each group consisting of 32 points. The projection layers, including the embedding projection layer and the global projection layer, are two-layer convolutional layers with a hidden dimension of 128 and an output dimension of 256, followed by a BatchNorm layer, a ReLU function, and a normalization operation. The importance score prediction head is a two-layer convolutional layer with a hidden dimension of 128 and an output dimension of 1, followed by a BatchNorm layer and a ReLU function. The hyper-parameters α𝛼\alphaitalic_α, β𝛽\betaitalic_β and γ𝛾\gammaitalic_γ are all set to 1.0 for classification tasks and set to 5.0, 1.0, and 1.0 for part segmentation tasks. Most of our experimental settings are adopted from PointMamba [18] including the classification head and segmentation head. All of our classification models are trained on a server with ten NVIDIA RTX 3090 GPUs, and all of our segmentation models are trained on a server with eight NVIDIA A100 GPUs with 80GB memory.

Appendix B Additional Ablation Study

Table 5: Ablation on the sensitivity of our PoinTramba to the number of groups (Group Number) and the number of points within a point group (Group Size). Experiments are conducted on the PB-T50-RS variant of the ScanObjNN dataset.
Group Number Group Size Acc. (%) \uparrow Group Number Group Size Acc. (%) \uparrow
8 1024 87.8 32 256 88.3
8 512 87.8 32 128 88.5
16 512 89.0 64 64 88.2
16 256 89.1 128 32 87.3

In this section, we perform additional ablation studies to verify the effectiveness of our method. First, we investigate the sensitivity of our method to the number of groups G𝐺Gitalic_G and the number of points within a point group K𝐾Kitalic_K. In particular, we divide the point clouds into several groups ranging from 32 to 1024, with the number of points within each point group varying from 128 to 8. The results are reported in Table 5. As shown in the table, a group size that is too small results in limited semantic information for each point group. Consequently, the intra-group Transformer encoder may fail to encode the potential semantic features of each point group, leading to limited recognition performance of PoinTramba. Conversely, when the group size is too large, a lightweight intra-group Transformer encoder, i.e., a 4-layer Transformer encoder, may fail to encode abundant features for each point group, also leading to limited recognition performance. Notably, due to the extraordinary long-range modeling capacity of Mamba, varying the number of point groups does not significantly affect the performance of our PoinTramba model.

Table 6: Ablation study on the sensitivity of our PoinTramba to the number of layers T𝑇Titalic_T in the Transformer encoder. Experiments are conducted on the PB-T50-RS variant of the ScanObjNN dataset.
Method Layer Num. Acc. (%) \uparrow Param. (M) \downarrow
PoinTramba (Ours) 0 87.9 12.5
PoinTramba (Ours) 1 87.9 14.2
PoinTramba (Ours) 2 88.2 16.0
PoinTramba (Ours) 4 89.1 19.5

Finally, we examine the sensitivity of our PoinTramba to the number of layers T𝑇Titalic_T in the Transformer encoder. The results are listed in Table 6. As inferred from the table, when the number of layers is small, e.g., 1 or 2, the intra-group Transformer encoder is too lightweight to capture abundant semantic features for each point group. Following this configuration, PoinTramba does not significantly outperform the PoinTramba model that uses MLPs as the intra-group encoder, i.e., when T=0𝑇0T=0italic_T = 0. However, when a four-layer Transformer encoder is used, the intra-group encoder effectively models long-range dependencies within each point group, resulting in outstanding recognition performance.

Appendix C Visualization

Refer to caption
Figure 4: Visualization of the importance scores for the point groups predicted by our PoinTramba model. Samples from various categories in ModelNet40 are used as examples. Red regions indicate higher importance scores, while yellow regions indicate lower importance scores.

In this section, we present visualizations of the importance scores predicted by our PoinTramba model in Fig.4. As illustrated in the figure, points associated with the fuselages have higher importance scores than those on the wings. Similarly, points located on the legs of chairs or tables, lampshades, bathtub water tanks, plant stems, stair handrails, and radio antennas also exhibit higher importance scores. Unlike previous conventional ordering strategies[18, 19], our BIO strategy reorders the group embeddings in a bi-directional manner based on the learned importance scores. This approach allows the Mamba model to better process structured data, resulting in more refined global inter-group features and significantly enhancing analytical performance.