PoinTramba:
A Hybrid Transformer-Mamba Framework
for Point Cloud Analysis

Zicheng Wang^1,2, Zhenghao Chen², Yiming Wu¹, Zhen Zhao², Lu** Zhou², Dong Xu¹
¹The University of Hong Kong
²The University of Sydney
{edmond02, yimingwu, dongxu}@hku.hk
{zhenghao.chen, zhen.zhao, lu**.zhou}@sydney.edu.au
Corresponding author

Abstract

Point cloud analysis has seen substantial advancements due to deep learning, although previous Transformer-based methods excel at modeling long-range dependencies on this task, their computational demands are substantial. Conversely, the Mamba offers greater efficiency but shows limited potential compared with Transformer-based methods. In this study, we introduce PoinTramba, a pioneering hybrid framework that synergies the analytical power of Transformer with the remarkable computational efficiency of Mamba for enhanced point cloud analysis. Specifically, our approach first segments point clouds into groups, where the Transformer meticulously captures intricate intra-group dependencies and produces group embeddings, whose inter-group relationships will be simultaneously and adeptly captured by efficient Mamba architecture, ensuring comprehensive analysis. Unlike previous Mamba approaches, we introduce a bi-directional importance-aware ordering (BIO) strategy to tackle the challenges of random ordering effects. This innovative strategy intelligently reorders group embeddings based on their calculated importance scores, significantly enhancing Mamba’s performance and optimizing the overall analytical process. Our framework achieves a superior balance between computational efficiency and analytical performance by seamlessly integrating these advanced techniques, marking a substantial leap forward in point cloud analysis. Extensive experiments on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our approach, establishing a new state-of-the-art analysis benchmark on point cloud recognition. For the first time, this paradigm leverages the combined strengths of both Transformer and Mamba architectures, facilitating a new standard in the field. The code is available at https://github.com/xiaoyao3302/PoinTramba.

1 Introduction

Point clouds, which serve as crucial 3D visual signals containing essential geometric information, have garnered increasing research interest [1, 2]. Point cloud analysis tasks, such as classification and segmentation, have achieved remarkable success thanks to the advent of deep learning methods. Early approaches either leveraged auxiliary data structures, like voxels, processed with 3D convolutional operations [3, 4], or directly utilized multilayer perceptions (MLPs) [5, 6, 7] to extract visual representations from raw point sets. However, these methods primarily rely on local dependencies and often neglect global features. To enhance global perception in point clouds, recent studies, e.g., PointBERT [8] and PointMAE [9]), have introduced Transformer [10] with attention mechanisms to process long sequences. Although the powerful long-range modeling capacity of the Transformer intuitively enhances analytical performance, the comprehensive mechanism inevitably results in high computational complexity and memory usage, rendering these methods less practical.

On the other hand, to avoid the large memory burden when modeling long-range relationships, a new architecture called Mamba [11] with a state space model [12, 13] module has been introduced. This architecture has achieved success in various Natural Language Processing [14, 15] and Computer Vision [16, 17] tasks. However, despite the efficiency of Mamba-based methods, their performance still lags behind that of Transformer-based methods of comparable size in point cloud analysis [18, 19]. Furthermore, the application of Mamba to point cloud analysis remains an area requiring further investigation. For example, one significant challenge is that point clouds are inherently unordered, whereas Mamba is primarily designed to process structured data. Consequently, the effectiveness of Mamba in handling point cloud data is still uncertain.

In this study, we introduce PoinTramba, a novel hybrid framework for point cloud analysis that harnesses the robust analytical capabilities of Transformer along with the efficiency of Mamba. Specifically, we segment point clouds into distinct groups and utilize the Transformer and Mamba to model intra-group and inter-group relationships, respectively. Initially, capitalizing on the Transformer’s exceptional ability to model long-range dependencies, we employ it to capture intra-group dependencies and generate enhanced intra-group features,i.e., group embeddings. While using the Transformer to model intra-group point clouds remains feasible due to the small number of points within the group, this approach would lead to a significant increase in complexity when modeling inter-group dependencies, due to the large number of groups. To efficiently model inter-group dependencies, we integrate Mamba, which reduces complexity to a linear scale. Specifically, we feed the group embeddings produced by the Transformer into a Mamba encoder. This encoder extracts inter-group global features from each point cloud sample, facilitating analytical tasks such as classification and segmentation.

Particularly, instead of applying Mamba directly to unordered point cloud groups, we introduce a bi-directional importance-aware ordering (BIO) strategy. This approach reorders the groups to mitigate the negative effects of random point cloud ordering to our Mamba encoder. Unlike previous methods [18, 19] such as those using conventional z ordering [20] or Hilbert ordering [21], our proposed algorithm learns an “importance” score for each group embedding and reorders the groups accordingly. Specifically, we map the group embeddings and the global features from the Mamba encoder to the same feature space. Then, we calculate the cosine similarity between each group embedding and the global feature. Additionally, we utilize an importance score prediction module that enables each group embedding to predict an importance score closely aligned with the calculated cosine similarity. Based on these importance scores, we reorder the group embeddings in a bi-directional manner, combining both descending and ascending orders. This ensures that each group embedding aggregates additional information from all other group embeddings. This reordering strategy fully exploits Mamba’s potential for processing structured data, resulting in more refined global inter-group features and significantly enhancing analytical performance.

By incorporating both Transformer and Mamba modules into our end-to-end hybrid analytical framework, we achieve comparable performance results on benchmark datasets while maintaining efficient complexity. Our contributions can be summarized as follows: 1) We propose PoinTramba, a novel hybrid framework combining Transformer and Mamba for efficient and effective point cloud analysis. This framework leverages the powerful modeling capacity of Transformers to produce enhanced intra-group features and the linear complexity of Mamba to generate inter-group features from a large number of group embeddings. 2) We propose a new bi-directional importance-aware ordering strategy to reorder the group embeddings. Such an operation can richly exploit the recurrent nature of Mamba for better processing the structured group order and aggregating additional information from all other group embeddings. 3) We conduct extensive analytical experiments on point cloud classification and segmentation using ScanObjectNN, ModelNet40, and ShapeNetPart benchmark datasets. These experiments demonstrate that our method achieves comparable quantitative results and validates its effectiveness.

2 Related Work

2.1 Deep Learning on Point Cloud Recognition

Deep learning on point cloud recognition has attracted great attention as point clouds contain abundant depth information that can be used in various applications like autonomous driving and robotics [22]. Compared with 2D images, 3D point clouds are sparse and the points are unordered, making it difficult to directly process point clouds for recognition.

Earlier analytical studies [3, 4] have directly utilized auxiliary data structures like voxels, enabling the application of conventional deep learning methods such as 3D convolutions [23, 24]. However, using these additional data structures results in significant computational and memory costs. Facing the issue, PointNet [5] is a pioneering work that proposes to use MLPs to directly process the point sets, which is simple but achieves promising performance and has inspired a series of works like PointNet++ [6] and DGCNN [7], etc. However, these methods only rely on local dependencies that often neglect global features, thus leading to limited performance. Recently, inspired by the success of the Transformer architecture in natural language processing and 2D vision on long-range context modeling [10, 25], various works have been proposed to explore the effectiveness of the Transformer architecture in point clouds [26, 27, 8, 9] and have achieved great performance.

However, the computational complexity of the attention module in Transformer is $\mathcal{O}(n^{2})$ , where $n$ indicates the number of input tokens. Therefore, despite the remarkable performance of the Transformer on point cloud recognition, when the number of input tokens increases, such Transformer-based methods will result in significant computational overhead, limiting the capacity of such methods.

2.2 State Space Models

The state space models are widely used in processing sequences of information by connecting the inputs and the outputs using latent states to model a system [28]. Recently, inspired by the structured state space (S4) model [13], various works [29, 30] have been proposed to use state space models to model long-range dependencies. In particular, the newly proposed Mamba [11] has attracted great attention due to its great potential in global perception. Compared with the computational complexity of Transformer as $\mathcal{O}(n^{2})$ , the computational complexity of Mamba is $\mathcal{O}(n)$ , leading to its linear scalability in sequence length. Inspired by the success of Mamba, various works have been proposed to examine the effectiveness of Mamba on 2D vision tasks, including image classification [16, 17] and medical image segmentation [31], etc. However, these Mamba-based methods still cannot outperform Transformer-based methods with comparable size [32]. Therefore, various works have been proposed to combine the advantages of Transformer and Mamba for better performance and lower computational complexity, which is still an open issue and deserves exploring [32, 33].

2.3 Mamba on Point Clouds

Inspired by the success of Mamba on 2D vision tasks [16, 17, 31], some recent works have been proposed to examine the effectiveness of Mamba on 3D point clouds. In particular, PointMamba [18] combines the Mamba encoder with the standard point cloud group dividing operation while PCMamba [19] combines Mamba with PointMLP [34] to extract the potential of Mamba on point clouds.

Compared with sequential language or image data, point clouds are highly unordered. However, Mamba is a recurrent model, and the order of the hidden states will severely influence the performance of the long-range dependency modeling of Mamba. Therefore, the key issue in adapting Mamba to point clouds lies in the ordering strategy of point clouds. To tackle the issue, PointMamba reorders the point clouds along the axes while PCMamba reorders the point clouds according to multiple orders including z order [20], Hilbert order [21], etc. However, whether these ordering strategies are suitable for point cloud analysis with Mamba is heuristic.

3 Method

3.1 PoinTramaba

The pipeline of our method is shown in Fig. 1 (a), given an input point cloud $\mathcal{P}$ , we aim to produce a global feature $F$ that can be utilized for various downstream tasks. Our method begins by segmenting $\mathcal{P}$ into $G$ point groups. For each group, we employ an Intra-group Transformer encoder, consisting of $T$ Transformer layers, to generate $G$ group embeddings $\{\boldsymbol{e}_{g}\}_{g=1}^{G}$ , where $g$ indicates the $g$ -th point group, as shown in Fig. 1 (b). Subsequently, we introduce a bi-directional importance-aware ordering (BIO) algorithm to reorder the group embeddings $\{\boldsymbol{e}_{g}\}_{g=1}^{G}$ in a bi-directional manner, combining both descending and ascending orders into $\{\boldsymbol{e}^{0}_{g}\}_{g=1}^{2G}$ . These reordered embeddings are then passed through an Inter-group Mamba encoder, as shown in Fig. 1 (c). After updating with $M$ Mamba layers, we obtain an updated set of features $\{\boldsymbol{e}^{M}_{g}\}_{g=1}^{2G}$ , which is used to produce the global feature $\boldsymbol{f}$ via an importance-aware pooling (IAP) operation. Note that the superscript $m\in\{0,\cdots,M\}$ indicates the $m$ -th Mamba layer, where $m=0$ indicates the input of the Mamba layer.

Refer to caption — Figure 1: The overview of our newly proposed PoinTramba framework (a) and its two main modules, the Intra-group Transformer Encoder (b) and the Inter-group Mamba Encoder (c). Initially, we segment the input point cloud into distinct point groups. Following this, we employ a Transformer encoder to model intra-group relationships and generate group embeddings. An importance-score prediction module is then utilized to predict the importance score for each group embedding. These predicted importance scores are used to reorder the group embeddings. Finally, a Mamba encoder extracts inter-group relationships from the reordered group embeddings, which are subsequently fed into an importance-aware pooling layer. This layer captures the global feature that can be further utilized for various downstream tasks such as classification and segmentation.

3.2 Intra-group Transformer and Inter-group Mamba Encoder

Intra-group Transformer Encoder. We begin by segmenting the point cloud $\mathcal{P}$ into $G$ groups, each containing $K$ points, formally denoted as $\mathcal{P}=\{\mathcal{P}_{g}\}_{g=1}^{G}$ , s.t., $\mathcal{P}_{g}=\{\mathcal{P}_{g_{k}}\}_{k=1}^{K}$ . Initially, we employ the Farthest Point Sampling (FPS) algorithm to select $G$ keypoints. Subsequently, we use the K-Nearest Neighbors (KNN) algorithm to find the $K$ nearest neighbors for each keypoint. In this notation, the subscript $g$ refers to the $g$ -th group of points, and $g_{k}$ denotes the $k$ -th point within the $g$ -th group. After segmenting the point cloud into $G$ groups, we leverage the Transformer’s exceptional capability to model long-range dependencies to capture intra-group dependencies and generate enhanced intra-group features, referred to as group embeddings. Specifically, for each point $\mathcal{P}_{g_{k}}$ , we first project the coordinates of the point, i.e., $(x_{g_{k}},y_{g_{k}},z_{g_{k}})$ , into a point embedding and a point position embedding using a standard embedding layer of the Transformer [10, 25, 35]. The sums of these point embeddings and point position embeddings are then fed into a standard Transformer encoder, which consists of $T$ layers, to model intra-group relationships and generate the group embedding $\boldsymbol{e}_{g}$ .

Inter-group Mamba Encoder. While using the Transformer to model intra-group point clouds is feasible due to the relatively small number of points within each group, modeling dependencies among a large number of groups (inter-group) would significantly increase computational complexity. To address this, we integrate the Mamba Encoder, which efficiently reduces the complexity to a linear scale when modeling inter-group dependencies. Specifically, after obtaining the group embeddings $\{\boldsymbol{e}_{g}\}_{g=1}^{G}$ , we propose a BIO strategy to obtain the reordered group embeddings $\{\tilde{\boldsymbol{e}}_{g}\}_{g=1}^{2G}$ . More details about the BIO strategy will be provided in Sec. 3.3. Similarly, we reorder the position embeddings of the coordinates of the keypoints using our BIO strategy. With both reordered group and positional embeddings, we add them to obtain the aggregation $\boldsymbol{E}^{m}$ of the collection $\{\boldsymbol{e}^{m}_{g}\}_{g=1}^{2G}$ , which will be updated by an $M$ -layer Mamba encoder. As shown in Fig. 1 (c), we adopt the Mamba layer from [18], which is a standard Mamba layer that can be presented as:

\begin{split}\boldsymbol{z}^{m}&=DW\_Conv\left(MLP\left(LN\left(\boldsymbol{E}% ^{m-1}\right)\right)\right),\\ \boldsymbol{E}^{m}&=MLP\left(LN\left(SSM\left(\sigma\left(\boldsymbol{z}^{m}% \right)\right)\right)\cdot\sigma\left(LN\left(\boldsymbol{E}^{m-1}\right)% \right)\right)+\boldsymbol{E}^{m-1}.\end{split}

(1)

where $m$ indicates the $m$ -th Mamba layer. Here, $DW\_Conv$ refers to depth-wise convolution, $LN$ denotes the LayerNorm operation, $SSM$ represents the state space model, which serves as a basic layer of our Mamba encoder, and $\cdot$ denotes the dot production. Additionally, $\sigma$ denotes the SiLU [36] operation. After updating the group embeddings through $M$ Mamba layers, we obtain the final updated group embeddings $\{\boldsymbol{e}_{g}^{M}\}_{g=1}^{2G}$ . These updated group embeddings are subsequently passed into an importance-aware pooling (IAP) operation to extract the global feature $F$ . By leveraging long-range relationship modeling and the linear scalability in sequence length, we enable the Mamba encoder to perform global perception.

3.3 Importance-aware Ordering and Pooling

Bi-directional Importance-aware Ordering. Due to the recurrent nature of Mamba [11] and the unordered nature of point clouds, feeding randomly ordered group embeddings into the Mamba encoder significantly affects Mamba’s performance of long-range dependency modeling. To address this issue, we propose a novel bi-directional importance-aware ordering (BIO) strategy to reorder the group embeddings, mitigating the adverse effects of random point cloud ordering.

In particular, as illustrated in Fig. 2, given the group embeddings $\{\boldsymbol{e}_{g}\}_{g=1}^{G}$ , we employ two non-linear projection layers, i.e., an embedding projection layer and a global projection layer. These two layers map the group embeddings and the global feature of the point cloud $\boldsymbol{f}$ into the same feature space, respectively, i.e., $\hat{\boldsymbol{e}}_{g}$ and $\hat{\boldsymbol{f}}$ . Note we will introduce the global feature of the point cloud later. Therefore, we can calculate the cosine similarity $S_{g}$ between $\hat{\boldsymbol{e}}_{g}$ and $\hat{\boldsymbol{f}}$ as the importance score that can be presented as:

S_{g}=\frac{\hat{\boldsymbol{e}}_{g}^{\top}\hat{\boldsymbol{f}}}{||\hat{% \boldsymbol{e}}_{g}||\times||\hat{\boldsymbol{f}}||},

(2)

However, calculating $S_{g}$ for each group embedding is not feasible as it requires a known global feature in the ordering stage, which is impractical. To overcome this limitation, we use a non-linear projection layer to predict the corresponding importance score of each group embedding, i.e., $I_{g}$ . We encourage $I_{g}$ to approximate $S_{g}$ using an importance loss, $\mathcal{L}^{importance}$ , which can be formulated as:

\mathcal{L}^{importance}=\frac{1}{N\times G}\sum_{n=1}^{N}\sum_{g=1}^{G}% \mathcal{L}_{smooth}(S_{n,g},I_{n,g}),

(3)

where $\mathcal{L}_{smooth}$ is smooth $L1$ loss. Note that we introduce a new subscript $n$ here, indicating the $n$ -th input point cloud, where there are $N$ point clouds in total. Note that below we may ignore the subscript $n$ when it is unnecessary to distinguish the index $n$ of the point cloud.

In addition, to learn meaningful projection layers, following previous works [37, 38], we perform a group embedding-to-global feature alignment using an alignment loss, i.e., $\mathcal{L}^{alignment}$ , which can be formulated as:

\mathcal{L}^{alignment}=\frac{1}{N\times G}\sum_{n=1}^{N}\sum_{g=1}^{G}(-% \mathrm{log}\frac{\hat{\boldsymbol{e}}_{n,g}^{\top}\hat{\boldsymbol{f}}_{n}}{% \sum_{m}\hat{\boldsymbol{e}}_{n,g}^{\top}\hat{\boldsymbol{f}}_{m}}).

(4)

After predicting the importance scores $I_{g}$ of the group embeddings, we reorder them in a bi-directional manner, i.e., $\boldsymbol{o}_{g}=\left[\boldsymbol{o}_{g}^{1},\boldsymbol{o}_{g}^{2}\right]$ , where $\boldsymbol{o}_{g}^{1}$ is the descending order of $I_{g}$ and $\boldsymbol{o}_{g}^{2}$ is the ascending order of $I_{g}$ . This process yields the reordered group embeddings $\{\tilde{\boldsymbol{e}}_{g}\}_{g=1}^{2G}$ . This bi-directional ordering strategy ensures that each group embedding aggregates information from all other group embeddings.

Importance-aware Pooling. Given the updated group embeddings $\{\boldsymbol{e}^{M}_{g}\}_{g=1}^{2G}$ , we perform a pooling operation to obtain the global feature $\boldsymbol{f}$ from these embeddings. Recall the predicted importance scores of the group embeddings represent the cosine similarities between the group embeddings and the global feature, highlighting the importance of each group embedding. A negative importance score suggests that the corresponding group embedding negatively impacts the global feature. Therefore, we propose to discard group embeddings with negative importance scores and reweight the remaining embeddings using their importance scores. This process is represented as:

\boldsymbol{f}=\sum_{g=1}^{2G}\boldsymbol{e}^{M}_{g}\cdot\mathbf{1}(I_{g}),

(5)

where $\mathbf{1}(\cdot)$ is the unit step function. The resulting global feature of the point cloud can be utilized for various downstream tasks such as classification or segmentation, using the corresponding loss $\mathcal{L}^{task}$ .

3.4 Objective Function

We adopt a multi-faceted approach to loss computation for optimization. Specifically, we incorporate the following loss components: 1) Task loss $\mathcal{L}^{task}$ optimizes the specific downstream tasks, such as classification or segmentation, ensuring robust performance. 2) Importance loss $\mathcal{L}^{Importance}$ optimizes the ordering of importance scores for different group embeddings. 3) Alignment loss $\mathcal{L}^{alignment}$ optimizes the prediction of importance scores, ensuring that the model learns semantically meaningful embedding projection layers. We use three trade-off parameters, $\alpha$ , $\beta$ , and $\gamma$ , to balance each loss component. Consequently, we train the entire network by solving the following optimization problem in an end-to-end fashion:

\mathcal{L}=\alpha\mathcal{L}^{task}+\beta\mathcal{L}^{importance}+\gamma% \mathcal{L}^{alignment}.

(6)

4 Experiments

4.1 Experimental Protocols

Table 1: Comparison of classification accuracies (in %) with the state-of-the-art methods on the ScanObjectNN on three variants, with PB-T50-RS being the most challenging one. The best performance is highlighted in bold. ^† denotes using rotational augmentation for training. Hybrid denotes the hybrid Transformer and Mamba backbone. All of the methods are trained from scratch without pre-training.

Methods	Backbone	Param. (M) $\downarrow$	OBJ-BG $\uparrow$	OBJ-ONLY $\uparrow$	PB-T50-RS $\uparrow$
PointNet [5]	MLP	3.5	73.3	79.2	68.0
PointNet++ [6]	MLP	1.5	82.3	84.3	77.9
PointCNN [41]	MLP	0.6	86.1	85.5	78.5
DGCNN [7]	MLP	1.8	82.8	86.2	78.1
MVTN [42]	MLP	11.2	-	-	82.8
PointNeXt [43]	MLP	1.4	-	-	87.7
PointMLP [34]	MLP	13.2	-	-	85.4
Point-BERT [8]	Transformer	22.1	79.9	80.6	77.2
PointMAE [9]	Transformer	22.1	86.8	86.9	80.8
PointMamba [18]	Mamba	12.3	88.3	87.8	82.5
PCM^† [19]	Mamba	34.2	-	-	88.1
PoinTramba (Ours)	Hybrid	19.5	92.3 $\pm$ 0.4	90.9 $\pm$ 0.2	84.5 $\pm$ 0.1
PoinTramba^† (Ours)	Hybrid	19.5	92.3 $\pm$ 0.2	91.3 $\pm$ 0.4	89.1 $\pm$ 0.2

Table 2: Comparison of classification accuracies (in %) with the state-of-the-art methods on the ModelNet40. The best performance is highlighted in bold. ^∗ denotes reproduced results. ^‡ denotes using voting by averaging the results of 10 randomly scaled input point clouds. Hybrid denotes the hybrid Transformer and Mamba backbone. All of the methods are trained from scratch without pre-training.

Methods	Backbone	Param. (M) $\downarrow$	Accuracy $\uparrow$
PointNet [5]	MLP	3.5	89.2
PointNet++ [6]	MLP	1.5	90.7
PointCNN [41]	MLP	0.6	92.2
DGCNN [7]	MLP	1.8	92.9
PointNeXt [43]	MLP	1.4	92.9
OctFormer [44]	Transformer	-	92.7
PointMAE [9]	Transformer	22.1	92.3
PointMamba [18]	Mamba	12.3	92.4
PCM^∗ [19]	Mamba	34.2	92.6
PoinTramba (Ours)	Hybrid	19.5	92.7 $\pm$ 0.1
PoinTramba^‡ (Ours)	Hybrid	19.5	92.9 $\pm$ 0.1

4.2 Experimental results

Real-world Object Classification on ScanObjectNN. We first compare our PoinTramba with other methods on the real-world object classification benchmark dataset ScanObjNN in Table 1. It can be inferred from the table that our method surpasses the current methods, including the MLP-based methods like PointNet++ and Transformer-based methods like PointMAE, by a significant margin. Specifically, our method exceeds the previous state-of-the-art (SOTA) by 4.0%, 3.5% and 1.0% on three variants of ScanObjNN, respectively. Notably, the number of parameters in PoinTramba is even less than those in standard Transformer-based methods, such as PointBERT and PointMAE. Although some MLP-based methods, like PointCNN, introduce fewer parameters, their performance lags significantly behind ours. The primary reason is that the intra-group Transformer encoder in our method enhances local feature extraction, outperforming other methods that rely on MLPs for this task. Additionally, the intra-group Transformer encoder is lightweight, ensuring that PoinTramba does not introduce significantly more parameters compared to PointMamba. It is also worth noting that PCM utilizes nearly double the parameters of PoinTramba, yet our method still outperforms it, highlighting the effectiveness of our approach.

Synthetic Object Classification on ModelNet40. We then compare our PoinTramba with other methods on the synthetic object classification benchmark dataset ModelNet40 in Table 2. Note that we reproduce the result of PCM as it follows a different setting from ours. It can be inferred from the table that our method achieves the SOTA performance, surpassing the current Transformer-based and Mamba-based methods by a large margin.

Part Segmentation on ShapeNetPart. Finally, we compare our PoinTramba with other methods on the part segmentation benchmark dataset ShapeNetPart in Table 3. Note that we reproduce the results of PCM since it requires not only coordinates but also additional four-dimensional information as input. It can be inferred from the table that despite the ShapeNetPart being a highly competitive benchmark, our method achieves comparable performance to previous SOTA methods, validating the effectiveness of our approach.

Table 3: Comparison of part segmentation accuracies (in %) with the state-of-the-art methods on the ShapeNetPart. The mIoU for all instances (Inst.) is reported. The best performance is highlighted in bold. ^∗ denotes reproduced results. Hybrid denotes the hybrid Transformer and Mamba backbone. All of the methods are trained from scratch without pre-training. The inputs are 3D coordinates.

Methods	Backbone	Param. (M) $\downarrow$	Inst. mIoU $\uparrow$
PointNet [5]	MLP	-	83.7
PointNet++ [6]	MLP	-	85.1
DGCNN [7]	MLP	-	85.2
APES [45]	MLP	-	85.8
PointMAE [9]	Transformer	27.1	85.7
PointMamba [18]	Mamba	17.4	85.8
PCM^∗ [19]	Mamba	40.6	84.3
PoinTramba (Ours)	Hybrid	25.4	85.7 $\pm$ 0.1

Table 4: Ablation study on the effectiveness of different components in our method, including the intra-group Transformer encoder (Transformer), the inter-group Mamba encoder (Mamba), the alignment loss (Align), the bi-directional importance-aware ordering strategy (BIO) and the importance-aware pooling (IAP). Experiments are conducted on the PB-T50-RS variant of the ScanObjNN dataset. The baseline method for comparison (Variant No. 0) is PointNet++.

Variant No.	Mamba	Transformer	Align	BIO	IAP	Acc. $\uparrow$
0						77.9
1	$\checkmark$					86.1
2	$\checkmark$		$\checkmark$			87.5
3	$\checkmark$	$\checkmark$				86.5
4	$\checkmark$	$\checkmark$	$\checkmark$			88.2
5	$\checkmark$		$\checkmark$	$\checkmark$	$\checkmark$	87.9
6	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$		88.6
7	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	$\checkmark$	89.1

4.3 Ablation Study

In this section, we highlight the contributions of our module designs on the PB-T50-RS variant of the ScanObjNN dataset, as shown in Table 4. From the table, we observe that deploying an inter-group Mamba encoder alone improves recognition performance by 8.2% over PointNet++ (Variant No. 1 vs. Variant No. 0), primarily due to Mamba’s global perception capability. Adding an intra-group Transformer encoder to the inter-group Mamba encoder further enhances performance by 0.4% (Variant No. 3 vs. Variant No. 1), mainly owing to the Transformer’s long-range modeling ability to capture intra-group dependencies. Incorporating our BIO strategy with the PoinTramba model results in a substantial performance improvement of 2.1% (Variant No. 6 vs. Variant No. 3), underscoring the effectiveness of our ordering algorithm. Additionally, the importance-aware pooling operation helps the model focus on useful group embeddings while ignoring irrelevant ones, leading to a 0.5% performance increase compared to the random ordering strategy (Variant No. 7 vs. Variant No. 6). Combining BIO and IAP with the Mamba encoder only can also achieve a recognition accuracy of 87.9%, surpassing the Mamba encoder by 1.8% (Variant No. 5 vs. Variant No. 1). Moreover, considering that the BIO strategy includes an alignment loss, which enhances local feature extraction, we further examine the effectiveness of $\mathcal{L}^{alignment}$ . As shown in the table, adopting $\mathcal{L}^{alignment}$ alone improves the Mamba model’s performance by 1.4% (Variant No. 2 vs. Variant No. 1) and the PoinTramba model’s performance by 1.7% (Variant No. 4 vs. Variant No. 3). Additionally, the BIO strategy alone brings a 0.4% performance improvement (Variant No. 4 vs. Variant No. 6). These ablation studies demonstrate the effectiveness of each component of our method.

We then verify the importance of ordering strategy for the Mamba model by adopting PoinTramba as the backbone and comparing six different ordering strategies, i.e., random ordering strategy, coordinate-based ordering strategy, i.e., the XYZ ordering strategy [18], z ordering strategy [20], Hilbert ordering strategy [21], single-directional importance-aware ordering strategy in descending order (SIO) and bi-directional importance-aware ordering strategy (BIO). The results are shown in Fig. 3 (a). As can be seen from the table, the random ordering strategy outperforms the coordinate-based ordering strategy by 0.8%. The main reason, based on our analysis, is likely that the random ordering strategy places embeddings that are both far from and close to a certain embedding into the neighboring region of that embedding. This way, the embedding aggregates information from both long-range and short-range distances, leading to competitive global perception. Notably, our BIO strategy surpasses the SIO strategy by 0.5%, as the BIO strategy ensures that each embedding aggregates information from all other embeddings, resulting in better global perception performance. Additionally, our BIO strategy outperforms the other ordering strategies by a significant margin, indicating its effectiveness.

Finally, we verify the importance of our importance-aware pooling operation (IAP), as shown in Fig. 3 (b). We compare our IAP with several different pooling strategies, including the average pooling strategy, the max pooling strategy, and the weighted sum strategy. It can be inferred from the figure that preventing the model from being influenced by useless group embeddings introduces a performance improvement of 0.6%, compared to the direct weighted sum strategy, indicating the effectiveness of our IAP strategy.

5 Conclusion

In this work, we introduce PoinTramba, a pioneering hybrid framework that combines the powerful modeling capacity of Transformers with the computational efficiency of Mamba for point cloud analysis. By integrating these two architectures, PoinTramba achieves a superior balance between computational complexity and analytical performance. Additionally, our innovative BIO strategy significantly exploits Mamba’s performance. Extensive experiments demonstrate the effectiveness of PoinTramba, establishing a new state-of-the-art benchmark in point cloud analysis. This novel approach leverages the combined strengths of Transformer and Mamba architectures, marking a significant advancement in the field. For future work, we plan to explore further optimization techniques to enhance the scalability of PoinTramba and investigate its applicability to a broader range of point cloud tasks and further refine our ordering strategy to further improve efficiency and performance.

Limitation. In this study, we focused solely on an importance-aware ordering strategy. However, it is not clear whether this approach is the most optimal sorting algorithm that can fully harness the potential of Mamba. Additionally, further experiments across a broader range of tasks are necessary to comprehensively evaluate the capabilities of our PoinTramba.

References

[1] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jian** Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020.
[2] Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16454–16463, 2022.
[3] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
[4] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
[5] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
[6] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
[7] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
[8] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022.
[9] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. In European conference on computer vision, pages 604–621. Springer, 2022.
[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[11] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
[12] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
[13] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
[14] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
[15] Wei He, Kai Han, Yehui Tang, Chengcheng Wang, Yujie Yang, Tianyu Guo, and Yunhe Wang. Densemamba: State space models with dense hidden connection for efficient large language models. arXiv preprint arXiv:2403.00818, 2024.
[16] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
[17] Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
[18] Dingkang Liang, Xin Zhou, Xinyu Wang, Xingkui Zhu, Wei Xu, Zhikang Zou, Xiaoqing Ye, and Xiang Bai. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024.
[19] Tao Zhang, ** Ji, and Shuicheng Yan. Point could mamba: Point cloud learning via state space model. arXiv preprint arXiv:2403.00762, 2024.
[20] Guy M Morton. A computer oriented geodetic data base and a new technique in file sequencing. 1966.
[21] David Hilbert and David Hilbert. Über die stetige abbildung einer linie auf ein flächenstück. Dritter Band: Analysis· Grundlagen der Mathematik· Physik Verschiedenes: Nebst Einer Lebensgeschichte, pages 1–2, 1935.
[22] Lichen Zhao, **yang Guo, Dong Xu, and Lu Sheng. Transformer3d-det: Improving 3d object detection by vote refinement. IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4735–4746, 2021.
[23] Daniel Maturana and Sebastian Scherer. 3d convolutional neural networks for landing zone detection from lidar. In 2015 IEEE international conference on robotics and automation (ICRA), pages 3471–3478. IEEE, 2015.
[24] Danil Prokhorov. A convolutional learning system for object classification in 3-d lidar data. IEEE Transactions on neural networks, 21(5):858–863, 2010.
[25] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[26] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16259–16268, 2021.
[27] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud transformer. Computational Visual Media, 7:187–199, 2021.
[28] Rui Xu, Shu Yang, Yihui Wang, Bo Du, and Hao Chen. A survey on vision mamba: Models, applications and challenges. arXiv preprint arXiv:2404.18861, 2024.
[29] Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
[30] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
[31] Jun Ma, Feifei Li, and Bo Wang. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
[32] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024.
[33] Jongho Park, Jaeseung Park, Zheyang Xiong, Nayoung Lee, Jaewoong Cho, Samet Oymak, Kangwook Lee, and Dimitris Papailiopoulos. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
[34] Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
[35] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
[36] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[37] Zicheng Wang, Wen Li, and Dong Xu. Domain adaptive sampling for cross-domain point cloud recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
[38] Yongming Rao, Jiwen Lu, and Jie Zhou. Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5376–5385, 2020.
[39] Mikaela Angelina Uy, Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
[40] Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
[41] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018.
[42] Abdullah Hamdi, Silvio Giancola, and Bernard Ghanem. Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2021.
[43] Guocheng Qian, Yuchen Li, Houwen Peng, **jie Mai, Hasan Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
[44] Peng-Shuai Wang. Octformer: Octree-based transformers for 3d point clouds. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
[45] Chengzhi Wu, Junwei Zheng, Julius Pfrommer, and Jürgen Beyerer. Attention-based point cloud edge sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2023.
[46] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

Appendix A More Details of Experimental Protocols

A.1 Datasets

Following previous works [18], we evaluate the effectiveness of our method on three downstream tasks, i.e., real-world object classification on ScanObjectNN [39], synthetic object classification on ModelNet40 [4], and part segmentation on ShapeNetPart [40]. The ScanObjNN dataset [39] is a challenging point cloud object classification dataset consisting of 15,000 point cloud objects from 15 categories scanned from the real world, which includes three variants: OBJ_BG, OBJ_ONLY, and PB_T50_RS. Following previous works [18, 19], we set the number of points for each sample as 2048. The ModelNet dataset [4] is a widely-used point cloud object classification dataset covering 40 categories where each category includes 100 synthetic samples. Following previous works [18, 19], we set the number of points for each sample as 1024. The ShapeNetPart [40] dataset is a challenging part segmentation dataset for point cloud, consisting of 16,880 samples from 16 different shape categories with 50 part labels. Following previous works [18, 19], we set the number of points for each sample as 2048.

A.2 Implementation Details

For all of our experiments, we adopt AdamW [46] optimizer with a cosine scheduler. We train our model for 300 epochs, including a warm-up stage of 10 epochs. The initial learning rate is set to 3e-4 for all classification models and 2e-4 for part segmentation models, with a weight decay of 5e-2 for all experiments. The batch size is set to 32 for all tasks. Our intra-group Transformer encoder includes 4 Transformer layers with a channel size of 384, and no classification token is utilized. Our inter-group Mamba encoder includes 12 Mamba layers with a channel size of 384, and no classification token is utilized. For classification tasks on both the ScanObjectNN and ModelNet datasets, the point cloud samples are divided into 256 groups, with each group consisting of 16 points. For part segmentation tasks on the ShapeNetPart dataset, the point cloud samples are divided into 128 groups, with each group consisting of 32 points. The projection layers, including the embedding projection layer and the global projection layer, are two-layer convolutional layers with a hidden dimension of 128 and an output dimension of 256, followed by a BatchNorm layer, a ReLU function, and a normalization operation. The importance score prediction head is a two-layer convolutional layer with a hidden dimension of 128 and an output dimension of 1, followed by a BatchNorm layer and a ReLU function. The hyper-parameters $\alpha$ , $\beta$ and $\gamma$ are all set to 1.0 for classification tasks and set to 5.0, 1.0, and 1.0 for part segmentation tasks. Most of our experimental settings are adopted from PointMamba [18] including the classification head and segmentation head. All of our classification models are trained on a server with ten NVIDIA RTX 3090 GPUs, and all of our segmentation models are trained on a server with eight NVIDIA A100 GPUs with 80GB memory.

Appendix B Additional Ablation Study

Table 5: Ablation on the sensitivity of our PoinTramba to the number of groups (Group Number) and the number of points within a point group (Group Size). Experiments are conducted on the PB-T50-RS variant of the ScanObjNN dataset.

Group Number	Group Size	Acc. (%) $\uparrow$	Group Number	Group Size	Acc. (%) $\uparrow$
8	1024	87.8	32	256	88.3
8	512	87.8	32	128	88.5
16	512	89.0	64	64	88.2
16	256	89.1	128	32	87.3

In this section, we perform additional ablation studies to verify the effectiveness of our method. First, we investigate the sensitivity of our method to the number of groups $G$ and the number of points within a point group $K$ . In particular, we divide the point clouds into several groups ranging from 32 to 1024, with the number of points within each point group varying from 128 to 8. The results are reported in Table 5. As shown in the table, a group size that is too small results in limited semantic information for each point group. Consequently, the intra-group Transformer encoder may fail to encode the potential semantic features of each point group, leading to limited recognition performance of PoinTramba. Conversely, when the group size is too large, a lightweight intra-group Transformer encoder, i.e., a 4-layer Transformer encoder, may fail to encode abundant features for each point group, also leading to limited recognition performance. Notably, due to the extraordinary long-range modeling capacity of Mamba, varying the number of point groups does not significantly affect the performance of our PoinTramba model.

Table 6: Ablation study on the sensitivity of our PoinTramba to the number of layers

T

in the Transformer encoder. Experiments are conducted on the PB-T50-RS variant of the ScanObjNN dataset.

Method	Layer Num.	Acc. (%) $\uparrow$	Param. (M) $\downarrow$
PoinTramba (Ours)	0	87.9	12.5
PoinTramba (Ours)	1	87.9	14.2
PoinTramba (Ours)	2	88.2	16.0
PoinTramba (Ours)	4	89.1	19.5

Finally, we examine the sensitivity of our PoinTramba to the number of layers $T$ in the Transformer encoder. The results are listed in Table 6. As inferred from the table, when the number of layers is small, e.g., 1 or 2, the intra-group Transformer encoder is too lightweight to capture abundant semantic features for each point group. Following this configuration, PoinTramba does not significantly outperform the PoinTramba model that uses MLPs as the intra-group encoder, i.e., when $T=0$ . However, when a four-layer Transformer encoder is used, the intra-group encoder effectively models long-range dependencies within each point group, resulting in outstanding recognition performance.

Appendix C Visualization

In this section, we present visualizations of the importance scores predicted by our PoinTramba model in Fig.4. As illustrated in the figure, points associated with the fuselages have higher importance scores than those on the wings. Similarly, points located on the legs of chairs or tables, lampshades, bathtub water tanks, plant stems, stair handrails, and radio antennas also exhibit higher importance scores. Unlike previous conventional ordering strategies[18, 19], our BIO strategy reorders the group embeddings in a bi-directional manner based on the learned importance scores. This approach allows the Mamba model to better process structured data, resulting in more refined global inter-group features and significantly enhancing analytical performance.

PoinTramba: A Hybrid Transformer-Mamba Framework for Point Cloud Analysis