PointViG: A Lightweight GNN-based Model for Efficient Point Cloud Analysis

Qiang Zheng Yafei Qi Chen Wang Chao Zhang Jian Sun
Abstract

In the domain of point cloud analysis, despite the significant capabilities of Graph Neural Networks (GNNs) in managing complex 3D datasets, existing approaches encounter challenges like high computational costs and scalability issues with extensive scenarios. These limitations restrict the practical deployment of GNNs, notably in resource-constrained environments. To address these issues, this study introduce Point Vision GNN (PointViG), an efficient framework for point cloud analysis. PointViG incorporates a lightweight graph convolutional module to efficiently aggregate local features and mitigate over-smoothing. For large-scale point cloud scenes, we propose an adaptive dilated graph convolution technique that searches for sparse neighboring nodes within a dilated neighborhood based on semantic correlation, thereby expanding the receptive field and ensuring computational efficiency. Experiments demonstrate that PointViG achieves performance comparable to state-of-the-art models while balancing performance and complexity. On the ModelNet40 classification task, PointViG achieved 94.3% accuracy with 1.5M parameters. For the S3DIS segmentation task, it achieved an mIoU of 71.7% with 5.3M parameters. These results underscore the potential and efficiency of PointViG in point cloud analysis.

keywords:
point cloud , classification , segmentation , graph convolution
journal:
{graphicalabstract}

1 Introduction

Refer to caption
Figure 1: Comparison of classification accuracy among representative models, where bubble areas correspond to the number of floating-point operations (FLOPs). The specific FLOPs values (in billions) are provided in textual notation. PointViG surpasses other models, demonstrating superior performance with fewer parameters and FLOPs. This highlights PointViG’s optimal trade-off between performance and complexity.

In recent years, the advancement of 3D vision technology in domains such as robotics, autonomous driving, and 3D reconstruction has propelled point cloud analysis into the forefront of 3D understanding. This has garnered extensive attention from both academia and industry. In contrast to the structured pixel representation, a point cloud is an assemblage of unordered and dispersed points, introducing challenges such as disorder, irregularity, sparsity, and noise. These inherent characteristics render point cloud analysis a challenging endeavor.

Previous endeavors have employed techniques such as voxelization [1, 2, 3], or projection [4, 5]. However, the processes of voxelization or projection come at the cost of losing fine-grained or structured information. PointNet [6] stands out as the pioneering point-based method. Subsequent to its introduction, numerous methods have emerged, adopting the raw point clouds directly as inputs. Categorically, based on the local feature aggregation, point-based methods fall into distinct groups, including MLP-based [7], Convolutional Neural Network (CNN)-based [8], Graph Neural Network (GNN)-based [9], and Transformer-based methods [10]. Among these, graph-based approaches treat scattered point clouds as graph data, and GNNs have emerged as a promising solution for non-Euclidean data, such as point clouds.

GNNs are highly effective for point cloud analysis but come with significant computational costs. This complexity mainly stems from the extensive matrix operations and iterative message-passing mechanisms employed in high-dimensional feature spaces. These operations scale quadratically with the graph’s node count and are heavily influenced by feature dimensionality. Additionally, the large scale of neighbor node tensors, which often exceed the size of the graph nodes themselves, further increases the computational burden. For example, DGCNN [9] requires re-searching neighbor nodes based on updated features before each graph convolution without down-sampling, resulting in substantial computation costs. It also fails to effectively control the over-smoothing phenomenon, limiting potential performance improvements. MRGCN [11] attempts to enhance performance by constructing deeper GNN networks, but this approach significantly increases time and space complexity, while offering relatively limited performance gains.

Moreover, these limitations are exacerbated in large-scale scenarios, making the direct application of GNNs for analyzing large scenes impractical. To address this issue, DGCNN [9] and PointWavelet [12] divide the scene into multiple blocks, while SPG [13] divides the scene into homogeneous patches through unsupervised graph partitioning before semantic segmentation. Although these methods alleviate computational pressure to some extent, they also limit precise semantic segmentation capabilities. AdaptConv [14] combines local feature extraction with a pyramid structure to better preserve connections between nodes. Despite progress in expanding the receptive field, there remains an irreconcilable contradiction between expanding the receptive field and controlling computational cost.

This paper introduces a GNN-based framework that is computationally efficient and widely applicable to various point cloud analysis tasks. At the core of this framework is a lightweight, plug-and-play graph convolution module. Traditional graph convolution operations typically involve processing two tensors: one related to the node itself and the other to its neighboring nodes. The scale of the neighbor node tensor is often much larger than that of the node itself. The proposed module optimizes this computational process by effectively compressing operations on the neighbor node tensor, thereby significantly reducing the overall computational load. Furthermore, to address the pervasive issue of over-smoothing in GNNs, which affects the diversity and representational power of node features, the module incorporates strategies specifically designed to enhance feature diversity, effectively mitigating this problem.

Additionally, this paper proposes an adaptive dilated graph convolution strategy for efficiently handling large-scale scenarios. This strategy first searches for a dilated subgraph for each node in low-dimensional geometric space, and then adaptively searches for sparse neighboring nodes within the subgraph based on high-dimensional semantic associations. Unlike traditional dilated convolutions, this method dynamically determines sampling points based on semantic associations, effectively expanding the receptive field without incurring excessive computational costs and ensuring that key semantic information is preserved during the sampling process.

Overall, the GNN-based point cloud analysis framework proposed in this paper, through its lightweight graph convolutional module and adaptive dilated graph convolution strategy, not only improves computational efficiency but also enhances processing capabilities for large point clouds, providing an efficient solution for point cloud analysis. The primary contributions can be summarized as follows:

  • 1.

    We introduce an effective GNN-based network PointViG designed specifically for the analysis of point clouds.

  • 2.

    We present an efficient graph convolutional module. This module substantially reduces the computational complexity of the network while effectively mitigating the over-smoothing phenomenon.

  • 3.

    We propose an adaptive dilated graph convolution strategy to overcome the computational challenges associated with large point cloud scenes. This strategy expands the receptive field without imposing excessive computational overhead, thus enhancing scalability for larger scenes.

  • 4.

    We validate the performance of PointViG through extensive experiments across various tasks. The results demonstrate PointViG’s competitiveness with state-of-the-art (SOTA) approaches. Notably, PointViG achieves comparable performance with lower complexity, achieving an optimal balance between performance and computational cost.

2 Related works

2.1 Graph neural networks

Graph Neural Networks (GNNs) were initially proposed in the seminal works [15, 16]. GNNs can be categorized into spatial-based and spectral-based models, depending on the utilized operators. The framework introduced in [17] pioneered an early spatial-based Graph Convolutional Network (GCN) by integrating non-recursive layers. Notably, recent advancements have witnessed the introduction of various spatial-based GCN variants, exemplified by [18, 19, 20]. In contrast, the spectral-based GCN was initially introduced by [21], drawing inspiration from spectral graph theory. Subsequently, numerous spectral-based GCN models, such as [22, 23, 24] have emerged. GCNs find widespread application in handling non-Euclidean data across real-world domains, including social networks, biochemical graphs, and citation networks. In computer vision, GCNs play a pivotal role in diverse applications, such as action recognition, scene graph generation, and point cloud analysis. Numerous studies such as [25, 26] construct graph data by establishing connections among human joints, subsequently employing Graph Convolutional Networks (GCNs) for the recognition of human actions. Scene graph generation, a process that automatically generates semantic graph structures representing objects and their relationships based on images, integrates object detection with GCN, as illustrated by [27, 28]. GCNs inherently lend themselves to the processing of unstructured data like point clouds, enabling tasks such as point cloud classification and segmentation. This corresponds to graph classification and node classification, exemplified by [13, 9]. Notably, Approach [29] introduces a novel approach, splitting regular images into patches and treating them as nodes. Leveraging GCN for processing these nodes yields promising performance, showcasing the adaptability and efficacy of GCNs in various image processing tasks.

2.2 Point-based methods

To address irregularities in point clouds, contemporary models adopt a direct input of the raw point cloud. Research endeavors, predicated on local feature extraction methods, are categorized into MLP-based approaches such as  [6, 7, 30, 31, 32], CNN-based methods exemplified by  [33, 34, 35, 36], attention-based techniques including [37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], and graph-based approaches. In the graph-based paradigm, points are treated as graph nodes, and edges are established based on spatial and feature relationships, aligning with the natural representation of point clouds as graph data. The inception of GCN for point cloud dates back to  [24], with subsequent works by exploring local features from neighboring points [9, 51, 52, 53, 54]. DGCNN [9] dynamically constructs a graph in the updated feature space, introducing the EdgeConv operator to aggregate features. 3D-GCN [55] enhances local feature extraction through a learnable kernel, ensuring shift and scale-invariance. Approach [14] introduces AdaptConv, generating convolution kernels adaptively based on point features to capture diverse relationships. Point2Node [56] dynamically integrates relationships between nodes, employing a gating mechanism for adaptive feature aggregation at the channel level. DeepGCN [11] incorporates concepts such as residual connectivity and dilated convolution to train very deep GNN. Method [57] utilizes spectral graph convolution for feature extraction and proposing a recursive pooling operation based on spectral clustering partitioning. Some approaches [13, 58] partition the scene into patches, treating them as super-points, and subsequently perform semantic prediction on each super-node.

3 Method

In this section, we commence with an introduction to the proposed PointViG Module, in Sec. 3.1. Subsequently, Sec. 3.2 introduces the adaptive dilated graph convolution, specifically designed for use in large point cloud scenarios. Finally, Sec. 3.3 provides detailed insights into the network architecture.

3.1 PointViG Module

As depicted in Fig. 2, the PointViG Module is primarily composed of two main components: the graph convolutional kernel (referred to as ”GraphConv Kernel” in Fig. 2) and several supplementary components (referred to as ”Pos-Encoding” and ”FFN” in Fig. 2). In this section, we initially elucidate the mathematical notation involved in graph convolution operations, followed by a detailed explanation of the working mechanism of the GraphConv Kernel, and conclude with an introduction to the supplementary components.

Refer to caption
Figure 2: PointViG Module acts as the basic module in PointViG framework.

3.1.1 Mathematical notation

Given an input point cloud with N𝑁Nitalic_N points, where each point is represented by a position vector pi3subscript𝑝𝑖superscript3p_{i}\in\mathbb{R}^{3}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and a corresponding feature vector fiDsubscript𝑓𝑖superscript𝐷f_{i}\in\mathbb{R}^{D}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, with D𝐷Ditalic_D denoting the feature dimension and i=1,2,,N𝑖12𝑁i=1,2,\ldots,Nitalic_i = 1 , 2 , … , italic_N, we treat these features as nodes within a graph denoted as 𝒱={v1,v2,,vN}𝒱subscript𝑣1subscript𝑣2subscript𝑣𝑁\mathcal{V}=\{v_{1},v_{2},\ldots,v_{N}\}caligraphic_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. For each node, we identify neighbors 𝒩(vi)𝒩subscript𝑣𝑖\mathcal{N}(v_{i})caligraphic_N ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in the feature space and establish edges eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT connecting visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to its neighbors vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Consequently, the point cloud is transformed into a graph 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ), where \mathcal{E}caligraphic_E represents all edges.

3.1.2 GraphConv-Kernel

The GraphConv Kernel is tasked with performing the fundamental graph convolution operations. It comprises two fully connected layers, FC1 and FC2, positioned at the input and output, respectively. Considering the output feature of the FC1 layer, let fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the central node feature, and fj𝒩(fi)subscript𝑓𝑗𝒩subscript𝑓𝑖f_{j}\in\mathcal{N}(f_{i})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote a set of neighborhood features. To capture local characteristics in the feature domain, we compute the difference Δfij=fjfiΔsubscript𝑓𝑖𝑗subscript𝑓𝑗subscript𝑓𝑖\Delta f_{ij}=f_{j}-f_{i}roman_Δ italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each neighborhood feature relative to the central node. The resulting vectors ΔfijΔsubscript𝑓𝑖𝑗\Delta f_{ij}roman_Δ italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT undergo max-pooling and are then input to a multilayer perceptron (MLP). Each MLP layer is followed by batch normalization and nonlinear activation to facilitate nonlinear transformation. Finally, the transformed ΔfijΔsubscript𝑓𝑖𝑗\Delta f_{ij}roman_Δ italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and the central feature fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are concatenated to integrate information from central and neighboring nodes. This concatenated feature is further processed by FC2.

GraphConv Kernel adopts an efficient design that ensures effective graph neighborhood feature aggregation. The GraphConv Kernel integrates global information denoted by the central node fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with local information represented by ΔfijΔsubscript𝑓𝑖𝑗\Delta f_{ij}roman_Δ italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in a high-dimensional space, thereby enhancing feature characterization. Additionally, the design avoids premature involvement of the central node during the transformation of local feature ΔfijΔsubscript𝑓𝑖𝑗\Delta f_{ij}roman_Δ italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT by MLP2subscriptMLP2\text{MLP}_{2}MLP start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, preserving the independence of these two features, contrary to DGCNN [9]. Moreover, the tensors of neighboring nodes often exceed the scale of the graph itself. In GraphConv Kernel, operations related to these neighbor node tensors are simplified to subtraction and pooling. By applying the majority of operations within the GraphConv Kernel and the entire PointViG Module directly to the nodes themselves and compressing computations related to neighbors, the computational burden on the network has been significantly reduced.

The operational mechanism of the GraphConv Kernel is illustrated in the following formula, where the notation within the formula is consistent with that of the preceding text.

fi=𝐅𝐂𝟏(f1)subscript𝑓𝑖subscript𝐅𝐂1subscript𝑓1\it{f_{i}}=\rm{\bf{FC_{1}}}(\it{f_{\rm 1}})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_FC start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (1)
f𝑚𝑎𝑥=𝐌𝐀𝐗{fjfi|fj𝒩(fi)G}subscript𝑓𝑚𝑎𝑥𝐌𝐀𝐗conditional-setsubscript𝑓𝑗subscript𝑓𝑖subscript𝑓𝑗𝒩subscript𝑓𝑖𝐺\it{f_{max}}=\rm{\bf{MAX}}\{\it{f_{j}-f_{i}}|\it{f_{j}\in\mathcal{N}(f_{i})\in G}\}italic_f start_POSTSUBSCRIPT italic_max end_POSTSUBSCRIPT = bold_MAX { italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_G } (2)
f2=𝐅𝐂𝟐{𝐂𝐨𝐧𝐜𝐚𝐭[𝐌𝐋𝐏𝟐(f𝑚𝑎𝑥),fi]}+f1subscript𝑓2subscript𝐅𝐂2𝐂𝐨𝐧𝐜𝐚𝐭subscript𝐌𝐋𝐏2subscript𝑓𝑚𝑎𝑥subscript𝑓𝑖subscript𝑓1\it{f_{\rm 2}}=\rm{\bf{FC_{2}}}\{\rm{\bf{Concat}}[\rm{\bf{MLP_{2}}}(\it{f_{max% }}),\it{f_{i}}]\}+\it{f_{\rm 1}}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_FC start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT { bold_Concat [ bold_MLP start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_max end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] } + italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (3)

3.1.3 Supplementary components

The PointViG Module, which includes the aforementioned GraphConv kernel along with the Pos-Encoding (Position Encoding) and the Feed-Forward Network (FFN), significantly enhances feature diversity. For a given input node pinsubscript𝑝inp_{\text{in}}italic_p start_POSTSUBSCRIPT in end_POSTSUBSCRIPT with its corresponding feature finsubscript𝑓inf_{\text{in}}italic_f start_POSTSUBSCRIPT in end_POSTSUBSCRIPT, the Pos-Encoding block utilizes a 3-layer MLP designed to encode positional information. The resulting position encoding feature is denoted as fpossubscript𝑓posf_{\text{pos}}italic_f start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT and is merged with the input feature finsubscript𝑓inf_{\text{in}}italic_f start_POSTSUBSCRIPT in end_POSTSUBSCRIPT. The position encoding serves as a supplement to the input information of graph convolution, akin to edge features, providing an effective supplement to the input information of graph convolution. Subsequently, the updated feature f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is passed to the GraphConv Kernel and the FFN block for graph convolution and subsequent non-linear transformation. Alongside the FFN block, the Pos-Encoding block assists in map** the inputs and outputs of the GraphConv kernel to a different domain, playing a crucial role in enhancing feature diversity and mitigating the over-smoothing phenomenon. The statistical results regarding feature diversity validate this assertion. The overall sequence of operations within the PointViG Module is summarized as follows:

f1=fin+𝐌𝐋𝐏𝟏(p𝑖𝑛)subscript𝑓1subscript𝑓insubscript𝐌𝐋𝐏1subscript𝑝𝑖𝑛\it{f_{\rm 1}}=\it{f_{\rm in}}+\rm{\bf{MLP_{1}}}(\it{p_{in}})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT + bold_MLP start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_in end_POSTSUBSCRIPT ) (4)
f2=𝐆𝐫𝐚𝐩𝐡𝐂𝐨𝐧𝐯(f1)subscript𝑓2𝐆𝐫𝐚𝐩𝐡𝐂𝐨𝐧𝐯subscript𝑓1\it{f_{\rm 2}}=\rm{\bf{GraphConv}}(\it{f_{\rm 1}})italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_GraphConv ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (5)
fout=𝐌𝐋𝐏𝟑(f2)+f2subscript𝑓outsubscript𝐌𝐋𝐏3subscript𝑓2subscript𝑓2\it{f_{\rm out}}=\rm{\bf{MLP_{3}}}(\it{f_{\rm 2}})+\it{f_{\rm 2}}italic_f start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT = bold_MLP start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (6)

3.2 Adaptive dilated graph convolution

To enhance semantic perception without significantly increasing computational demand, we introduce the adaptive dilated graph convolution. The computational complexity of distance computation is 𝒪(N2d)𝒪superscript𝑁2𝑑\mathcal{O}(N^{2}d)caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ) when seeking neighbors for all nodes in the graph, where N𝑁Nitalic_N and d𝑑ditalic_d represent the number of nodes and the feature dimension, respectively. Consequently, for scenarios involving large point clouds, direct querying of neighboring nodes across the entire graph becomes impractical. Notably, in expansive scenes with multiple targets, there exists a substantial correlation in semantics between points within an individual target and its nearby region. Conversely, the semantic association between objects situated at a considerable distance is typically weak. We empirically assume that semantic associations exhibit non-uniformity within extensive scenes. Based on this empirical assumption, this study divides the neighbor node search in large point cloud scenes into two stages. Initially, in the low-dimensional geometric space, spatial neighbor points for each point are identified to form its subgraph; subsequently, within the subgraph, sparse neighbor nodes are adaptively searched based on the correlations in the high-dimensional semantic space. Owing to the adaptive sparse sampling strategy, it becomes feasible to define a broader subgraph, consequently extending the perceptual field of each node significantly.

The adaptive dilated graph convolution expands the receptive field while concurrently minimizing computational overhead. Analyzing the complexity, let N𝑁Nitalic_N represent the scene’s total node count, d𝑑ditalic_d the feature dimension with the condition that d3much-greater-than𝑑3d\gg 3italic_d ≫ 3, and m𝑚mitalic_m the number of nodes within each subgraph, where mNmuch-less-than𝑚𝑁m\ll Nitalic_m ≪ italic_N. The proposed sampling strategy is divided into a two-step process. Initially, subgraphs are formed based on Euclidean distances, which introduces a complexity of 𝒪(3N2)𝒪3superscript𝑁2\mathcal{O}(3N^{2})caligraphic_O ( 3 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Subsequently, the process involves identifying neighbors for a central node within each subgraph among m𝑚mitalic_m nodes, yielding a complexity of 𝒪(md)𝒪𝑚𝑑\mathcal{O}(md)caligraphic_O ( italic_m italic_d ) per subgraph. When aggregated across N𝑁Nitalic_N subgraphs, this results in a complexity of 𝒪(mdN)𝒪𝑚𝑑𝑁\mathcal{O}(mdN)caligraphic_O ( italic_m italic_d italic_N ). Thus, the overall computational complexity is encapsulated by 𝒪(3N2+mdN)𝒪3superscript𝑁2𝑚𝑑𝑁\mathcal{O}(3N^{2}+mdN)caligraphic_O ( 3 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_d italic_N ). The ratio α𝛼\alphaitalic_α, which compares the complexities of the proposed method to the direct sampling approach, is expressed as follows:

α=𝒪(3N2+mdN)𝒪(N2d)=(3d+mN)1𝛼𝒪3superscript𝑁2𝑚𝑑𝑁𝒪superscript𝑁2𝑑3𝑑𝑚𝑁much-less-than1\alpha=\frac{\mathcal{O}(3N^{2}+mdN)}{\mathcal{O}(N^{2}d)}=(\frac{3}{d}+\frac{% m}{N})\ll 1italic_α = divide start_ARG caligraphic_O ( 3 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_m italic_d italic_N ) end_ARG start_ARG caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ) end_ARG = ( divide start_ARG 3 end_ARG start_ARG italic_d end_ARG + divide start_ARG italic_m end_ARG start_ARG italic_N end_ARG ) ≪ 1 (7)

The distinction between the adaptive dilated graph convolution and the conventional dilated convolution applied in CNN lies in the fact that the conventional dilated convolution typically specifies sparse neighbors with pre-established sampling rules within a local patch based on spatial relationships. In contrast, the adaptive dilated graph convolution dynamically selects neighborhood nodes within the subgraph adaptively, based on the strength of semantic associations.

In experimental settings, where samples from ModelNet40 and ShapeNet denote individual objects characterized by strong semantic associations among points, we choose to directly query neighboring nodes throughout the entire graph. Conversely, for expansive S3DIS scenes with multiple targets, we employ the adaptive dilated graph convolution to efficiently search for neighborhood nodes.

3.3 Network architecture

The PointViG architecture, designed for classification and semantic segmentation, is plotted in Fig. 3 and Fig. 4, respectively. We devised a pyramid encoder backbone for progressive feature extraction. For classification, each encoder stage globally searches for neighborhood nodes, and the resulting output undergoes mean-pooling before being fed to an MLP for classification predictions. In the segmentation task, an asymmetric structure is adopted between the encoder and decoder. Each decoder stage integrates an upsampling layer and an MLP, facilitating the continuous map** of features from the sparse point set to the dense set. The output of each encoder stage is concatenated with the corresponding decoder stage’s upsampling layer output via a skip connection. All experiments were conducted using PyTorch [59] on a TITAN XP GPU.

Refer to caption
Figure 3: The PointViG architecture designed for classification.
Refer to caption
Figure 4: The PointViG architecture designed for semantic segmentation.

4 Experiments

In this section, we initially assess the model’s performance across tasks including classification, part segmentation, and scene semantic segmentation. Subsequently, we conduct comprehensive experiments to validate the network design, and showcase some visualization results.

4.1 Classification

The network undergoes evaluation on the ModelNet40 classification task, consisting of 12,311 CAD models distributed among 40 classes. The training set encompasses 9,843 models, while the remaining 2,468 are designated for testing. In accordance with the methodology presented in PointNet [6], 1,024 point clouds are uniformly sampled from the models, with their corresponding (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) coordinates stored as input. Data augmentation is applied through a random scale transformation with parameters [0.7,1.0/0.7]0.71.00.7[0.7,1.0/0.7][ 0.7 , 1.0 / 0.7 ]. For the classification task, the encoder comprises three stages, each housing a single block. The input channels for the three stages are 64, 128, and 256, with downsampling ratios of 1, 2, and 2, respectively. A batch size of 32 is employed. The Adam optimizer is chosen, with an annealing learning rate with a 25-epoch period. Within each cycle, the learning rate progressively decreases from 0.001 to 1.0×1051.0superscript1051.0\times 10^{-5}1.0 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

The classification results are presented in Tab. 1, with evaluation metrics including mean class accuracy (mAcc) and overall accuracy (OA). PointViG demonstrates superior performance compared to other models, even without employing auxiliary inputs and the voting strategy.

Method Input #points mAcc(%) OA(%)
PointNet [6] xyz 1k 86.0 89.2
PointNet++(MSG) [7] xyz, nor 5k - 91.9
SpecGCN [57] xyz 1k - 92.1
3D-GCN [55] xyz 1k - 92.1
PointCNN [33] xyz 1k 88.1 92.2
PCNN [60] xyz 1k - 92.3
SpiderCNN [35] xyz, nor 5k - 92.4
CSANet [61] xyz 1k 89.9 92.8
DGCNN [9] xyz 1k 90.2 92.9
RS-CNN [62] w/o vot. xyz 1k - 92.9
KPConv [36] xyz 6.8k - 92.9
Point2Node [56] xyz 1k - 93.0
PCT [63] w/o vot. xyz 1k - 93.2
PointNext [31] xyz 1k 90.8 93.2
DTO-Net [64] xyz 1k 91.4 93.3
SO-Net [65] xyz, nor 5k - 93.4
AdaptConv [14] xyz 1k 90.7 93.4
PointConT [66] xyz 1k - 93.5
DeepGCN [11] xyz 1k 90.9 93.6
PointMixer [67] xyz 1k 91.4 93.6
PT [68] xyz 1k 90.6 93.7
CurveNet [69] xyz 1k - 93.8
PointMLP [30] xyz 1k 90.9 94.1
PointViG (Ours) xyz 1k 91.2 94.3
Table 1: Classification results for the ModelNet40 dataset.

4.2 Part segmentation

PointViG undergoes additional evaluation using the ShapeNetPart dataset to address the part segmentation task. This dataset comprises 16,880 samples across 16 classes. Each object encompasses between 2 to 6 parts, resulting in a dataset total of 50 parts. The ShapeNet experiments in this paper utilize only the coordinates as input. The network adopts an encoding-decoding asymmetric structure. The encoder comprises 3 stages with downsampling ratios of 1, 4, and 4. The neighborhood size for each stage is set to 32, and the model undergoes training for 100 epochs. The reported results, encompassing mean class IoU (mIoU) and mean instance IoU (IoU), are detailed in Tab. 2. PointViG exhibits comparable performance to advanced methods.

Method mIoU(%) IoU(%)
PointNet [6] 80.4 83.7
SO-Net [65] 81.0 84.9
PointNet++ [7] 81.9 85.1
3D-GCN [55] 82.1 85.1
DGCNN [9] 82.3 85.2
PCNN [60] - 85.1
PCNN [60] - 85.1
PointASNL [70] - 86.1
AdaptConv [14] 83.4 86.4
PT [68] 83.7 86.6
PointMLP [30] 84.6 86.1
KPConv [36] 85.1 86.4
PointViG (Ours) 83.2 85.9
Table 2: Part segmentation results for the ShapeNetPart dataset.

4.3 Semantic segmentation

The semantic segmentation experiment is conducted on the S3DIS dataset, a large-scale indoor scene dataset encompassing point clouds from six areas, totaling 271 rooms. Each point is characterized by xyz𝑥𝑦𝑧xyzitalic_x italic_y italic_z coordinates and RGB features, annotated with a semantic label from 13 categories. The challenging Area-5 is selected for testing, while the remaining areas serve as training data. Our data preprocessing employs the entire room as input.

The relevant settings of adaptive dilated graph convolution. In S3DIS segmentation experiments, we have introduced adaptive dilated graph convolutions based on the PointViG Module. Given the highly uneven distribution of point density in large-scale scenes, to ensure a consistent perceptual field for each subgraph, we employed Ball-Query, denoted by r𝑟ritalic_r for the radius of the sphere and k𝑘kitalic_k for the number of sparse neighbor nodes sampled within the subgraph. During actual operation, to achieve tensor alignment, the Ball-Query algorithm requires the additional specification of a parameter m𝑚mitalic_m. If the actual number of points within the sphere is less than m𝑚mitalic_m, the algorithm ensures that the number of returned points equals m𝑚mitalic_m through replication and padding operations. In the adaptive sparse sampling phase, to prevent these invalid filling points in the subgraph from being selected, we have modified the Ball-Query algorithm by introducing a masking operation. This enhancement ensures that the influence of these filling points is shielded during the process of adaptively sampling neighbor nodes. We set r𝑟ritalic_r to 0.2, m𝑚mitalic_m to 64 and k𝑘kitalic_k to 32.

The pertinent configurations of the network architecture. The encoder comprises five stages, with the downsampling ratio and the number of blocks for each stage specified as (1,4,4,4,4)14444(1,4,4,4,4)( 1 , 4 , 4 , 4 , 4 ) and (1,2,3,2,2)12322(1,2,3,2,2)( 1 , 2 , 3 , 2 , 2 ), respectively, as depicted in Fig. 4. As the neural network progresses in depth, there is a gradual transition from low-level geometric features to more intricate high-level semantic features. To address shallow stages, we employ the grou**-MLP-pooling inference schedule (Stage 1) for efficiency. In deeper stages (Stage 2 to 5), we introduce the proposed ADGC as fundamental building blocks to aggregate local features within the feature space.

Results. Evaluation metrics, namely mIoU, mAcc, and OA, are chosen for performance assessment. The experimental results presented in Tab. 3 demonstrate that, among the methods listed, PointViG outperforms various graph-based approaches by a substantial margin. In a comprehensive evaluation, considering both the number of parameters and performance, PointViG achieves the optimal trade-off between performance and model complexity.

Method Params mIoU OA mAcc
PointNet [6] 3.6 M 41.1 49.0
SegCloud [71] - 48.9 57.4
PointCNN [33] 0.6 M 57.3 85.9 63.9
SPG [13] - 58.0 86.4 -
PCCN [60] - 58.3 67.0
PointWeb [72] - 60.3 87.0 66.6
HPEIN [73] - 61.9 87.2 68.3
PointASNL [70] - 62.6 87.7 68.5
GACNet [74] - 62.8 87.7 -
KPConv [36] 15.0 M 67.1 72.8
PointNext-B [31] 3.8 M 67.3 89.4 73.7
AdaptConv [14] - 67.9 90.0 73.2
PointNext-L [31] 7.1 M 69.0 90.0 75.3
PT [68] - 70.4 90.8 76.5
PointNext-XL [31] 41.6 M 70.5 90.6 76.8
PointViG (Ours) 5.3 M 71.7 90.8 78.9
Table 3: Semantic segmentation results for the S3DIS dataset, evaluated on Area-5 (%).

4.4 Complexity analysis

Tab. 4 details the space complexity (number of parameters) and time complexity (floating-point operations per sample) for typical models in the context of the ModelNet40 classification task. Specifically, DGCNN [33], DeepGCN [11], GAPointNet [75], and PointWavelet [12] are classified as graph-based approaches within this table. PointViG demonstrates superior accuracy compared to all other methods. Regarding model size, PointViG has the fewest parameters, second only to PointCNN [33]. Moreover, PointViG also exhibits significantly lower FLOPs than other models, affirming its computational efficiency, as depicted in Fig. 1.

Method
Params.
(M)
FLOPs
(G)
mAcc
(%)
OA
(%)
PointNet [6] 3.5 0.9 86.2 89.2
PointNet++(MSG) [7] 1.7 4.1 - 91.9
PointCNN [33] 0.6 - 88.1 92.2
DGCNN [9] 1.8 4.8 90.2 92.9
GAPointNet [75] 22.9 - 89.7 92.4
DeepGCN [11] 2.2 3.9 90.9 93.6
PointNext-S [31] 4.5 6.5 90.9 93.7
PointMLP [30] 13.2 31.3 90.9 94.1
PointWavelet-L [12] 58.4 39.2 91.1 94.3
PointViG (Ours) 1.5 0.6 91.2 94.3
Table 4: The complexity of ModelNet40 classification. The optimal scores in each column highlighted in bold (M: 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, G: 109superscript10910^{9}10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT).

4.5 Feature diversity and over-smoothing

In this section, we assess the diversity of output features at each layer within the PointViG Module (refer to Fig. 2). The metric for feature diversity is expressed as follows:

Diver(X)=XX¯2M×N×dDiver𝑋subscriptnorm𝑋¯𝑋2𝑀𝑁𝑑\text{Diver}(X)=\frac{\left\|X-\overline{X}\right\|_{2}}{M\times N\times d}Diver ( italic_X ) = divide start_ARG ∥ italic_X - over¯ start_ARG italic_X end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_M × italic_N × italic_d end_ARG (8)

In Eq. 8, where X𝑋Xitalic_X and X¯¯𝑋\overline{X}over¯ start_ARG italic_X end_ARG represent the output features of all samples in the test set and their means at a specific layer, and M𝑀Mitalic_M, N𝑁Nitalic_N, and d𝑑ditalic_d denote the number of samples, points, and feature dimensionality, respectively. Consequently, M×N×d𝑀𝑁𝑑M\times N\times ditalic_M × italic_N × italic_d represents the total number of elements in X𝑋Xitalic_X. Essentially, when features in X𝑋Xitalic_X converge, the presence of high redundancy results in a low Diver(X)Diver𝑋\text{Diver}(X)Diver ( italic_X ).

The classification model in this study consists of three stages, each corresponding to a PointViG Module, and Fig. 5 depicts three curves corresponding to Module-0, Module-1, and Module-2, respectively. Horizontal axis labels in the graph indicate layers within the PointViG Module. In these annotations, “FC2 (+Res)” signifies the outcome after integrating the output of FC2 and the feature passed by the skip connection. The portion of the PointViG Module situated between FC1 and FC2, constituting the basic graph convolution operation, corresponds to the GraphConv Kernel in PointViG Module. Consequently, the variation in feature diversity from FC1 to the “Concat” layer depicted in Fig. 5 effectively mirrors the influence of basic graph convolution on feature diversity.

Refer to caption
Figure 5: Illustration of the feature diversity output by each layer within the three PointViG Modules utilized in classification models.

We conducted a comprehensive analysis of Fig. 5 from multiple perspectives, detailed below:

  • 1.

    The output of the GraphConv Kernel, specifically the ”Concat” layer, shows the lowest feature diversity across the curves, except for the ”Pos-En” layer in Module-0. This is attributed to over-smoothing in graph convolution, caused by aggregating features with high semantic relevance in a high-dimensional space, leading to the convergence of node features.

  • 2.

    For FC1, Modules 2 and 3 exhibit noteworthy input feature diversity due to their close association with preceding module outputs. However, the over-smoothing effect and back-propagation reduce feature diversity for Modules 2 and 3 through the FC1 layer. In contrast, Module-0, distinguished from Modules 2 and 3, incorporates low-level geometric features with minimal feature diversity post Position Encoding in Module-1. The FC1 in Module-1 significantly increases feature diversity, highlighting FC1’s pivotal role in the PointViG Module.

  • 3.

    After the ”Concat” layer, a discernible improvement in feature diversity is observed, affirming the substantial contributions of FC2, skip connections, and FFN in enhancing feature diversity.

  • 4.

    A comparative analysis of the output from each FFN block indicates a gradual increase in feature diversity for each module with increasing network depth.

Synthesizing these findings with prior analyses underscores the crucial role of the components in the PointViG Module in alleviating over-smoothing in graph convolution. This establishes the PointViG Module as an effective graph convolution operator.

4.6 The effect of adaptive dilated graph convolution

The proposed adaptive dilated graph convolution adaptively searches for neighboring nodes in dilated perspectives based on semantic associations. In contrast, conventional dilated convolution typically specifies the neighborhood uniformly or randomly. In the semantic segmentation model, we set r=0.2𝑟0.2r=0.2italic_r = 0.2, m=64𝑚64m=64italic_m = 64, and k=32𝑘32k=32italic_k = 32. Tab. 5 presents the performance comparison of the semantic segmentation task when using dilated graph convolution in uniform, random, and adaptive manners. The segmentation model achieves the best performance with adaptive dilated graph convolution.

       Dilation        
       Uniform
       
       Random
       
       Adaptive
       mIoU (%)        69.0        67.4        71.7
Table 5: Comparative performance analysis of dilated graph convolution with different sparse sampling strategies for S3DIS (Area-5) segmentation.

4.7 Ablation studies

The design of PointViG Module. Ablation experiments are conducted to assess the impact of these components in the PointViG Module on classification tasks. The experimental results are presented in Tab. 6. Abbreviations such as ”Pos-En,” ”FC1,” ”FC2,” ”FFN,” and ”concat” represent position encoding, the FC1 layer, the FC2 layer, the FFN block, and the concatenation operation between the central node and the local feature in Fig. 2. Models 1, 2 and 3 involve the removal of position encoding, FC layers, and FFN, respectively, showcasing varying degrees of performance degradation. In Model-4, FC layers and FFN are removed, while in Model-5, the PointViG Module is reduced to a pure GraphConv Kernel. The results of Models 4 and 5 underscore that direct utilization of the GraphConv Kernel for point cloud classification leads to substantial performance degradation. Model-6 explores the concatenation operation in the GraphConv Kernel, confirming its ability to supplement the central node’s features to the graph convolution output and strengthen feature characterization. The performance discrepancies between these ablation models and the PointViG model echo the analysis of feature diversity presented in Sec. 4.5.

Model Pos-En FC1 FC2 FFN Concat OA(%)
Model-1 92.1
Model-2 93.1
Model-3 93.3
Model-4 92.6
Model-5 90.8
Model-6 93.5
PointViG (Ours) 94.3
Table 6: Ablation study of the PointViG Module design for classification.

Comparative analysis of PointViG and other graph convolution operators. We evaluate prominent variations of graph convolution kernels in Tab. 7. Instead of directly employing these kernels as the fundamental module, we replace the GraphConv Kernel in the proposed PointViG Module with these kernels for a fair comparison. The PointViG Module demonstrates superior performance compared to the other kernels, suggesting its superior ability to aggregate point cloud features.

      kernel       mAcc(%)       OA(%)
      GIN [76]       88.9       92.5
      MRGCN [11]       90.1       93.1
      EdgeConv [9]       90.6       93.6
      GraphSAGE [77]       89.5       93.2
      PointViG (Ours)       91.2       94.3
Table 7: Embedding of various graph convolutional kernels within the PointViG Module for a fair performance comparison within the same framework on ModelNet40 classification.

The neighborhood size for ModelNet40 classification. In the design of the classification model, we simplify by setting the number of neighborhood nodes (k𝑘kitalic_k) to a consistent value at each stage. We assess the influence of different k𝑘kitalic_k on performance. The experimental results shown in Tab. 8 indicate that the classification task reaches its peak performance when the parameter k𝑘kitalic_k is set to 16.

k𝑘kitalic_k 4 8 12 16
OA(%) 93.1 94.3 93.6 93.4
Table 8: ModelNet40 classification results with varying numbers (k𝑘kitalic_k) of nearest neighbors.

The effect of neighhood size (k𝑘kitalic_k) on S3DIS segmentation. Tab. 9 compares the performance of S3DIS semantic segmentation with different k𝑘kitalic_k values when r𝑟ritalic_r is fixed at 0.2. The model achieves optimal performance when k=32𝑘32k=32italic_k = 32. This observation indicates that treating all nodes in the subgraph as neighborhood nodes does not improve performance. The adaptive dilated graph convolution strategy demonstrates its effectiveness by adaptively selecting nodes with the strongest semantic associations.

k𝑘kitalic_k 8 16 24 32 40 48
mIoU (%) 69.0 70.5 71.4 71.7 70.4 70.3
Table 9: Segmentation results for S3DIS (Area-5) with varying numbers of neighboring nodes.

The effect of radius (r𝑟ritalic_r) on S3DIS segmentation. In S3DIS segmentation, we employ the Ball-Query, where the parameters r𝑟ritalic_r and k𝑘kitalic_k define the subgraph range and the number of neighboring nodes. Increasing r𝑟ritalic_r expands the perceptual field of the graph convolution without a corresponding increase in computational cost. We maintain a constant number of neighboring nodes at 32 and explore the impact of varied r𝑟ritalic_r values on segmentation performance. The results, depicted in Tab. 10, indicate optimal performance when r𝑟ritalic_r is set to 0.2.

In Tab. 10, when r=0.1𝑟0.1r=0.1italic_r = 0.1, the subgraph is small, fostering strong semantic associations among nodes within the subgraph. Typically, r𝑟ritalic_r is set to 0.1 in PointNext [31]. As r𝑟ritalic_r increases to 0.12 and 0.14, the subgraph range expands, but perceptual field expansion remains limited, resulting in weakened semantic associations between valid points compared to r=0.1𝑟0.1r=0.1italic_r = 0.1. As r𝑟ritalic_r continues to increase to larger values, the perceptual field undergoes a dramatic expansion. The central node gains enhanced flexibility in searching for neighboring nodes within a broader range, establishing semantic associations with distant nodes, thereby contributing to further performance improvement.

r𝑟ritalic_r 0.10 0.12 0.14 0.16 0.18 0.20
mIoU (%) 70.0 69.1 69.6 70.7 70.9 71.7
Table 10: Semantic segmentation results on S3DIS Dataset (Area-5) with varying radii of Ball-Query.

The effect of masking operation on S3DIS segmentation. The S3DIS segmentation model employs Ball-Query, leading to the inclusion of invalid points in the acquired subgraphs obtained by filling operation. The presence of invalid points affects the adaptive search for neighborhood nodes. To mitigate this issue, we enhance the Ball-Query algorithm to output not only the subgraph but also the corresponding mask. During adaptive dilated graph convolution, we utilize the mask flag to mitigate the impact of duplicate points when searching for neighborhood nodes in the subgraph. To validate the effectiveness of the masking operation, we compare the performance of the PointViG framework both with and without the masking operation. The masking operation overall yields a significant performance gain. The complex distribution of points leads to significant variations in both the perceived fields and interactions among adjacent nodes when the parameter r𝑟ritalic_r undergoes alterations. Consequently, accurately predicting the trend of the curves depicted in Fig. 6 becomes challenging. Upon an overall comparison of the two curves, it can be observed that, in general, the performance of adaptive dilated graph convolution is superior when employing a masking operation compared to when it is not used. Moreover, the performance gap becomes more pronounced at larger values of radius r𝑟ritalic_r.

Refer to caption
Figure 6: Segmentation results of S3DIS (Area-5) with and without the masking operation in adaptive dilated graph convolution with varied radii of Ball-Query.

4.8 Visualization

4.8.1 Visualization of t-SNE feature clustering

In Fig. 7 and Fig. 8, t-Distributed Stochastic Neighbor Embedding (t-SNE) is applied for both dimensionality reduction and cluster analysis on the encoder features and logits of the ModelNet40 test set samples. In Fig. 7, a significant portion of the clusters is clearly distinguishable from others, while a small central cluster lacks distinct boundaries, leading to some overlap. The introduction of logits, obtained by transforming encoder features through an MLP in the classification head, markedly clarifies the boundaries between clusters. In Fig. 8, apart from a limited number of outliers with abnormal distributions, points corresponding to the same category exhibit close proximity within their respective clusters. Notably, clusters associated with distinct categories are visibly distant from one another, emphasizing a high level of discriminability.

Refer to caption
Figure 7: Visualisation of encoder features. We employed t-SNE to visualize the encoder features of the samples within the test set of ModelNet40.
Refer to caption
Figure 8: Visualization of output logits. We utilize t-SNE to illustrate the logits of the samples within the ModelNet40 test set.

4.8.2 Visualization of neighboring nodes

In Fig. 9, we depict the graph node neighborhoods for samples in the ModelNet40 test set. The three columns in the illustration correspond to the stages (Stage-1 to Stage-3) of the classification model encoder, representing the network’s progressive deepening. The downsampling operation in each stage leads to a gradual sparsification of the point cloud. Initially constrained to local regions in the shallow stage, neighbor nodes progressively transcend spatial constraints, exploring globally for nodes with high semantic relevance.

Illustratively, in the first row depicting an airplane, the neighbor nodes extend from the local area of the right wing edges to encompass the entire wingspan. In the second row featuring a chair, a specific point on the legs expands from an initial distribution involving two legs to three and four, capturing the crucial structural information of the chair having four symmetric legs. The third-row stool exhibits a similar pattern. The fourth-row table illustrates a central point at one corner gradually expanding its neighboring nodes from one corner to the other along the table edge.

These examples affirm the effectiveness of the PointViG Module proposed in this paper in capturing neighboring nodes with semantic associations, facilitating feature aggregation for enhanced target identification.

Refer to caption
Figure 9: Visualization of graph node neighborhoods. Each row represents a sample from the ModelNet40 test set. The three columns correspond to the three stages of the classification model encoder. Green dots represent the central node, while its neighboring nodes are denoted by red dots.

4.8.3 Visualization of part segmentation

A visualization of the part segmentation experiment results is presented in Fig. 10. Generally, the segmentation performance is superior for targets with a simple structure. However, for targets characterized by a complex structure, deviations in predictions are observed at points where different parts are combined. This is evident in specific instances such as the junction of the tail and body of the rocket in the second row, the fusion of the rear wheel and car body, and the combination of the fuel tank and Motorbike body in the fourth row.

Refer to caption
Figure 10: Visualization of part segmentation results. For each sample, the left side displays the ground truth, while the right side illustrates the corresponding prediction.

4.8.4 Visualization of semantic segmentation

Fig. 11 displays the results of the semantic segmentation for S3DIS (Area-5). In general, the PointViG model exhibits a relatively accurate performance in the segmentation task. The segmentation inaccuracies observed in Fig. 11 can be primarily attributed to the following factors:

Refer to caption
Figure 11: Visualization of semantic segmentation results in S3DIS (Area-5). The four columns depict, from left to right, the input data, ground truth, prediction (Ours) and prediction (DGCNN [9]). To enhance the visualization, we excluded the ceiling and a portion of the walls.
  • 1.

    Small-sized targets in one or more dimensions. The elongated bookcase in the first-row sample scene poses a challenge in capturing features accurately and determining semantic boundaries, leading to the incorrect identification of some points as clutter. In contrast, the larger bookcase in the second-row sample scene yields more accurate identification.

  • 2.

    Semantically related and neighboring targets. In the sample scene of the fourth row, the model incorrectly labels the upper part of a column as a wall, influenced by the strong semantic correlation and frequent co-occurrence between the column and the wall. The visual similarity between the two complicates accurate differentiation for the model.

  • 3.

    Targets with obscure geometric and color features. The small rectangular clutter area between two bookcases in the third-row sample scene and the adjacent rectangular clutter areas on the wall in the fourth-row sample scene are misclassified as walls. These clutter regions closely resemble the background walls in both color and geometric features, presenting a challenge for the model in differentiation.

5 Conclusion

In this paper, we introduce a novel computationally efficient graph convolutional operator, PointViG, complemented by an adaptive dilated graph convolution strategy tailored for large-scale point cloud scenarios. Leveraging these technologies, we construct an effective framework for point cloud analysis. This framework achieves performance comparable to SOTA models in tasks such as classification and segmentation while significantly reducing complexity, thereby achieving an optimal balance between performance and complexity. PointViG framework provides a viable solution for deploying point cloud analysis models in resource-constrained environments. Although PointViG focuses primarily on point cloud analysis, its potential for application in other fields involving graph data analysis remains largely untapped. In future research, we plan to incorporate model compression techniques to further reduce model complexity and explore the possibility of extending PointViG to other domains.

References

  • [1] D. Maturana, S. Scherer, Voxnet: A 3d convolutional neural network for real-time object recognition, in: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2015, pp. 922–928. doi:10.1109/IROS.2015.7353481.
  • [2] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, J. Kautz, Splatnet: Sparse lattice networks for point cloud processing, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 2530–2539. doi:10.1109/CVPR.2018.00268.
  • [3] G. Riegler, A. O. Ulusoy, A. Geiger, Octnet: Learning deep 3d representations at high resolutions, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6620–6629. doi:10.1109/CVPR.2017.701.
  • [4] H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller, Multi-view convolutional neural networks for 3d shape recognition, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 945–953. doi:10.1109/ICCV.2015.114.
  • [5] W. Wang, T. Wang, Y. Cai, Multi-view attention-convolution pooling network for 3d point cloud classification, Applied Intelligence (2021) 1–12doi:https://doi.org/10.1007/s10489-021-02840-2.
  • [6] R. Q. Charles, H. Su, M. Kaichun, L. J. Guibas, Pointnet: Deep learning on point sets for 3d classification and segmentation, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 77–85. doi:10.1109/CVPR.2017.16.
  • [7] C. R. Qi, L. Yi, H. Su, L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, in: Advances in Neural Information Processing Systems, Vol. 30, 2017.
  • [8] W. Wu, Z. Qi, L. Fuxin, Pointconv: Deep convolutional networks on 3d point clouds, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9613–9622. doi:10.1109/CVPR.2019.00985.
  • [9] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, J. M. Solomon, Dynamic graph cnn for learning on point clouds, ACM Trans. Graph. 38 (5) (2019).
  • [10] N. Engel, V. Belagiannis, K. Dietmayer, Point transformer, IEEE Access 9 (2021) 134826–134840. doi:10.1109/ACCESS.2021.3116304.
  • [11] G. Li, M. Mueller, G. Qian, I. C. Delgadillo Perez, A. Abualshour, A. K. Thabet, B. Ghanem, Deepgcns: Making gcns go as deep as cnns, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
  • [12] C. Wen, J.-L. Long, B. Yu, D. Tao, Pointwavelet: Learning in spectral domain for 3d point cloud analysis, ArXiv abs/2302.05201 (2023).
  • [13] L. Landrieu, M. Simonovsky, Large-scale point cloud semantic segmentation with superpoint graphs, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4558–4567. doi:10.1109/CVPR.2018.00479.
  • [14] H. Zhou, Y. Feng, M. Fang, M. Wei, J. Qin, T. Lu, Adaptive graph convolution for point cloud analysis, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 4945–4954. doi:10.1109/ICCV48922.2021.00492.
  • [15] M. Gori, G. Monfardini, F. Scarselli, A new model for learning in graph domains, in: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, 2005, pp. 729–734 vol. 2. doi:10.1109/IJCNN.2005.1555942.
  • [16] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural network model, IEEE Transactions on Neural Networks 20 (1) (2009) 61–80. doi:10.1109/TNN.2008.2005605.
  • [17] A. Micheli, Neural network for graphs: A contextual constructive approach, IEEE Transactions on Neural Networks 20 (3) (2009) 498–511. doi:10.1109/TNN.2008.2010350.
  • [18] J. Atwood, D. Towsley, Diffusion-convolutional neural networks, in: D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 29, 2016.
  • [19] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, Neural message passing for quantum chemistry, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, JMLR.org, 2017, p. 1263–1272.
  • [20] M. Niepert, M. Ahmed, K. Kutzkov, Learning convolutional neural networks for graphs, in: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, JMLR.org, 2016, p. 2014–2023.
  • [21] J. Bruna, W. Zaremba, A. D. Szlam, Y. LeCun, Spectral networks and locally connected networks on graphs, CoRR abs/1312.6203 (2013).
  • [22] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in: D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 29, 2016.
  • [23] X. Li, Y. Hu, Y. Sun, J. Hu, J. Zhang, M. Qu, A deep graph structured clustering network, IEEE Access 8 (2020) 161727–161738. doi:10.1109/ACCESS.2020.3020192.
  • [24] T. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, ArXiv abs/1609.02907 (2016).
  • [25] A. Jain, A. R. Zamir, S. Savarese, A. Saxena, Structural-rnn: Deep learning on spatio-temporal graphs, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5308–5317. doi:10.1109/CVPR.2016.573.
  • [26] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in: AAAI Conference on Artificial Intelligence, 2018.
  • [27] D. Xu, Y. Zhu, C. B. Choy, L. Fei-Fei, Scene graph generation by iterative message passing, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3097–3106. doi:10.1109/CVPR.2017.330.
  • [28] Y. Guo, J. Song, L. Gao, H. T. Shen, One-shot scene graph generation, in: Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 3090–3098. doi:10.1145/3394171.3414025.
  • [29] K. Han, Y. Wang, J. Guo, Y. Tang, E. Wu, Vision (gnn): An image is worth graph of nodes, arXiv preprint arXiv:2206.00272 (2022).
  • [30] X. Ma, C. Qin, H. You, H. Ran, Y. Fu, Rethinking network design and local geometry in point cloud: A simple residual mlp framework, in: International Conference on Learning Representations, 2022.
  • [31] G. Qian, Y. Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, B. Ghanem, Pointnext: Revisiting pointnet++ with improved training and scaling strategies, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, Vol. 35, 2022.
  • [32] L. Zhao, W. Tao, Jsnet++: Dynamic filters and pointwise correlation for 3d point cloud instance and semantic segmentation, IEEE Transactions on Circuits and Systems for Video Technology 33 (4) (2023) 1854–1867. doi:10.1109/TCSVT.2022.3218076.
  • [33] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, B. Chen, Pointcnn: Convolution on x-transformed points, in: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 31, 2018.
  • [34] M. Xu, R. Ding, H. Zhao, X. Qi, Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3172–3181. doi:10.1109/CVPR46437.2021.00319.
  • [35] Y. Xu, T. Fan, M. Xu, L. Zeng, Y. Qiao, Spidercnn: Deep learning on point sets with parameterized convolutional filters, in: European Conference on Computer Vision, 2018.
  • [36] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, L. Guibas, Kpconv: Flexible and deformable convolution for point clouds, in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6410–6419. doi:10.1109/ICCV.2019.00651.
  • [37] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, Y. W. Teh, Set transformer: A framework for attention-based permutation-invariant neural networks, in: Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 3744–3753.
  • [38] X. Liu, Z. Han, Y.-S. Liu, M. Zwicker, Point2sequence: Learning the shape representation of 3d point clouds with an attention-based sequence to sequence network, in: AAAI Conference on Artificial Intelligence, 2019.
  • [39] S. Xie, S. Liu, Z. Chen, Z. Tu, Attentional shapecontextnet for point cloud recognition, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4606–4615. doi:10.1109/CVPR.2018.00484.
  • [40] J. Yang, Q. Zhang, B. Ni, L. Li, J. Liu, M. Zhou, Q. Tian, Modeling point clouds with self-attention and gumbel subset sampling, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3318–3327. doi:10.1109/CVPR.2019.00344.
  • [41] C. Park, Y. Jeong, M. Cho, J. Park, Fast point transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16949–16958.
  • [42] L. Hui, H. Yang, M. Cheng, J. Xie, J. Yang, Pyramid point cloud transformer for large-scale place recognition, 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021) 6078–6087.
  • [43] R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. L. Wang, Y. J. Qiao, H. Li, Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training, in: Advances in Neural Information Processing Systems, Vol. 35, 2022, pp. 27061–27074.
  • [44] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, J. Lu, Point-bert: Pre-training 3d point cloud transformers with masked point modeling, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19291–19300. doi:10.1109/CVPR52688.2022.01871.
  • [45] D. Wang, Z.-X. Yang, Self-supervised point cloud understanding via mask transformer and contrastive learning, IEEE Robotics and Automation Letters 8 (1) (2023) 184–191. doi:10.1109/LRA.2022.3224370.
  • [46] W. Zhou, Q. Wang, W. **, X. Shi, Y. He, Gtnet: Graph transformer network for 3d point cloud classification and semantic segmentation (2023). arXiv:2305.15213.
  • [47] R. Zhang, L. Wang, Y. Qiao, P. Gao, H. Li, Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 21769–21780.
  • [48] N. Zhang, Z. Pan, T. H. Li, W. Gao, G. Li, Improving graph representation for point cloud segmentation via attentive filtering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 1244–1254.
  • [49] F. Yin, Z. Huang, T. Chen, G. Luo, G. Yu, B. Fu, Dcnet: Large-scale point cloud semantic segmentation with discriminative and efficient feature aggregation, IEEE Transactions on Circuits and Systems for Video Technology 33 (8) (2023) 4083–4095. doi:10.1109/TCSVT.2023.3239541.
  • [50] Z. Huang, Z. Zhao, B. Li, J. Han, Lcpformer: Towards effective 3d point cloud analysis via local context propagation in transformers, IEEE Transactions on Circuits and Systems for Video Technology (2023) 1–13doi:10.1109/TCSVT.2023.3247506.
  • [51] H. Lei, N. Akhtar, A. Mian, Spherical kernel for efficient graph convolution on 3d point clouds, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (10) (2021) 3664–3680. doi:10.1109/TPAMI.2020.2983410.
  • [52] B.-S. Hua, M.-K. Tran, S.-K. Yeung, Pointwise convolutional neural networks, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 984–993. doi:10.1109/CVPR.2018.00109.
  • [53] Y. Shen, C. Feng, Y. Yang, D. Tian, Mining point cloud local structures by kernel correlation and graph pooling, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4548–4557. doi:10.1109/CVPR.2018.00478.
  • [54] Z. Liang, M. Yang, L. Deng, C. Wang, B. Wang, Hierarchical depthwise graph convolutional neural network for 3d semantic segmentation of point clouds, in: 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8152–8158. doi:10.1109/ICRA.2019.8794052.
  • [55] Z.-H. Lin, S.-Y. Huang, Y.-C. F. Wang, Convolution in the cloud: Learning deformable kernels in 3d graph convolution networks for point cloud analysis, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1797–1806. doi:10.1109/CVPR42600.2020.00187.
  • [56] W. Han, C. Wen, C. Wang, X. Li, Q. Li, Point2node: Correlation learning of dynamic-node for point cloud feature modeling, in: AAAI Conference on Artificial Intelligence, 2019.
  • [57] C. Wang, B. Samari, K. Siddiqi, Local spectral graph convolution for point set feature learning, in: European Conference on Computer Vision, 2018.
  • [58] L. Landrieu, M. Boussaha, Point cloud oversegmentation with graph-structured deep metric learning, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7432–7441. doi:10.1109/CVPR.2019.00762.
  • [59] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems, Vol. 32, 2019.
  • [60] M. Atzmon, H. Maron, Y. Lipman, Point convolutional neural networks by extension operators, ACM Trans. Graph. 37 (4) (jul 2018). doi:10.1145/3197517.3201301.
  • [61] G. Wang, Q. Zhai, H. Liu, Cross self-attention network for 3d point cloud, Knowledge-Based Systems 247 (2022) 108769. doi:https://doi.org/10.1016/j.knosys.2022.108769.
    URL https://www.sciencedirect.com/science/article/pii/S0950705122003616
  • [62] Y. Liu, B. Fan, S. Xiang, C. Pan, Relation-shape convolutional neural network for point cloud analysis, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8887–8896. doi:10.1109/CVPR.2019.00910.
  • [63] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, S.-M. Hu, Pct: Point cloud transformer, Computational Visual Media 7 (2) (2021) 187–199. doi:10.1007/s41095-021-0229-5.
  • [64] R. Hu, B. Yang, H. Ye, F. Cao, C. Wen, Q. Zhang, Decouple the object: Component-level semantic recognizer for point clouds classification, Knowledge-Based Systems 248 (2022) 108887. doi:https://doi.org/10.1016/j.knosys.2022.108887.
    URL https://www.sciencedirect.com/science/article/pii/S0950705122004245
  • [65] J. Li, B. M. Chen, G. H. Lee, So-net: Self-organizing network for point cloud analysis, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 9397–9406. doi:10.1109/CVPR.2018.00979.
  • [66] Y. Liu, B. Tian, Y. Lv, L. Li, F.-Y. Wang, Point cloud classification using content-based transformer via clustering in feature space, IEEE/CAA Journal of Automatica Sinica 11 (1) (2024) 231–239. doi:10.1109/JAS.2023.123432.
  • [67] J. Choe, C. Park, F. Rameau, J. Park, I. S. Kweon, Pointmixer: Mlp-mixer for point cloud understanding, arXiv preprint arXiv:2111.11187 (2021).
  • [68] H. Zhao, L. Jiang, J. Jia, P. Torr, V. Koltun, Point transformer, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 16239–16248. doi:10.1109/ICCV48922.2021.01595.
  • [69] T. Xiang, C. Zhang, Y. Song, J. Yu, W. Cai, Walk in the cloud: Learning curves for point clouds shape analysis, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 915–924.
  • [70] X. Yan, C. Zheng, Z. Li, S. Wang, S. Cui, Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [71] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, S. Savarese, Segcloud: Semantic segmentation of 3d point clouds, 2017 International Conference on 3D Vision (3DV) (2017) 537–547.
  • [72] H. Zhao, L. Jiang, C.-W. Fu, J. Jia, Pointweb: Enhancing local neighborhood features for point cloud processing, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5560–5568.
  • [73] L. Jiang, H. Zhao, S. Liu, X. Shen, C.-W. Fu, J. Jia, Hierarchical point-edge interaction network for point cloud semantic segmentation, 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019) 10432–10440.
  • [74] L. Wang, Y. Huang, Y. Hou, S. Zhang, J. Shan, Graph attention convolution for point cloud semantic segmentation, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10288–10297. doi:10.1109/CVPR.2019.01054.
  • [75] C. Chen, L. Z. Fragonara, A. Tsourdos, Gapointnet: Graph attention based point neural network for exploiting local feature of point cloud, Neurocomputing 438 (2021) 122–132. doi:https://doi.org/10.1016/j.neucom.2021.01.095.
  • [76] X. Keyulu, H. Weihua, L. Jure, J. Stefanie, How powerful are graph neural networks?, in: International Conference on Learning Representations, 2019.
  • [77] W. L. Hamilton, R. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 2017, p. 1025–1035.