SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Feng Xiao, Hongbin Xu, Qiuxia Wu, Wenxiong Kang Feng Xiao, Hongbin Xu, and Wenxiong Kang are with the School of Automation Science and Engineering, South China University of Technology, Guangzhou, China, 510006. Qiuxia Wu is with the School of Software Engineering, South China University of Technology, Guangzhou, China, 510006. Wenxiong Kang is the corresponding author (email: [email protected]). The codes are released at https://github.com/onmyoji-xiao/3dvg_SeCG.

Abstract

3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in the description. Experiments show that direct matching of language and visual modal has limited capacity to comprehend complex referential relationships in utterances. It is mainly due to the interference caused by redundant visual information in cross-modal alignment. To strengthen relation-orientated map** between different modalities, we propose SeCG, a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. More text-related feature expressions are obtained through the guidance of global semantics and implicit relationships. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods, particularly improving the localization performance for the multi-relation challenges.

Index Terms:

3D visual grounding, visual-language learning, point cloud, graph attention, referential relationship

I Introduction

Refer to caption — Figure 1: Comparison of results without(a) and with(b) our multi-relation improvement, (c) shows the ground truth and related objects. The green words in the utterances are target names and the blues are references. The decomposed pairwise relationships are framed on the text, corresponding to the dashed lines of the same color in above pictures.

Vision and language are two critical modalities for computers to understand real 3D scenes and solve deep problems [1]. A plethora of studies on modalities of interaction between vision and language provide the substantial prerequisite for various applications such as autonomous driving, robotics, and remote sensing [2]. With the flourishing of multi-modal learning, 3D visual grounding is a novel challenging task aimed at empowering the computer to find the correspondence between human language and 3D visual information.

Nowadays the core of well-explored 3D visual grounding tasks is the perception of referential relationships. It means the target object is found by describing its position relative to the referred objects. In terms of single modality, 3D object detection and natural language processing have achieved great success in their respective domains. These existing models exactly support the recognition of fine-grained objects and the parsing of natural sentences [3, 4]. Consequently, the challenge of 3D visual grounding essentially depends on cross-modal alignment, where the uncertain viewpoints and interference from other similar objects increase the difficulty. This intelligent algorithm is required to not only distinguish the target and the referred objects at the semantic level but also find the unique object based on the complex orientation relationship in the sentence.

Existing models are prone to misselect the distractors with weak understanding of the referential dependence especially involved of multiple referred objects [5, 6, 7, 8, 9, 10]. Directly matching two modal features of independent encoding in 3D visual grounding is deficient to understand complex relationships. As shown in Fig 1, each utterance contains multiple objects that are spatially related to the target directly or indirectly. In (a), the basic method of matching separately extracted text and visual features incorrectly locates the similar distractor, which is mainly due to insufficient perception of some referential relationships. It is worth noting that the definition of the target also depends on the multi-level relationships between other objects such as “bed” and “bookshelf”, “wall” and “whiteboard”. Take the second scene as an example, there are three objects involved in the definition of the target “desk”, and its specific position is determined by the progressive understanding of the three marked relationship phrases. In (b), the improved method pays more attention to the simultaneous perception of multiple references and locates correct targets.

To handle the aforementioned issue, we tend to improve the cross-modal alignment for descriptions with multiple referred objects from two perspectives: (i) Relational learning. Modeling the object relationships before visual-language matching is conducive to the subsequent integration of orientation clues. Adding language information in the autonomous learning of relationships can guide the visual encoding to a specific direction about salient objects, while visual encoding in most existing models is language-independent. (ii) Semantic enhancement. In human thought patterns, we often first localize the salient objects of the mentioned categories when facing a long description. That means before analyzing the inherent appearance properties and related connections, the semantic category completes the preliminary screening. The prior semantic knowledge can be utilized in early encoding to extract more associated features for relation-level perception.

In this paper, we propose SeCG, a semantic-enhanced relational learning model based on graph attention for 3D visual grounding. The overall pipeline is shown in Fig 2. For relational learning, encoded objects from point clouds are constructed as nodes into a novel graph attention network(GAT) to learn implicit relationships. Our graph network is capable of extracting the attention features relevant to referential descriptions through a multi-modal memory unit. In order to facilitate the adaptability of the model to various viewpoints, we embed relative position encoding of multiple views into graph attention calculations. For semantic enhancement, each node in the graph is prompted to simultaneously aggregate deep features from two modes of expression, the original RGB point cloud, and the semantic point cloud. The semantic point cloud is a high-level expression without color and texture, which makes the encoder focus more on object position and category and provides direct guidance for cross-modal alignment on relational scene comprehension. Finally, the visual encoding features and language encoding features are fed into a transformer decoder to find the corresponding target.

The main contributions are summarized as follows:

•

We propose a semantic-enhanced visual grounding model with cross-modal graph attention, focusing on the challenging localization with multiple referred objects.
•

We design a novel graph attention layer for implicit relational learning, which introduces language-guided memory units and multi-view geometrical assistance.
•

We first exploit prior semantic knowledge in point cloud encoding by constructing a new expression, synchronously perceiving more directed information from referential languages.
•

Our method is tested on ReferIt3D [5] and ScanRefer [6], outperforms the existing state-of-the-art methods.

II Related Work

II-A 3D Visual Grounding

Visual grounding is a multidisciplinary task integrating computer vision and natural language processing, has been extended to the 3D field in recent years. The mainstream reasoning pipelines are generally divided into two camps, two-stage methods where the objects are extracted from front detectors and one-stage methods adjusting the predicted regions synchronizing with multi-modal alignment [11]. 3D visual grounding task is conducted on scene point clouds, devoted to matching the sole target from a lot of objects by the referential description. Unlike 2D images, point clouds are sparse and noisy, and lack dense texture and structured representation, seriously limiting the migration of outstanding 2D localization methods with pixel-level visual encoding [12, 13, 14]. There are also some methods that jointly train visual grounding and captioning models to achieve complementary improvement [15, 16].

ScanRefer [6] is the first work to construct a large-scale 3D location dataset with free-form descriptions and develop a two-stage network architecture. Then ReferIt3D [5] provides other two large-scale and complementary datasets, Nr3D and Sr3D, which only focus on scenes with multiple instances of the same fine-grained class. Based on the above baselines, many two-stage grounding methods have been proposed. SAT [8] utilizes semantic features of 2D projection from a pre-trained detector in training to learn a better 3D object representation, while LAR [9] directly synthesizes 2D images of each object as auxiliary semantics. TransRefer3D [17] is the first to introduce the Transformer [18] architecture, establishing entity-and-relation aware attention to distinguish the referent corresponding to the same word. The research of LanguageRefer [7] pays more attention to view-dependent utterances and verifies the positive significance of viewpoint correction. Also for the view rotation challenge, MVT [10] fuses multi-view position encoding with point cloud features and obviously improves the overall performance at a relatively low cost by rotating box coordinates. In contrast, ViewRefer [19] rotates the point clouds to multiple views to extract visual features, and leverages a large-scale language model to expand the view-related text.

Different from the detection-then-matching pipelines, one-stage grounding models still predict or adjust bounding boxes in final multi-modal decoding. BUTD-DETR [20] outputs the target box by an extra prediction head of the decoder, which means text understanding can indicate detection results. EDA [21] similarly accomplishes the box prediction in the decoding stage, preventing imbalance and ambiguity learning by text decoupling and component alignment. 3D-SPS [22] regards the visual grounding task as point selecting, points are filtered from coarse to fine by MLP(Multilayer Perceptron) and Transformer layers. Single-stage methods indeed improve the defective outputs from independent detectors, but two-stage methods are still applied widely owing to their rich intermediate results for complex demands in scene understanding tasks. In this paper, we only focus on the matching of the vision and language modalities in the two-stage framework as the same as ReferIt3D [5].

II-B Graph Attention for Visual-language

In order to learn relationship representation to deepen the understanding of complex scene situations or events, graph neural network(GNN) has been used in many cross-modal tasks such as visual question answering(VQA), visual grounding, and image-text match. Graph attention network(GAT) is one of the most popular GNN variants, where every node computes the weight of its neighbors and aggregates relevant features to update its representation [23]. ReGAT [24] constructs two graphs for detected objects, a full-connected relation graph for implicit relation and a pruned graph with prior knowledge, calculated by attention mechanism. The work in [25] for VQA also introduces the graph attention convolution layer, but the graph is based on object difference with extra soft attention. For image-text matching tasks, researchers tend to build the GNNs with image areas and noun phrases respectively to obtain global and local correspondence [26, 27, 28].

GAT is similarly employed in the 3D cross-modal field. FFL-3DOG [29] generates a context-aware object representation by a 3D visual graph matched with the nodes of the language scene graph, and the top K proposals with higher scores are chosen to predict the grounding target. Nevertheless, in this paper, instead of enhancing the multi-modal fusion on graphs by node match, we add language-guided memory into the graph attention layer to allow the node to learn relevant information automatically.

III Methods

III-A Overview

Firstly, the 3D scene point clouds are segmented into independent objects for subsequent work. The instance labels in ReferIt3D [5] are available from the ground truth of ScanNet Dataset [30]. On the contrary, ScanRefer [6] needs a prerequisite network to output objects, and we employed pre-trained Pointgroup [31] like previous works [32, 10] for point cloud segmentation. Fig 2 shows the model architecture and inference process of localization. The proposals are further calculated in two important modules, semantic-enhanced visual encoding and relation learning on graph attention. Finally, a Transformer decoder outputs the localization results for the object features from the graph nodes and text features from a pre-trained language model.

III-B Semantic-enhanced Visual Encoding

For a fair comparison with previous methods like [8, 7, 20, 17, 10], we use the same PointNet++ [33] as a basic backbone to encode the initial point cloud containing $N$ objects. We randomly sample 1024 points with coordinates and colors $(X,Y,Z,R,G,B)$ for each object and feed them into the network, getting $N$ 768-dimensional features.

Afterwards, we build a new representation $(X,Y,Z,C)$ for 3D points in the semantic point cloud. It is defined on high-level semantics, where $C$ represents the semantic category. The category information can be obtained from an extra MLP classification module or previous segmentation results. We use a smaller network to encode semantic point clouds without complex color or texture. It is guided to understand more relationships by simplified representation of objects, rather than limited to appearance learning. Subsequently, the encoding features $V_{rgb}$ and $V_{sem}$ from the two point clouds are fused as follows:

\displaystyle V_{F}

\displaystyle=Relu([fc_{1}(V_{rgb}),fc_{2}(V_{sem})])

(1)

where $[,]$ represents the concatenation, $fc_{1}$ and $fc_{2}$ are the full-connected layers for feature map**, and their results are concatenated to be the semantic-enhanced feature $V_{F}$ .

In this cascade structure, the generation of a semantic point cloud is completely determined by the previous category information. For visual grounding tasks where segmentation categories cannot be employed directly (like Referit3D), the classification results in the early training stage are inaccurate, which causes the semantic encoder to match information in turbid modes and affects the optimization direction of subsequent modules. Consequently, the shallow encoder for RGB point clouds is pre-trained with the same datasets before the holistic training to overcome this defect.

III-C Relation Graph Learning

The core of understanding referential descriptions is to grasp the position relationships of the mentioned objects. We construct a full-connected graph to autonomously learn implicit relationships among objects based on a graph attention network. The object features extracted in the previous networks constitute the nodes of the scene graph. In our designed graph network, each node undergoes two multi-head attention layers. The $i$ -th node value is updated to $v_{i}^{{}^{\prime}}$ by information aggregation of adjacent nodes $D_{i}$ :

\displaystyle v_{i}^{{}^{\prime}}

\displaystyle=\rVert_{k=1}^{K}\sigma(\sum_{j\in D_{i}}\alpha_{ij}^{k}W^{k}v_{j})

(2)

where $v_{j}$ represents the node value of $j$ -th neighbor, $\alpha_{ij}^{k}$ is the attentive weight of $k$ -th head, $W^{k}$ is the projection matrix, and $\sigma()$ indicates a nonlinear activation layer. The results of K heads are weighted and added to obtain the final value. To enable the relational learning model to leverage textual information and better adapt to view transformations, we propose two sub-modules to improve the intrinsic attention algorithm. The node update process in one graph attention layer is shown in Fig 3.

III-C1 Auxiliary Memory Unit

Inspired by the consideration of attention limitation in [34], the self-attention coefficient $\alpha$ in equation (2) is weak to inherit the prior knowledge about real scenes. We add a learnable matrix as a memory unit into the key and value of the attention operator to avoid this limitation. However, this attention structure is still a probabilistic model purely relying on vision. For targets grounded by multiple reference relationships or composite relationships, it is necessary to introduce text modality in relational learning to select attention content and reduce redundant connections. In other words, graph nodes will perform multi-directional differentiated learning on the description-related information flow during the message transmission and updating, not only from the target to the referred objects. Therefore, we design a novel memory unit that aligns text information to improve graph attention:

\displaystyle X_{m}

\displaystyle=[[v_{i\in N}],F_{t}\times M]

(3)

where $[v_{i\in N}]$ is the feature matrix composed of $N$ object features, the text feature vector $F_{t}$ and a learnable memory matrix $M$ are fused as a text-enhanced memory unit and combined with original features. The new feature matrix $X_{m}$ is immediately used for graph attention computation to obtain the updated node values $v_{i\in N}^{{}^{\prime}}$ :

\displaystyle[{v_{i\in N}^{{}^{\prime}}}]

\displaystyle=softmax(\frac{W_{q}[v_{i\in N}]\cdot W_{k}X_{m}^{T}}{\sqrt{d_{m}% }})W_{v}X_{m}

(4)

where $W_{q}$ , $W_{k}$ and $W_{v}$ are weight coefficients, the dimension of $X_{m}$ is used as a scaling factor $d_{m}$ .

III-C2 Multi-view Position Embedding

The description of a 3D scene may depend on observations from any perspective, so deep features extracted from point clouds need to adapt different views. In terms of each object, geometry, texture, or color attributes in point-form expression are inherently not affected by the view. Due to the impact of object location on relational learning with referential languages, we encode multi-view information into position embedding of graph attention. Specifically, multiple object coordinates are generated under the isometrically rotated scene views, and the point cloud features are decoupled from them as shared information. We encode the pairwise position relationship and add it to the calculation of the attention coefficient.

	$\displaystyle E_{r}$	$\displaystyle=[sin(\frac{M_{r}}{\mu_{a}}),cos(\frac{M_{r}}{\mu_{a}})]$		(5)
	$\displaystyle\alpha_{r}^{{}^{\prime}}$	$\displaystyle=max\{E_{r},0\}\alpha_{r}$		(6)

where $M_{r}$ express the relative relationship of object $<i,j>$ from the perspective $r$ , calculated as ( $\log\frac{x_{i}-x_{j}}{l_{i}^{X}}$ , $\log\frac{y_{i}-y_{j}}{l_{i}^{Y}}$ , $\log\frac{z_{i}-z_{j}}{l_{i}^{Z}}$ ), $l_{i}$ is the length of the bounding box of the $i$ -th object on each axis. $M_{r}$ is calculated by sin and cos operators at wavelength $\mu_{a}$ to obtain the position encoding vector $E_{r}$ , which is superimposed on the attention coefficient $\alpha_{r}$ as a position-related feature.

As shown in Fig 4, our proposed graph network consists of memory graph attention(MGA) layers and linear layers. Geometric features from $R$ views are embedded into the first-layer attention operator, and each node produces specific values under the guidance of $r$ -th view. The original node value is added to the updated value of the first layer as input to the next layer. Finally, the output features are aggregated by the average operator to obtain visual encoding features containing semantic and relational information.

III-D Visual-Language Training

The key to localization is the matching of visual content and text description. We have obtained the object features from point clouds as visual encoding results. A pre-trained language model with BERT [35] is employed for language encoding, fine-tuned with a lower learning rate in the training stage. The sentences in datasets are directly tokenized and encoded to generate 768-dimensional text features. In the localization stage, we use a standard Transformer decoder consisting of multi-head attention layers to match objects and descriptions. The expression of the decoding layer is as follows:

\displaystyle D(F_{v},F_{l})

\displaystyle=\rVert_{i=1}^{M}A_{n}(W_{i}^{q}A_{s}(F_{v}),W_{i}^{k}F_{l},W_{i}% ^{v}F_{l})

(7)

where $F_{v}$ and $F_{l}$ are visual coding and language coding features respectively, $A_{s}$ represents the self-attention calculation of $F_{v}$ , $W_{i}^{q}$ , $W_{i}^{k}$ and $W_{i}^{v}$ represent the three linear transformations on query, key and value of the attention operator $A_{n}$ in the $i$ -th head. After stacking several decoding layers, the probability of each object matching the described target is exported. In addition to computing the localization loss of the target, we also add classification layers regarding the object types and description sentences. The loss is defined as,

\displaystyle Loss

\displaystyle=\lambda_{1}l_{sem}+\lambda_{2}l_{lan}+\lambda_{3}l_{ref}

(8)

where $l_{ref}$ is the location loss, $l_{sem}$ is the classification loss of objects from semantic point cloud, $l_{lan}$ is the loss about classifying the described target through sentence encoding features, $\lambda_{1}-\lambda_{3}$ are weight coefficients. All of them are computed using the cross-entropy loss function on linearly mapped features.

TABLE I: Comparison with other recent methods on Nr3D and Sr3D

Methods	Nr3D					Sr3D
Methods	Overall	Easy	Hard	V-dep	V-indep	Overall	Easy	Hard	V-dep	V-indep
ReferIt3D [5]	35.6%	43.6%	27.9%	32.5%	37.1%	40.8%	44.7%	31.5%	39.2%	40.8%
TGNN [36]	37.3%	44.2%	30.6%	35.8%	38.0%	45.0%	48.5%	36.9%	45.8%	45.0%
InstanceRefer [32]	38.8%	46.0%	31.8%	34.5%	41.9%	48.0%	51.1%	40.5%	45.4%	48.1%
3DVG-Trans [37]	40.8%	48.5%	34.8%	34.8%	43.7%	51.4%	54.2%	44.9%	44.6%	51.7%
TransRefer3D [17]	42.1%	48.5%	36.0%	36.5%	44.9%	57.4%	60.5%	50.2%	49.9%	57.7%
LanguageRefer [7]	43.9%	51.0%	36.6%	41.7%	45.0%	56.0%	58.9%	49.3%	49.2%	56.3%
SAT [8]	49.2%	56.3%	42.4%	46.9%	50.4%	57.9%	61.2%	50.0%	49.2%	58.3%
BUTD-DETR [20]	54.6%	60.7%	48.4%	46.0%	58.0%	67.0%	68.6%	63.2%	53.0%	67.6%
MVT [10]	55.1%	61.3%	49.1%	54.3%	55.4%	64.5%	66.9%	58.8%	58.4%	64.7%
ViewRefer [19]	56.0%	63.0%	49.7%	55.1%	56.8%	67.0%	68.9%	62.1%	52.2%	67.7%
SeCG(ours)	57.9%	64.2%	51.9%	57.2%	58.3%	68.3%	71.1%	61.7%	57.0%	68.8%

•

The first-ranked overall evaluation results are underlined, and the highest item for each indicator is bolded.

TABLE II: Comparison with other recent methods without extra data on Scanrefer.

Methods	Detector	2D	Unique		Multiple		Overall
Methods	Detector	2D	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]
Validation Results
ScanRefer [6]	VoteNet		67.64%	46.19%	32.06%	21.26%	38.97%	26.10%
TGNN [36]	3D-UNet		68.61%	56.80%	29.84%	23.18%	37.37%	29.70%
SAT [8]	VoteNet	✓	73.21%	50.83%	37.64%	25.16%	44.54%	30.14%
InstanceRefer [32]	PointGroup		77.45%	66.83%	31.27%	24.77%	40.23%	32.93%
MVT [10]	PointGroup		77.67%	66.45%	31.92%	25.26%	40.80%	33.26%
ViewRefer [19]	PointGroup		-	-	33.08%	26.50%	41.30%	33.66%
3DVG-Transformer [37]	VoteNet		77.16%	58.47%	38.38%	28.70%	45.90%	34.47%
3DJCG* [15]	VoteNet	✓	-	64.50%	-	30.29%	-	36.93%
SeCG(ours)	PointGroup		77.72%	67.41%	38.53%	31.67%	46.13%	38.59%
Test Results
ScanRefer [6]	VoteNet	✓	68.59%	43.53%	34.88%	20.97%	42.44%	26.03%
TGNN [36]	3D-UNet		68.34%	58.94%	33.12%	25.26%	41.02%	32.81%
InstanceRefer [32]	PointGroup		77.82%	66.69%	34.57%	26.88%	44.27%	35.80%
D3Net* [38]	PointGroup	✓	76.59%	65.79%	36.19%	27.26%	45.25%	35.90%
SeCG(Ours)	PointGroup		72.88%	61.75%	36.96%	29.33%	45.01%	36.60%
SeCG+(Ours)	PointGroup	✓	77.99%	66.28%	36.36%	28.23%	45.69%	36.77%

•

The first-ranked overall evaluation results are underlined, and the highest item for each indicator is bolded.

IV Experiment

IV-A Datasets

Nr3D and Sr3D are two large-scale and complementary visual grounding datasets from ReferIt3D [5]. Nr3D contains 41,503 natural utterances about object description generated by taggers on ScanNet [30] 3D scenes. Sr3D is a synthetic dataset where referential languages are automatically generated by templates of five established spatial relationships, containing 83,572 utterances. The above-described objects have no more than six distractors of the same type in the corresponding scene.

ScanRefer dataset [6] provides 51,583 human-written free-form descriptions of 11,046 objects in ScanNet. Compared with ReferIt3D, it prefers to use multiple short sentences to describe one object and usually requires the introduction of an object detector to generate proposals before localization due to the unknown object bounding boxes.

IV-B Evaluation Metrics

In the visual grounding task, each description corresponds to a unique target, the number of correctly matched text-object pairs determines the accuracy of the localization models. As the datasets give different challenges, the evaluation metrics are also diverse.

Nr3D is divided into easy and hard subsets based on whether there are more than 2 distractors to evaluate the model’s fine-grained discrimination ability of similar objects. On the other hand, some referential languages are defined according to a specific perspective, so Nr3D also evaluates the robustness of perspective changes from view-dependent and view-independent samples respectively. In addition, the major problem discussed in this paper is multi-relationship utterances with different references. We employ the number of non-target type objects mentioned in the sentence to evaluate it.

In ScanRefer, the object bounding boxes are also included as prediction items in the evaluation criteria. When calculating the localization accuracy, it is necessary to judge whether the prediction box correctly hits the target based on IoU with thresholds of 0.5 and 0.25. ScanRefer sets two subsets, unique and multiple samples, to evaluate model performance, where the unique means only a single object of the target class in the scene.

IV-C Implementation Details

Both point cloud encoders are based on PointNet++ and the semantic point cloud encoder is a smaller one than the common structure. The relation graph network consists of two MGA layers, where the feature dimension of each node is set to 768, the number of heads $K$ is 8, the height of memory matrix $M$ is 10, and the position embedding applies 4 views. The loss weight coefficients $\lambda_{1}$ and $\lambda_{2}$ are set to 0.5, $\lambda_{3}$ is set to 1.0. All models are trained with a batch size of 36 for 120 epochs, using the Adam [39] optimizer. The leaning rate is initialized to $5\times 10^{-4}$ with decay of 0.65 every 10 epochs from 30 to 80 epochs, and the pre-trained classification layers, language encoding layers, and transformer layers are $1/10$ of that.

IV-D Localization Results

IV-D1 Nr3D/Sr3D

Table I shows the performance of our method and recent works on Nr3d and Sr3d. “V-dep” and “V-indep” represent the evaluation results of view-dependent and view-independent samples. “Easy” indicates the localization accuracy on samples of less than 2 objects with the same category of target in the same scene, “Hard” is the opposite. Our proposed SeCG reaches the state-of-the-art with overall accuracy of 57.9% and 68.3% on both datasets, especially improving on hard and view-dependent samples of Nr3D with 2.2% and 2.1%. Compared with synthetic sentences in Sr3D generated with pairing-relation templates, the human-defined referential relations in Nr3D are more complex and even require indirect relation to complete the localization. Therefore, our main improvement in relational learning is most evident in Nr3D. This shows that our approach effectively analyzes and resolves issues involving multiple reference relationships.

Fig 5 shows some visualization results on samples with more than 2 references. Our approach adds a relational learning module and semantic enhancement module based on MVT, using a similar backbone and multi-view aggregation technique. It’s noticed that MVT uses language-guided supervision in training with 525 instance classes to improve the overall accuracy, but we only set 40 rough classes in the loss. Because we hope the model focuses more on class-independent understanding. “SeCG-nonsem” represents our model trained without semantic enhancement. Our proposed graph attention network is a definite structure for understanding referential relationships, and semantic-level encoding is also conducive to the extraction of relational information.

IV-D2 ScanRefer

Our method is compared with other two-stage visual grounding methods on ScanRefer in Table II. “[email protected]” and “[email protected]” represent the accuracy of the predicted bounding boxes with 0.25 and 0.5 IoU thresholds respectively. “✓” marks methods that incorporate multi-view 2D features during inference, and unremarkable methods only use 3D point clouds. “*” means that the model is only trained on visual grounding datasets and losses in a joint architecture. “Test Results” only includes published methods that have been evaluated on the online benchmark. The second column shows the detectors of the first stage and the third column shows whether the multi-view features from 2D images are added. Our proposed method outperforms others in overall accuracy on both the validation and test sets. For the models that also use PointGroup as the detector, we have significantly improved on the hard samples with multiple distractors. It is noticed that InstanceRefer has a particularly high accuracy on “Unique” with 66.69% mainly due to its filter of objects by target category before localization, which greatly reduces the difficulty of single-object scenes.

In the online benchmark, we test our methods with and without 2D information. SeCG is to use point clouds only, and SeCG+ is to project 2D multi-view features of pre-trained ENet[40] onto points aligned to ScanRefer baseline. Table II indicates that adding 2D features greatly helps localizing the unique object, but is not conducive to scenarios with multiple same-class objects. That is because the latter needs more relationships to support its description while redundant appearance features may interfere with the relational learning direction, which is consistent with our previous analysis of semantic effect.

Some localization results are visualized in Fig 6. Based on the results of various datasets, our method can correctly locate in most complex scenarios but has a weak understanding of some rare attributes and relationships. For example, ”shorter” implicitly compares the target to other similar objects but is not mentioned directly, and some negative descriptions with ”not” may cause misleading. In addition, the incorrect predictions usually locate similar objects near the target. Improvements can be made in these directions in the future.

TABLE III: Ablation studies of the graph attention module

GAT	MP	MU	Overall	M-Num $\textgreater$ 2	M-Num $\leq$ 2
			51.8%	45.6%	52.2%
✓			53.9%	52.4%	54.3%
✓	✓		54.6%	53.4%	54.9%
✓	✓	✓	55.2%	54.3%	55.5%

IV-E Ablation Study

IV-E1 Relation Graph

We verify the effectiveness of each sub-module in the relation graph network with ablation experiments in Table III. “GAT”, “MU”, and “MP” represent the graph attention module, memory unit module, and multi-view position embedding module. “M-Num” is the number of mentioned objects under rough statistics by parsing the object classes that occurred in utterances. The evaluation is based on Nr3D, whose unified object boxes and free-form utterances exclude the influence of the detector and directly reflect the ability to understand real languages. Compared to the baseline in row 1, our proposed graph network has improved the grounding performance on samples with more than 2 object classes mentioned in the description for 8.7%, outstrip** other single-relation samples. In other words, when there are at least two direct or indirect referential relations in the utterance, our model can better understand them and locate the target.

IV-E2 Semantic Enhancement

Table IV separately lists the influence of semantic-enhanced encoding mode with different backbones. PoinetNet++ is the most used network in 3D visual grounding tasks. For a fair comparison with other methods, we use it to evaluate the experiment results. Point Transformer [41] is a more novel point cloud network that uses the attention mechanism. Smaller networks (0.91M of PointNet++ and 1.53M of Point Transformer) are constructed to capture more location information for relational learning from semantically rendered point clouds. It can be seen that semantic point cloud encoding has improved the localization performance of both backbones, especially on hard samples. Point Transformer has a larger scale but no obvious advantages, which indicates that relational learning requires directional information such as semantic-level position extraction more than rich features.

TABLE IV: Ablation studies of semantic-enhanced encoding module

Backbone	Sem-encoder	Overall	Easy	Hard
PointNet++		55.2%	61.6%	49.1%
PointNet++	✓	57.9%	64.2%	51.9%
Point Transformer		56.4%	62.3%	50.3%
Point Transformer	✓	57.2%	63.4%	51.2%

V Conclusion

In this paper, we point out the challenge of 3D visual grounding about weak understanding of multiple referred objects. To perceive complex and indirect relationships of the mentioned objects, SeCG, a semantic-enhanced visual grounding model based on graph attention is proposed. We construct a cross-modal graph attention network with a language-guided updating layer for relational learning and utilize prior semantic knowledge to enhance its perception. Different from previous works that directly match visual and language features, our proposed 3D encoding module can provide more useful relationship information and improve the matching in complex referential descriptions. Experimental results on ReferIt3D and ScanRefer show that our method has outperformed others in overall accuracy, especially on the targets that require multiple references to locate. In the current performance, unintuitive or uncommon relationship descriptions are still challenging to comprehend. We will improve our model to improve text understanding and cross-modal alignment in the future.

References

[1] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” arXiv preprint arXiv:2202.10936, 2022.
[2] J. Lahoud, J. Cao, F. S. Khan, H. Cholakkal, R. M. Anwer, S. Khan, and M.-H. Yang, “3d vision with transformers: a survey,” arXiv preprint arXiv:2208.04309, 2022.
[3] Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3d point clouds: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 12, pp. 4338–4364, 2020.
[4] D. Yin, L. Dong, H. Cheng, X. Liu, K.-W. Chang, F. Wei, and J. Gao, “A survey of knowledge-intensive nlp with pre-trained language models,” arXiv preprint arXiv:2202.08772, 2022.
[5] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas, “Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 422–440.
[6] D. Z. Chen, A. X. Chang, and M. Nießner, “Scanrefer: 3d object localization in rgb-d scans using natural language,” in European conference on computer vision. Springer, 2020, pp. 202–221.
[7] J. Roh, K. Desingh, A. Farhadi, and D. Fox, “Languagerefer: Spatial-language model for 3d visual grounding,” in Conference on Robot Learning. PMLR, 2022, pp. 1046–1056.
[8] Z. Yang, S. Zhang, L. Wang, and J. Luo, “Sat: 2d semantics assisted training for 3d visual grounding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1856–1866.
[9] E. Bakr, Y. Alsaedy, and M. Elhoseiny, “Look around and refer: 2d synthetic semantics knowledge distillation for 3d visual grounding,” Advances in Neural Information Processing Systems, vol. 35, pp. 37 146–37 158, 2022.
[10] S. Huang, Y. Chen, J. Jia, and L. Wang, “Multi-view transformer for 3d visual grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 524–15 533.
[11] J. Huang, Y. Qin, J. Qi, Q. Sun, and H. Zhang, “Deconfounded visual grounding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 998–1006.
[12] J. Deng, Z. Yang, D. Liu, T. Chen, W. Zhou, Y. Zhang, H. Li, and W. Ouyang, “Transvg++: End-to-end visual grounding with language conditioned vision transformer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[13] Y. Du, Z. Fu, Q. Liu, and Y. Wang, “Visual grounding with transformers,” in 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6.
[14] S. Chen and B. Li, “Multi-modal dynamic graph transformer for visual grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 534–15 543.
[15] D. Cai, L. Zhao, J. Zhang, L. Sheng, and D. Xu, “3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 464–16 473.
[16] D. Z. Chen, Q. Wu, M. Nießner, and A. X. Chang, “D 3 net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,” in European Conference on Computer Vision. Springer, 2022, pp. 487–505.
[17] D. He, Y. Zhao, J. Luo, T. Hui, S. Huang, A. Zhang, and S. Liu, “Transrefer3d: Entity-and-relation aware transformer for fine-grained 3d visual grounding,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2344–2352.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[19] Z. Guo, Y. Tang, R. Zhang, D. Wang, Z. Wang, B. Zhao, and X. Li, “Viewrefer: Grasp the multi-view knowledge for 3d visual grounding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 372–15 383.
[20] A. Jain, N. Gkanatsios, I. Mediratta, and K. Fragkiadaki, “Bottom up top down detection transformers for language grounding in images and point clouds,” in European Conference on Computer Vision. Springer, 2022, pp. 417–433.
[21] Y. Wu, X. Cheng, R. Zhang, Z. Cheng, and J. Zhang, “Eda: Explicit text-decoupling and dense alignment for 3d visual grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 231–19 242.
[22] J. Luo, J. Fu, X. Kong, C. Gao, H. Ren, H. Shen, H. Xia, and S. Liu, “3d-sps: Single-stage 3d visual grounding via referred point progressive selection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 454–16 463.
[23] S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?” arXiv preprint arXiv:2105.14491, 2021.
[24] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph attention network for visual question answering,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 10 313–10 322.
[25] X. Zhu, Z. Mao, Z. Chen, Y. Li, Z. Wang, and B. Wang, “Object-difference drived graph convolutional networks for visual question answering,” Multimedia Tools and Applications, vol. 80, pp. 16 247–16 265, 2021.
[26] S. Long, S. C. Han, X. Wan, and J. Poon, “Gradual: Graph-based dual-modal representation for image-text matching,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3459–3468.
[27] X. Liu, Y. He, Y.-M. Cheung, X. Xu, and N. Wang, “Learning relationship-enhanced semantic graph for fine-grained image–text matching,” IEEE Transactions on Cybernetics, 2022.
[28] Y. **g, W. Wang, L. Wang, and T. Tan, “Learning aligned image-text representations using graph attentive relational network,” IEEE Transactions on Image Processing, vol. 30, pp. 1840–1852, 2021.
[29] M. Feng, Z. Li, Q. Li, L. Zhang, X. Zhang, G. Zhu, H. Zhang, Y. Wang, and A. Mian, “Free-form description guided 3d visual graph network for object grounding in point cloud,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3722–3731.
[30] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839.
[31] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: Dual-set point grou** for 3d instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and Pattern recognition, 2020, pp. 4867–4876.
[32] Z. Yuan, X. Yan, Y. Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1791–1800.
[33] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
[34] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 578–10 587.
[35] J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of naacL-HLT, vol. 1, 2019, p. 2.
[36] P.-H. Huang, H.-H. Lee, H.-T. Chen, and T.-L. Liu, “Text-guided graph neural networks for referring 3d instance segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1610–1618.
[37] L. Zhao, D. Cai, L. Sheng, and D. Xu, “3dvg-transformer: Relation modeling for visual grounding on point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2928–2937.
[38] D. Zhenyu Chen, Q. Wu, M. Nießner, and A. X. Chang, “D3net: A unified speaker-listener architecture for 3d dense captioning and visual grounding,” arXiv e-prints, pp. arXiv–2112, 2021.
[39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[40] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.
[41] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 259–16 268.