11institutetext: Shanghai AI Laboratory 22institutetext: The University of Hong Kong

Empowering 3D Visual Grounding with Reasoning Capabilities

Chenming Zhu 1122    Tai Wang 11    Wenwei Zhang 11    Kai Chen 11    Xihui Liu 2†2†
Abstract

Although great progress has been made in 3D visual grounding, current models still rely on explicit textual descriptions for grounding and lack the ability to reason human intentions from implicit instructions. We propose a new task called 3D reasoning grounding and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. We further design our approach, ReGround3D, composed of the visual-centric reasoning module empowered by Multi-modal Large Language Model (MLLM) and the 3D grounding module to obtain accurate object locations by looking back to the enhanced geometry and fine-grained details from the 3D scenes. A chain-of-grounding mechanism is proposed to further boost the performance with interleaved reasoning and grounding steps during inference. Extensive experiments on the proposed benchmark validate the effectiveness of our proposed approach. Codes, datasets, and benchmarks will be available at https://zcmax.github.io/projects/ScanReason.

Keywords:
3D reasoning grounding 3D visual grounding multi-modal large language models

1 Introduction

Refer to caption
Figure 1: For an embodied agent, they not only need to be able to understand the 3D environment and complex human instructions, but also localize the target objects for interaction and navigation. Although GPT-4 (GPT-4V) have strong text (multi-modal) reasoning abilities, they lack the ability to directly perceive the 3D scene, understand the 3D spatial relationships and output corresponding target object locations. Instead, our proposed method ReGround3D has the 3D perception, reasoning, and grounding capabilities in the real-world 3D environment.

Understanding and reasoning in the 3D visual world is critical for applications such as robotics and augmented reality, where embodied agents are expected to understand the 3D layout and predict the 3D locations of objects based on human instructions. The example in Fig. 1 demonstrates a scenario where the question can only be solved with a comprehensive understanding of the 3D scene and joint reasoning of the question and the 3D environment. However, current 3D visual grounding models [23, 38, 49, 42] trained on [6, 1] localize objects based on explicit descriptions of the object category, attribute, and 3D spatial relationships, and lack the ability to reason the user intentions and predict object locations with implicit human instructions such as “I am thirsty, can I have something to drink?”.

To bridge the aforementioned gap and to push the boundaries of what embodied agents can understand and how they can interact with the 3D world, we propose a new task of 3D reasoning grounding and introduce a new benchmark named ScanReason. The task requires the model to conduct joint reasoning on the question and the 3D environment before predicting the 3D locations of target objects. We define five categories of 3D reasoning: spatial reasoning, function reasoning, logistic reasoning, emotional reasoning, and safety reasoning. The first two categories focus on the fundamental understanding of the 3D physical world, while the last three categories are built upon fundamental abilities to address user-centric real-world challenges. The benchmark comprises more than 10K question-answer-3D bounding box pairs from 2K scenes belonging to the five reasoning types mentioned above. The GPT-4-assisted data annotation process largely increases the efficiency of curating such a dataset.

We propose ReGround3D as an initial attempt to the new task of 3D Reasoning Grounding. Intuitively, for 3D grounding with implicit instructions, we need to first conduct reasoning on the language instructions and the coarse visual environment. Then with the idea of which object we want to find in mind, we look back to the 3D scene and ground the target object. For complex instructions we may need to alternate the reasoning and look-back process for multiple iterations. Inspired by this intuition, our framework is composed of visual-centric reasoning module and 3D grounding with geometry-enhanced look-back module, with a Chain-of-Grounding mechanism during inference to alternately conduct reasoning and grounding for multiple for multiple rounds. Specifically, the visual-centric reasoning module conducts joint reasoning of 3D scene and instructions with an MLLM. This module predicts a special token representing the semantic and location information of the target object, which is used for the grounding module. The 3D grounding module uses the output token embedding from the previous reasoning module to locate the target object by looking back to the fine-grained 3D scene representation. Unlike previous MLLMs attempting to directly predict bounding box coordinates, our look-back mechanism enables the model to capture more comprehensive 3D geometry and fine-grained object details for accurate 3D grounding. The Chain-of-Grounding mechanism is proposed to synergize reasoning and grounding, which allows multiple rounds alternating between reasoning and grounding during inference.

In summary, our contributions are threefold: 1) We propose a new task of 3D reasoning grounding which requires the model synergize reasoning and grounding. We further introduce a new benchmark ScanReason, comprises five reasoning types (spatial reasoning, function reasoning, logistic reasoning, emotional reasoning, and safety reasoning) for the task of 3D reasoning grounding in diverse scenes. 2) We design a new framework ReGround3D with a visual-centric reasoning module and a 3D grounding module with geometry-enhanced look-back. We further introduce a Chain-of-Grounding mechanism to boost the 3D reasoning grounding ability with a chain of interleaved reasoning and grounding steps. 3) Extensive experiments demonstrate the effectiveness of the our ReGround3D on the ScanReason benchmark for 3D reasoning grounding.

2 Related Work

2.0.1 3D Vision and Language Learning

3D Vision-language learning (3D-VL) is garnering increasing attention, with many 3D-VL tasks focusing on how to connect the 3D world with natural language. Among them, 3D Question Answering (3D QA) [2, 44] aims to enable models to provide text answers based on natural language questions. Situation Question Answering in 3D Scenes (SQA3D) [28] requires an agent to first understand its location based on a text description, then provide a reasonable answer based on the surrounding environment, which can be seen as an extension of 3D QA in the embodied AI area. 3D Visual Grounding [1, 6, 27, 38, 3, 30] demands that models identify and locate target objects in a 3D scene based on given descriptions, outputting the objects’ coordinates and 3D bounding boxes. These descriptions usually explicitly rely on the objects’ attributes and their spatial relationships. 3D Dense Captioning [13, 15, 48, 45, 35, 24, 7, 4] requires models to output a series of object coordinates and corresponding scene-based descriptions based on a given scene. Different from these 3D-VL tasks, the questions in 3D reasoning grounding could be more implicit and complex.

2.0.2 3D Visual Grounding

The task of 3D visual grounding is aimed at localizing the objects that are explicitly referred to by free-form guided language expressions in the 3D scene. Inspired by the success of transformers in natural language processing, recent 3D visual grounding approaches [38, 23, 49, 52, 11, 16] have started to adopt transformer [34] architectures for handling the complex relationships between language descriptions and 3D visual data. These methods leverage the self-attention mechanism of transformers to dynamically weigh the importance of different parts of the input data, facilitating more effective grounding of textual descriptions in 3D environment. Compared with 3D visual grounding, our proposed 3D reasoning grounding requires model to reason the complex question, ground target objects and give the explanation at the same time.

2.0.3 Multi-modal Large Language Models

Recently, there has been an increasing effort to extend the powerful complex reasoning and world knowledge capabilities of LLMs [17, 33, 46] to other modalities [5, 50, 25, 18, 51, 8, 9, 43]. Among these works, some have aimed to enable LLMs to understand the 3D world. [39, 19] focus on delving into LLMs’ ability to comprehend 3D objects, which can not be directly applied to 3D scenes. 3D-LLM [20] is the pioneering work that incorporates the 3D scene into LLM to carry out general 3D understanding tasks. However, by using 3D features constructed through projecting the 2D features of multi-view 2D images extracted by pre-trained 2D Vision-Language Models into 3D space, 3D-LLM struggles to directly capture the complex spatial relationships between objects and the structure of 3D scenes. [12, 26] directly extract 3D features from the reconstructed 3D point cloud and support multi-modal visual prompts (text, images, 3D objects) in an MLLM. To alleviate the difficulty LLMs face in understanding complex 3D scenes, [21] choose to first explicitly segment the objects in the 3D scene and then perform multi-stage object-aware scene-text alignment to achieve 3D scene understanding. However, due to the lack of large-scale 3D-language alignment data and the intricate content of 3D scenes, although current MLLMs can achieve favorable performance in 3D scene understanding tasks, their localization performance is still significantly behind the 3D localization specialists. Our approach seeks to address this issue by introducing a 3D grounding model to enhance the localization capability of MLLMs.

3 ScanReason Benchmark

3.1 3D Reasoning Grounding Task

Given a 3D scene and an implicit free-form query, 3D reasoning grounding requires the model to predict the 3D bounding boxes of the target objects as well as the textual answers and explanations. As shown in Fig. 2, different from traditional 3D visual grounding, the queries of 3D reasoning grounding are implicit and complex, requiring strong reasoning, commonsense, and world knowledge. The number of target objects in 3D reasoning grounding is flexible, and any object in the 3D scene that satisfies the query requirements should be considered as the target object.

Refer to caption
Figure 2: The left side figure shows the overall of our ScanReason dataset. For each reasoning category, we designed different prompts to generate corresponding questions. And the right side figure shows the differences between traditional 3D visual grounding task and our proposed 3D reasoning grounding task.

3.2 Question Types

To comprehensively evaluate the 3D reasoning grounding abilities, we define 5 types of questions depending on which type of reasoning is required. Spatial reasoning and function reasoning require fundamental understanding of the 3D physical world, while logistic reasoning, emotional reasoning, and safety reasoning are high-level reasoning skills built upon the two fundamental reasoning abilities to address user-centric real-world applications, as shown in Fig. 2.

3.2.1 Spatial Reasoning

measures models’ understanding of 3D spatial relationships among objects in a 3D scene. It encompasses the ability to comprehend the layout and structure of the 3D scene, the 3D location of objects within it, which could serve as the foundation of navigating or planning movements in the 3D environment.

3.2.2 Function Reasoning

involves understanding and inferring the purpose, function, or affordance of objects within the 3D scene. For example, function reasoning allows an embodied agent to recognize that a chair is for sitting, a lamp is for lighting, and a refrigerator is for storing food at low temperatures. Such understanding enables the embodied agent to assist users and to perform complex tasks more effectively (e.g., turning on a lamp when the room gets dark, or navigating to a refrigerator to fetch a drink).

3.2.3 Logistic Reasoning

allows an embodied agent to not only understand its environment but also to interact with it in a goal-directed manner. For example, given a question shown in Fig. 2: “If I’m cooking dinner in the kitchen, where is the nearest place for me to throw the rubbish?”, an agent need to use such reasoning ability to infer the location of objects (in this question, a rubbish bin) based on their function and spatial relationships under the specific setting (the kitchen).

3.2.4 Emotional Reasoning

plays a critical role in human-robot interaction, where the target objects are determined by understanding human emotions, preferences, and behavioral patterns. This ability makes the embodied agents more attuned to the emotional and psychological states of humans, allowing them to provide more personalized, empathetic, and contextually appropriate responses and solutions, such as: “I’m very sad, is there something can make me feel happy?” shown in 2.

3.2.5 Safety Reasoning

focuses on preventing harm and ensuring the well-being of humans and other creatures in the 3D environment. It requires the embodied agent to identify and assess the risk and make safety-aware decisions, such as: “Where should first aid kits be placed for easy access but out of reach of children?”.

3.3 Automatic Data Annotation with GPT-4

We leverage the 3D scenes and bounding box annotations from the EmbodiedScan dataset [36] and apply GPT-4 [29] to automatically generate question-answer-location pairs for the five question types respectively. Specifically, we provide GPT-4 with the categories and bounding box locations of all objects in the scene, and ask GPT-4 to generate questions and answers with the target object ids of the provided objects. The details can be found in the Appendix.

In total, our ScanReason dataset consists of 12929 complex reasoning question-answer-location pairs from 1456 scenes, which are split into 11455 training pairs and 1474 validation pairs. All 1474 validation question-answer pairs have been manually verified, including 342 spatial reasoning questions, 287 function reasoning questions, 581 logistic reasoning questions, 211 emotional questions and 53 safety reasoning questions. We provide the detailed statistics and more examples of our dataset in Appendix.

3.4 Evaluation Metric

To evaluate the accuracy of predicted objects and their locations for a flexible number of ground-truth objects, we follow the evaluation metric of 3D object detection tasks. Specifically, we adopt Acc@k𝑘kitalic_kIoU as our metric, where k𝑘kitalic_k is the threshold for the Intersection of Union (IoU) between positive predictions and ground-truths. We evaluate the performance under k=0.25𝑘0.25k=0.25italic_k = 0.25 in our experiments.

4 Method

Solving the task of 3D reasoning grounding requires the synergization of the perception, reasoning, and grounding capability of the embodied agent. Intuitively, we can first conduct reasoning based on the implicit instruction such as “where should first aid kits be placed?” and the visual environment. The reasoning process provides us with information about the rough location and semantics of the object we are looking for. Then kee** that information in mind, we look back to the 3D environment to precisely locate the object. For complex scenarios, alternate reasoning and looking back are required for multiple rounds until we obtain the final answer.

Inspired by this intuition, we propose ReGround3D, consisting of a visual-centric reasoning module and a 3D grounding module with geometry-enhanced look-back, as illustrated in Fig. 3. The visual-centric reasoning module (Sec. 4.1) performs joint reasoning of language instruction and visual scene, and predicts a special <LOC> token representing the grounding information. The 3D grounding module (Sec. 4.2) looks back to the original 3D scene with comprehensive geometry information and fine-grained details. It takes the hidden embedding of the <LOC> token containing grounding-related information from the 3D features, and eventually predicts the 3D locations of the target objects. Furthermore, we propose Chain-of-Grounding mechanism (CoG) (Sec. 4.3), i.e., a chain of interleaved reasoning and grounding steps, to further synergize the grounding and reasoning capability for the 3D reasoning grounding task, as illustrated in Fig. 4.

4.1 Visual-Centric Reasoning

Due to the complexity of the 3D scene and user instructions, particularly given the implicit intention behind the human instruction, ReGround3D starts with a visual-centric reasoning module that can perceive the scene, comprehend the human instructions, and conduct joint reasoning of 3D scene and instructions. We believe the reasoning process eventually implies the grounding intention, i.e., implicitly encodes the information indicating the target object to solve the task. Thus, we design the visual-centric reasoning module to predict grounding queries for localizing the target objects in the following 3D grounding module.

Specifically, for simplicity, we leverage 3D-LLM to serve as the visual-centric reasoning module because of its strong reasoning abilities inherited from the LLM. Based on the BLIP2 architecture [25], 3D-LLM uses pre-trained image encoders to extract multi-view 2D image features and unprojects them into 3D spaces. The visual features are encoded by the Q-Former to produce 32 tokens as the visual input to the LLM. By leveraging the pretrained image encoders and the Q-Former, the visual tokens encode rich semantics but lack the 3D structures, spatial interactions, and fine-grained details. Therefore, instead of directly predicting the object locations by the 3D-LLM, we ask the 3D-LLM to predict the feature representation as the output of the reasoning process, and the predicted feature is further used to ground the target object in the grounding module.

To enable the prediction of the grounding feature, we expand the original vocabulary of the 3D-LLM with a special <LOC> token. The <LOC> token is laden with the contextual scene and the target object information which can guide the 3D grounding module to accurately localize target objects.

Refer to caption
Figure 3: The pipeline of ReGround3D. Given the 3D scene and human instruction, the visual reasoning module first performs joint 3D scene and instruction reasoning, and then guide the 3D grounding module to look-back the 3D scene and perform target object location.

4.2 3D Grounding with Geometry-Enhanced Look-Back

Once obtaining the <LOC> token, ReGround3D extracts the last-layer embedding hlocsubscript𝑙𝑜𝑐h_{loc}italic_h start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT of the <LOC> token and sends it into the 3D grounding module to predict the 3D bounding boxes. The 3D grounding module is devised with a “look-back” mechanism which allows the model to access the 3D geometry and fine-grained details from a 3D point cloud encoder. The fine-grained geometry-enhanced 3D visual features and hlocsubscript𝑙𝑜𝑐h_{loc}italic_h start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT are sent into a query selection module to retrieve the most relevant object features. Those features are further decoded into 3D bounding boxes with the 3D box decoder.

4.2.1 3D Visual Encoder

Unlike the 2D image encoder used for 3D-LLM, the 3D visual encoder directly extracts features from 3D point clouds to capture more geometric and spatial information about the 3D structures and fine-grained details. The powerful 3D visual encoder which captures comprehensive geometry, structure, layout, and fine-grained information is critical to accurate 3D grounding. Subsequently, the 3D features fscenesubscript𝑓𝑠𝑐𝑒𝑛𝑒f_{scene}italic_f start_POSTSUBSCRIPT italic_s italic_c italic_e italic_n italic_e end_POSTSUBSCRIPT produced by the 3D visual encoder and the grounding feature hlocsubscript𝑙𝑜𝑐h_{loc}italic_h start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT from 3D-LLM are sent to the query selection module.

4.2.2 Query Selection Module

We adopt a cross-attention mechanism, where we treat fscenesubscript𝑓𝑠𝑐𝑒𝑛𝑒f_{scene}italic_f start_POSTSUBSCRIPT italic_s italic_c italic_e italic_n italic_e end_POSTSUBSCRIPT as Q𝑄Qitalic_Q (Query), and hlocsubscript𝑙𝑜𝑐h_{loc}italic_h start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT as both K𝐾Kitalic_K (Key) and V𝑉Vitalic_V (Value), to implicitly obtain a feature-level reasoning activation heatmap. During this scene look-back process, this module roughly locates scene features that have a high response to the <LOC> token. We then select the k𝑘kitalic_k most relevant features as the object query fquerysubscript𝑓𝑞𝑢𝑒𝑟𝑦f_{query}italic_f start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT.

4.2.3 3D Box Decoder

is a classical transformer decoder, which consists of M transformer decoder layers, In each decoder layer, the object queries fquerysubscript𝑓𝑞𝑢𝑒𝑟𝑦f_{query}italic_f start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT go through the text feature hlocsubscript𝑙𝑜𝑐h_{loc}italic_h start_POSTSUBSCRIPT italic_l italic_o italic_c end_POSTSUBSCRIPT cross-attention layer and scene feature fscenesubscript𝑓𝑠𝑐𝑒𝑛𝑒f_{scene}italic_f start_POSTSUBSCRIPT italic_s italic_c italic_e italic_n italic_e end_POSTSUBSCRIPT cross-attention layer. Finally, the prediction head takes the updated object queries as input and predicts the final 3D locations and matching score.

4.2.4 Discussion

In comparison to previous works [10, 14, 20, 12] that directly predicts the bounding boxes by the LLM, the extra grounding module has the following advantages: 1) Based on the MLLM reasoning results, it allows the grounding module to perceive the scene again and focus on the region under the implicit guidance of MLLM, adapted to the user queries and the reasoning results. 2) The scene representation perceived in the grounding module can be more precise and fine-grained, which is complementary to the visual-centric reasoning module. 3) The two-step reasoning-grounding pipeline is flexible and can generalized to other types of predicting formats such as segmentation masks (by simply replacing the 3D box decoder with a 3D mask decoder).

Refer to caption
Figure 4: Illustration of Chain-of-Grounding (CoG) Mechanism

4.3 Chain-of-Grounding Mechanism

Existing design conduct the reasoning and grounding process sequentially, i.e., the reasoning process is finished before grounding. We argue that the grounding results can also facilitate the reasoning process, especially for those requiring spatial information. Thus, to further synergize the reasoning and grounding process, inspired by chain-of-thought (CoT) [37], we propose Chain-of-Grounding (CoG), which introduces a chain of interleaved steps of reasoning and grounding to find the targeted objects during inference, as shown in Fig. 4. Such a process allows the model to actively find relevant objects that help solve the problem, and then conduct reasoning with the assistance of the additional information of these relevant objects so that the model can more precisely find the target objects.

Specifically, given a question provided by users, CoG translates it into another question of finding the explicit mentioned objects in the original question. The generated new question is sent into ReGround3D to ground the objects mentioned in the original question in the 3D scene with corresponding confidence scores. An object can be seen as a successfully located object when its confidence score is above the threshold, and the located object information could serve as explicit guidance for 3D-LLM in the next reasoning stage. As shown in Fig. 4, after obtaining the 3D information of objects explicitly mentioned in the original question, the located object information is inserted to update the question. The updated question is then sent to ReGround3D to perform reasoning and grounding to output the target object locations.

4.4 Instruction Tuning

4.4.1 Training Objective

We use the pretrained weights of 3D-LLM as the initialization for the visual-centric reasoning module. Except for freezing the 3D visual encoder pretrained on [36], the rest of parameters in the visual-centric reasoning module and 3D grounding module in our framework are trained in an end-to-end manner. The training supervision is a weighted sum of the next token prediction loss from 3D-LLM and the 3D detection loss from the 3D grounding module.

=λtexttext+λdetdetsubscript𝜆𝑡𝑒𝑥𝑡subscript𝑡𝑒𝑥𝑡subscript𝜆𝑑𝑒𝑡subscript𝑑𝑒𝑡\mathcal{L}=\lambda_{text}\mathcal{L}_{text}+\lambda_{det}\mathcal{L}_{det}caligraphic_L = italic_λ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT (1)

The 3D detection loss is defined following:

det=λL1L1+λIOUIOU+λcontrastcontrastsubscript𝑑𝑒𝑡subscript𝜆𝐿1subscript𝐿1subscript𝜆𝐼𝑂𝑈subscript𝐼𝑂𝑈subscript𝜆𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡subscript𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡\mathcal{L}_{det}=\lambda_{L1}\mathcal{L}_{L1}\\ +\lambda_{IOU}\mathcal{L}_{IOU}\\ +\lambda_{contrast}\mathcal{L}_{contrast}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_I italic_O italic_U end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_I italic_O italic_U end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT (2)

4.4.2 Instruction Tuning Dataset

We load the pretrained weights of 3D-LLM and the 3D visual encoder, and finetunes the LoRA of 3D-LLM, the query selection module, and the 3D box decoder with an instruction tuning dataset. To construct the instruction tuning dataset, we reformulate the data annotations from existing 3D datasets into question-answer or question-answer-bounding-box pairs. Specifically, the 3D visual grounding data from ScanRefer [6], SR3D, NR3D [1] and the 3D object detection data from EmbodiedScan [36] are formulated into question-answer-bbox pairs, and the 3D/spatial question answering data from SR3D [1], CLEVR3D [40], SQA3D [28] are formulated into question-answer pairs without bounding boxes. The information shown in Tab. 1 illustrates how we unify the instruction and output with task-specific templates. More details can be found in the Appendix. The reformulated data combined with our proposed ScanReason dataset, serve as the instruction tuning dataset.

Table 1: We list part of data templates used to train ReGround3D for each task.
Task Name Text Instructions Output Type Templates Expected Output
3D Visual Grounding <scene> Here is a description about an object: "<expr>", where is the object in the 3D scene? Please answer the question only with output 3D box prediction(s). It is <LOC>.
3D/Spatial Question Answering <scene> Answer the question: “<question> Please answer the question only with text, do not output 3D box prediction(s). <answer>.
<scene> <situation>, <question> Please answer the question only with text, do not output 3D box prediction(s). <answer>.
<scene><question> Please answer the question only with text, do not output 3D box prediction(s). <answer>.
3D Object Detection <scene> Where is the <category> in this 3D scene? Please answer the question only
with output 3D box prediction(s). Sure, <LOC>.
3D Reasoning Grounding <scene> Answer the question: “<question> Please answer the question with text and output 3D box prediction(s). Sure, <LOC>, <reason>

5 Experiment

5.1 Implementation Details

5.1.1 Network Architecture

For the 3D grounding module, we adopt the pre-trained multi-modal 3D encoder in EmbodiedScan as the 3D visual encoder. During the trainning stage, We use LoRA to effeciently finetune the 3D-LLM to preserve the original 3D scene understanding capability and reduce the computation costs. The number of object queries k𝑘kitalic_k in the query selection module is set to be 256.

5.1.2 Training Parameters

The training is done on 8 NVIDIA A100 GPUs. We adopt AdamW optimizer with a learning rate of 1e-4 and use a learning rate scheduler WarmupDecayLR with the warmup steps of 100. The total batch size is set to be 16. The loss weight parameters λtextsubscript𝜆𝑡𝑒𝑥𝑡\lambda_{text}italic_λ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT and λdetsubscript𝜆𝑑𝑒𝑡\lambda_{det}italic_λ start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT in total loss \mathcal{L}caligraphic_L are set to 1.0 and 1.0, respectively, and the weight λL1subscript𝜆𝐿1\lambda_{L1}italic_λ start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT, λIOUsubscript𝜆𝐼𝑂𝑈\lambda_{IOU}italic_λ start_POSTSUBSCRIPT italic_I italic_O italic_U end_POSTSUBSCRIPT and λcontrastsubscript𝜆𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡\lambda_{contrast}italic_λ start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT in detsubscript𝑑𝑒𝑡\mathcal{L}_{det}caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT are set to 1.0, 2.0 and 1.0.

5.2 Results Comparison

5.2.1 Evaluation on 3D Visual Grounding

In order to verify the superiority of our model in grounding ability and facilitate comparison between current models, we report the explicit grounding performance on the existing 3D visual grounding task. Since the evaluation settings of Nr3D and Sr3D [1] are based on ground-truth object proposals, while ScanRefer [6] requires models to output 3D bounding boxes, we choose ScanRefer as our benchmark for comparison. We divide the existing methods into two categories, one is the grounding model designed specifically for the 3D visual grounding task, and the other is generalist MLLMs which can understand a variety of 3D vision-language tasks. The original 3D-LLM embed 3D locations in the vocabularies and represent the grounded 3D bounding boxes by a sequence of discrete location tokens. However, since the fine-tuned model of 3D-LLM on ScanRefer and related location tokens implementation are not accessible, we adapt 3D-LLM to directly output 3D numerical coordinates representing 3D bounding boxes by fine-tuning the pre-trained model on our reformulated 3D visual grounding data, denotes as 3D-LLM (vg). As shown in Tab. 2, current generalist MLLM models still lag behind the specialist models in terms of the grounding ability. By incorporating the 3D grounding module into MLLM, ReGround3D shows the SOTA performance on the traditional 3D visual grounding task.

Table 2: Results on 3D visual grounding task among ReGround3D (ours) and existing methods.
Type Methods [email protected] [email protected]
Specialists ScanRefer [6] 37.3 24.3
MVT [22] 40.8 33.3
3DVG-Trans [47] 45.9 34.5
ViL3DRel [11] 47.9 37.7
BUTD-DETR [23] 52.2 39.8
L3Det [49] 52.8 40.2
Generalized MLLMs LLM-Grounder [41] 17.1 5.3
3D-LLM [20] 30.3 -
3D-LLM(vg) [20] 33.1 28.7
Chat3D-v2 [21] 35.9 30.4
ours ReGround3D 53.1 41.1

5.2.2 Evaluation on 3D Reasoning Grounding

The various visual grounding models rely on explicit text-object alignment in the input object expression to achieve localization, which fall to be applied to 3D reasoning grounding task. We performed a comparision between our proposed method ReGround3D and existing MLLM methods, including 3D-LLM(vg) and Chat3D-v2 [21]. Besides, inspired by Chat-3D v2, which first segments objects, then equips them with unique object identifiers to conduct effective object grounding, we set up a LLM-based 3D reasoning grounding baseline: We first segment the objects from the scene using a 3D instance segmentor [31], then convert the segmented object information including their categories and 3D bounding boxes into text as input of LLM (InternLM2-7B [32]). Besides, to better validate the performance of our model and ensure the fair comparison, we remove the ScanReason dataset from the training data, denoted as ReGround3D*. As show in Tab. 3, the LLM-based reasoning method (Mask3D [31] + InternLM2-7B [32]) possesses a very strong function reasoning ability, but struggles understanding of 3D spatial relationship. Additionally, irrespective of whether ScanReason is used in training, our model significantly outperforms the existing MLLMs. By synergize the reasoning and grounding process utilizing the Chain-of-Grounding (CoG) mechanism, the 3D reasoning grounding performance of ReGround3D can be further improved (from 28.98 to 30.62), especially on the spatial reasoning and logistic reasoning questions. The qualitative results comparison shown in Fig. 5 demonstrates superiority of our method in reasoning human complex instruction based on the 3D scene.

Table 3: Results ([email protected]) on 3D reasoning grounding task among ReGround3D (ours) and existing methods.
Methods LLM Spatial Function Logistic Emotional Safety Overall
Mask3D [31] + InternLM2-7B [32] InternLM2-7B 10.34 36.12 9.98 8.21 8.99 14.86
3D-LLM(vg) [20] FlanT5XLXL{}_{\text{XL}}start_FLOATSUBSCRIPT XL end_FLOATSUBSCRIPT-3B 18.31 17.42 10.97 8.12 6.33 13.29
Chat3D-v2 [21] Vicuna-7B 20.21 18.39 11.32 7.98 9.88 14.98
ReGround3D* FlanT5XLXL{}_{\text{XL}}start_FLOATSUBSCRIPT XL end_FLOATSUBSCRIPT-3B 30.76 29.8 18.67 19.22 17.12 23.27
ReGround3D FlanT5XLXL{}_{\text{XL}}start_FLOATSUBSCRIPT XL end_FLOATSUBSCRIPT-3B 32.98 36.23 26.99 23.12 22.98 28.98
ReGround3D (CoG) FlanT5XLXL{}_{\text{XL}}start_FLOATSUBSCRIPT XL end_FLOATSUBSCRIPT-3B 34.71 36.79 29.11 24.03 23.21 30.62
Refer to caption
Figure 5: Visualization comparison of 3D reasoning grounding capability between ReGround3D and 3D-LLM. Our method could achieve much more accurate grounding results which satisfy the implicit question intention, and give the corresponding explanation at the same time. More illustrations are given in the Appendix.

5.3 Ablation Study

In this section, we conduct an extensive ablation study to verify the effectiveness of each component.

5.3.1 Effectiveness of 3D Grounding Module

In order to more comprehensively verify the effectiveness of the 3D grounding module we proposed, we step by step verify the performance changes from 3D-LLM to ReGround3D. Apart from the 3D-LLM(vg), we fine-tuned the 3D-LLM model respectively on full reformulated existing data and all instruction tuning data, including ScanReason, denoted as 3D-LLM (full) and 3D-LLM (full+sr). All the <LOC> token in answers of the data used for fine-tuning 3D-LLM have been converted to the numerical coordinates of the corresponding boxes. The results in Tab. 4(a) show that with the same tuning dataset, our ReGround3D achieves far superior performance to 3D-LLM (28.98 vs. 19.21) by introducing the 3D grounding module.

5.3.2 Effectiveness of Instruction Tuning Dataset

Tab. 4(b) showcases the impact of different training data types on 3D visual grounding performance. 3D Object Detection (3D OD) provides the explicit 3D semantic category and visual alignment whereas 3D Question Answering (3D QA) data injects the basic 3D scene understanding ability into the model, which have certain benefits for the visual grounding ability of the model. In addition, we find that training with 3D reasoning grounding dataset can further improve the performance on 3D visual grounding.

Table 4: Ablation study on effectiveness of 3D grounding module and training data
(a) When using the same tuning dataset, ReGround3D achieves far better performance on ScanReason than 3D-LLM.
Methods [email protected]
3D-LLM(vg) [20] 13.29
3D-LLM(full) [20] 15.31
3D-LLM(full+sr) [20] 19.21
ReGround3D(full+sr) 28.98
(b) Ablation study on training data. We evaluate through the metric of accuracy on the val set of the ScanRefer dataset.
Dataset [email protected] [email protected]
VG OD VQA RD
19.3 14.2
48.7 37.6
49.2 38.1
51.8 39.4
53.1 41.1

5.3.3 Discussion of CoG Mechanism

While CoG Mechanism boosts the performance with interleaved reasoning and grounding steps during inference, it uses the relevant objects information explicitly presented in the question to help find the target objects. A natural question arises: if we input the information of all the objects instead of relavant objects into ReGround3D during CoG, will this make the model more accurately find the target objects? We first use the existing 3D Segmentor [31] to segment all objects in the scene, then update the original question using all the object 3D bounding boxes information according to the template in Sec. 4.3 and send into ReGround3D. However, experiments show that using all the object information will instead reduce the 3D reasoning grounding performance from 28.98 to 27.67. One possible reason could be the model’s attention is dispersed by too many irrelavent objects.

6 Conclusion

This paper introduces a new 3D vision language learning task: 3D reasoning grounding, which requires the model to perform active reasoning over complex and implicit human instruction, localize the target objects and give corresponding explanation. Furthermore, we propose ScanReason, a new dataset and benchmark to further unlock and throughly evaluate the 3D reasoning grounding capability. Based on this dataset, we propose a novel approach: ReGround3D, which utilizes the strong reasoning capability of MLLM guiding the 3D grounding module to obtain accurate object locations, and a Chain of Grounding (CoG) mechanism is presented to further boost the performance with interleaved reasoning and grounding steps during inference. We believe that our work will further the natural interaction between embodied agents and humans in open 3D environments. For the current ScanReason benchmark, we find that the questions in three high-level 3D reasoning categories may have overlaps. For a certain reasoning question, similar questions may appear in one or two other categories. We leave the problem as a future challenge for better reasoning grounding ability evaluation.

References

  • [1] Achlioptas, P., Abdelreheem, A., Xia, F., Elhoseiny, M., Guibas, L.: Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. pp. 422–440. Springer (2020)
  • [2] Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answering for spatial scene understanding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19129–19139 (2022)
  • [3] Baroni, M.: Linguistic generalization and compositionality in modern artificial neural networks. Philosophical Transactions of the Royal Society B 375(1791), 20190307 (2020)
  • [4] Cai, D., Zhao, L., Zhang, J., Sheng, L., Xu, D.: 3djcg: A unified framework for joint dense captioning and visual grounding on 3d point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16464–16473 (2022)
  • [5] Chen, C., Qin, R., Luo, F., Mi, X., Li, P., Sun, M., Liu, Y.: Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437 (2023)
  • [6] Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX. pp. 202–221. Springer (2020)
  • [7] Chen, D.Z., Wu, Q., Nießner, M., Chang, A.X.: D3net: A speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. arXiv preprint arXiv:2112.01551 (2021)
  • [8] Chen, F., Han, M., Zhao, H., Zhang, Q., Shi, J., Xu, S., Xu, B.: X-LLM: Bootstrap** advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160 (2023)
  • [9] Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chandra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478 (2023)
  • [10] Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
  • [11] Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Language conditioned spatial relation reasoning for 3d object grounding. Advances in Neural Information Processing Systems 35, 20522–20535 (2022)
  • [12] Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., Chen, T.: Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651 (2023)
  • [13] Chen, S., Zhu, H., Chen, X., Lei, Y., Yu, G., Chen, T.: End-to-end 3d dense captioning with vote2cap-detr. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11124–11133 (2023)
  • [14] Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)
  • [15] Chen, Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2cap: Context-aware dense captioning in rgb-d scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3193–3203 (2021)
  • [16] Chen, Z., Hu, R., Chen, X., Nießner, M., Chang, A.X.: Unit3d: A unified transformer for 3d dense captioning and visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18109–18119 (2023)
  • [17] Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
  • [18] Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
  • [19] Han, J., Zhang, R., Shao, W., Gao, P., Xu, P., Xiao, H., Zhang, K., Liu, C., Wen, S., Guo, Z., et al.: Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905 (2023)
  • [20] Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems 36 (2024)
  • [21] Huang, H., Wang, Z., Huang, R., Liu, L., Cheng, X., Zhao, Y., **, T., Zhao, Z.: Chat-3d v2: Bridging 3d scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168 (2023)
  • [22] Huang, S., Chen, Y., Jia, J., Wang, L.: Multi-view transformer for 3d visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15524–15533 (2022)
  • [23] Jain, A., Gkanatsios, N., Mediratta, I., Fragkiadaki, K.: Bottom up top down detection transformers for language grounding in images and point clouds. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. pp. 417–433. Springer (2022)
  • [24] Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., Jiang, Y.G.: More: Multi-order relation mining for dense captioning in 3d scenes. arXiv preprint arXiv:2203.05203 (2022)
  • [25] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
  • [26] Li, M., Chen, X., Zhang, C., Chen, S., Zhu, H., Yin, F., Yu, G., Chen, T.: M3dbench: Let’s instruct large models with multi-modal 3d prompts. arXiv preprint arXiv:2312.10763 (2023)
  • [27] Luo, J., Fu, J., Kong, X., Gao, C., Ren, H., Shen, H., Xia, H., Liu, S.: 3d-sps: Single-stage 3d visual grounding via referred point progressive selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16454–16463 (2022)
  • [28] Ma, X., Yong, S., Zheng, Z., Li, Q., Liang, Y., Zhu, S.C., Huang, S.: Sqa3d: Situated question answering in 3d scenes. arXiv preprint arXiv:2210.07474 (2022)
  • [29] OpenAI: Gpt-4 technical report (2023)
  • [30] Roh, J., Desingh, K., Farhadi, A., Fox, D.: Languagerefer: Spatial-language model for 3d visual grounding. In: Conference on Robot Learning. pp. 1046–1056. PMLR (2022)
  • [31] Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: Mask transformer for 3d semantic instance segmentation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 8216–8223. IEEE (2023)
  • [32] Team, I.: Internlm: A multilingual language model with progressively enhanced capabilities (2023)
  • [33] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
  • [34] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
  • [35] Wang, H., Zhang, C., Yu, J., Cai, W.: Spatiality-guided transformer for 3d dense captioning on point clouds. arXiv preprint arXiv:2204.10688 (2022)
  • [36] Wang, T., Mao, X., Zhu, C., Xu, R., Lyu, R., Li, P., Chen, X., Zhang, W., Chen, K., Xue, T., et al.: Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. arXiv preprint arXiv:2312.16170 (2023)
  • [37] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022)
  • [38] Wu, Y., Cheng, X., Zhang, R., Cheng, Z., Zhang, J.: Eda: Explicit text-decoupling and dense alignment for 3d visual and language learning. arXiv preprint arXiv:2209.14941 (2022)
  • [39] Xu, R., Wang, X., Wang, T., Chen, Y., Pang, J., Lin, D.: Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911 (2023)
  • [40] Yan, X., Yuan, Z., Du, Y., Liao, Y., Guo, Y., Li, Z., Cui, S.: Clevr3d: Compositional language and elementary visual reasoning for question answering in 3d real-world scenes. arXiv preprint arXiv:2112.11691 (2021)
  • [41] Yang, J., Chen, X., Qian, S., Madaan, N., Iyengar, M., Fouhey, D.F., Chai, J.: Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311 (2023)
  • [42] Yang, Z., Zhang, S., Wang, L., Luo, J.: Sat: 2d semantics assisted training for 3d visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 1856–1866 (2021)
  • [43] Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
  • [44] Ye, S., Chen, D., Han, S., Liao, J.: 3d question answering. IEEE Transactions on Visualization and Computer Graphics (2022)
  • [45] Yuan, Z., Yan, X., Liao, Y., Guo, Y., Li, G., Cui, S., Li, Z.: X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8563–8573 (2022)
  • [46] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
  • [47] Zhao, L., Cai, D., Sheng, L., Xu, D.: 3dvg-transformer: Relation modeling for visual grounding on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2928–2937 (2021)
  • [48] Zhong, Y., Xu, L., Luo, J., Ma, L.: Contextual modeling for 3d dense captioning on point clouds. arXiv preprint arXiv:2210.03925 (2022)
  • [49] Zhu, C., Zhang, W., Wang, T., Liu, X., Chen, K.: Object2scene: Putting objects in context for open-vocabulary 3d detection. arXiv preprint arXiv:2309.09456 (2023)
  • [50] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
  • [51] Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
  • [52] Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3d-vista: Pre-trained transformer for 3d vision and text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2911–2921 (2023)

The supplementary material consists of quantitative evaluations on text description of 3D reasoning grounding (Sec. 7), more visualization (Sec. 8), the ScanReason annotations generation prompts (Sec. 9) and details of instruction tuning datasets (Sec. 10).

7 More Evaluations

Considering the expected outputs of 3D reasoning grounding questions consist of not only the target objects 3D bounding boxes, but also text response including either demonstrating the explanation (e.g. Why to choose these objects) or offering reasonable suggestions (e.g. How to use these objects.). We argue that it is also necessary to evaluate the text response correctness. However, due to the complexity and diversity of the answer, it is non-trival to use or design a proper evaluation method which can make sure the evaluation accuracy. To ensure the evaluation accuracy with limited human and time resources, we uniformly sample 100 reasoning grounding pairs from evaluation datasets and test GPT-4 [29], 3D-LLM [20] and ReGround3D on the datasets. Then we manually score the 300 responses using an integer ranging from 1 to 5, while 1 indicates an incorrect answer, 5 is a correct answer. The matching score λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents levels of the similarity between the response and ground-truth answer. The correctness metric is denoted as :

S=1NiN(λi14)×100%𝑆1𝑁superscriptsubscript𝑖𝑁subscript𝜆𝑖14percent100S=\frac{1}{N}\sum_{i}^{N}\left(\frac{\lambda_{i}-1}{4}\right)\times 100\%italic_S = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 end_ARG start_ARG 4 end_ARG ) × 100 % (3)

The results in Tab. 5 demonstrate that even if GPT-4 could not access to information of the 3D scene, it can “guess” the answer to the complex reasoning question based on its powerful world knowledge and common sense, and it could serve as a strong baseline for evaluation. ReGround3D achieves the superior performance (38.7 vs. 32.4) based on a much smaller LLM (FlanT5XLXL{}_{\text{XL}}start_FLOATSUBSCRIPT XL end_FLOATSUBSCRIPT-3B) and has the ability to localize the target objects in 3D scene at the same time.

Table 5: Matching scores of text responses on 3D reasoning grounding task among ReGround3D(ours), 3D-LLM and GPT-4.
Methods Spatial Function Logistic Emotional Safety Overall
GPT-4 [29] 23.7 48.7 29.1 19.7 26.5 32.4
3D-LLM(vg) [20] 17.2 24.4 18.4 11.1 13.8 17.2
ReGround3D 34.9 49.2 35.1 30.1 30.2 38.7

8 More Visualizations

8.1 More Examples

In this section we will illustrate more examples of our ScanReason benchmark for each type of reasoning questions in  Fig. 7Fig. 8Fig. 9Fig. 10 and Fig. 11. Each example consists of the reasoning question, target object locations (3D bounding boxes) and the corresponding text response.

8.2 More Qualitative Results

Illustrated in the qualitative results in the paper, we find that our model tends to output much less 3D bounding boxes compared with the ground-truth 3D bounding boxes when multiple objects are regarded as the target objects. Besides, as shown in Fig. 6, even if the 3D grounding module is introduced to more accurately localize the target objects, ReGround3D still struggles to recognize and localize the small and long-tailed objects in the 3D scene.

Refer to caption
Figure 6: This example demonstrates that our model sometimes struggles to localize the small and long-tailed objects in the 3D scene.
Refer to caption
Figure 7: Examples of spatial reasoning data. Each row represents one question-answer-location pair. The left column represents the 3D scene and target objects location, and the right column shows the question and corresponding text answer.
Refer to caption
Figure 8: Examples of function reasoning data. Each row represents one question-answer-location pair. The left column represents the 3D scene and target objects location, and the right column shows the question and corresponding text answer.
Refer to caption
Figure 9: Examples of logistic reasoning data. Each row represents one question-answer-location pair. The left column represents the 3D scene and target objects location, and the right column shows the question and corresponding text answer.
Refer to caption
Figure 10: Examples of emotional reasoning data. Each row represents one question-answer-location pair. The left column represents the 3D scene and target objects location, and the right column shows the question and corresponding text answer.
Refer to caption
Figure 11: Examples of safety reasoning data. Each row represents one question-answer-location pair. The left column represents the 3D scene and target objects location, and the right column shows the question and corresponding text answer.

9 ScanReason Annotations Generation Prompts

We show five prompt templates for generating five types of reasoning question-answer-location pairs, each comprising system messages and manually crafted context examples. In our attempts, since GPT-3.5 struggles to understand the 3D spatial relationships of objects in the scene based on the provided 3D spatial coordinates of objects, we resort to GPT-4 for data generation, which is verified to be much better than GPT-3.5 in understanding the spatial relationships. We input the category information and 3D bounding boxes of the objects in the 3D scenes, providing information about the semantics and spatial locations of the scene in a textual representation. Then we provide specific instructions to the GPT-4 [29] to generate diverse data. As shown in Fig. 12Fig. 13Fig. 14Fig. 15 and Fig. 16, to further make the generated question-answer-location pairs more accurate and responsive, we adopt prompt engineering by giving GPT-4 [29] about 3-5 few-shot examples to show what kind of data it is should generate. For each sample in the few shot samples, the “content” has the object ids, category information and 3D bounding boxes of the objects the scenes, and the “response” refers to human-written question-answer-location pairs. And we include the 3D bounding boxes and categories information of all the objects in scenes into “query” and ask the GPT-4 [29] to give us 10 meaningful samples.

Refer to caption
Figure 12: Prompts on generating function reasoning question-answer-location pairs data..
Refer to caption
Figure 13: Prompts on generating spatial reasoning question-answer-location pairs data.
Refer to caption
Figure 14: Prompts on generating logistic reasoning question-answer-location pairs data..
Refer to caption
Figure 15: Prompts on generating emotional reasoning question-answer-location pairs data..
Refer to caption
Figure 16: Prompts on generating safety reasoning question-answer-location pairs data.

10 Details of Instruction Tuning Datasets

10.1 Data Reformulation

10.1.1 3D Object Detection data.

Generally speaking, 3D object detection datasets contain information about 3D bounding boxes of all objects in the pre-defined list of categories. In order to cover as many objects as possible, we chose to construct question and answer pairs based on the EmbodiedScan [36] dataset, which includes 160k 3D-oriented boxes spanning over 760 categories. During the model training process, we convert the annotations of 3D boxes into a specific question answering pair template: “User: <scene> Where is the <category> in this 3D Scene? Assistant: Sure, <LOC>.” Here, <category> is randomly selected from the ground-truth categories contained in the current 3D scene, <scene> is the placeholder of 3D scene tokens.

10.1.2 3D Visual Grounding data.

3D visual grounding data aims at localizing the unique object in 3D scenes given the descriptive object expression. The descriptions of objects in these data typically explicitly include the object attributes and their spatial relationships with other objects. To ensure diversity in training data, we selected ScanRefer [6], SR3D [1], NR3D [1] as training data. Considering the variety of object descriptions, it is difficult to simply reformulate the object expression using a simple template like: “ User: <scene> Where is <expr> in the 3D Scene?. Assistant: It is <LOC>.” Therefore, we choose to retain the original object description as much as possible and use a template: “Here is a description about an object: “ <expr> ”, where is the object in the 3D Scene? Assistant: It is <LOC>.”, where <expr> represents the object description in the data. Besides, we have created a range of similar question templates that are randomly selected during the training process.

10.1.3 Spatial Question Answering data

We hope the model can understand 3D position in a more natural way using numerical values in Natural Language. We use [x,y,z,dx,dy,dz]𝑥𝑦𝑧𝑑𝑥𝑑𝑦𝑑𝑧[x,y,z,dx,dy,dz][ italic_x , italic_y , italic_z , italic_d italic_x , italic_d italic_y , italic_d italic_z ] to represent a 3D box, where [x,y,z]𝑥𝑦𝑧[x,y,z][ italic_x , italic_y , italic_z ] represents the center of a 3D area and [dx,dy,dz]𝑑𝑥𝑑𝑦𝑑𝑧[dx,dy,dz][ italic_d italic_x , italic_d italic_y , italic_d italic_z ] represents the 3D box size, these coordinates can appear anywhere in the input text. Since there are no explicit coordinate question answering pairs in the 3D vision-language datasets, we turned our attention to the SR3D dataset. SR3D is a template-based generated dataset that not only provides expressions of target objects but also provides object ids for target objects and anchor objects. Based on SR3D, we constructed a 3D QA dataset focusing on 3D positional relationships between objects. For example, the query “select the trash can that is beneath the desk” from SR3D dataset can be transformed into “User: Is the trash can [x1,y1,z1,dx1,dy1,dz1]subscript𝑥1subscript𝑦1subscript𝑧1𝑑subscript𝑥1𝑑subscript𝑦1𝑑subscript𝑧1[x_{1},y_{1},z_{1},dx_{1},dy_{1},dz_{1}][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] beneath the desk [x2,y2,z2,dx2,dy2,dz2]subscript𝑥2subscript𝑦2subscript𝑧2𝑑subscript𝑥2𝑑subscript𝑦2𝑑subscript𝑧2[x_{2},y_{2},z_{2},dx_{2},dy_{2},dz_{2}][ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_d italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]?” Assistant: Yes.” with the assistance of GPT-3.5.

10.1.4 3D Question Answering data.

Considering that we expect the model can also output reasonable answers in the conversation, we also introduce 3D QA data during the training process to further enhance the model’s 3D visual question answering and scene understanding capabilities. We reformulate CLEVR3D [40] data into a simple question-answer template: “User: <scene> <question>. Assistant: <answer>.” The SQA3D [28] dataset not only provides questions but also provides the situation in which the questions are asked. We reformulate it into a template: “User: <scene> <situation> <question>. Assistant: <answer>.” wherein the <question> the placeholder for the question and <situation> is the placeholder of situation while raising the corresponding question.

10.1.5 3D Reasoning Grounding data.

In addition to the types of data mentioned above, We also employed our own proposed 3D reasoning grounding data to train the model, further enhancing its capability to handle complex reasoning questions. The output format is more akin to a combination of 3D Question Answering (QA) and localization, where the model’s response not only includes the target object but also provides an explanation for the selection of the target object. We adopted a template: “User: <scene> <question>. Assistant: Sure, <LOC>, <reason>.”

10.2 Output Type Templates

In the actual interaction between users and the model, questions are generally not divided according to the aforementioned tasks but are more concerned with whether the model’s output format meets the user’s needs. For example, in the 3D QA data, there exists a question: “Where is the pillow on the bed?”, with the corresponding answer being “near the headboard”. Simultaneously, such questions may also appear in our reformulated 3D Object Detection and Visual Grounding datasets, where the desired model output is the specific location coordinates of the object in the scene. To make the interaction between the model and users more natural and the outputs more in line with user needs, we accordingly employ output type templates appended to user instructions. Such instructions enable the training data to break free from the constraints of its original task and integrate more naturally according to the output type, thereby further enhancing the model’s understanding and response to complex and varied inputs in natural dialogue.