3D-GRAND: A Million-Scale Dataset for 3D-LLMs
with Better Grounding and Less Hallucination

Jianing Yang^∗♣ Xuweiyi Chen^∗♣ Nikhil Madaan Madhavan Iyengar^♣
Shengyi Qian^♣ David F. Fouhey Joyce Chai^♣

^♣ University of Michigan New York University
https://3d-grand.github.io/ Equal contribution.Independent researcher.

Abstract

The integration of language and 3D perception is crucial for develo** embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs.

Refer to caption — Figure 1: We introduce 3D-GRAND, a large-scale, densely grounded 3D-text dataset, and 3D-POPE, a 3D-LLM hallucination benchmark. Training on 3D-GRAND improves grounding accuracy and reduces hallucinations.

1 Introduction

Embodied Artificial Intelligence (EAI) represents a frontier in robotics and machine learning. In EAI, the integration of perception, language, and action within physical spaces is crucial for develo** intelligent systems capable of meaningfully navigating and interacting with their environments. Central to this vision is the concept of grounding language in the physical world [6, 9]. Grounding connects abstract linguistic constructs to concrete objects in three-dimensional space, thereby enabling robots and intelligent agents to effectively understand and manipulate their surroundings.

Recent advancements in Large Language Models (LLMs) have greatly benefited Embodied Artificial Intelligence (EAI). LLMs demonstrate exceptional capabilities in understanding language instructions [49, 64], perceiving the environment [44, 39, 3, 81, 71], and planning detailed actions [7, 33]. The primary inputs to LLMs, other than pure language, have been the combination of language and 2D images, categorizing these models as 2D-LLMs. The significant advancements in 2D-LLMs can be largely attributed to their training on extensive vision-language datasets. These datasets [58, 82], comprising billions of image and text pairs, have been instrumental in enhancing the models’ understanding of visual content and its contextual relevance to textual information. These large datasets have provided the foundational data necessary for training models that excel at integrating visual perception with language processing. Despite some progress in equip** LLMs to understand 3D scenes (3D-LLMs) [27, 29, 66, 30, 83, 11, 53], these models remain in their early stages due to the scarcity of 3D scene and text pairs. In this work, we introduce 3D-GRAND, a pioneering million-scale dataset designed for densely-grounded 3D Instruction Tuning.

Recently, SceneVerse [35] concurrently introduced a large-scale 3D vision-language dataset. However, a significant limitation of this dataset is the absence of object grounding in language, which is crucial for enhancing model usability in robotics tasks and reducing hallucination. Research on 2D-LLMs indicates that grounding language to 2D contexts notably mitigates hallucination in language models [74, 51, 5, 38, 54, 78], thereby enhancing the reliability and interpretability of generated responses. While 2D grounding has been extensively explored, extending these principles to 3D environments is still underdeveloped. This situation raises two critical questions: (1) Is there any hallucination in 3D-LLMs and if so, how severe is it? (2) Can densely-grounded data mitigate hallucination for 3D-LLMs? These questions underscore a critical need within the research community for the development of an evaluation benchmark specifically designed for 3D-LLMs and the construction of a large-scale, 3D-grounded dataset.

To quantify hallucination in 3D LLMs, this work introduces 3D-POPE (3D Polling-based Object Probing Evaluation). 3D-POPE provides a comprehensive and standardized protocol for evaluating hallucination that enables systematic assessment and facilitates fair comparisons across 3D-LLMs, enhancing our understanding of model capabilities in object hallucination. Specifically, we pose existence questions to 3D-LLMs and evaluate their responses, as illustrated in Fig 1.

To evaluate the role of densely-grounded dataset, we introduce a pioneering million-scale dataset, 3D-GRAND, for densely grounded 3D instruction tuning. 3D-GRAND includes 40,087 household scenes paired with 6.2 million scene-language instructions, featuring dense phrase-to-object grounding. We conduct rigorous human evaluations to ensure the dataset’s quality. Our results trained with 3D-GRAND highlight the dataset’s effectiveness in enhancing grounding and reducing hallucination for 3D-LLMs. We highlight the effectiveness of incorporating 3D-GRAND in Fig 1 and introduce each category of 3D-GRAND and provide examples in Fig 2.

To sum up, our contributions include:

•

3D-GRAND, the first million-scale, densely-grounded 3D-text dataset for grounded 3D Instruction Tuning. 3D-GRAND includes 40K household scenes paired with 6.2M densely-grounded scene-language instructions.
•

3D-POPE, a suite of benchmarks and metrics that systematically evaluate hallucination, enabling fair comparisons of future 3D-LLM models in terms of object hallucination.
•

Quantitative research findings regarding hallucination, grounding, and scaling that provide guidance to future research: (1). training 3D-LLMs with 3D-GRAND significantly reduces hallucinations, particularly when the data is densely grounded; (2). densely grounded instruction tuning significantly enhances the grounding capabilities of 3D-LLMs; (3). scaling densely grounded data consistently improves grounding accuracy and reduces hallucination; and (4). models can successfully transfer from sim-to-real, providing an early signal for a low-cost and sustainable future of scaling synthetic 3D data to help on real tasks.

Dataset	Which part is grounded?	Densely Grounded?	Language source	# 3D Scenes	# Language pairs
ReferIt3D [2]	obj-refer	✗	Human,Template	0.7K	125K
ScanRefer [10]	obj-refer	✗	Human	0.7K	51K
Scan2Cap [12]	obj-refer	✗	Human	0.7K	51K
ScanEnts3D [1]	obj-refer	✓	Human	0.7K	84K
PhraseRefer [76]	obj-refer	✓	Human	0.7K	170K
ScanQA [4]	answer	✗	Human	0.7K	41K
SQA3D [46]	question	✗	Human	0.65K	33.4K
3DVQA [22]	✗	✗	Template	0.7K	500K
CLEVR3D [70]	✗	✗	Template	8.7K	171K
3DMV-VQA [26]	✗	✗	Template	4.1K	55K
EmbodiedScan [65]	✗	✗	Template	3.4K	970K
3DMIT [42]	✗	✗	LLM	0.7K	75K
M3DBench [40]	obj-refer, question	✗	LLM	0.7K	327K
3D-DenseOG [32]	scene	✓	Human	0.7K	51K
3D-LLM [27]	obj-refer	✗	LLM	0.9K	200K
LL3DA [11]	question, answer	question	Template,LLM	0.9K	200K
Chat3D-v2 [29]	scene	✓	Human,LLM	0.7K	0.7K
3D-VisTA [83]	question	✗	Template,LLM	3K	278K
LEO [30]	question	✗	LLM	3K	579K
SceneVerse [35]	obj-refer	✗	Template,LLM	62K	2.5M
3D-GRAND	scene, obj-refer, question, answer	✓	Template,LLM	40K	6.2M

Table 1: Comparison of 3D-GRAND with existing 3D scene datasets with language annotations. 3D-GRAND is the largest language-grounded dataset.

2 Related Work

Injecting 3D into LLMs. Recent advancements in large language models (LLMs) have inspired research into extending their capabilities to 3D environments, leading to the development of 3D-LLMs [11, 53, 71, 83]. Notable works in this field include 3D-LLM [27], which integrates 3D point clouds and features into LLMs to enable tasks such as captioning, question answering, and navigation. LEO [30] excels as an embodied multi-modal generalist agent in perception, grounding, reasoning, planning, and action in 3D environments, highlighting the potential of 3D-LLMs in understanding and interacting with the physical world. The most relevant work to our model is Chat-3Dv2 [66, 29], which grounds generated scene captions to objects in 3D scenes. However, Chat-3Dv2’s dataset is limited to one type of 3D-text task (scene captioning) and only includes 705 captions from a subset of ScanNet scenes. In 3D-GRAND, we expand this concept by diversifying 3D-text tasks and increasing the dataset size to a million-scale. Our results demonstrate promising data scaling effects and sim-to-real transfer, paving the way for future large-scale training of 3D-LLMs.

Object Hallucination of VLMs. While 2D VLMs have achieved impressive performance, they are prone to hallucinating objects that do not exist in the provided images, a problem known as object hallucination [15, 56]. Several methods have been suggested to mitigate the object hallucination issue, such as integrating an external object detector [77], applying visually grounded visual instruction tuning [74, 78] or reinforcement learning [61, 24], performing iterative refinement [80], and adapting the decoding strategies [31]. To quantify and mitigate this issue, several benchmarks have been proposed. CHAIR (Caption Hallucination Assessment with Image Relevance) [56] measures the frequency of hallucinated objects in image captions by comparing the objects mentioned to the ground truth annotations. POPE (Probing Object Hallucination Evaluation) [41] assesses a VLM’s ability to identify the presence or absence of objects through yes/no probing questions. However, these studies primarily focus on 2D image-text datasets like COCO [43]. In contrast, object hallucination in 3D-LLMs remains largely unexplored. Our work addresses this gap by introducing 3D-POPE, a comprehensive benchmark for evaluating object hallucination in 3D-LLMs. To the best of our knowledge, this is the first object hallucination benchmark for 3D-LLMs.

Grounding datasets. In the 2D domain, large-scale datasets with grounding information have been instrumental in advancing vision-language research. Notable examples include RefCOCO [75], which provides referring expressions for objects in COCO images [43]. Additionally, 2D LLMs [51, 54, 69, 38, 74] have been trained with densely-grounded web-crawled image-text pairs. In the 3D domain, there is a growing interest in creating datasets that pair 3D scenes with textual annotations [76, 1, 32, 12]. ScanRefer [10] pioneered this effort by contributing a dataset of ScanNet [14] scenes with referring expressions. Table 1 introduces the efforts made by the community. However, these datasets have limited grounding annotations and often focus on a single task, such as referring expression comprehension or visual question answering. In contrast, our proposed dataset, 3D-GRAND, stands out by providing 6.2 million densely-grounded scene-language instructions across a diverse set of 3D-text tasks and 40,087 household scenes. This enables a wide range of grounding tasks and facilitates the development of more reliable and better-grounded 3D-LLMs.

3 3D-GRAND: The 3D Ground Anything Dataset

In this section, we introduce 3D-GRAND, a large-scale, densely-grounded 3D-text dataset designed for grounded 3D instruction tuning. We describe the data collection process, dataset statistics, and the unique features that make 3D-GRAND a valuable resource for advancing research in 3D-LLMs.

3D scene collection. The majority of 3D-text research is currently based on ScanNet scenes collected from real camera scans, which are limited in scale. However, recent advancements have led to the development of numerous synthetic data generation pipelines [48, 17, 20, 18, 37, 52, 62, 47, 73, 25, 60, 36, 21]. Given the scalability of these synthetic data generation pipelines, we explore the potential of using synthetic 3D scenes to enhance 3D-text understanding.

Synthetic data offers significant advantages, such as lower costs and greater accessibility, making it an attractive alternative. If models trained on simulated 3D-text data can effectively transfer to real-world 3D scenes, the research community stands to benefit immensely.

To this end, we curate a diverse collection of 40,087 high-quality 3D indoor scenes from the 3D-FRONT [23] and Structured3D [79] datasets. These datasets are chosen for their large quantities of synthetic indoor scenes with professionally designed layouts. The collection includes a variety of room types, such as living rooms, bedrooms, kitchens, office spaces, and conference rooms. We further process these 3D scenes to generate per-room 3D point clouds. Details on point cloud rendering and cleaning are provided in the Appendix.

Densely-grounded text annotation. The definition of densely-grounded text is that every noun phrase of object mentioned in the text should be associated with an 3D object in the 3D scene. This is illustrated in Figure 2. This is a difficult type of data to get annotations on. Early work such as ScanEnts3D [1] relies on hiring professional human annotators to obtain such annotations. The authors report that crowd-sourcing annotators (Amazon Mechanical Turk (AMT) [13]) were not able to reliably complete this task and they had to hire professional annotators (error rate AMT: 16%, professional: $<$ 5%). Yet our human quality check shows that LLMs (GPT-4 [49]) can achieve $<$ 8.2-5.6% densely-grounding error rate (see Appendix for detail). This finding is in accordance with recent studies [19, 63] reporting LLMs can be human-level annotators. The accuracy of LLM-annotation provides one motivation for considering LLMs as densely grounding annotation tool.

The second, and perhaps more critical, motivation is the scalability of annotation. While we can potentially scale up 3D scenes using synthetic data generation pipelines, annotating these scenes with human effort is both costly and time-consuming, especially for complex tasks like densely grounding annotation. To put the cost of money and time in perspective, for the data we annotated in this paper, we estimate that obtaining the same annotations with human annotator would cost at least $539,000 and require 5.76 years (no eat, no sleep) worth of work from a professional annotator (earning minimum wage of $10.67 per hour). In contrast, using LLMs (GPT4 [49]), we achieve the same results for $3,030 within 2 days, representing a 178x reduction in cost and a 1051x reduction in time. At the time of writing, the cost and time further decreases by 50% to $1,500 and 1 day, with the introduction of GPT-4o [50].

As previously discussed, using humans to annotate 3D scenes can be an exhaustive process. Meanwhile, 2D-LLMs demonstrate remarkable capabilities in understanding visual inputs and generating language, making them well-suited for creating high-quality, grounded language annotations. However, due to the hallucination issues and data issues in 2D-LLMs, aggregating information across images, even those originating from the same scene, is not feasible yet.

In contrast, Large Language Models (LLMs) excel at understanding structural data and generating diverse and fluent language [49]. They have demonstrated capabilities in spatial reasoning [8], solving both elementary and sophisticated math problems [68, 34]. To address the limitations of 2D-LLMs when annotate 3D scenes, we leverage the strengths of LLMs. By integrating detailed, accurate information into a reliable scene graph, we provide LLMs with the necessary data to reason effectively and generate precise annotations.

Here are the key steps of applying our pipeline to obtain densely-grounded annotation for any synthetic 3D scene:

$\bullet$ 3D Model to 2D Image. In the 3D-Front dataset, each object is sourced from 3D-Future [23], which provides a ground truth 2D image for each object. For the Structured3D dataset, individual images for each object are not available. Therefore, we utilize the set-of-mark prompting technique [72], where each object to be annotated is circled in red in the images.
$\bullet$ 2D Image to Attributes. We use GPT-4V to generate detailed language annotations for each 2D object image, including attributes like name, color, finish, and texture. The naming is now open-vocabulary, contrary to being class-agnostic.
$\bullet$ List of Attributes to Scene Graph. We structure each individual objects’ annotations into a JSON-based scene graph that captures the relationships and attributes of objects within the scene. Note that we obtain this scene graph from synthetic data which we can guarantee the correctness.
$\bullet$ Scene Graph to Generated Annotations. Based on the given scene graph, we will be able to produce 3D-Grounded Object Reference, 3D-Grounded Scene Description and 3D-Grounded QA using GPT-4 [49] and GPT4-Turbo with various prompts, which we will show in the appendix.
$\bullet$ Generated Annotations to Processed Annotations. After we acquire raw annotations, we will apply hallucination filters and template augmentation for the phrase tag to remove low quality annotations and augment generated annotations.

We want to reinforce the set of grounded language tasks that cover a diverse range of language understanding and generation scenarios in 3D environments which we show in Figure 2:

$\bullet$ 3D-Grounded Object Reference: Given a 3D scene and an object of interest, 3D-LLM is required to generate a description that uniquely identifies the target object. The description includes text and grounding information, not only for the target object but also for any landmark objects mentioned in the description. This task is conceptually similar to Visual Grounding, Scene-aware Object Captioning, and Dense Captioning in 2D vision-language research.
$\bullet$ 3D-Grounded Scene Description: Given a 3D scene, the 3D-LLM generates a description that captures the salient aspects of the environment. The description includes both text and grounding information, linking the language to specific objects or regions in the scene.
$\bullet$ 3D-Grounded QA: Given a 3D scene and a question about the environment, the 3D-LLM generates an answer that is grounded in the scene. Both the question and answer include text and grounding information, ensuring that the 3D-LLM’s responses are contextually relevant and accurate.

Dataset highlights. 3D-GRAND possesses several unique features that distinguish it from existing 3D-language datasets: (1). Large-scale: With 40,087 scenes and 6.2 million annotations, 3D-GRAND is the largest 3D-language dataset to date, providing ample data for training and evaluating 3D-LLMs. (2). Dense grounding: Unlike recent million-scale datasets like SceneVerse, which lack grounded language annotations, each language annotation in 3D-GRAND is densely grounded to specific objects or regions within the 3D scenes, facilitating fine-grained language understanding and generation. (3). Diverse language tasks: 3D-GRAND supports a broad array of grounded language tasks, including object reference , spatial reasoning, and scene understanding, making it a comprehensive benchmark for evaluating 3D-LLMs. (4). High-quality annotations: The language annotations in 3D-GRAND are meticulously collected, filtered, and evaluated to ensure they are of high quality, diverse, and natural.

These unique features establish 3D-GRAND as a valuable resource for advancing research in 3D-LLMs and embodied AI. By providing a large-scale, densely-grounded 3D-text dataset, 3D-GRAND enables the development and evaluation of more capable and reliable 3D-LLMs that can effectively understand and interact with the physical world.

4 3D-POPE: A Benchmark for Evaluating Hallucination in 3D-LLMs

To systematically evaluate the hallucination behavior of 3D-LLMs, we introduce the 3D Polling-based Object Probing Evaluation (3D-POPE) benchmark. 3D-POPE is designed to assess a model’s ability to accurately identify the presence or absence of objects in a given 3D scene.

Dataset. To facilitate the 3D-POPE benchmark, we curate a dedicated dataset from the ScanNet [14] dataset, utilizing the semantic classes from ScanNet200 [57]. Specifically, we use the ScanNet validation set as the foundation for evaluating 3D-LLMs on the 3D-POPE benchmark.

Benchmark design. 3D-POPE consists of a set of triples, each comprising a 3D scene, a posed question, and a binary answer (“Yes” or “No”) indicating the presence or absence of an object (Fig. 1 middle). To ensure a balanced dataset, we maintain a 1:1 ratio of existent to nonexistent objects when constructing these triples. For the selection of negative samples (i.e., nonexistent objects), we employ three distinct sampling strategies:

•

Random Sampling: Nonexistent objects are randomly selected from the set of objects not present in the 3D scene.
•

Popular Sampling: We select the top- $k$ most frequent objects not present in the 3D scene, where $k$ equals the number of objects currently in the scene.
•

Adversarial Sampling: For each positively identified object in the scene, we rank objects that are not present and have not been used as negative samples based on their frequency of co-occurrence with the positive object in the training dataset. The highest-ranking co-occurring object is then selected as the adversarial sample. This approach differs from the original POPE [41] to avoid adversarial samples mirroring popular samples, as indoor scenes often contain similar objects.

These sampling strategies are designed to challenge the model’s robustness and assess its susceptibility to different levels of object hallucination.

Metrics. To evaluate the model’s performance on the 3D-POPE benchmark, we use key metrics including Precision, Recall, F1 Score, Accuracy, and Yes (%). Precision and Recall assess the model’s ability to correctly affirm the presence of objects and identify the absence of objects, respectively. Precision is particularly important as it indicates the proportion of non-existing objects generated by the 3D-LLMs. The F1 Score, combining Precision and Recall, offers a balanced view of performance and serves as the primary evaluation metric. Accuracy measures the proportion of correctly answered questions, encompassing both “Yes” and “No” responses. Additionally, the Yes (%) metric reports the ratio of incorrect “Yes” responses to understand the model’s tendencies regarding object hallucination.

Leaderboard. We establish a public leaderboard for the 3D-POPE benchmark, allowing researchers to submit their 3D-LLM results and compare their performance against other state-of-the-art models. The leaderboard reports the evaluation metrics for each model under the three sampling strategies, providing a transparent and standardized way to assess the hallucination performance of 3D-LLMs.

Dataset	3D-POPE	Model	Precision	Recall	F1 Score	Accuracy	Yes (%)
ScanNet200 Val	Random	Random Baseline	50.00	50.00	50.00	50.00	50.00
		3D-LLM [27]	50.03	99.88	66.67	50.07	99.81
		3D-VisTA [83]	50.12	53.58	51.79	49.66	53.95
		LEO [30]	51.95	77.65	62.25	52.91	74.73
		Ours zero-shot (Grounding)	93.34	84.25	88.56	89.12	45.13
	Popular	Random Baseline	50.00	50.00	50.00	50.00	50.00
		3D-LLM [27]	49.97	99.88	66.61	49.94	99.94
		3D-VisTA [83]	47.40	51.88	49.54	49.49	52.30
		LEO [30]	48.30	77.65	59.55	47.27	80.38
		Ours zero-shot (Grounding)	73.05	84.28	78.26	76.59	57.69
	Adversarial	Random Baseline	50.00	50.00	50.00	50.00	50.00
		3D-LLM [27]	49.97	99.88	66.61	49.94	99.94
		3D-VisTA [83]	48.28	54.39	51.15	51.14	52.99
		LEO [30]	48.47	77.98	59.78	47.52	80.45
		Ours zero-shot (Grounding)	69.86	84.21	76.37	73.95	60.26

Table 2: 3D-POPE benchmark results for evaluating hallucination in 3D language models. Random Baseline refers to a model randomly predicting “yes” or “no” with 50% chance, given the 1:1 positive/negative sample ratio in the dataset.

	Model	[email protected]	[email protected]
Non-zero-shot	3D-LLM (flamingo) [27]	21.2	-
	3D-LLM (BLIP2-opt) [27]	29.6	-
	3D-LLM (BLIP2-flant5) [27]	30.3	-
Zero-shot	LLM-Grounder [71]	17.1	5.3
Sim-to-real zero-shot	3D-GRAND	38.0	27.4

Table 3: ScanRefer Results for evaluating visual grounding capability of 3D-LLMs. 3D-GRAND achieves the best zero-shot performance among 3D-LLMs, providing signals for sim-to-real transfer.

3D-POPE	Model	Precision
Random	3D-GRAND	93.34
Random	w/o grounding tokens	91.96
Popular	3D-GRAND	73.05
Popular	w/o grounding tokens	70.37
Adversarial	3D-GRAND	69.86
Adversarial	w/o grounding tokens	67.48

Table 4: Ablation on 3D-POPE. Without the grounding tokens, 3D-GRAND hallucinates more.

5 Experiments

In this section, we present our experimental setup, including the baselines, datasets, and implementation details. We then report the results of our approach, denoted as 3D-GRAND on the ScanRefer [10] and the 3D-POPE benchmark, demonstrating the effectiveness in improving grounding and reducing hallucination. Finally, we conduct an ablation study to analyze the impact of different components of our model and training strategy.

5.1 Experimental Setup

Baselines. We compare our 3D-GRAND against the following baselines: 3D-LLM [27], LEO [30], and 3D-Vista [83]. Each model, along with the specific checkpoint used to obtain the results, is documented in the appendix. Our proposed model is based on Llama-2 [64]. The input is object-centric context, including a scene graph with each object’s category, centroid (x, y, z), and extent (width, height, depth), along with the text instruction and user query. During training, we utilized ground-truth centroids and extents. For inference, we used bounding boxes predicted by Mask3D [59]. Examples of input/output and details of the model can be found in the supplementary material.

Datasets. We evaluate our model 3D-GRAND on two datasets: 3D-POPE and ScanRefer. 3D-POPE is our newly introduced benchmark dataset for evaluating object hallucination in 3D-LLMs, as described in Section 4. For ScanRefer, We utilized the validation split which contains 9,508 natural language descriptions of 2,068 objects in 141 ScanNet [14] scenes.

Metrics. For the ScanRefer benchmark, we use the official evaluation metrics, including [email protected] and [email protected]. For the 3D-POPE benchmark, we report accuracy, precision, recall, F1 score, and “Yes” rate under the three sampling strategies described in Section 4.

Implementation Details. The 3D-GRAND model is LoRA-finetuned [28] based off Llama-2. We use DeepSpeed ZeRO-2 [55] and FlashAttention [16] to save GPU memory and speed up training. The model is trained in BF16 precision on 12 NVIDIA A40 GPUs with a combined batch size of 96 and a learning rate of 2e-4. We use the AdamW [45] optimizer with a weight decay of 0.01 and a cosine learning rate scheduler. We train the mode for 10k steps, which takes approximately 48 hours.

5.2 Results on 3D-POPE

We first evaluate these approaches on 3D-POPE and report results on Table 2. Results show that 3D-LLM [27] almost always produces yes responses to any question. 3D-VisTA [83] performs similarly to the random baseline. LEO [30] tends to answer yes frequently, but its precision indicates a similar object hallucination rate to the random baseline. In our evaluation, 3D-GRAND achieved exceptional performance, with 93.34% precision and 89.12% accuracy when tested with random sampling. However, our model struggles with the more difficult splits, Popular and Adversarial, which demonstrates the effectiveness and rigorousness of 3D-POPE as a benchmark. Moreover, we emphasize that our model has never encountered ScanNet during training. More analysis on 3D hallucination can be found in the supplementary material.

5.3 Results on ScanRefer

We report results on ScanRefer in Table 4. Our model significantly outperforms all the baselines, achieving state-of-the-art zero-shot results on both [email protected] and [email protected]. Notably, our model surpasses the previous best-performing model, 3D-LLM, by 7.7% on [email protected]. We emphasize that our model, unlike 3D-LLM, has never seen ScanNet scenes during its training (zero-shot) and is only trained on synthetics 3D scene instead of real scans. Therefore, this results provide a promising early signal that sim-to-real transfer can be achieved via our dense-grounded large-scale dataset.

Table 5: Ablation Study on Grounding Accuracy (%) on ScanRefer: Training with densely-grounded data significantly improves grounding accuracy, particularly when multiple distractor objects of the same category are present in the room.

Method	Det.	Unique		Multiple		Overall
Method	Det.	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]
Best IoU (upper bound)	Mask3D (Top100)	93.7	66.8	91.6	70.7	92.4	69.2
Best IoU (upper bound)	Mask3D (Top40)	81.2	58.7	80.7	62.4	80.9	61.0
Non-grounded Model	Mask3D (Top40)	51.8	33.1	21.3	17.9	34.2	24.3
Grounded Model (ground later)	Mask3D (Top40)	50.4	32.4	26.0	20.5	36.3	25.5
Grounded Model (ground first)	Mask3D (Top40)	54.4	36.4	26.0	20.8	38.0	27.4
Best IoU (upper bound)	GT	100.0	100.0	100.0	100.0	100.0	100.0
Non-grounded Model	GT	90.8	90.8	26.0	26.0	53.4	53.4
Grounded Model	GT	91.0	91.0	32.1	32.1	57.0	57.0

5.4 Ablation Study

To better understand the impact of different components of our 3D-LLM, we conduct an ablation study on the ScanRefer and 3D-POPE benchmarks.

Grounding tokens. We show the results of our model with different types of grounding methods in Table 5. We also show results on 3D-POPE in Table 4. In general, the model has a worse grounding performance and more hallucinations without grounding tokens. “Ground First” and “ground later” refer to whether the dense grounding (grounding every single object mentioned) of the object reference query happens before or after the model outputs the final answer for the refer expression. The former effectively constitutes a chain-of-thought reasoning process [67], which is likely why the performance increases compared to the latter. See Appendix for details.

Mask3D proposals. Finally, we show the upper bound of our approach in Table 5. Our results are based on Mask3D proposals. Due to the context length of LLM, we only use top-40 proposals.

5.5 Data Scaling and Sim-to-Real Transfer

The results are presented in Figure 4. Our model is trained on synthetic 3D scenes from 3D-FRONT and Structured3D [79, 23], and evaluated on real-world 3D scans from ScanNet [14]. The grounding performance consistently improves, and the hallucination rate drops as the densely-grounded data scales up. Notably, our model trained on densely grounded data scales better than the same model trained without such data. These findings pave the way for a future where we can scale 3D-text understanding using synthetic scenes obtained from simulation, which is much cheaper and more accessible to obtain.

6 Conclusion

In this paper, we introduced 3D-GRAND, a large-scale, densely-grounded 3D-text dataset designed for grounded 3D instruction tuning, and 3D-POPE, a comprehensive benchmark for evaluating object hallucination in 3D-LLMs. Through extensive experiments, we demonstrated the effectiveness of our dataset on 3D-LLMs in improving grounding and reducing hallucination, achieving state-of-the-art performance on the ScanRefer and 3D-POPE benchmarks. Our ablation study and qualitative analysis highlighted the importance of densely-grounded instruction tuning, the data scaling law, and effective sim-to-real transfer in develo** high-performing 3D-LLMs. We hope our contributions and findings can spark further research and innovation in this field, ultimately leading to the development of more advanced and capable 3D-LLMs for a wide range of applications.

7 Acknowledgement

This work is generously supported by NSF IIS-1949634, NSF SES-2128623, and has benefited from the Microsoft Accelerate Foundation Models Research (AFMR) grant program.

References

[1] A. Abdelreheem, K. Olszewski, H.-Y. Lee, P. Wonka, and P. Achlioptas. Scanents3d: Exploiting phrase-to-3d-object correspondences for improved visio-linguistic models in 3d scenes. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3524–3534, 2024.
[2] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. J. Guibas. ReferIt3D: Neural listeners for fine-grained 3d object identification in real-world scenes. In 16th European Conference on Computer Vision (ECCV), 2020.
[3] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
[4] D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129–19139, 2022.
[5] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
[6] Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian. Experience grounds language, 2020.
[7] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
[8] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
[9] K. R. Chandu, Y. Bisk, and A. W. Black. Grounding ‘grounding’ in NLP. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4283–4305, Online, Aug. 2021. Association for Computational Linguistics.
[10] D. Z. Chen, A. X. Chang, and M. Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. 16th European Conference on Computer Vision (ECCV), 2020.
[11] S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning. arXiv preprint arXiv:2311.18651, 2023.
[12] Z. Chen, A. Gholami, M. Nießner, and A. X. Chang. Scan2cap: Context-aware dense captioning in rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3193–3203, 2021.
[13] K. Crowston. Amazon mechanical turk: A research tool for organizations and information systems scholars. In A. Bhattacherjee and B. Fitzgerald, editors, Sha** the Future of ICT Research. Methods and Approaches, pages 210–221, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[14] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
[15] W. Dai, Z. Liu, Z. Ji, D. Su, and P. Fung. Plausible may not be faithful: Probing object hallucination in vision-language pre-training, 2023.
[16] T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024.
[17] M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford, L. Weihs, M. Yatskar, and A. Farhadi. RoboTHOR: An Open Simulation-to-Real Embodied AI Platform. In CVPR, 2020.
[18] M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, J. Salvador, K. Ehsani, W. Han, E. Kolve, A. Farhadi, A. Kembhavi, and R. Mottaghi. ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In NeurIPS, 2022. Outstanding Paper Award.
[19] B. Ding, C. Qin, L. Liu, Y. K. Chia, S. Joty, B. Li, and L. Bing. Is gpt-3 a good data annotator?, 2023.
[20] K. Ehsani, W. Han, A. Herrasti, E. VanderBilt, L. Weihs, E. Kolve, A. Kembhavi, and R. Mottaghi. ManipulaTHOR: A Framework for Visual Object Manipulation. In CVPR, 2021.
[21] Epic Games. Unreal engine.
[22] Y. Etesam, L. Kochiev, and A. X. Chang. 3dvqa: Visual question answering for 3d environments. In 2022 19th Conference on Robots and Vision (CRV), pages 233–240. IEEE, 2022.
[23] H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, et al. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933–10942, 2021.
[24] A. Gunjal, J. Yin, and E. Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143, 2024.
[25] L. Höllein, A. Cao, A. Owens, J. Johnson, and M. Nießner. Text2room: Extracting textured 3d meshes from 2d text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7909–7920, 2023.
[26] Y. Hong, C. Lin, Y. Du, Z. Chen, J. B. Tenenbaum, and C. Gan. 3d concept learning and reasoning from multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9202–9212, 2023.
[27] Y. Hong, H. Zhen, P. Chen, S. Zheng, Y. Du, Z. Chen, and C. Gan. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems, 36:20482–20494, 2023.
[28] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
[29] H. Huang, Z. Wang, R. Huang, L. Liu, X. Cheng, Y. Zhao, T. **, and Z. Zhao. Chat-3d v2: Bridging 3d scene and large language models with object identifiers. arXiv preprint arXiv:2312.08168, 2023.
[30] J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y. Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang. An embodied generalist agent in 3d world. In ICML, 2024.
[31] Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[32] W. Huang, D. Liu, and W. Hu. Dense object grounding in 3d scenes. Proceedings of the 31st ACM International Conference on Multimedia, 2023.
[33] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
[34] S. Imani, L. Du, and H. Shrivastava. Mathprompter: Mathematical reasoning using large language models, 2023.
[35] B. Jia, Y. Chen, H. Yu, Y. Wang, X. Niu, T. Liu, Q. Li, and S. Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. arXiv preprint arXiv:2401.09340, 2024.
[36] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy, Y. Gao, H. Henry, M. Mattar, and D. Lange. Unity: A general platform for intelligent agents, 2020.
[37] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
[38] X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
[39] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
[40] M. Li, X. Chen, C. Zhang, S. Chen, H. Zhu, F. Yin, G. Yu, and T. Chen. M3dbench: Let’s instruct large models with multi-modal 3d prompts. arXiv preprint arXiv:2312.10763, 2023.
[41] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
[42] Z. Li, C. Zhang, X. Wang, R. Ren, Y. Xu, R. Ma, and X. Liu. 3dmit: 3d multi-modal instruction tuning for scene understanding. arXiv preprint arXiv:2401.03201, 2024.
[43] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[44] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. In NeurIPS, 2023.
[45] I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019.
[46] X. Ma, S. Yong, Z. Zheng, Q. Li, Y. Liang, S.-C. Zhu, and S. Huang. Sqa3d: Situated question answering in 3d scenes. In International Conference on Learning Representations, 2023.
[47] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[48] M. Mittal, C. Yu, Q. Yu, J. Liu, N. Rudin, D. Hoeller, J. L. Yuan, R. Singh, Y. Guo, H. Mazhar, A. Mandlekar, B. Babich, G. State, M. Hutter, and A. Garg. Orbit: A unified simulation framework for interactive robot learning environments. IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023.
[49] OpenAI. Gpt-4 technical report, 2024.
[50] OpenAI. Hello gpt-4o, May 2024.
[51] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
[52] X. Puig, E. Undersander, A. Szot, M. D. Cote, R. Partsey, J. Yang, R. Desai, A. W. Clegg, M. Hlavac, T. Min, T. Gervet, V. Vondrus, V.-P. Berges, J. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi. Habitat 3.0: A co-habitat for humans, avatars and robots, 2023.
[53] Z. Qi, Y. Fang, Z. Sun, X. Wu, T. Wu, J. Wang, D. Lin, and H. Zhao. Gpt4point: A unified framework for point-language understanding and generation, 2023.
[54] H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
[55] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
[56] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.
[57] D. Rozenberszki, O. Litany, and A. Dai. Language-grounded indoor 3d semantic segmentation in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
[58] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
[59] J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe. Mask3d: Mask transformer for 3d semantic instance segmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8216–8223. IEEE, 2023.
[60] J. Schult, S. Tsai, L. Höllein, B. Wu, J. Wang, C.-Y. Ma, K. Li, X. Wang, F. Wimbauer, Z. He, P. Zhang, B. Leibe, P. Vajda, and J. Hou. Controlroom3d: Room generation using semantic proxy rooms, 2023.
[61] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan, L.-Y. Gui, Y.-X. Wang, Y. Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
[62] A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V. Vondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V. Koltun, J. Malik, M. Savva, and D. Batra. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
[63] Z. Tan, A. Beigi, S. Wang, R. Guo, A. Bhattacharjee, B. Jiang, M. Karami, J. Li, L. Cheng, and H. Liu. Large language models for data annotation: A survey, 2024.
[64] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
[65] T. Wang, X. Mao, C. Zhu, R. Xu, R. Lyu, P. Li, X. Chen, W. Zhang, K. Chen, T. Xue, et al. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. arXiv preprint arXiv:2312.16170, 2023.
[66] Z. Wang, H. Huang, Y. Zhao, Z. Zhang, and Z. Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes. arXiv preprint arXiv:2308.08769, 2023.
[67] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
[68] Y. Wu, F. Jia, S. Zhang, H. Li, E. Zhu, Y. Wang, Y. T. Lee, R. Peng, Q. Wu, and C. Wang. An empirical study on challenging math problem solving with gpt-4, 2023.
[69] J. Xu, X. Zhou, S. Yan, X. Gu, A. Arnab, C. Sun, X. Wang, and C. Schmid. Pixel Aligned Language Models. arXiv preprint arXiv: 2312.09237, 2023.
[70] X. Yan, Z. Yuan, Y. Du, Y. Liao, Y. Guo, Z. Li, and S. Cui. Clevr3d: Compositional language and elementary visual reasoning for question answering in 3d real-world scenes. arXiv preprint arXiv:2112.11691, 2021.
[71] J. Yang, X. Chen, S. Qian, N. Madaan, M. Iyengar, D. F. Fouhey, and J. Chai. Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. In ICRA, 2024.
[72] J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023.
[73] Y. Yang, F.-Y. Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), volume 30, pages 20–25. IEEE/CVF, 2024.
[74] H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y. Yang. Ferret: Refer and ground anything anywhere at any granularity. In The Twelfth International Conference on Learning Representations, 2023.
[75] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
[76] Z. Yuan, X. Yan, Z. Li, X. Li, Y. Guo, S. Cui, and Z. Li. Toward explainable and fine-grained 3d grounding through referring textual phrases. arXiv preprint arXiv:2207.01821, 2022.
[77] B. Zhai, S. Yang, X. Zhao, C. Xu, S. Shen, D. Zhao, K. Keutzer, M. Li, T. Yan, and X. Fan. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779, 2023.
[78] Y. Zhang, Z. Ma, X. Gao, S. Shakiah, Q. Gao, and J. Chai. Groundhog: Grounding large language models to holistic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[79] J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Proceedings of The European Conference on Computer Vision (ECCV), 2020.
[80] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024.
[81] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
[82] W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, and Y. Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text, 2023.
[83] Z. Zhu, X. Ma, Y. Chen, Z. Deng, S. Huang, and Q. Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911–2921, 2023.

Appendix A Model Details

A.1 Model input and output demonstration

In Figure 5, we show an example of 3D-GRAND model’s input and output on the Grounded Object Reference task. Note how in the “Response”, we train the model to generate a $\langle$ detailed_grounding $\rangle$ pair of tags to densely ground every single object mentioned in the refer expression after generating the grounding answer in $\langle$ refer_expression_grounding $\rangle$ . The “ground first” “ground later” in Table 5 means whether the $\langle$ detailed_grounding $\rangle$ tags happen before or after the $\langle$ refer_expression_grounding $\rangle$ tags. Figure 5 is an example of “ground later”, and Figure 6 shows an example of “ground first”.

A.2 Training Data

There are two flavors of models that we fine-tuned: one grounded object reference model, and one grounded QA model. The grounded object reference model was trained using the grounded object reference data on 3D-FRONT train split, which consist of 234,791 3D-text pairs, each of which are densely grounded. This model was used to generate the ScanRefer results presented in Table 4, 5, and Figure 4 The grounded QA model was trained using a subset of 200k grounded QA: object existence data from the 3D-FRONT train split. The reason that we select a subset of 200k QAs is simply because the entire grounded QA dataset is too large and we do not have enough resource to train on all data. However, as shown in Table 2 and Figure 4, we find even such a subset is very effective in reducing hallucinations in 3D-LLMs.

We provide official data splits of train, val and test (90%, 5%, 5%) in our dataset release. The val and test proportion might seem small, but given our dataset’s million-scale, they should be sufficiently large for any development and evaluation purposes.

A.3 Training Details

The two flavors of model mentioned above are LoRA-finetuned [28] based off Llama-2. We use DeepSpeed ZeRO-2 [55] and FlashAttention [16] to save GPU memory and speed up training. The model is trained in BF16 precision on 12 NVIDIA A40 GPUs with a combined batch size of 96 and a learning rate of 2e-4. We use the AdamW [45] optimizer with a weight decay of 0.01 and a cosine learning rate scheduler. We train the mode for 10k steps, which takes approximately 48 hours.

Appendix B Additional 3D-GRAND Data Collection

B.1 Point Cloud Generation Pipeline for 3D-Front

Here, we present an expanded version of Section 3, focusing on the methodologies employed in the collection and cleaning of 3D scenes, specifically detailing our process for deriving 3D point clouds from existing datasets.

In our workflow with 3D-FRONT, layouts and meshes are initially processed in Blender to produce multi-view images. These images are subsequently used to construct comprehensive point clouds for entire houses. Both point clouds and per-room meshes are utilized to generate scene-level point clouds. We avoid direct use of room meshes because they lack color information in ceilings, walls, and floors, necessitating the final output to be a point cloud.

For Structure3D, while per-scene multi-view images facilitate direct rendering of per-scene point clouds, we frequently encounter issues where parts of adjacent scenes are inadvertently reconstructed due to window transparency. To address this, we employ the layout of each scene to trim extraneous points, thus enhancing the precision of the resulting point clouds.

Appendix C Addtional 3D POPE results

C.1 3D Pope Results on NYU40

Table 6 presents evaluation results for 3D POPE using the NYU40 class set. NYU40 includes a subset of the classes from ScanNet200 featured in the main results table. The NYU40 class set consolidates many fine-grained classes into an “other” category, potentially reducing the challenge of negative sampling in the Popular and Adversarial settings compared to the ScanNet200 scenario.

Dataset	3D-POPE	Model	Accuracy	Precision	Recall	F1 Score	Yes (%)
ScanNet Val (NYU40)	Random	3D-LLM	50.00	50.00	100.00	66.67	100.00
		3D-VisTA	50.12	50.08	77.13	60.73	77.01
		LEO	54.03	52.70	78.52	63.07	74.50
		Ours zero-shot (No Grounding)	86.45	87.26	85.36	86.30	48.91
		Ours zero-shot (Grounding)	85.68	88.22	82.34	85.18	46.67
	Popular	3D-LLM	50.00	50.00	100.00	66.67	100.00
		3D-VisTA	50.27	50.23	77.13	60.84	76.91
		LEO	48.86	49.28	77.44	60.23	78.58
		Ours zero-shot (No Grounding)	80.85	78.30	85.35	81.68	54.50
		Ours zero-shot (Grounding)	81.69	81.32	82.28	81.80	50.59
	Adversarial	3D-LLM	50.00	50.00	100.00	66.67	100.00
		3D-VisTA	50.44	50.48	77.14	61.03	76.86
		LEO	49.77	49.85	77.67	60.73	77.91
		Ours zero-shot (No Grounding)	81.47	78.98	85.78	82.24	54.31
		Ours zero-shot (Grounding)	82.10	81.72	82.72	82.22	50.61

Table 6: Results of 3D-LLMs under three evaluation settings of 3D-POPE on the validation set of ScanNet using NYU40 class set. Yes denotes the proportion of answering “Yes” to the given question. The best results in each block are denoted in bold.

Appendix D Human Validation

Because our dataset generation process involves GPT-4V, there is a potential for hallucinations. We identify three types of possible hallucinations that could impact our dataset: the text might inaccurately describe an object’s property, such as color or size (termed incorrect object attribute); it might incorrectly depict the spatial relationship between two objects (termed incorrect spatial relation); or it might describe an object that does not exist in the referenced scene at all (termed incorrect object existence). Additionally, inaccuracies in our dataset may also arise from incorrectly grounding the wrong object.

To validate our dataset against these potential failures, we plan to verify a subset of our data through crowdsourcing to ascertain the frequency of these failure cases.

D.1 Crowd-sourcing

We crowd-source the validation of annotations using Hive, a platform commonly used for sourcing annotations for computer vision tasks. The platform can be accessed at https://thehive.ai/.

We conceptualize our dataset validation as a data annotation problem, employing scene-text pairs as the data unit. Annotators are instructed to label these pairs as “True” or “False” to indicate the presence or absence of hallucinations or inaccuracies. Additionally, a “Cannot Decide” option is provided to accommodate cases where the scene view is unclear.

D.1.1 Task Generation

Hive only supports presenting static images to annotators, so we generate annotation tasks by composing snapshots of a scene with corresponding text annotations. For each task, we take snapshots from four different angles and pair them with a corresponding annotation. To maintain simplicity and conciseness, we require validation of just one sentence per task, providing some surrounding context and highlighting the target sentence. For grounding validation, the grounded object is outlined in the scene with a bounding box, and the referring phrase in the sentence is emphasized. An example of such a task, along with the annotation interface, is depicted in Figure 8. Figure 9 displays two text validation tasks and two grounding validation tasks that were presented to annotators.

D.1.2 Crowd-sourcing Validity

Validating a dataset necessitates a high level of attention from annotators. We curate sets of instructions, qualifying tasks, and honeypot tasks to ensure that the annotations obtained from crowdsourcing are reliable. The crowdsourcing process is illustrated in Figure 10.

Before presenting any tasks to the workers, we present them with a set of instruction tasks that show an example annotation, the correct response (as determined by us), and the reason why that response is correct. They are paired with an incorrect example and an explanation of why it is incorrect in order to ensure unbiased annotations. Examples of qualifying instructions are shown in 11. These instructions are intentionally brief, as we found through trial-and-error that longer, paragraph-based instructions were largely ignored by annotators.

Qualifying tasks are presented to the annotators before they are shown any real tasks in order to train them to complete the real task with a high accuracy. Annotators are both shown the correct answer and a reasoning as to why it is correct for every qualifier. We set the minimum qualifier accuracy to 0.75 to ensure that annotators must achieve a minimum competency before annotating real tasks. Every dataset is given between 12 and 30 specially crafted qualifying tasks that demonstrate the possible inaccuracies that could appear in the data. These qualifiers are divided equally between true and false examples so as not to bias workers towards any one answer.

Honeypot tasks are randomly mixed in with real tasks in order to ensure that annotators are maintaining a high quality of annotations throughout the entire job. Because we annotate the honeypot tasks before showing them to annotators, we are able to evaluate any given worker’s accuracy on honeypot tasks. We set the minimum honeypot accuracy to 0.89 to ensure that annotators are maintaining correct annotations. Workers that do not maintain this accuracy are banned from annotating our tasks. This is higher than the required accuracy for qualifiers because we expect annotators to already be well trained in our annotation tasks from the instructions and qualifiers. Every data type is given between 18 and 35 honeypot tasks. The honeypots are also approximately divided equally between true and false examples so that workers who consistently select a single answer without paying attention to the task (e.g., someone who always selects “True") will be banned.

To further ensure high-quality annotations, we send each question to 3 different annotators and only accept an annotation if at least 2 out of the 3 annotators agree with each other on the truthfulness of an item. If agreement is not reached, the task is returned as inconclusive.

D.2 Results

We perform validation on 10,200 room-annotation pairs. From each of the three data types, 1,700 pairs are sampled for validation of both text truthfulness and grounding accuracy. A subset of 800 rooms is uniformly chosen, with 400 designated for text truthfulness and another 400 for grounding accuracy. The text data is uniformly sampled from these rooms. We report accuracies for both text truthfulness and grounding accuracy in Table 7.

We report comprehensive statistics from the annotation process in Table 8. We observe a very low qualifier pass rate ranging from 11 - 20 % across the different tasks in our data, suggesting that our qualifiers were effective in allowing only the most attentive annotators qualify to annotate real tasks. In addition, none of these annotators were banned due to honeypots. This increases our confidence that our qualification process is effective in training annotators and filtering out those who were not attentive. We also observe that workers spend roughly the same time on real tasks and honeypot tasks, suggesting that the honeypots are indistinguishable from real tasks for the annotators. This further supports the validity of our annotations.

Table 7: Text Truthfulness and Grounding Accuracy from crowdsourcing. Accuracy is computed by dividing the number of “True” responses by the total number of tasks (1700).

Method	Text Truthfulness	Grounding Accuracy
Grounded Scene Description	0.877	0.944
Grounded QA	0.852	0.956
Grounded Object Reference	0.863	0.918

Table 8: Comprehensive annotation metrics. Includes qualifier pass rate, honeypot count, honeypot ban rate, percent of tasks marked inconclusive (where workers could not come to an agreement on the label), and the average time that workers spend on both real tasks and honeypot tasks. Each dataset was evaluated on 1700 annotations. At least 2 workers must agree on the label for an annotation to be considered valid.

Category	Type	% Qualifier Pass Rate	# Honeypots	% Honeypot Ban Rate	% of Inconclusive Tasks	Avg. Real Task Speed (s)	Avg. Honeypot Speed (s)
Category	Type	(Pass)	(Total)	(Ban)	(Tasks)	(Real)	(Honeypot)
Text Accuracy	Scene Description	17	35	0	2.03	17.91	17.08
	QA	19	18	0	1.35	16.22	16.64
	Object Reference	10	38	0	0.82	19.88	17.57
Grounding Accuracy	Scene Description	20	18	0	1.66	9.69	12.84
	QA	16	18	0	1.11	6.65	11.14
	Object Reference	20	18	0	1.88	9.69	12.84

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination