-
OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding
Authors:
Francis Engelmann,
Ayca Takmaz,
Jonas Schult,
Elisabetta Fedele,
Johanna Wald,
Songyou Peng,
Xi Wang,
Or Litany,
Siyu Tang,
Federico Tombari,
Marc Pollefeys,
Leonidas Guibas,
Hongbo Tian,
Chunjie Wang,
Xiaosheng Yan,
Bingwen Wang,
Xuanyang Zhang,
Xiao Liu,
Phuc Nguyen,
Khoi Nguyen,
Anh Tran,
Cuong Pham,
Zhening Huang,
Xiaoyang Wu,
Xi Chen
, et al. (3 additional authors not shown)
Abstract:
This report provides an overview of the challenge hosted at the OpenSUN3D Workshop on Open-Vocabulary 3D Scene Understanding held in conjunction with ICCV 2023. The goal of this workshop series is to provide a platform for exploration and discussion of open-vocabulary 3D scene understanding tasks, including but not limited to segmentation, detection and map**. We provide an overview of the chall…
▽ More
This report provides an overview of the challenge hosted at the OpenSUN3D Workshop on Open-Vocabulary 3D Scene Understanding held in conjunction with ICCV 2023. The goal of this workshop series is to provide a platform for exploration and discussion of open-vocabulary 3D scene understanding tasks, including but not limited to segmentation, detection and map**. We provide an overview of the challenge hosted at the workshop, present the challenge dataset, the evaluation methodology, and brief descriptions of the winning methods. For additional details, please see https://opensun3d.github.io/index_iccv23.html.
△ Less
Submitted 17 March, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels
Authors:
Rui Huang,
Songyou Peng,
Ayca Takmaz,
Federico Tombari,
Marc Pollefeys,
Shiji Song,
Gao Huang,
Francis Engelmann
Abstract:
Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Importantly, models trained on this data typically struggle to recognize object classes beyond the annotated classes, i.e., they do not generalize well to unseen domains and require additional domain-specific annot…
▽ More
Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Importantly, models trained on this data typically struggle to recognize object classes beyond the annotated classes, i.e., they do not generalize well to unseen domains and require additional domain-specific annotations. In contrast, 2D foundation models demonstrate strong generalization and impressive zero-shot abilities, inspiring us to incorporate these characteristics from 2D models into 3D models. Therefore, we explore the use of image segmentation foundation models to automatically generate training labels for 3D segmentation. We propose Segment3D, a method for class-agnostic 3D scene segmentation that produces high-quality 3D segmentation masks. It improves over existing 3D segmentation models (especially on fine-grained masks), and enables easily adding new training data to further boost the segmentation performance -- all without the need for manual training labels.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
OpenMask3D: Open-Vocabulary 3D Instance Segmentation
Authors:
Ayça Takmaz,
Elisabetta Fedele,
Robert W. Sumner,
Marc Pollefeys,
Federico Tombari,
Francis Engelmann
Abstract:
We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related…
▽ More
We introduce the task of open-vocabulary 3D instance segmentation. Current approaches for 3D instance segmentation can typically only recognize object categories from a pre-defined closed set of classes that are annotated in the training datasets. This results in important limitations for real-world applications where one might need to perform tasks guided by novel, open-vocabulary queries related to a wide variety of objects. Recently, open-vocabulary 3D scene understanding methods have emerged to address this problem by learning queryable features for each point in the scene. While such a representation can be directly employed to perform semantic segmentation, existing methods cannot separate multiple object instances. In this work, we address this limitation, and propose OpenMask3D, which is a zero-shot approach for open-vocabulary 3D instance segmentation. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings. Experiments and ablation studies on ScanNet200 and Replica show that OpenMask3D outperforms other open-vocabulary methods, especially on the long-tail distribution. Qualitative experiments further showcase OpenMask3D's ability to segment object properties based on free-form queries describing geometry, affordances, and materials.
△ Less
Submitted 29 October, 2023; v1 submitted 23 June, 2023;
originally announced June 2023.
-
3D Segmentation of Humans in Point Clouds with Synthetic Data
Authors:
Ayça Takmaz,
Jonas Schult,
Irem Kaftan,
Mertcan Akçay,
Bastian Leibe,
Robert Sumner,
Francis Engelmann,
Siyu Tang
Abstract:
Segmenting humans in 3D indoor scenes has become increasingly important with the rise of human-centered robotics and AR/VR applications. To this end, we propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation. Few works have attempted to directly segment humans in cluttered 3D scenes, which is largely due to the lack of annotated train…
▽ More
Segmenting humans in 3D indoor scenes has become increasingly important with the rise of human-centered robotics and AR/VR applications. To this end, we propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation. Few works have attempted to directly segment humans in cluttered 3D scenes, which is largely due to the lack of annotated training data of humans interacting with 3D scenes. We address this challenge and propose a framework for generating training data of synthetic humans interacting with real 3D scenes. Furthermore, we propose a novel transformer-based model, Human3D, which is the first end-to-end model for segmenting multiple human instances and their body-parts in a unified manner. The key advantage of our synthetic data generation framework is its ability to generate diverse and realistic human-scene interactions, with highly accurate ground truth. Our experiments show that pre-training on synthetic data improves performance on a wide variety of 3D human segmentation tasks. Finally, we demonstrate that Human3D outperforms even task-specific state-of-the-art 3D segmentation methods.
△ Less
Submitted 18 August, 2023; v1 submitted 1 December, 2022;
originally announced December 2022.
-
Unsupervised Monocular Depth Reconstruction of Non-Rigid Scenes
Authors:
Ayça Takmaz,
Danda Pani Paudel,
Thomas Probst,
Ajad Chhatkuli,
Martin R. Oswald,
Luc Van Gool
Abstract:
Monocular depth reconstruction of complex and dynamic scenes is a highly challenging problem. While for rigid scenes learning-based methods have been offering promising results even in unsupervised cases, there exists little to no literature addressing the same for dynamic and deformable scenes. In this work, we present an unsupervised monocular framework for dense depth estimation of dynamic scen…
▽ More
Monocular depth reconstruction of complex and dynamic scenes is a highly challenging problem. While for rigid scenes learning-based methods have been offering promising results even in unsupervised cases, there exists little to no literature addressing the same for dynamic and deformable scenes. In this work, we present an unsupervised monocular framework for dense depth estimation of dynamic scenes, which jointly reconstructs rigid and non-rigid parts without explicitly modelling the camera motion. Using dense correspondences, we derive a training objective that aims to opportunistically preserve pairwise distances between reconstructed 3D points. In this process, the dense depth map is learned implicitly using the as-rigid-as-possible hypothesis. Our method provides promising results, demonstrating its capability of reconstructing 3D from challenging videos of non-rigid scenes. Furthermore, the proposed method also provides unsupervised motion segmentation results as an auxiliary output.
△ Less
Submitted 28 October, 2021; v1 submitted 31 December, 2020;
originally announced December 2020.