Search | arXiv e-print repository

ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic Reconstruction

Authors: Silvan Weder, Francis Engelmann, Johannes L. Schönberger, Akihito Seki, Marc Pollefeys, Martin R. Oswald

Abstract: We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames. Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality. To overcome the inherent challenges of online methods, we make two main contributions. First, to effectively extract information from the… ▽ More We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames. Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality. To overcome the inherent challenges of online methods, we make two main contributions. First, to effectively extract information from the input RGB-D video stream, we jointly estimate geometry and semantic labels per frame in 3D. A key focus of our approach is to reason about semantic entities both in the 2D input and the local 3D domain to leverage differences in spatial context and network architectures. Our method predicts 2D features using an off-the-shelf segmentation network. The extracted 2D features are refined by a lightweight 3D network to enable reasoning about the local 3D structure. Second, to efficiently deal with an infinite stream of input RGB-D frames, a subsequent network serves as a temporal expert predicting the incremental scene updates by leveraging 2D, 3D, and past information in a learned manner. These updates are then integrated into a global scene representation. Using these main contributions, our method can enable scenarios with real-time constraints and can scale to arbitrary scene sizes by processing and updating the scene only in a local region defined by the new measurement. Our experiments demonstrate improved results compared to existing online methods that purely operate in local regions and show that complementary sources of information can boost the performance. We provide a thorough ablation study on the benefits of different architectural as well as algorithmic design decisions. Our method yields competitive results on the popular ScanNet benchmark and SceneNN dataset. △ Less

Submitted 3 December, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2210.10770 [pdf, other]

LaMAR: Benchmarking Localization and Map** for Augmented Reality

Authors: Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L. Schönberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, Ondrej Miksik, Marc Pollefeys

Abstract: Localization and map** is the foundational technology for augmented reality (AR) that enables sharing and persistence of digital content in the real world. While significant progress has been made, researchers are still mostly driven by unrealistic benchmarks not representative of real-world AR scenarios. These benchmarks are often based on small-scale datasets with low scene diversity, captured… ▽ More Localization and map** is the foundational technology for augmented reality (AR) that enables sharing and persistence of digital content in the real world. While significant progress has been made, researchers are still mostly driven by unrealistic benchmarks not representative of real-world AR scenarios. These benchmarks are often based on small-scale datasets with low scene diversity, captured from stationary cameras, and lack other sensor inputs like inertial, radio, or depth data. Furthermore, their ground-truth (GT) accuracy is mostly insufficient to satisfy AR requirements. To close this gap, we introduce LaMAR, a new benchmark with a comprehensive capture and GT pipeline that co-registers realistic trajectories and sensor streams captured by heterogeneous AR devices in large, unconstrained scenes. To establish an accurate GT, our pipeline robustly aligns the trajectories against laser scans in a fully automated manner. As a result, we publish a benchmark dataset of diverse and large-scale scenes recorded with head-mounted and hand-held AR devices. We extend several state-of-the-art methods to take advantage of the AR-specific setup and evaluate them on our benchmark. The results offer new insights on current research and reveal promising avenues for future work in the field of localization and map** for AR. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted at ECCV 2022, website at https://lamar.ethz.ch/

arXiv:2109.10165 [pdf, other]

doi 10.1109/ICRA46639.2022.9811877

Panoptic Multi-TSDFs: a Flexible Representation for Online Multi-resolution Volumetric Map** and Long-term Dynamic Scene Consistency

Authors: Lukas Schmid, Jeffrey Delmerico, Johannes Schönberger, Juan Nieto, Marc Pollefeys, Roland Siegwart, Cesar Cadena

Abstract: For robotic interaction in environments shared with other agents, access to volumetric and semantic maps of the scene is crucial. However, such environments are inevitably subject to long-term changes, which the map needs to account for. We thus propose panoptic multi-TSDFs as a novel representation for multi-resolution volumetric map** in changing environments. By leveraging high-level informat… ▽ More For robotic interaction in environments shared with other agents, access to volumetric and semantic maps of the scene is crucial. However, such environments are inevitably subject to long-term changes, which the map needs to account for. We thus propose panoptic multi-TSDFs as a novel representation for multi-resolution volumetric map** in changing environments. By leveraging high-level information for 3D reconstruction, our proposed system allocates high resolution only where needed. Through reasoning on the object level, semantic consistency over time is achieved. This enables our method to maintain up-to-date reconstructions with high accuracy while improving coverage by incorporating previous data. We show in thorough experimental evaluation that our map can be efficiently constructed, maintained, and queried during online operation, and that the presented approach can operate robustly on real depth sensors using non-optimized panoptic segmentation as input. △ Less

Submitted 22 February, 2022; v1 submitted 21 September, 2021; originally announced September 2021.

Comments: Accepted for ICRA 2022. 7 pages, 8 figures. Code: https://github.com/ethz-asl/panoptic_map**, Video: https://www.youtube.com/watch?v=A7o2Vy7_TV4

Journal ref: 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 8018-8024

arXiv:2109.04409 [pdf, other]

Reconstructing and grounding narrated instructional videos in 3D

Authors: Dimitri Zhukov, Ignacio Rocco, Ivan Laptev, Josef Sivic, Johannes L. Schönberger, Bugra Tekin, Marc Pollefeys

Abstract: Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructiona… ▽ More Narrated instructional videos often show and describe manipulations of similar objects, e.g., repairing a particular model of a car or laptop. In this work we aim to reconstruct such objects and to localize associated narrations in 3D. Contrary to the standard scenario of instance-level 3D reconstruction, where identical objects or scenes are present in all views, objects in different instructional videos may have large appearance variations given varying conditions and versions of the same product. Narrations may also have large variation in natural language expressions. We address these challenges by three contributions. First, we propose an approach for correspondence estimation combining learnt local features and dense flow. Second, we design a two-step divide and conquer reconstruction approach where the initial 3D reconstructions of individual videos are combined into a 3D alignment graph. Finally, we propose an unsupervised approach to ground natural language in obtained 3D reconstructions. We demonstrate the effectiveness of our approach for the domain of car maintenance. Given raw instructional videos and no manual supervision, our method successfully reconstructs engines of different car models and associates textual descriptions with corresponding objects in 3D. △ Less

Submitted 10 September, 2021; v1 submitted 9 September, 2021; originally announced September 2021.

arXiv:2012.01377 [pdf, other]

Cross-Descriptor Visual Localization and Map**

Authors: Mihai Dusmanu, Ondrej Miksik, Johannes L. Schönberger, Marc Pollefeys

Abstract: Visual localization and map** is the key technology underlying the majority of mixed reality and robotics systems. Most state-of-the-art approaches rely on local features to establish correspondences between images. In this paper, we present three novel scenarios for localization and map** which require the continuous update of feature representations and the ability to match across different… ▽ More Visual localization and map** is the key technology underlying the majority of mixed reality and robotics systems. Most state-of-the-art approaches rely on local features to establish correspondences between images. In this paper, we present three novel scenarios for localization and map** which require the continuous update of feature representations and the ability to match across different feature types. While localization and map** is a fundamental computer vision problem, the traditional setup supposes the same local features are used throughout the evolution of a map. Thus, whenever the underlying features are changed, the whole process is repeated from scratch. However, this is typically impossible in practice, because raw images are often not stored and re-building the maps could lead to loss of the attached digital content. To overcome the limitations of current approaches, we present the first principled solution to cross-descriptor localization and map**. Our data-driven approach is agnostic to the feature descriptor type, has low computational requirements, and scales linearly with the number of description algorithms. Extensive experiments demonstrate the effectiveness of our approach on state-of-the-art benchmarks for a variety of handcrafted and learned features. △ Less

Submitted 21 September, 2021; v1 submitted 2 December, 2020; originally announced December 2020.

Comments: Accepted at ICCV 2021. 18 pages, 15 figures, 6 tables

arXiv:2011.14791 [pdf, other]

NeuralFusion: Online Depth Fusion in Latent Space

Authors: Silvan Weder, Johannes L. Schönberger, Marc Pollefeys, Martin R. Oswald

Abstract: We present a novel online depth map fusion approach that learns depth map aggregation in a latent feature space. While previous fusion methods use an explicit scene representation like signed distance functions (SDFs), we propose a learned feature representation for the fusion. The key idea is a separation between the scene representation used for the fusion and the output scene representation, vi… ▽ More We present a novel online depth map fusion approach that learns depth map aggregation in a latent feature space. While previous fusion methods use an explicit scene representation like signed distance functions (SDFs), we propose a learned feature representation for the fusion. The key idea is a separation between the scene representation used for the fusion and the output scene representation, via an additional translator network. Our neural network architecture consists of two main parts: a depth and feature fusion sub-network, which is followed by a translator sub-network to produce the final surface representation (e.g. TSDF) for visualization or other tasks. Our approach is an online process, handles high noise levels, and is particularly able to deal with gross outliers common for photometric stereo-based depth maps. Experiments on real and synthetic data demonstrate improved results compared to the state of the art, especially in challenging scenarios with large amounts of noise and outliers. △ Less

Submitted 8 June, 2021; v1 submitted 30 November, 2020; originally announced November 2020.

arXiv:2008.11239 [pdf, other]

HoloLens 2 Research Mode as a Tool for Computer Vision Research

Authors: Dorin Ungureanu, Federica Bogo, Silvano Galliani, Pooja Sama, Xin Duan, Casey Meekhof, Jan Stühmer, Thomas J. Cashman, Bugra Tekin, Johannes L. Schönberger, Pawel Olszta, Marc Pollefeys

Abstract: Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present HoloLens 2 Research Mode, an API and a set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed… ▽ More Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present HoloLens 2 Research Mode, an API and a set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed reality applications based on processing sensor data. We also show how to combine the Research Mode sensor data with the built-in eye and hand tracking capabilities provided by HoloLens 2. By releasing the Research Mode API and a set of open-source tools, we aim to foster further research in the fields of computer vision as well as robotics and encourage contributions from the research community. △ Less

Submitted 25 August, 2020; originally announced August 2020.

arXiv:2006.06634 [pdf, other]

Privacy-Preserving Image Features via Adversarial Affine Subspace Embeddings

Authors: Mihai Dusmanu, Johannes L. Schönberger, Sudipta N. Sinha, Marc Pollefeys

Abstract: Many computer vision systems require users to upload image features to the cloud for processing and storage. These features can be exploited to recover sensitive information about the scene or subjects, e.g., by reconstructing the appearance of the original image. To address this privacy concern, we propose a new privacy-preserving feature representation. The core idea of our work is to drop const… ▽ More Many computer vision systems require users to upload image features to the cloud for processing and storage. These features can be exploited to recover sensitive information about the scene or subjects, e.g., by reconstructing the appearance of the original image. To address this privacy concern, we propose a new privacy-preserving feature representation. The core idea of our work is to drop constraints from each feature descriptor by embedding it within an affine subspace containing the original feature as well as adversarial feature samples. Feature matching on the privacy-preserving representation is enabled based on the notion of subspace-to-subspace distance. We experimentally demonstrate the effectiveness of our method and its high practical relevance for the applications of visual localization and map** as well as face authentication. Compared to the original features, our approach makes it significantly more difficult for an adversary to recover private information. △ Less

Submitted 30 March, 2021; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: Accepted at CVPR 2021. 16 pages, 10 figures, 4 tables

arXiv:2003.08348 [pdf, other]

Multi-View Optimization of Local Feature Geometry

Authors: Mihai Dusmanu, Johannes L. Schönberger, Marc Pollefeys

Abstract: In this work, we address the problem of refining the geometry of local image features from multiple views without known scene or camera geometry. Current approaches to local feature detection are inherently limited in their keypoint localization accuracy because they only operate on a single view. This limitation has a negative impact on downstream tasks such as Structure-from-Motion, where inaccu… ▽ More In this work, we address the problem of refining the geometry of local image features from multiple views without known scene or camera geometry. Current approaches to local feature detection are inherently limited in their keypoint localization accuracy because they only operate on a single view. This limitation has a negative impact on downstream tasks such as Structure-from-Motion, where inaccurate keypoints lead to large errors in triangulation and camera localization. Our proposed method naturally complements the traditional feature extraction and matching paradigm. We first estimate local geometric transformations between tentative matches and then optimize the keypoint locations over multiple views jointly according to a non-linear least squares formulation. Throughout a variety of experiments, we show that our method consistently improves the triangulation and camera localization performance for both hand-crafted and learned local features. △ Less

Submitted 22 July, 2020; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: Accepted at ECCV 2020. 28 pages, 11 figures, 6 tables

arXiv:2001.04388 [pdf, other]

RoutedFusion: Learning Real-time Depth Map Fusion

Authors: Silvan Weder, Johannes L. Schönberger, Marc Pollefeys, Martin R. Oswald

Abstract: The efficient fusion of depth maps is a key part of most state-of-the-art 3D reconstruction methods. Besides requiring high accuracy, these depth fusion methods need to be scalable and real-time capable. To this end, we present a novel real-time capable machine learning-based method for depth map fusion. Similar to the seminal depth map fusion approach by Curless and Levoy, we only update a local… ▽ More The efficient fusion of depth maps is a key part of most state-of-the-art 3D reconstruction methods. Besides requiring high accuracy, these depth fusion methods need to be scalable and real-time capable. To this end, we present a novel real-time capable machine learning-based method for depth map fusion. Similar to the seminal depth map fusion approach by Curless and Levoy, we only update a local group of voxels to ensure real-time capability. Instead of a simple linear fusion of depth information, we propose a neural network that predicts non-linear updates to better account for typical fusion errors. Our network is composed of a 2D depth routing network and a 3D depth fusion network which efficiently handle sensor-specific noise and outliers. This is especially useful for surface edges and thin objects for which the original approach suffers from thickening artifacts. Our method outperforms the traditional fusion approach and related learned approaches on both synthetic and real data. We demonstrate the performance of our method in reconstructing fine geometric details from noise and outlier contaminated data on various scenes. △ Less

Submitted 3 April, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

Comments: 11 pages, 8 figures, accepted to CVPR 2020

arXiv:1903.05572 [pdf, other]

Privacy Preserving Image-Based Localization

Authors: Pablo Speciale, Johannes L. Schönberger, Sing Bing Kang, Sudipta N. Sinha, Marc Pollefeys

Abstract: Image-based localization is a core component of many augmented/mixed reality (AR/MR) and autonomous robotic systems. Current localization systems rely on the persistent storage of 3D point clouds of the scene to enable camera pose estimation, but such data reveals potentially sensitive scene information. This gives rise to significant privacy risks, especially as for many applications 3D map** i… ▽ More Image-based localization is a core component of many augmented/mixed reality (AR/MR) and autonomous robotic systems. Current localization systems rely on the persistent storage of 3D point clouds of the scene to enable camera pose estimation, but such data reveals potentially sensitive scene information. This gives rise to significant privacy risks, especially as for many applications 3D map** is a background process that the user might not be fully aware of. We pose the following question: How can we avoid disclosing confidential information about the captured 3D scene, and yet allow reliable camera pose estimation? This paper proposes the first solution to what we call privacy preserving image-based localization. The key idea of our approach is to lift the map representation from a 3D point cloud to a 3D line cloud. This novel representation obfuscates the underlying scene geometry while providing sufficient geometric constraints to enable robust and accurate 6-DOF camera pose estimation. Extensive experiments on several datasets and localization scenarios underline the high practical relevance of our proposed approach. △ Less

Submitted 13 March, 2019; originally announced March 2019.

arXiv:1712.05773 [pdf, other]

Semantic Visual Localization

Authors: Johannes L. Schönberger, Marc Pollefeys, Andreas Geiger, Torsten Sattler

Abstract: Robust visual localization under a wide range of viewing conditions is a fundamental problem in computer vision. Handling the difficult cases of this problem is not only very challenging but also of high practical relevance, e.g., in the context of life-long localization for augmented reality or autonomous robots. In this paper, we propose a novel approach based on a joint 3D geometric and semanti… ▽ More Robust visual localization under a wide range of viewing conditions is a fundamental problem in computer vision. Handling the difficult cases of this problem is not only very challenging but also of high practical relevance, e.g., in the context of life-long localization for augmented reality or autonomous robots. In this paper, we propose a novel approach based on a joint 3D geometric and semantic understanding of the world, enabling it to succeed under conditions where previous approaches failed. Our method leverages a novel generative model for descriptor learning, trained on semantic scene completion as an auxiliary task. The resulting 3D descriptors are robust to missing observations by encoding high-level 3D geometric and semantic information. Experiments on several challenging large-scale localization datasets demonstrate reliable localization under extreme viewpoint, illumination, and geometry changes. △ Less

Submitted 16 April, 2018; v1 submitted 15 December, 2017; originally announced December 2017.

arXiv:1407.6245 [pdf, other]

doi 10.7717/peerj.453

scikit-image: Image processing in Python

Authors: Stefan van der Walt, Johannes L. Schönberger, Juan Nunez-Iglesias, François Boulogne, Joshua D. Warner, Neil Yager, Emmanuelle Gouillart, Tony Yu, the scikit-image contributors

Abstract: scikit-image is an image processing library that implements algorithms and utilities for use in research, education and industry applications. It is released under the liberal "Modified BSD" open source license, provides a well-documented API in the Python programming language, and is developed by an active, international team of collaborators. In this paper we highlight the advantages of open sou… ▽ More scikit-image is an image processing library that implements algorithms and utilities for use in research, education and industry applications. It is released under the liberal "Modified BSD" open source license, provides a well-documented API in the Python programming language, and is developed by an active, international team of collaborators. In this paper we highlight the advantages of open source to achieve the goals of the scikit-image library, and we showcase several real-world image processing applications that use scikit-image. △ Less

Submitted 23 July, 2014; originally announced July 2014.

Comments: Distributed under Creative Commons CC-BY 4.0. Published in PeerJ

Showing 1–13 of 13 results for author: Schönberger, J