-
DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene from Sparse LiDAR Data and Single Color Image
Authors:
Jiaxiong Qiu,
Zhaopeng Cui,
Yinda Zhang,
Xingdi Zhang,
Shuaicheng Liu,
Bing Zeng,
Marc Pollefeys
Abstract:
In this paper, we propose a deep learning architecture that produces accurate dense depth for the outdoor scene from a single color image and a sparse depth. Inspired by the indoor depth completion, our network estimates surface normals as the intermediate representation to produce dense depth, and can be trained end-to-end. With a modified encoder-decoder structure, our network effectively fuses…
▽ More
In this paper, we propose a deep learning architecture that produces accurate dense depth for the outdoor scene from a single color image and a sparse depth. Inspired by the indoor depth completion, our network estimates surface normals as the intermediate representation to produce dense depth, and can be trained end-to-end. With a modified encoder-decoder structure, our network effectively fuses the dense color image and the sparse LiDAR depth. To address outdoor specific challenges, our network predicts a confidence mask to handle mixed LiDAR signals near foreground boundaries due to occlusion, and combines estimates from the color image and surface normals with learned attention maps to improve the depth accuracy especially for distant areas. Extensive experiments demonstrate that our model improves upon the state-of-the-art performance on KITTI depth completion benchmark. Ablation study shows the positive impact of each model components to the final performance, and comprehensive analysis shows that our model generalizes well to the input with higher sparsity or from indoor scenes.
△ Less
Submitted 9 April, 2019; v1 submitted 2 December, 2018;
originally announced December 2018.
-
DGC-Net: Dense Geometric Correspondence Network
Authors:
Iaroslav Melekhov,
Aleksei Tiulpin,
Torsten Sattler,
Marc Pollefeys,
Esa Rahtu,
Juho Kannala
Abstract:
This paper addresses the challenge of dense pixel correspondence estimation between two images. This problem is closely related to optical flow estimation task where ConvNets (CNNs) have recently achieved significant progress. While optical flow methods produce very accurate results for the small pixel translation and limited appearance variation scenarios, they hardly deal with the strong geometr…
▽ More
This paper addresses the challenge of dense pixel correspondence estimation between two images. This problem is closely related to optical flow estimation task where ConvNets (CNNs) have recently achieved significant progress. While optical flow methods produce very accurate results for the small pixel translation and limited appearance variation scenarios, they hardly deal with the strong geometric transformations that we consider in this work. In this paper, we propose a coarse-to-fine CNN-based framework that can leverage the advantages of optical flow approaches and extend them to the case of large transformations providing dense and subpixel accurate estimates. It is trained on synthetic transformations and demonstrates very good performance to unseen, realistic, data. Further, we apply our method to the problem of relative camera pose estimation and demonstrate that the model outperforms existing dense approaches.
△ Less
Submitted 22 October, 2018; v1 submitted 19 October, 2018;
originally announced October 2018.
-
Episodic Curiosity through Reachability
Authors:
Nikolay Savinov,
Anton Raichuk,
Raphaël Marinier,
Damien Vincent,
Marc Pollefeys,
Timothy Lillicrap,
Sylvain Gelly
Abstract:
Rewards are sparse in the real world and most of today's reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the agent to create rewards for itself - thus making rewards dense and more suitable for learning. In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up w…
▽ More
Rewards are sparse in the real world and most of today's reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the agent to create rewards for itself - thus making rewards dense and more suitable for learning. In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up with the real task reward - making it possible for RL algorithms to learn from the combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. To determine the bonus, the current observation is compared with the observations in memory. Crucially, the comparison is done based on how many environment steps it takes to reach the current observation from those in memory - which incorporates rich information about environment dynamics. This allows us to overcome the known "couch-potato" issues of prior work - when the agent finds a way to instantly gratify itself by exploiting actions which lead to hardly predictable consequences. We test our approach in visually rich 3D environments in ViZDoom, DMLab and MuJoCo. In navigational tasks from ViZDoom and DMLab, our agent outperforms the state-of-the-art curiosity method ICM. In MuJoCo, an ant equipped with our curiosity module learns locomotion out of the first-person-view curiosity only.
△ Less
Submitted 6 August, 2019; v1 submitted 4 October, 2018;
originally announced October 2018.
-
SurfelMeshing: Online Surfel-Based Mesh Reconstruction
Authors:
Thomas Schöps,
Torsten Sattler,
Marc Pollefeys
Abstract:
We address the problem of mesh reconstruction from live RGB-D video, assuming a calibrated camera and poses provided externally (e.g., by a SLAM system). In contrast to most existing approaches, we do not fuse depth measurements in a volume but in a dense surfel cloud. We asynchronously (re)triangulate the smoothed surfels to reconstruct a surface mesh. This novel approach enables to maintain a de…
▽ More
We address the problem of mesh reconstruction from live RGB-D video, assuming a calibrated camera and poses provided externally (e.g., by a SLAM system). In contrast to most existing approaches, we do not fuse depth measurements in a volume but in a dense surfel cloud. We asynchronously (re)triangulate the smoothed surfels to reconstruct a surface mesh. This novel approach enables to maintain a dense surface representation of the scene during SLAM which can quickly adapt to loop closures. This is possible by deforming the surfel cloud and asynchronously remeshing the surface where necessary. The surfel-based representation also naturally supports strongly varying scan resolution. In particular, it reconstructs colors at the input camera's resolution. Moreover, in contrast to many volumetric approaches, ours can reconstruct thin objects since objects do not need to enclose a volume. We demonstrate our approach in a number of experiments, showing that it produces reconstructions that are competitive with the state-of-the-art, and we discuss its advantages and limitations. The algorithm (excluding loop closure functionality) is available as open source at https://github.com/puzzlepaint/surfelmeshing .
△ Less
Submitted 20 November, 2019; v1 submitted 1 October, 2018;
originally announced October 2018.
-
Night-to-Day Image Translation for Retrieval-based Localization
Authors:
Asha Anoosheh,
Torsten Sattler,
Radu Timofte,
Marc Pollefeys,
Luc Van Gool
Abstract:
Visual localization is a key step in many robotics pipelines, allowing the robot to (approximately) determine its position and orientation in the world. An efficient and scalable approach to visual localization is to use image retrieval techniques. These approaches identify the image most similar to a query photo in a database of geo-tagged images and approximate the query's pose via the pose of t…
▽ More
Visual localization is a key step in many robotics pipelines, allowing the robot to (approximately) determine its position and orientation in the world. An efficient and scalable approach to visual localization is to use image retrieval techniques. These approaches identify the image most similar to a query photo in a database of geo-tagged images and approximate the query's pose via the pose of the retrieved database image. However, image retrieval across drastically different illumination conditions, e.g. day and night, is still a problem with unsatisfactory results, even in this age of powerful neural models. This is due to a lack of a suitably diverse dataset with true correspondences to perform end-to-end learning. A recent class of neural models allows for realistic translation of images among visual domains with relatively little training data and, most importantly, without ground-truth pairings. In this paper, we explore the task of accurately localizing images captured from two traversals of the same area in both day and night. We propose ToDayGAN - a modified image-translation model to alter nighttime driving images to a more useful daytime representation. We then compare the daytime and translated night images to obtain a pose estimate for the night image using the known 6-DOF position of the closest day image. Our approach improves localization performance by over 250% compared the current state-of-the-art, in the context of standard metrics in multiple categories.
△ Less
Submitted 4 March, 2019; v1 submitted 25 September, 2018;
originally announced September 2018.
-
Efficient 2D-3D Matching for Multi-Camera Visual Localization
Authors:
Marcel Geppert,
Peidong Liu,
Zhaopeng Cui,
Marc Pollefeys,
Torsten Sattler
Abstract:
Visual localization, i.e., determining the position and orientation of a vehicle with respect to a map, is a key problem in autonomous driving. We present a multicamera visual inertial localization algorithm for large scale environments. To efficiently and effectively match features against a pre-built global 3D map, we propose a prioritized feature matching scheme for multi-camera systems. In con…
▽ More
Visual localization, i.e., determining the position and orientation of a vehicle with respect to a map, is a key problem in autonomous driving. We present a multicamera visual inertial localization algorithm for large scale environments. To efficiently and effectively match features against a pre-built global 3D map, we propose a prioritized feature matching scheme for multi-camera systems. In contrast to existing works, designed for monocular cameras, we (1) tailor the prioritization function to the multi-camera setup and (2) run feature matching and pose estimation in parallel. This significantly accelerates the matching and pose estimation stages and allows us to dynamically adapt the matching efforts based on the surrounding environment. In addition, we show how pose priors can be integrated into the localization system to increase efficiency and robustness. Finally, we extend our algorithm by fusing the absolute pose estimates with motion estimates from a multi-camera visual inertial odometry pipeline (VIO). This results in a system that provides reliable and drift-less pose estimation. Extensive experiments show that our localization runs fast and robust under varying conditions, and that our extended algorithm enables reliable real-time pose estimation.
△ Less
Submitted 14 May, 2019; v1 submitted 17 September, 2018;
originally announced September 2018.
-
Real-Time Dense Map** for Self-driving Vehicles using Fisheye Cameras
Authors:
Zhaopeng Cui,
Lionel Heng,
Ye Chuan Yeo,
Andreas Geiger,
Marc Pollefeys,
Torsten Sattler
Abstract:
We present a real-time dense geometric map** algorithm for large-scale environments. Unlike existing methods which use pinhole cameras, our implementation is based on fisheye cameras which have larger field of view and benefit some other tasks including Visual-Inertial Odometry, localization and object detection around vehicles. Our algorithm runs on in-vehicle PCs at 15 Hz approximately, enabli…
▽ More
We present a real-time dense geometric map** algorithm for large-scale environments. Unlike existing methods which use pinhole cameras, our implementation is based on fisheye cameras which have larger field of view and benefit some other tasks including Visual-Inertial Odometry, localization and object detection around vehicles. Our algorithm runs on in-vehicle PCs at 15 Hz approximately, enabling vision-only 3D scene perception for self-driving vehicles. For each synchronized set of images captured by multiple cameras, we first compute a depth map for a reference camera using plane-swee** stereo. To maintain both accuracy and efficiency, while accounting for the fact that fisheye images have a rather low resolution, we recover the depths using multiple image resolutions. We adopt the fast object detection framework YOLOv3 to remove potentially dynamic objects. At the end of the pipeline, we fuse the fisheye depth images into the truncated signed distance function (TSDF) volume to obtain a 3D map. We evaluate our method on large-scale urban datasets, and results show that our method works well even in complex environments.
△ Less
Submitted 18 April, 2019; v1 submitted 17 September, 2018;
originally announced September 2018.
-
Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System
Authors:
Lionel Heng,
Benjamin Choi,
Zhaopeng Cui,
Marcel Geppert,
Sixing Hu,
Benson Kuan,
Peidong Liu,
Rang Nguyen,
Ye Chuan Yeo,
Andreas Geiger,
Gim Hee Lee,
Marc Pollefeys,
Torsten Sattler
Abstract:
Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps…
▽ More
Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps the cost of this sensor suite to a minimum. In addition, the project seeks to extend the operating envelope to include GNSS-less conditions which are typical for environments with tall buildings, foliage, and tunnels. Emphasis is placed on leveraging multi-view geometry and deep learning to enable the vehicle to localize and perceive in 3D space. This paper presents an overview of the project, and describes the sensor suite and current progress in the areas of calibration, localization, and perception.
△ Less
Submitted 4 March, 2019; v1 submitted 14 September, 2018;
originally announced September 2018.
-
Uncertainty Quantification in CNN-Based Surface Prediction Using Shape Priors
Authors:
Katarína Tóthová,
Sarah Parisot,
Matthew C. H. Lee,
Esther Puyol-Antón,
Lisa M. Koch,
Andrew P. King,
Ender Konukoglu,
Marc Pollefeys
Abstract:
Surface reconstruction is a vital tool in a wide range of areas of medical image analysis and clinical research. Despite the fact that many methods have proposed solutions to the reconstruction problem, most, due to their deterministic nature, do not directly address the issue of quantifying uncertainty associated with their predictions. We remedy this by proposing a novel probabilistic deep learn…
▽ More
Surface reconstruction is a vital tool in a wide range of areas of medical image analysis and clinical research. Despite the fact that many methods have proposed solutions to the reconstruction problem, most, due to their deterministic nature, do not directly address the issue of quantifying uncertainty associated with their predictions. We remedy this by proposing a novel probabilistic deep learning approach capable of simultaneous surface reconstruction and associated uncertainty prediction. The method incorporates prior shape information in the form of a principal component analysis (PCA) model. Experiments using the UK Biobank data show that our probabilistic approach outperforms an analogous deterministic PCA-based method in the task of 2D organ delineation and quantifies uncertainty by formulating distributions over predicted surface vertex positions.
△ Less
Submitted 30 July, 2018;
originally announced July 2018.
-
Hybrid Scene Compression for Visual Localization
Authors:
Federico Camposeco,
Andrea Cohen,
Marc Pollefeys,
Torsten Sattler
Abstract:
Localizing an image wrt. a 3D scene model represents a core task for many computer vision applications. An increasing number of real-world applications of visual localization on mobile devices, e.g., Augmented Reality or autonomous robots such as drones or self-driving cars, demand localization approaches to minimize storage and bandwidth requirements. Compressing the 3D models used for localizati…
▽ More
Localizing an image wrt. a 3D scene model represents a core task for many computer vision applications. An increasing number of real-world applications of visual localization on mobile devices, e.g., Augmented Reality or autonomous robots such as drones or self-driving cars, demand localization approaches to minimize storage and bandwidth requirements. Compressing the 3D models used for localization thus becomes a practical necessity. In this work, we introduce a new hybrid compression algorithm that uses a given memory limit in a more effective way. Rather than treating all 3D points equally, it represents a small set of points with full appearance information and an additional, larger set of points with compressed information. This enables our approach to obtain a more complete scene representation without increasing the memory requirements, leading to a superior performance compared to previous compression schemes. As part of our contribution, we show how to handle ambiguous matches arising from point compression during RANSAC. Besides outperforming previous compression techniques in terms of pose accuracy under the same memory constraints, our compression scheme itself is also more efficient. Furthermore, the localization rates and accuracy obtained with our approach are comparable to state-of-the-art feature-based methods, while using a small fraction of the memory.
△ Less
Submitted 22 April, 2019; v1 submitted 19 July, 2018;
originally announced July 2018.
-
TrimBot2020: an outdoor robot for automatic gardening
Authors:
Nicola Strisciuglio,
Radim Tylecek,
Michael Blaich,
Nicolai Petkov,
Peter Bieber,
Jochen Hemming,
Eldert van Henten,
Torsten Sattler,
Marc Pollefeys,
Theo Gevers,
Thomas Brox,
Robert B. Fisher
Abstract:
Robots are increasingly present in modern industry and also in everyday life. Their applications range from health-related situations, for assistance to elderly people or in surgical operations, to automatic and driver-less vehicles (on wheels or flying) or for driving assistance. Recently, an interest towards robotics applied in agriculture and gardening has arisen, with applications to automatic…
▽ More
Robots are increasingly present in modern industry and also in everyday life. Their applications range from health-related situations, for assistance to elderly people or in surgical operations, to automatic and driver-less vehicles (on wheels or flying) or for driving assistance. Recently, an interest towards robotics applied in agriculture and gardening has arisen, with applications to automatic seeding and crop** or to plant disease control, etc. Autonomous lawn mowers are succesful market applications of gardening robotics. In this paper, we present a novel robot that is developed within the TrimBot2020 project, funded by the EU H2020 program. The project aims at prototy** the first outdoor robot for automatic bush trimming and rose pruning.
△ Less
Submitted 15 May, 2018; v1 submitted 5 April, 2018;
originally announced April 2018.
-
InLoc: Indoor Visual Localization with Dense Matching and View Synthesis
Authors:
Hajime Taira,
Masatoshi Okutomi,
Torsten Sattler,
Mircea Cimpoi,
Marc Pollefeys,
Josef Sivic,
Tomas Pajdla,
Akihiko Torii
Abstract:
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale visual localization method targeted for indoor environments. The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii)…
▽ More
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale visual localization method targeted for indoor environments. The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii) pose estimation using dense matching rather than local features to deal with textureless indoor scenes, and (iii) pose verification by virtual view synthesis to cope with significant changes in viewpoint, scene layout, and occluders. Second, we collect a new dataset with reference 6DoF poses for large-scale indoor localization. Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario. Third, we demonstrate that our method significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.
△ Less
Submitted 8 April, 2018; v1 submitted 27 March, 2018;
originally announced March 2018.
-
Semantic Visual Localization
Authors:
Johannes L. Schönberger,
Marc Pollefeys,
Andreas Geiger,
Torsten Sattler
Abstract:
Robust visual localization under a wide range of viewing conditions is a fundamental problem in computer vision. Handling the difficult cases of this problem is not only very challenging but also of high practical relevance, e.g., in the context of life-long localization for augmented reality or autonomous robots. In this paper, we propose a novel approach based on a joint 3D geometric and semanti…
▽ More
Robust visual localization under a wide range of viewing conditions is a fundamental problem in computer vision. Handling the difficult cases of this problem is not only very challenging but also of high practical relevance, e.g., in the context of life-long localization for augmented reality or autonomous robots. In this paper, we propose a novel approach based on a joint 3D geometric and semantic understanding of the world, enabling it to succeed under conditions where previous approaches failed. Our method leverages a novel generative model for descriptor learning, trained on semantic scene completion as an auxiliary task. The resulting 3D descriptors are robust to missing observations by encoding high-level 3D geometric and semantic information. Experiments on several challenging large-scale localization datasets demonstrate reliable localization under extreme viewpoint, illumination, and geometry changes.
△ Less
Submitted 16 April, 2018; v1 submitted 15 December, 2017;
originally announced December 2017.
-
An Exploration of 2D and 3D Deep Learning Techniques for Cardiac MR Image Segmentation
Authors:
Christian F. Baumgartner,
Lisa M. Koch,
Marc Pollefeys,
Ender Konukoglu
Abstract:
Accurate segmentation of the heart is an important step towards evaluating cardiac function. In this paper, we present a fully automated framework for segmentation of the left (LV) and right (RV) ventricular cavities and the myocardium (Myo) on short-axis cardiac MR images. We investigate various 2D and 3D convolutional neural network architectures for this task. We investigate the suitability of…
▽ More
Accurate segmentation of the heart is an important step towards evaluating cardiac function. In this paper, we present a fully automated framework for segmentation of the left (LV) and right (RV) ventricular cavities and the myocardium (Myo) on short-axis cardiac MR images. We investigate various 2D and 3D convolutional neural network architectures for this task. We investigate the suitability of various state-of-the art 2D and 3D convolutional neural network architectures, as well as slight modifications thereof, for this task. Experiments were performed on the ACDC 2017 challenge training dataset comprising cardiac MR images of 100 patients, where manual reference segmentations were made available for end-diastolic (ED) and end-systolic (ES) frames. We find that processing the images in a slice-by-slice fashion using 2D networks is beneficial due to a relatively large slice thickness. However, the exact network architecture only plays a minor role. We report mean Dice coefficients of $0.950$ (LV), $0.893$ (RV), and $0.899$ (Myo), respectively with an average evaluation time of 1.1 seconds per volume on a modern GPU.
△ Less
Submitted 10 October, 2017; v1 submitted 13 September, 2017;
originally announced September 2017.
-
3D Visual Perception for Self-Driving Cars using a Multi-Camera System: Calibration, Map**, Localization, and Obstacle Detection
Authors:
Christian Häne,
Lionel Heng,
Gim Hee Lee,
Friedrich Fraundorfer,
Paul Furgale,
Torsten Sattler,
Marc Pollefeys
Abstract:
Cameras are a crucial exteroceptive sensor for self-driving cars as they are low-cost and small, provide appearance information about the environment, and work in various weather conditions. They can be used for multiple purposes such as visual navigation and obstacle detection. We can use a surround multi-camera system to cover the full 360-degree field-of-view around the car. In this way, we avo…
▽ More
Cameras are a crucial exteroceptive sensor for self-driving cars as they are low-cost and small, provide appearance information about the environment, and work in various weather conditions. They can be used for multiple purposes such as visual navigation and obstacle detection. We can use a surround multi-camera system to cover the full 360-degree field-of-view around the car. In this way, we avoid blind spots which can otherwise lead to accidents. To minimize the number of cameras needed for surround perception, we utilize fisheye cameras. Consequently, standard vision pipelines for 3D map**, visual localization, obstacle detection, etc. need to be adapted to take full advantage of the availability of multiple cameras rather than treat each camera individually. In addition, processing of fisheye images has to be supported. In this paper, we describe the camera calibration and subsequent processing pipeline for multi-fisheye-camera systems developed as part of the V-Charge project. This project seeks to enable automated valet parking for self-driving cars. Our pipeline is able to precisely calibrate multi-camera systems, build sparse 3D maps for visual navigation, visually localize the car with respect to these maps, generate accurate dense maps, as well as detect obstacles based on real-time depth map extraction.
△ Less
Submitted 31 August, 2017;
originally announced August 2017.
-
Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions
Authors:
Torsten Sattler,
Will Maddern,
Carl Toft,
Akihiko Torii,
Lars Hammarstrand,
Erik Stenborg,
Daniel Safari,
Masatoshi Okutomi,
Marc Pollefeys,
Josef Sivic,
Fredrik Kahl,
Tomas Pajdla
Abstract:
Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds. Practical visual localization approaches need to be robust to a wide variety of viewing condition, including day-night changes, as well as weather and seasonal variations, while providing highly accurate 6 degree-of-freedom (6DOF) camera pose estimate…
▽ More
Visual localization enables autonomous vehicles to navigate in their surroundings and augmented reality applications to link virtual to real worlds. Practical visual localization approaches need to be robust to a wide variety of viewing condition, including day-night changes, as well as weather and seasonal variations, while providing highly accurate 6 degree-of-freedom (6DOF) camera pose estimates. In this paper, we introduce the first benchmark datasets specifically designed for analyzing the impact of such factors on visual localization. Using carefully created ground truth poses for query images taken under a wide variety of conditions, we evaluate the impact of various factors on 6DOF camera pose estimation accuracy through extensive experiments with state-of-the-art localization approaches. Based on our results, we draw conclusions about the difficulty of different conditions, showing that long-term localization is far from solved, and propose promising avenues for future work, including sequence-based localization approaches and the need for better local features. Our benchmark is available at visuallocalization.net.
△ Less
Submitted 4 April, 2018; v1 submitted 27 July, 2017;
originally announced July 2017.
-
Slanted Stixels: Representing San Francisco's Steepest Streets
Authors:
Daniel Hernandez-Juarez,
Lukas Schneider,
Antonio Espinosa,
David Vázquez,
Antonio M. López,
Uwe Franke,
Marc Pollefeys,
Juan C. Moure
Abstract:
In this work we present a novel compact scene representation based on Stixels that infers geometric and semantic information. Our approach overcomes the previous rather restrictive geometric assumptions for Stixels by introducing a novel depth model to account for non-flat roads and slanted objects. Both semantic and depth cues are used jointly to infer the scene representation in a sound global e…
▽ More
In this work we present a novel compact scene representation based on Stixels that infers geometric and semantic information. Our approach overcomes the previous rather restrictive geometric assumptions for Stixels by introducing a novel depth model to account for non-flat roads and slanted objects. Both semantic and depth cues are used jointly to infer the scene representation in a sound global energy minimization formulation. Furthermore, a novel approximation scheme is introduced that uses an extremely efficient over-segmentation. In doing so, the computational complexity of the Stixel inference algorithm is reduced significantly, achieving real-time computation capabilities with only a slight drop in accuracy. We evaluate the proposed approach in terms of semantic and geometric accuracy as well as run-time on four publicly available benchmark datasets. Our approach maintains accuracy on flat road scene datasets while improving substantially on a novel non-flat road dataset.
△ Less
Submitted 17 July, 2017;
originally announced July 2017.
-
Information-Flow Matting
Authors:
Yağız Aksoy,
Tunç Ozan Aydın,
Marc Pollefeys
Abstract:
We present a novel, purely affinity-based natural image matting algorithm. Our method relies on carefully defined pixel-to-pixel connections that enable effective use of information available in the image. We control the information flow from the known-opacity regions into the unknown region, as well as within the unknown region itself, by utilizing multiple definitions of pixel affinities. Among…
▽ More
We present a novel, purely affinity-based natural image matting algorithm. Our method relies on carefully defined pixel-to-pixel connections that enable effective use of information available in the image. We control the information flow from the known-opacity regions into the unknown region, as well as within the unknown region itself, by utilizing multiple definitions of pixel affinities. Among other forms of information flow, we introduce color-mixture flow, which builds upon local linear embedding and effectively encapsulates the relation between different pixel opacities. Our resulting novel linear system formulation can be solved in closed-form and is robust against several fundamental challenges of natural matting such as holes and remote intricate structures. While our method is primarily designed as a standalone matting tool, we show that it can also be used for regularizing mattes obtained by sampling-based methods. The formulation is also extended to layer color estimation and we show that the use of multiple channels of flow increases the layer color quality. We also demonstrate our performance in green-screen keying and analyze the characteristics of the utilized affinities.
△ Less
Submitted 12 April, 2019; v1 submitted 17 July, 2017;
originally announced July 2017.
-
Semantically Informed Multiview Surface Refinement
Authors:
Maros Blaha,
Mathias Rothermel,
Martin R. Oswald,
Torsten Sattler,
Audrey Richard,
Jan D. Wegner,
Marc Pollefeys,
Konrad Schindler
Abstract:
We present a method to jointly refine the geometry and semantic segmentation of 3D surface meshes. Our method alternates between updating the shape and the semantic labels. In the geometry refinement step, the mesh is deformed with variational energy minimization, such that it simultaneously maximizes photo-consistency and the compatibility of the semantic segmentations across a set of calibrated…
▽ More
We present a method to jointly refine the geometry and semantic segmentation of 3D surface meshes. Our method alternates between updating the shape and the semantic labels. In the geometry refinement step, the mesh is deformed with variational energy minimization, such that it simultaneously maximizes photo-consistency and the compatibility of the semantic segmentations across a set of calibrated images. Label-specific shape priors account for interactions between the geometry and the semantic labels in 3D. In the semantic segmentation step, the labels on the mesh are updated with MRF inference, such that they are compatible with the semantic segmentations in the input images. Also, this step includes prior assumptions about the surface shape of different semantic classes. The priors induce a tight coupling, where semantic information influences the shape update and vice versa. Specifically, we introduce priors that favor (i) adaptive smoothing, depending on the class label; (ii) straightness of class boundaries; and (iii) semantic labels that are consistent with the surface orientation. The novel mesh-based reconstruction is evaluated in a series of experiments with real and synthetic data. We compare both to state-of-the-art, voxel-based semantic 3D reconstruction, and to purely geometric mesh refinement, and demonstrate that the proposed scheme yields improved 3D geometry as well as an improved semantic segmentation.
△ Less
Submitted 26 June, 2017;
originally announced June 2017.
-
Matching neural paths: transfer from recognition to correspondence search
Authors:
Nikolay Savinov,
Lubor Ladicky,
Marc Pollefeys
Abstract:
Many machine learning tasks require finding per-part correspondences between objects. In this work we focus on low-level correspondences - a highly ambiguous matching problem. We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity. Training it for low-level correspondence prediction directly might not be an optio…
▽ More
Many machine learning tasks require finding per-part correspondences between objects. In this work we focus on low-level correspondences - a highly ambiguous matching problem. We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity. Training it for low-level correspondence prediction directly might not be an option in some domains where the ground-truth correspondences are hard to obtain. We show how transfer from recognition can be used to avoid such training. Our idea is to mark parts as "matching" if their features are close to each other at all the levels of convolutional feature hierarchy (neural paths). Although the overall number of such paths is exponential in the number of layers, we propose a polynomial algorithm for aggregating all of them in a single backward pass. The empirical validation is done on the task of stereo correspondence and demonstrates that we achieve competitive results among the methods which do not use labeled target domain data.
△ Less
Submitted 5 November, 2017; v1 submitted 19 May, 2017;
originally announced May 2017.
-
Semantic3D.net: A new Large-scale Point Cloud Classification Benchmark
Authors:
Timo Hackel,
Nikolay Savinov,
Lubor Ladicky,
Jan D. Wegner,
Konrad Schindler,
Marc Pollefeys
Abstract:
This paper presents a new 3D point cloud classification benchmark data set with over four billion manually labelled points, meant as input for data-hungry (deep) learning methods. We also discuss first submissions to the benchmark that use deep convolutional neural networks (CNNs) as a work horse, which already show remarkable performance improvements over state-of-the-art. CNNs have become the de…
▽ More
This paper presents a new 3D point cloud classification benchmark data set with over four billion manually labelled points, meant as input for data-hungry (deep) learning methods. We also discuss first submissions to the benchmark that use deep convolutional neural networks (CNNs) as a work horse, which already show remarkable performance improvements over state-of-the-art. CNNs have become the de-facto standard for many tasks in computer vision and machine learning like semantic segmentation or object detection in images, but have no yet led to a true breakthrough for 3D point cloud labelling tasks due to lack of training data. With the massive data set presented in this paper, we aim at closing this data gap to help unleash the full potential of deep learning methods for 3D labelling tasks. Our semantic3D.net data set consists of dense point clouds acquired with static terrestrial laser scanners. It contains 8 semantic classes and covers a wide range of urban outdoor scenes: churches, streets, railroad tracks, squares, villages, soccer fields and castles. We describe our labelling interface and show that our data set provides more dense and complete point clouds with much higher overall number of labelled points compared to those already available to the research community. We further provide baseline method descriptions and comparison between methods submitted to our online system. We hope semantic3D.net will pave the way for deep learning methods in 3D point cloud labelling to learn richer, more general 3D representations, and first submissions after only a few months indicate that this might indeed be the case.
△ Less
Submitted 12 April, 2017;
originally announced April 2017.
-
The Stixel world: A medium-level representation of traffic scenes
Authors:
Marius Cordts,
Timo Rehfeld,
Lukas Schneider,
David Pfeiffer,
Markus Enzweiler,
Stefan Roth,
Marc Pollefeys,
Uwe Franke
Abstract:
Recent progress in advanced driver assistance systems and the race towards autonomous vehicles is mainly driven by two factors: (1) increasingly sophisticated algorithms that interpret the environment around the vehicle and react accordingly, and (2) the continuous improvements of sensor technology itself. In terms of cameras, these improvements typically include higher spatial resolution, which a…
▽ More
Recent progress in advanced driver assistance systems and the race towards autonomous vehicles is mainly driven by two factors: (1) increasingly sophisticated algorithms that interpret the environment around the vehicle and react accordingly, and (2) the continuous improvements of sensor technology itself. In terms of cameras, these improvements typically include higher spatial resolution, which as a consequence requires more data to be processed. The trend to add multiple cameras to cover the entire surrounding of the vehicle is not conducive in that matter. At the same time, an increasing number of special purpose algorithms need access to the sensor input data to correctly interpret the various complex situations that can occur, particularly in urban traffic.
By observing those trends, it becomes clear that a key challenge for vision architectures in intelligent vehicles is to share computational resources. We believe this challenge should be faced by introducing a representation of the sensory data that provides compressed and structured access to all relevant visual content of the scene. The Stixel World discussed in this paper is such a representation. It is a medium-level model of the environment that is specifically designed to compress information about obstacles by leveraging the typical layout of outdoor traffic scenes. It has proven useful for a multitude of automotive vision applications, including object detection, tracking, segmentation, and map**.
In this paper, we summarize the ideas behind the model and generalize it to take into account multiple dense input streams: the image itself, stereo depth maps, and semantic class probability maps that can be generated, e.g., by CNNs. Our generalization is embedded into a novel mathematical formulation for the Stixel model. We further sketch how the free parameters of the model can be learned using structured SVMs.
△ Less
Submitted 2 April, 2017;
originally announced April 2017.
-
Quad-networks: unsupervised learning to rank for interest point detection
Authors:
Nikolay Savinov,
Akihito Seki,
Lubor Ladicky,
Torsten Sattler,
Marc Pollefeys
Abstract:
Several machine learning tasks require to represent the data using only a sparse set of interest points. An ideal detector is able to find the corresponding interest points even if the data undergo a transformation typical for a given domain. Since the task is of high practical interest in computer vision, many hand-crafted solutions were proposed. In this paper, we ask a fundamental question: can…
▽ More
Several machine learning tasks require to represent the data using only a sparse set of interest points. An ideal detector is able to find the corresponding interest points even if the data undergo a transformation typical for a given domain. Since the task is of high practical interest in computer vision, many hand-crafted solutions were proposed. In this paper, we ask a fundamental question: can we learn such detectors from scratch? Since it is often unclear what points are "interesting", human labelling cannot be used to find a truly unbiased solution. Therefore, the task requires an unsupervised formulation. We are the first to propose such a formulation: training a neural network to rank points in a transformation-invariant manner. Interest points are then extracted from the top/bottom quantiles of this ranking. We validate our approach on two tasks: standard RGB image interest point detection and challenging cross-modal interest point detection between RGB and depth images. We quantitatively show that our unsupervised method performs better or on-par with baselines.
△ Less
Submitted 10 April, 2017; v1 submitted 22 November, 2016;
originally announced November 2016.
-
Semantically Guided Depth Upsampling
Authors:
Nick Schneider,
Lukas Schneider,
Peter **gera,
Uwe Franke,
Marc Pollefeys,
Christoph Stiller
Abstract:
We present a novel method for accurate and efficient up- sampling of sparse depth data, guided by high-resolution imagery. Our approach goes beyond the use of intensity cues only and additionally exploits object boundary cues through structured edge detection and semantic scene labeling for guidance. Both cues are combined within a geodesic distance measure that allows for boundary-preserving dept…
▽ More
We present a novel method for accurate and efficient up- sampling of sparse depth data, guided by high-resolution imagery. Our approach goes beyond the use of intensity cues only and additionally exploits object boundary cues through structured edge detection and semantic scene labeling for guidance. Both cues are combined within a geodesic distance measure that allows for boundary-preserving depth in- terpolation while utilizing local context. We model the observed scene structure by locally planar elements and formulate the upsampling task as a global energy minimization problem. Our method determines glob- ally consistent solutions and preserves fine details and sharp depth bound- aries. In our experiments on several public datasets at different levels of application, we demonstrate superior performance of our approach over the state-of-the-art, even for very sparse measurements.
△ Less
Submitted 2 August, 2016;
originally announced August 2016.
-
TI-POOLING: transformation-invariant pooling for feature learning in Convolutional Neural Networks
Authors:
Dmitry Laptev,
Nikolay Savinov,
Joachim M. Buhmann,
Marc Pollefeys
Abstract:
In this paper we present a deep neural network topology that incorporates a simple to implement transformation invariant pooling operator (TI-POOLING). This operator is able to efficiently handle prior knowledge on nuisance variations in the data, such as rotation or scale changes. Most current methods usually make use of dataset augmentation to address this issue, but this requires larger number…
▽ More
In this paper we present a deep neural network topology that incorporates a simple to implement transformation invariant pooling operator (TI-POOLING). This operator is able to efficiently handle prior knowledge on nuisance variations in the data, such as rotation or scale changes. Most current methods usually make use of dataset augmentation to address this issue, but this requires larger number of model parameters and more training data, and results in significantly increased training time and larger chance of under- or overfitting. The main reason for these drawbacks is that the learned model needs to capture adequate features for all the possible transformations of the input. On the other hand, we formulate features in convolutional neural networks to be transformation-invariant. We achieve that using parallel siamese architectures for the considered transformation set and applying the TI-POOLING operator on their outputs before the fully-connected layers. We show that this topology internally finds the most optimal "canonical" instance of the input image for training and therefore limits the redundancy in learned features. This more efficient use of training data results in better performance on popular benchmark datasets with smaller number of parameters when comparing to standard convolutional neural networks with dataset augmentation and to other baselines.
△ Less
Submitted 22 September, 2016; v1 submitted 21 April, 2016;
originally announced April 2016.
-
Automatic 3D Reconstruction of Manifold Meshes via Delaunay Triangulation and Mesh Swee**
Authors:
Andrea Romanoni,
Amaël Delaunoy,
Marc Pollefeys,
Matteo Matteucci
Abstract:
In this paper we propose a new approach to incrementally initialize a manifold surface for automatic 3D reconstruction from images. More precisely we focus on the automatic initialization of a 3D mesh as close as possible to the final solution; indeed many approaches require a good initial solution for further refinement via multi-view stereo techniques. Our novel algorithm automatically estimates…
▽ More
In this paper we propose a new approach to incrementally initialize a manifold surface for automatic 3D reconstruction from images. More precisely we focus on the automatic initialization of a 3D mesh as close as possible to the final solution; indeed many approaches require a good initial solution for further refinement via multi-view stereo techniques. Our novel algorithm automatically estimates an initial manifold mesh for surface evolving multi-view stereo algorithms, where the manifold property needs to be enforced. It bootstraps from 3D points extracted via Structure from Motion, then iterates between a state-of-the-art manifold reconstruction step and a novel mesh swee** algorithm that looks for new 3D points in the neighborhood of the reconstructed manifold to be added in the manifold reconstruction. The experimental results show quantitatively that the mesh swee** improves the resolution and the accuracy of the manifold reconstruction, allowing a better convergence of state-of-the-art surface evolution multi-view stereo algorithms.
△ Less
Submitted 21 April, 2016;
originally announced April 2016.
-
Semantic 3D Reconstruction with Continuous Regularization and Ray Potentials Using a Visibility Consistency Constraint
Authors:
Nikolay Savinov,
Christian Haene,
Lubor Ladicky,
Marc Pollefeys
Abstract:
We propose an approach for dense semantic 3D reconstruction which uses a data term that is defined as potentials over viewing rays, combined with continuous surface area penalization. Our formulation is a convex relaxation which we augment with a crucial non-convex constraint that ensures exact handling of visibility. To tackle the non-convex minimization problem, we propose a majorize-minimize ty…
▽ More
We propose an approach for dense semantic 3D reconstruction which uses a data term that is defined as potentials over viewing rays, combined with continuous surface area penalization. Our formulation is a convex relaxation which we augment with a crucial non-convex constraint that ensures exact handling of visibility. To tackle the non-convex minimization problem, we propose a majorize-minimize type strategy which converges to a critical point. We demonstrate the benefits of using the non-convex constraint experimentally. For the geometry-only case, we set a new state of the art on two datasets of the commonly used Middlebury multi-view stereo benchmark. Moreover, our general-purpose formulation directly reconstructs thin objects, which are usually treated with specialized algorithms. A qualitative evaluation on the dense semantic 3D reconstruction task shows that we improve significantly over previous methods.
△ Less
Submitted 26 August, 2019; v1 submitted 11 April, 2016;
originally announced April 2016.
-
Capturing Hands in Action using Discriminative Salient Points and Physics Simulation
Authors:
Dimitrios Tzionas,
Luca Ballan,
Abhilash Srikantha,
Pablo Aponte,
Marc Pollefeys,
Juergen Gall
Abstract:
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated object…
▽ More
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.
△ Less
Submitted 7 March, 2016; v1 submitted 6 June, 2015;
originally announced June 2015.
-
Learning the Matching Function
Authors:
Ľubor Ladický,
Christian Häne,
Marc Pollefeys
Abstract:
The matching function for the problem of stereo reconstruction or optical flow has been traditionally designed as a function of the distance between the features describing matched pixels. This approach works under assumption, that the appearance of pixels in two stereo cameras or in two consecutive video frames does not change dramatically. However, this might not be the case, if we try to match…
▽ More
The matching function for the problem of stereo reconstruction or optical flow has been traditionally designed as a function of the distance between the features describing matched pixels. This approach works under assumption, that the appearance of pixels in two stereo cameras or in two consecutive video frames does not change dramatically. However, this might not be the case, if we try to match pixels over a large interval of time.
In this paper we propose a method, which learns the matching function, that automatically finds the space of allowed changes in visual appearance, such as due to the motion blur, chromatic distortions, different colour calibration or seasonal changes. Furthermore, it automatically learns the importance of matching scores of contextual features at different relative locations and scales. Proposed classifier gives reliable estimations of pixel disparities already without any form of regularization.
We evaluated our method on two standard problems - stereo matching on KITTI outdoor dataset, optical flow on Sintel data set, and on newly introduced TimeLapse change detection dataset. Our algorithm obtained very promising results comparable to the state-of-the-art.
△ Less
Submitted 2 February, 2015;
originally announced February 2015.
-
Efficient Structured Prediction with Latent Variables for General Graphical Models
Authors:
Alexander Schwing,
Tamir Hazan,
Marc Pollefeys,
Raquel Urtasun
Abstract:
In this paper we propose a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. We describe a local entropy approximation for this general formulation using duality, and derive an efficient message passing algorithm that is guaranteed to converge. We demonstrate its effectiv…
▽ More
In this paper we propose a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. We describe a local entropy approximation for this general formulation using duality, and derive an efficient message passing algorithm that is guaranteed to converge. We demonstrate its effectiveness in the tasks of image segmentation as well as 3D indoor scene understanding from single images, showing that our approach is superior to latent structured support vector machines and hidden conditional random fields.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.