Search | arXiv e-print repository

arXiv:2008.07971 [pdf, other]

doi 10.1109/LRA.2021.3064284

Super-Human Performance in Gran Turismo Sport Using Deep Reinforcement Learning

Authors: Florian Fuchs, Yunlong Song, Elia Kaufmann, Davide Scaramuzza, Peter Duerr

Abstract: Autonomous car racing is a major challenge in robotics. It raises fundamental problems for classical approaches such as planning minimum-time trajectories under uncertain dynamics and controlling the car at the limits of its handling. Besides, the requirement of minimizing the lap time, which is a sparse objective, and the difficulty of collecting training data from human experts have also hindere… ▽ More Autonomous car racing is a major challenge in robotics. It raises fundamental problems for classical approaches such as planning minimum-time trajectories under uncertain dynamics and controlling the car at the limits of its handling. Besides, the requirement of minimizing the lap time, which is a sparse objective, and the difficulty of collecting training data from human experts have also hindered researchers from directly applying learning-based approaches to solve the problem. In the present work, we propose a learning-based system for autonomous car racing by leveraging a high-fidelity physical car simulation, a course-progress proxy reward, and deep reinforcement learning. We deploy our system in Gran Turismo Sport, a world-leading car simulator known for its realistic physics simulation of different race cars and tracks, which is even used to recruit human race car drivers. Our trained policy achieves autonomous racing performance that goes beyond what had been achieved so far by the built-in AI, and, at the same time, outperforms the fastest driver in a dataset of over 50,000 human players. △ Less

Submitted 9 May, 2021; v1 submitted 18 August, 2020; originally announced August 2020.

Comments: Accepted for Publication at the IEEE Robotics and Automation Letters (RA-L) 2021, and International Conference on Robots and Automation (ICRA) 2021

Journal ref: IEEE Robotics and Automation Letters (RAL) 2021

arXiv:2008.03324 [pdf, other]

Fisher Information Field: an Efficient and Differentiable Map for Perception-aware Planning

Authors: Zichao Zhang, Davide Scaramuzza

Abstract: Considering visual localization accuracy at the planning time gives preference to robot motion that can be better localized and thus has the potential of improving vision-based navigation, especially in visually degraded environments. To integrate the knowledge about localization accuracy in motion planning algorithms, a central task is to quantify the amount of information that an image taken at… ▽ More Considering visual localization accuracy at the planning time gives preference to robot motion that can be better localized and thus has the potential of improving vision-based navigation, especially in visually degraded environments. To integrate the knowledge about localization accuracy in motion planning algorithms, a central task is to quantify the amount of information that an image taken at a 6 degree-of-freedom pose brings for localization, which is often represented by the Fisher information. However, computing the Fisher information from a set of sparse landmarks (i.e., a point cloud), which is the most common map for visual localization, is inefficient. This approach scales linearly with the number of landmarks in the environment and does not allow the reuse of the computed Fisher information. To overcome these drawbacks, we propose the first dedicated map representation for evaluating the Fisher information of 6 degree-of-freedom visual localization for perception-aware motion planning. By formulating the Fisher information and sensor visibility carefully, we are able to separate the rotational invariant component from the Fisher information and store it in a voxel grid, namely the Fisher information field. This step only needs to be performed once for a known environment. The Fisher information for arbitrary poses can then be computed from the field in constant time, eliminating the need of costly iterating all the 3D landmarks at the planning time. Experimental results show that the proposed Fisher information field can be applied to different motion planning algorithms and is at least one order-of-magnitude faster than using the point cloud directly. Moreover,the proposed map representation is differentiable, resulting in better performance than the point cloud when used in trajectory optimization algorithms. △ Less

Submitted 7 August, 2020; originally announced August 2020.

Comments: 18 pages, 15 figures

arXiv:2008.02532 [pdf, other]

Online Weight-adaptive Nonlinear Model Predictive Control

Authors: Dimche Kostadinov, Davide Scaramuzza

Abstract: Nonlinear Model Predictive Control (NMPC) is a powerful and widely used technique for nonlinear dynamic process control under constraints. In NMPC, the state and control weights of the corresponding state and control costs are commonly selected based on human-expert knowledge, which usually reflects the acceptable stability in practice. Although broadly used, this approach might not be optimal for… ▽ More Nonlinear Model Predictive Control (NMPC) is a powerful and widely used technique for nonlinear dynamic process control under constraints. In NMPC, the state and control weights of the corresponding state and control costs are commonly selected based on human-expert knowledge, which usually reflects the acceptable stability in practice. Although broadly used, this approach might not be optimal for the execution of a trajectory with the lowest positional error and sufficiently "smooth" changes in the predicted controls. Furthermore, NMPC with an online weight update strategy for fast, agile, and precise unmanned aerial vehicle navigation, has not been studied extensively. To this end, we propose a novel control problem formulation that allows online updates of the state and control weights. As a solution, we present an algorithm that consists of two alternating stages: (i) state and command variable prediction and (ii) weights update. We present a numerical evaluation with a comparison and analysis of different trade-offs for the problem of quadrotor navigation. Our computer simulation results show improvements of up to 70% in the accuracy of the executed trajectory compared to the standard solution of NMPC with fixed weights. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 2020

arXiv:2007.10284 [pdf, other]

doi 10.1109/IROS45743.2020.9340823

Learning High-Level Policies for Model Predictive Control

Authors: Yunlong Song, Davide Scaramuzza

Abstract: The combination of policy search and deep neural networks holds the promise of automating a variety of decision-making tasks. Model Predictive Control (MPC) provides robust solutions to robot control tasks by making use of a dynamical model of the system and solving an optimization problem online over a short planning horizon. In this work, we leverage probabilistic decision-making approaches and… ▽ More The combination of policy search and deep neural networks holds the promise of automating a variety of decision-making tasks. Model Predictive Control (MPC) provides robust solutions to robot control tasks by making use of a dynamical model of the system and solving an optimization problem online over a short planning horizon. In this work, we leverage probabilistic decision-making approaches and the generalization capability of artificial neural networks to the powerful online optimization by learning a deep high-level policy for the MPC (High-MPC). Conditioning on robot's local observations, the trained neural network policy is capable of adaptively selecting high-level decision variables for the low-level MPC controller, which then generates optimal control commands for the robot. First, we formulate the search of high-level decision variables for MPC as a policy search problem, specifically, a probabilistic inference problem. The problem can be solved in a closed-form solution. Second, we propose a self-supervised learning algorithm for learning a neural network high-level policy, which is useful for online hyperparameter adaptations in highly dynamic environments. We demonstrate the importance of incorporating the online adaption into autonomous robots by using the proposed method to solve a challenging control problem, where the task is to control a simulated quadrotor to fly through a swinging gate. We show that our approach can handle situations that are difficult for standard MPC. △ Less

Submitted 9 May, 2021; v1 submitted 20 July, 2020; originally announced July 2020.

Comments: Accepted for Publication at the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 2020

arXiv:2007.06255 [pdf, other]

CPC: Complementary Progress Constraints for Time-Optimal Quadrotor Trajectories

Authors: Philipp Foehn, Davide Scaramuzza

Abstract: In many mobile robotics scenarios, such as drone racing, the goal is to generate a trajectory that passes through multiple waypoints in minimal time. This problem is referred to as time-optimal planning. State-of-the-art approaches either use polynomial trajectory formulations, which are suboptimal due to their smoothness, or numerical optimization, which requires waypoints to be allocated as cost… ▽ More In many mobile robotics scenarios, such as drone racing, the goal is to generate a trajectory that passes through multiple waypoints in minimal time. This problem is referred to as time-optimal planning. State-of-the-art approaches either use polynomial trajectory formulations, which are suboptimal due to their smoothness, or numerical optimization, which requires waypoints to be allocated as costs or constraints to specific discrete-time nodes. For time-optimal planning, this time-allocation is a priori unknown and renders traditional approaches incapable of producing truly time-optimal trajectories. We introduce a novel formulation of progress bound to waypoints by a complementarity constraint. While the progress variables indicate the completion of a waypoint, change of this progress is only allowed in local proximity to the waypoint via complementarity constraints. This enables the simultaneous optimization of the trajectory and the time-allocation of the waypoints. To the best of our knowledge, this is the first approach allowing for truly time-optimal trajectory planning for quadrotors and other systems. We perform and discuss evaluations on optimality and convexity, compare to other related approaches, and qualitatively to an expert-human baseline. △ Less

Submitted 3 August, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

arXiv:2006.05768 [pdf, other]

Deep Drone Acrobatics

Authors: Elia Kaufmann, Antonio Loquercio, René Ranftl, Matthias Müller, Vladlen Koltun, Davide Scaramuzza

Abstract: Performing acrobatic maneuvers with quadrotors is extremely challenging. Acrobatic flight requires high thrust and extreme angular accelerations that push the platform to its physical limits. Professional drone pilots often measure their level of mastery by flying such maneuvers in competitions. In this paper, we propose to learn a sensorimotor policy that enables an autonomous quadrotor to fly ex… ▽ More Performing acrobatic maneuvers with quadrotors is extremely challenging. Acrobatic flight requires high thrust and extreme angular accelerations that push the platform to its physical limits. Professional drone pilots often measure their level of mastery by flying such maneuvers in competitions. In this paper, we propose to learn a sensorimotor policy that enables an autonomous quadrotor to fly extreme acrobatic maneuvers with only onboard sensing and computation. We train the policy entirely in simulation by leveraging demonstrations from an optimal controller that has access to privileged information. We use appropriate abstractions of the visual input to enable transfer to a real quadrotor. We show that the resulting policy can be directly deployed in the physical world without any fine-tuning on real data. Our methodology has several favorable properties: it does not require a human expert to provide demonstrations, it cannot harm the physical system during training, and it can be used to learn maneuvers that are challenging even for the best human pilots. Our approach enables a physical quadrotor to fly maneuvers such as the Power Loop, the Barrel Roll, and the Matty Flip, during which it incurs accelerations of up to 3g. △ Less

Submitted 11 June, 2020; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: 8 pages + 2 pages references. Video: https://youtu.be/2N_wKXQ6MXA. Code: https://github.com/uzh-rpg/deep_drone_acrobatics

Journal ref: Robotics, Science, and Systems (RSS), 2020

arXiv:2005.12813 [pdf, other]

AlphaPilot: Autonomous Drone Racing

Authors: Philipp Foehn, Dario Brescianini, Elia Kaufmann, Titus Cieslewski, Mathias Gehrig, Manasi Muglikar, Davide Scaramuzza

Abstract: This paper presents a novel system for autonomous, vision-based drone racing combining learned data abstraction, nonlinear filtering, and time-optimal trajectory planning. The system has successfully been deployed at the first autonomous drone racing world championship: the 2019 AlphaPilot Challenge. Contrary to traditional drone racing systems, which only detect the next gate, our approach makes… ▽ More This paper presents a novel system for autonomous, vision-based drone racing combining learned data abstraction, nonlinear filtering, and time-optimal trajectory planning. The system has successfully been deployed at the first autonomous drone racing world championship: the 2019 AlphaPilot Challenge. Contrary to traditional drone racing systems, which only detect the next gate, our approach makes use of any visible gate and takes advantage of multiple, simultaneous gate detections to compensate for drift in the state estimate and build a global map of the gates. The global map and drift-compensated state estimate allow the drone to navigate through the race course even when the gates are not immediately visible and further enable to plan a near time-optimal path through the race course in real time based on approximate drone dynamics. The proposed system has been demonstrated to successfully guide the drone through tight race courses reaching speeds up to 8m/s and ranked second at the 2019 AlphaPilot Challenge. △ Less

Submitted 20 August, 2021; v1 submitted 26 May, 2020; originally announced May 2020.

Comments: This paper is an extended version of an accepted publication from Robotics: Science and Systems, 2020. This version has been accepted for publication in Autonomous Robots (Springer). Please cite as "AlphaPilot: Autonomous Drone Racing", P. Foehn, Autonomous Robots 2021. Associated video at https://youtu.be/DGjwm5PZQT8

arXiv:2005.05179 [pdf, other]

doi 10.1007/s11263-020-01399-8

Reference Pose Generation for Long-term Visual Localization via Learned Features and View Synthesis

Authors: Zichao Zhang, Torsten Sattler, Davide Scaramuzza

Abstract: Visual Localization is one of the key enabling technologies for autonomous driving and augmented reality. High quality datasets with accurate 6 Degree-of-Freedom (DoF) reference poses are the foundation for benchmarking and improving existing methods. Traditionally, reference poses have been obtained via Structure-from-Motion (SfM). However, SfM itself relies on local features which are prone to f… ▽ More Visual Localization is one of the key enabling technologies for autonomous driving and augmented reality. High quality datasets with accurate 6 Degree-of-Freedom (DoF) reference poses are the foundation for benchmarking and improving existing methods. Traditionally, reference poses have been obtained via Structure-from-Motion (SfM). However, SfM itself relies on local features which are prone to fail when images were taken under different conditions, e.g., day/ night changes. At the same time, manually annotating feature correspondences is not scalable and potentially inaccurate. In this work, we propose a semi-automated approach to generate reference poses based on feature matching between renderings of a 3D model and real images via learned features. Given an initial pose estimate, our approach iteratively refines the pose based on feature matches against a rendering of the model from the current pose estimate. We significantly improve the nighttime reference poses of the popular Aachen Day-Night dataset, showing that state-of-the-art visual localization methods perform better (up to $47\%$) than predicted by the original reference poses. We extend the dataset with new nighttime test images, provide uncertainty estimates for our new reference poses, and introduce a new evaluation criterion. We will make our reference poses and our framework publicly available upon publication. △ Less

Submitted 30 December, 2020; v1 submitted 11 May, 2020; originally announced May 2020.

Comments: 25 pages, 16 figures. Int J Comput Vis (2020)

arXiv:2003.13493 [pdf, other]

Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO

Authors: Balazs Nagy, Philipp Foehn, Davide Scaramuzza

Abstract: The recent introduction of powerful embedded graphics processing units (GPUs) has allowed for unforeseen improvements in real-time computer vision applications. It has enabled algorithms to run onboard, well above the standard video rates, yielding not only higher information processing capability, but also reduced latency. This work focuses on the applicability of efficient low-level, GPU hardwar… ▽ More The recent introduction of powerful embedded graphics processing units (GPUs) has allowed for unforeseen improvements in real-time computer vision applications. It has enabled algorithms to run onboard, well above the standard video rates, yielding not only higher information processing capability, but also reduced latency. This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms in the field of visual-inertial odometry (VIO). While most steps of a VIO pipeline work on visual features, they rely on image data for detection and tracking, of which both steps are well suited for parallelization. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency. Our work first revisits the problem of non-maxima suppression for feature detection specifically on GPUs, and proposes a solution that selects local response maxima, imposes spatial feature distribution, and extracts features simultaneously. Our second contribution introduces an enhanced FAST feature detector that applies the aforementioned non-maxima suppression method. Finally, we compare our method to other state-of-the-art CPU and GPU implementations, where we always outperform all of them in feature tracking and detection, resulting in over 1000fps throughput on an embedded Jetson TX2 platform. Additionally, we demonstrate our work integrated in a VIO pipeline achieving a metric state estimation at ~200fps. △ Less

Submitted 3 August, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

Comments: IEEE International Conference on Intelligent Robots and Systems (IROS), 2020. Open-source implementation available at https://github.com/uzh-rpg/vilib

arXiv:2003.09148 [pdf, other]

Event-based Asynchronous Sparse Convolutional Networks

Authors: Nico Messikommer, Daniel Gehrig, Antonio Loquercio, Davide Scaramuzza

Abstract: Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse "events". Recently, pattern recognition algorithms, such as learning-based methods, have made significant progress with event cameras by converting events into synchronous dense, image-like representations and applying traditional machine learning methods developed for standar… ▽ More Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse "events". Recently, pattern recognition algorithms, such as learning-based methods, have made significant progress with event cameras by converting events into synchronous dense, image-like representations and applying traditional machine learning methods developed for standard cameras. However, these approaches discard the spatial and temporal sparsity inherent in event data at the cost of higher computational complexity and latency. In this work, we present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output, thus directly leveraging the intrinsic asynchronous and sparse nature of the event data. We show both theoretically and experimentally that this drastically reduces the computational complexity and latency of high-capacity, synchronous neural networks without sacrificing accuracy. In addition, our framework has several desirable characteristics: (i) it exploits spatio-temporal sparsity of events explicitly, (ii) it is agnostic to the event representation, network architecture, and task, and (iii) it does not require any train-time change, since it is compatible with the standard neural networks' training process. We thoroughly validate the proposed framework on two computer vision tasks: object detection and object recognition. In these tasks, we reduce the computational complexity up to 20 times with respect to high-latency neural networks. At the same time, we outperform state-of-the-art asynchronous approaches up to 24% in prediction accuracy. △ Less

Submitted 17 July, 2020; v1 submitted 20 March, 2020; originally announced March 2020.

Journal ref: European Conference on Computer Vision (ECCV), 2020

arXiv:2003.09078 [pdf, other]

Reducing the Sim-to-Real Gap for Event Cameras

Authors: Timo Stoffregen, Cedric Scheerlinck, Davide Scaramuzza, Tom Drummond, Nick Barnes, Lindsay Kleeman, Robert Mahony

Abstract: Event cameras are paradigm-shifting novel sensors that report asynchronous, per-pixel brightness changes called 'events' with unparalleled low latency. This makes them ideal for high speed, high dynamic range scenes where conventional cameras would fail. Recent work has demonstrated impressive results using Convolutional Neural Networks (CNNs) for video reconstruction and optic flow with events. W… ▽ More Event cameras are paradigm-shifting novel sensors that report asynchronous, per-pixel brightness changes called 'events' with unparalleled low latency. This makes them ideal for high speed, high dynamic range scenes where conventional cameras would fail. Recent work has demonstrated impressive results using Convolutional Neural Networks (CNNs) for video reconstruction and optic flow with events. We present strategies for improving training data for event based CNNs that result in 20-40% boost in performance of existing state-of-the-art (SOTA) video reconstruction networks retrained with our method, and up to 15% for optic flow networks. A challenge in evaluating event based video reconstruction is lack of quality ground truth images in existing datasets. To address this, we present a new High Quality Frames (HQF) dataset, containing events and ground truth frames from a DAVIS240C that are well-exposed and minimally motion-blurred. We evaluate our method on HQF + several existing major event camera datasets. △ Less

Submitted 22 August, 2020; v1 submitted 19 March, 2020; originally announced March 2020.

Comments: Figure 5 fixed (had a glitch)

Journal ref: European Conference on Computer Vision, 2020

arXiv:2003.05654 [pdf, other]

AirSim Drone Racing Lab

Authors: Ratnesh Madaan, Nicholas Gyde, Sai Vemprala, Matthew Brown, Keiko Nagami, Tim Taubner, Eric Cristofalo, Davide Scaramuzza, Mac Schwager, Ashish Kapoor

Abstract: Autonomous drone racing is a challenging research problem at the intersection of computer vision, planning, state estimation, and control. We introduce AirSim Drone Racing Lab, a simulation framework for enabling fast prototy** of algorithms for autonomy and enabling machine learning research in this domain, with the goal of reducing the time, money, and risks associated with field robotics. Our… ▽ More Autonomous drone racing is a challenging research problem at the intersection of computer vision, planning, state estimation, and control. We introduce AirSim Drone Racing Lab, a simulation framework for enabling fast prototy** of algorithms for autonomy and enabling machine learning research in this domain, with the goal of reducing the time, money, and risks associated with field robotics. Our framework enables generation of racing tracks in multiple photo-realistic environments, orchestration of drone races, comes with a suite of gate assets, allows for multiple sensor modalities (monocular, depth, neuromorphic events, optical flow), different camera models, and benchmarking of planning, control, computer vision, and learning-based algorithms. We used our framework to host a simulation based drone racing competition at NeurIPS 2019. The competition binaries are available at our github repository. △ Less

Submitted 12 March, 2020; originally announced March 2020.

Comments: 14 pages, 6 figures

arXiv:2003.04159 [pdf, other]

Tightly-coupled Fusion of Global Positional Measurements in Optimization-based Visual-Inertial Odometry

Authors: Giovanni Cioffi, Davide Scaramuzza

Abstract: Motivated by the goal of achieving robust, drift-free pose estimation in long-term autonomous navigation, in this work we propose a methodology to fuse global positional information with visual and inertial measurements in a tightly-coupled nonlinear-optimization-based estimator. Differently from previous works, which are loosely-coupled, the use of a tightly-coupled approach allows exploiting the… ▽ More Motivated by the goal of achieving robust, drift-free pose estimation in long-term autonomous navigation, in this work we propose a methodology to fuse global positional information with visual and inertial measurements in a tightly-coupled nonlinear-optimization-based estimator. Differently from previous works, which are loosely-coupled, the use of a tightly-coupled approach allows exploiting the correlations amongst all the measurements. A sliding window of the most recent system states is estimated by minimizing a cost function that includes visual re-projection errors, relative inertial errors, and global positional residuals. We use IMU preintegration to formulate the inertial residuals and leverage the outcome of such algorithm to efficiently compute the global position residuals. The experimental results show that the proposed method achieves accurate and globally consistent estimates, with negligible increase of the optimization computational cost. Our method consistently outperforms the loosely-coupled fusion approach. The mean position error is reduced up to 50% with respect to the loosely-coupled approach in outdoor Unmanned Aerial Vehicle (UAV) flights, where the global position information is given by noisy GPS measurements. To the best of our knowledge, this is the first work where global positional measurements are tightly fused in an optimization-based visual-inertial odometry algorithm, leveraging the IMU preintegration method to define the global positional factors. △ Less

Submitted 10 July, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, 2020

arXiv:2003.03929 [pdf, other]

doi 10.1109/ICRA48506.2021.9561774

Geometry-aware Compensation Scheme for Morphing Drones

Authors: Amedeo Fabris, Kevin Kleber, Davide Falanga, Davide Scaramuzza

Abstract: Morphing multirotors, such as the Foldable Drone , can increase the versatility of drones employing in-flight-adaptive-morphology. To further increase precision in their tasks, recent works have investigated stable flight in asymmetric morphologies mainly leveraging the low-level controller. However, the aerodynamic effects embedded in multirotors are only analyzed in fixed shape aerial vehicles a… ▽ More Morphing multirotors, such as the Foldable Drone , can increase the versatility of drones employing in-flight-adaptive-morphology. To further increase precision in their tasks, recent works have investigated stable flight in asymmetric morphologies mainly leveraging the low-level controller. However, the aerodynamic effects embedded in multirotors are only analyzed in fixed shape aerial vehicles and are completely ignored in morphing drones. In this paper, we investigate the effects of the partial overlap between the propellers and the main body of a morphing quadrotor. We perform experiments to characterize such effects and design a morphology-aware control scheme to account for them. To guarantee the right trade-off between efficiency and compactness of the vehicle, we propose a simple geometry-aware compensation scheme based on the results of these experiments. We demonstrate the effectiveness of our approach by deploying the compensation scheme on the Foldable Drone, a quadrotor that can fold its arms around the main body. The same set of experiments are performed and compared against one another by activating and deactivating the compensation scheme offline or during the flight. To the best of our knowledge, this is the first work counteracting the aerodynamic effects of a morphing quadrotor during flight and showing the effects of partial overlap between a propeller and the central body of the drone. △ Less

Submitted 29 December, 2021; v1 submitted 9 March, 2020; originally announced March 2020.

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), Xi'an, 2021

arXiv:2003.02790 [pdf, other]

Event-Based Angular Velocity Regression with Spiking Networks

Authors: Mathias Gehrig, Sumit Bam Shrestha, Daniel Mouritzen, Davide Scaramuzza

Abstract: Spiking Neural Networks (SNNs) are bio-inspired networks that process information conveyed as temporal spikes rather than numeric values. A spiking neuron of an SNN only produces a spike whenever a significant number of spikes occur within a short period of time. Due to their spike-based computational model, SNNs can process output from event-based, asynchronous sensors without any pre-processing… ▽ More Spiking Neural Networks (SNNs) are bio-inspired networks that process information conveyed as temporal spikes rather than numeric values. A spiking neuron of an SNN only produces a spike whenever a significant number of spikes occur within a short period of time. Due to their spike-based computational model, SNNs can process output from event-based, asynchronous sensors without any pre-processing at extremely lower power unlike standard artificial neural networks. This is possible due to specialized neuromorphic hardware that implements the highly-parallelizable concept of SNNs in silicon. Yet, SNNs have not enjoyed the same rise of popularity as artificial neural networks. This not only stems from the fact that their input format is rather unconventional but also due to the challenges in training spiking networks. Despite their temporal nature and recent algorithmic advances, they have been mostly evaluated on classification problems. We propose, for the first time, a temporal regression problem of numerical values given events from an event camera. We specifically investigate the prediction of the 3-DOF angular velocity of a rotating event camera with an SNN. The difficulty of this problem arises from the prediction of angular velocities continuously in time directly from irregular, asynchronous event-based input. Directly utilising the output of event cameras without any pre-processing ensures that we inherit all the benefits that they provide over conventional cameras. That is high-temporal resolution, high-dynamic range and no motion blur. To assess the performance of SNNs on this task, we introduce a synthetic event camera dataset generated from real-world panoramic images and show that we can successfully train an SNN to perform angular velocity regression. △ Less

Submitted 5 March, 2020; originally announced March 2020.

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), Paris, 2020

arXiv:2003.02247 [pdf, other]

Voxel Map for Visual SLAM

Authors: Manasi Muglikar, Zichao Zhang, Davide Scaramuzza

Abstract: In modern visual SLAM systems, it is a standard practice to retrieve potential candidate map points from overlap** keyframes for further feature matching or direct tracking. In this work, we argue that keyframes are not the optimal choice for this task, due to several inherent limitations, such as weak geometric reasoning and poor scalability. We propose a voxel-map representation to efficiently… ▽ More In modern visual SLAM systems, it is a standard practice to retrieve potential candidate map points from overlap** keyframes for further feature matching or direct tracking. In this work, we argue that keyframes are not the optimal choice for this task, due to several inherent limitations, such as weak geometric reasoning and poor scalability. We propose a voxel-map representation to efficiently retrieve map points for visual SLAM. In particular, we organize the map points in a regular voxel grid. Visible points from a camera pose are queried by sampling the camera frustum in a raycasting manner, which can be done in constant time using an efficient voxel hashing method. Compared with keyframes, the retrieved points using our method are geometrically guaranteed to fall in the camera field-of-view, and occluded points can be identified and removed to a certain extend. This method also naturally scales up to large scenes and complicated multicamera configurations. Experimental results show that our voxel map representation is as efficient as a keyframe map with 5 keyframes and provides significantly higher localization accuracy (average 46% improvement in RMSE) on the EuRoC dataset. The proposed voxel-map representation is a general approach to a fundamental functionality in visual SLAM and widely applicable. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Journal ref: IEEE Conference on Robotics and Automation(ICRA), Paris, 2020

arXiv:2003.02014 [pdf, other]

doi 10.1109/ICRA40945.2020.9197553

Redesigning SLAM for Arbitrary Multi-Camera Systems

Authors: Juichung Kuo, Manasi Muglikar, Zichao Zhang, Davide Scaramuzza

Abstract: Adding more cameras to SLAM systems improves robustness and accuracy but complicates the design of the visual front-end significantly. Thus, most systems in the literature are tailored for specific camera configurations. In this work, we aim at an adaptive SLAM system that works for arbitrary multi-camera setups. To this end, we revisit several common building blocks in visual SLAM. In particular,… ▽ More Adding more cameras to SLAM systems improves robustness and accuracy but complicates the design of the visual front-end significantly. Thus, most systems in the literature are tailored for specific camera configurations. In this work, we aim at an adaptive SLAM system that works for arbitrary multi-camera setups. To this end, we revisit several common building blocks in visual SLAM. In particular, we propose an adaptive initialization scheme, a sensor-agnostic, information-theoretic keyframe selection algorithm, and a scalable voxel-based map. These techniques make little assumption about the actual camera setups and prefer theoretically grounded methods over heuristics. We adapt a state-of-the-art visual-inertial odometry with these modifications, and experimental results show that the modified pipeline can adapt to a wide range of camera setups (e.g., 2 to 6 cameras in one experiment) without the need of sensor-specific modifications or tuning. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Journal ref: IEEE Conference on Robotics and Automation (ICRA), Paris, 2020

arXiv:2003.00752 [pdf, other]

doi 10.1109/LRA.2020.3009067

Learning Depth With Very Sparse Supervision

Authors: Antonio Loquercio, Alexey Dosovitskiy, Davide Scaramuzza

Abstract: Motivated by the astonishing capabilities of natural intelligent agents and inspired by theories from psychology, this paper explores the idea that perception gets coupled to 3D properties of the world via interaction with the environment. Existing works for depth estimation require either massive amounts of annotated training data or some form of hard-coded geometrical constraint. This paper expl… ▽ More Motivated by the astonishing capabilities of natural intelligent agents and inspired by theories from psychology, this paper explores the idea that perception gets coupled to 3D properties of the world via interaction with the environment. Existing works for depth estimation require either massive amounts of annotated training data or some form of hard-coded geometrical constraint. This paper explores a new approach to learning depth perception requiring neither of those. Specifically, we train a specialized global-local network architecture with what would be available to a robot interacting with the environment: from extremely sparse depth measurements down to even a single pixel per image. From a pair of consecutive images, our proposed network outputs a latent representation of the observer's motion between the images and a dense depth map. Experiments on several datasets show that, when ground truth is available even for just one of the image pixels, the proposed network can learn monocular dense depth estimation up to 22.5% more accurately than state-of-the-art approaches. We believe that this work, despite its scientific interest, lays the foundations to learn depth from extremely sparse supervision, which can be valuable to all robotic systems acting under severe bandwidth or sensing constraints. △ Less

Submitted 16 July, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

Comments: Accepted for Publication at the IEEE Robotics and Automation Letters (RA-L) 2020, and International Conference on Intelligent Robots and Systems (IROS) 2020

Journal ref: IEEE Robotics and Automation Letters (RA-L) 2020

arXiv:2003.00278 [pdf, other]

doi 10.1109/LRA.2020.3009077

Augmenting Visual Place Recognition with Structural Cues

Authors: Amadeus Oertel, Titus Cieslewski, Davide Scaramuzza

Abstract: In this paper, we propose to augment image-based place recognition with structural cues. Specifically, these structural cues are obtained using structure-from-motion, such that no additional sensors are needed for place recognition. This is achieved by augmenting the 2D convolutional neural network (CNN) typically used for image-based place recognition with a 3D CNN that takes as input a voxel gri… ▽ More In this paper, we propose to augment image-based place recognition with structural cues. Specifically, these structural cues are obtained using structure-from-motion, such that no additional sensors are needed for place recognition. This is achieved by augmenting the 2D convolutional neural network (CNN) typically used for image-based place recognition with a 3D CNN that takes as input a voxel grid derived from the structure-from-motion point cloud. We evaluate different methods for fusing the 2D and 3D features and obtain best performance with global average pooling and simple concatenation. On the Oxford RobotCar dataset, the resulting descriptor exhibits superior recognition performance compared to descriptors extracted from only one of the input modalities, including state-of-the-art image-based descriptors. Especially at low descriptor dimensionalities, we outperform state-of-the-art descriptors by up to 90%. △ Less

Submitted 16 July, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

Comments: 8 pages, published in RA-L & IROS 2020

Journal ref: IEEE Robotics and Automation Letters, 2020

arXiv:2001.06209 [pdf, other]

Registration made easy -- standalone orthopedic navigation with HoloLens

Authors: Florentin Liebmann, Simon Roner, Marco von Atzigen, Florian Wanivenhaus, Caroline Neuhaus, José Spirig, Davide Scaramuzza, Reto Sutter, Jess Snedeker, Mazda Farshad, Philipp Fürnstahl

Abstract: In surgical navigation, finding correspondence between preoperative plan and intraoperative anatomy, the so-called registration task, is imperative. One promising approach is to intraoperatively digitize anatomy and register it with the preoperative plan. State-of-the-art commercial navigation systems implement such approaches for pedicle screw placement in spinal fusion surgery. Although these sy… ▽ More In surgical navigation, finding correspondence between preoperative plan and intraoperative anatomy, the so-called registration task, is imperative. One promising approach is to intraoperatively digitize anatomy and register it with the preoperative plan. State-of-the-art commercial navigation systems implement such approaches for pedicle screw placement in spinal fusion surgery. Although these systems improve surgical accuracy, they are not gold standard in clinical practice. Besides economical reasons, this may be due to their difficult integration into clinical workflows and unintuitive navigation feedback. Augmented Reality has the potential to overcome these limitations. Consequently, we propose a surgical navigation approach comprising intraoperative surface digitization for registration and intuitive holographic navigation for pedicle screw placement that runs entirely on the Microsoft HoloLens. Preliminary results from phantom experiments suggest that the method may meet clinical accuracy requirements. △ Less

Submitted 17 January, 2020; originally announced January 2020.

Comments: 6 pages, 5 figures, accepted at CVPR 2019 workshop on Computer Vision Applications for Mixed Reality Headsets (https://docs.microsoft.com/en-us/windows/mixed-reality/cvpr-2019)

ACM Class: I.4.1

arXiv:1912.03095 [pdf, other]

Video to Events: Recycling Video Datasets for Event Cameras

Authors: Daniel Gehrig, Mathias Gehrig, Javier Hidalgo-Carrió, Davide Scaramuzza

Abstract: Event cameras are novel sensors that output brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high dynamic range (HDR), high temporal resolution, and no motion blur. Recently, novel learning approaches operating on event data have achieved impressive results. Yet, these methods re… ▽ More Event cameras are novel sensors that output brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high dynamic range (HDR), high temporal resolution, and no motion blur. Recently, novel learning approaches operating on event data have achieved impressive results. Yet, these methods require a large amount of event data for training, which is hardly available due the novelty of event sensors in computer vision research. In this paper, we present a method that addresses these needs by converting any existing video dataset recorded with conventional cameras to synthetic event data. This unlocks the use of a virtually unlimited number of existing video datasets for training networks designed for real event data. We evaluate our method on two relevant vision tasks, i.e., object recognition and semantic segmentation, and show that models trained on synthetic events have several benefits: (i) they generalize well to real event data, even in scenarios where standard-camera images are blurry or overexposed, by inheriting the outstanding properties of event cameras; (ii) they can be used for fine-tuning on real data to improve over state-of-the-art for both classification and semantic segmentation. △ Less

Submitted 1 April, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

arXiv:1911.04553 [pdf, other]

Towards Low-Latency High-Bandwidth Control of Quadrotors using Event Cameras

Authors: Rika Sugimoto Dimitrova, Mathias Gehrig, Dario Brescianini, Davide Scaramuzza

Abstract: Event cameras are a promising candidate to enable high speed vision-based control due to their low sensor latency and high temporal resolution. However, purely event-based feedback has yet to be used in the control of drones. In this work, a first step towards implementing low-latency high-bandwidth control of quadrotors using event cameras is taken. In particular, this paper addresses the problem… ▽ More Event cameras are a promising candidate to enable high speed vision-based control due to their low sensor latency and high temporal resolution. However, purely event-based feedback has yet to be used in the control of drones. In this work, a first step towards implementing low-latency high-bandwidth control of quadrotors using event cameras is taken. In particular, this paper addresses the problem of one-dimensional attitude tracking using a dualcopter platform equipped with an event camera. The event-based state estimation consists of a modified Hough transform algorithm combined with a Kalman filter that outputs the roll angle and angular velocity of the dualcopter relative to a horizon marked by a black-and-white disk. The estimated state is processed by a proportional-derivative attitude control law that computes the rotor thrusts required to track the desired attitude. The proposed attitude tracking scheme shows promising results of event-camera-driven closed loop control: the state estimator performs with an update rate of 1 kHz and a latency determined to be 12 ms, enabling attitude tracking at speeds of over 1600 deg/s. △ Less

Submitted 28 March, 2020; v1 submitted 11 November, 2019; originally announced November 2019.

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), Paris, 2020

arXiv:1909.01423 [pdf, other]

Exploration Without Global Consistency Using Local Volume Consolidation

Authors: Titus Cieslewski, Andreas Ziegler, Davide Scaramuzza

Abstract: In exploration, the goal is to build a map of an unknown environment. Most state-of-the-art approaches use map representations that require drift-free state estimates to function properly. Real-world state estimators, however, exhibit drift. In this paper, we present a 2D map representation for exploration that is robust to drift. Rather than a global map, it uses local metric volumes connected by… ▽ More In exploration, the goal is to build a map of an unknown environment. Most state-of-the-art approaches use map representations that require drift-free state estimates to function properly. Real-world state estimators, however, exhibit drift. In this paper, we present a 2D map representation for exploration that is robust to drift. Rather than a global map, it uses local metric volumes connected by relative pose estimates. This pose-graph does not need to be globally consistent. Overlaps between the volumes are resolved locally, rather than on the faulty estimate of space. We demonstrate our representation with a frontier-based exploration approach, evaluate it under different conditions and compare it with a commonly-used grid-based representation. We show that, at the cost of longer exploration time, using the proposed representation allows full coverage of space even for very large drift in the state estimate, contrary to the grid-based representation. The system is validated in a real world experiment and we discuss its extension to 3D. △ Less

Submitted 3 September, 2019; originally announced September 2019.

Comments: 16 pages with large margins, accepted for publication at the International Symposium on Robotics Research (ISRR), Hanoi, 2019

Journal ref: International Symposium on Robotics Research (ISRR), Hanoi, 2019

arXiv:1907.06890 [pdf, other]

doi 10.1109/LRA.2020.2974682

A General Framework for Uncertainty Estimation in Deep Learning

Authors: Antonio Loquercio, Mattia Segù, Davide Scaramuzza

Abstract: Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge abou… ▽ More Neural networks predictions are unreliable when the input sample is out of the training distribution or corrupted by noise. Being able to detect such failures automatically is fundamental to integrate deep learning algorithms into robotics. Current approaches for uncertainty estimation of neural networks require changes to the network and optimization process, typically ignore prior knowledge about the data, and tend to make over-simplifying assumptions which underestimate uncertainty. To address these limitations, we propose a novel framework for uncertainty estimation. Based on Bayesian belief networks and Monte-Carlo sampling, our framework not only fully models the different sources of prediction uncertainty, but also incorporates prior data information, e.g. sensor noise. We show theoretically that this gives us the ability to capture uncertainty better than existing methods. In addition, our framework has several desirable properties: (i) it is agnostic to the network architecture and task; (ii) it does not require changes in the optimization process; (iii) it can be applied to already trained architectures. We thoroughly validate the proposed framework through extensive experiments on both computer vision and control tasks, where we outperform previous methods by up to 23% in accuracy. △ Less

Submitted 7 February, 2020; v1 submitted 16 July, 2019; originally announced July 2019.

Comments: Accepted for publication in the Robotics and Automation Letters 2020, and for presentation at the International Conference on Robotics and Automation (ICRA) 2020

Journal ref: IEEE Robotics and Automation Letters 2020

arXiv:1906.07165 [pdf, other]

High Speed and High Dynamic Range Video with an Event Camera

Authors: Henri Rebecq, René Ranftl, Vladlen Koltun, Davide Scaramuzza

Abstract: Event cameras are novel sensors that report brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high temporal resolution, high dynamic range, and no motion blur. While the stream of events encodes in principle the complete visual signal, the reconstruction of an intensity image from… ▽ More Event cameras are novel sensors that report brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high temporal resolution, high dynamic range, and no motion blur. While the stream of events encodes in principle the complete visual signal, the reconstruction of an intensity image from a stream of events is an ill-posed problem in practice. Existing reconstruction approaches are based on hand-crafted priors and strong assumptions about the imaging process as well as the statistics of natural images. In this work we propose to learn to reconstruct intensity images from event streams directly from data instead of relying on any hand-crafted priors. We propose a novel recurrent network to reconstruct videos from a stream of events, and train it on a large amount of simulated event data. During training we propose to use a perceptual loss to encourage reconstructions to follow natural image statistics. We further extend our approach to synthesize color images from color event streams. Our network surpasses state-of-the-art reconstruction methods by a large margin in terms of image quality (> 20%), while comfortably running in real-time. We show that the network is able to synthesize high framerate videos (> 5,000 frames per second) of high-speed phenomena (e.g. a bullet hitting an object) and is able to provide high dynamic range reconstructions in challenging lighting conditions. We also demonstrate the effectiveness of our reconstructions as an intermediate representation for event data. We show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as object classification and visual-inertial odometry and that this strategy consistently outperforms algorithms that were specifically designed for event data. △ Less

Submitted 15 June, 2019; originally announced June 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1904.08298

arXiv:1906.03996 [pdf, other]

Rethinking Trajectory Evaluation for SLAM: a Probabilistic, Continuous-Time Approach

Authors: Zichao Zhang, Davide Scaramuzza

Abstract: Despite the existence of different error metrics for trajectory evaluation in SLAM, their theoretical justifications and connections are rarely studied, and few methods handle temporal association properly. In this work, we propose to formulate the trajectory evaluation problem in a probabilistic, continuous-time framework. By modeling the groundtruth as random variables, the concepts of absolute… ▽ More Despite the existence of different error metrics for trajectory evaluation in SLAM, their theoretical justifications and connections are rarely studied, and few methods handle temporal association properly. In this work, we propose to formulate the trajectory evaluation problem in a probabilistic, continuous-time framework. By modeling the groundtruth as random variables, the concepts of absolute and relative error are generalized to be likelihood. Moreover, the groundtruth is represented as a piecewise Gaussian Process in continuous-time. Within this framework, we are able to establish theoretical connections between relative and absolute error metrics and handle temporal association in a principled manner. △ Less

Submitted 10 June, 2019; originally announced June 2019.

Comments: Accepted at ICRA19 Workshop on Dataset Generation and Benchmarking of SLAM Algorithms for Robotics and VR/AR. Best paper award

arXiv:1906.03289 [pdf, other]

Visual-Inertial Odometry of Aerial Robots

Authors: Davide Scaramuzza, Zichao Zhang

Abstract: Visual-Inertial odometry (VIO) is the process of estimating the state (pose and velocity) of an agent (e.g., an aerial robot) by using only the input of one or more cameras plus one or more Inertial Measurement Units (IMUs) attached to it. VIO is the only viable alternative to GPS and lidar-based odometry to achieve accurate state estimation. Since both cameras and IMUs are very cheap, these senso… ▽ More Visual-Inertial odometry (VIO) is the process of estimating the state (pose and velocity) of an agent (e.g., an aerial robot) by using only the input of one or more cameras plus one or more Inertial Measurement Units (IMUs) attached to it. VIO is the only viable alternative to GPS and lidar-based odometry to achieve accurate state estimation. Since both cameras and IMUs are very cheap, these sensor types are ubiquitous in all today's aerial robots. △ Less

Submitted 14 June, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

Comments: Accepted in the Encyclopedia of Robotics, Springer

arXiv:1906.02919 [pdf, other]

EVDodgeNet: Deep Dynamic Obstacle Dodging with Event Cameras

Authors: Nitin J. Sanket, Chethan M. Parameshwara, Chahat Deep Singh, Ashwin V. Kuruttukulam, Cornelia Fermüller, Davide Scaramuzza, Yiannis Aloimonos

Abstract: Dynamic obstacle avoidance on quadrotors requires low latency. A class of sensors that are particularly suitable for such scenarios are event cameras. In this paper, we present a deep learning -- based solution for dodging multiple dynamic obstacles on a quadrotor with a single event camera and on-board computation. Our approach uses a series of shallow neural networks for estimating both the ego-… ▽ More Dynamic obstacle avoidance on quadrotors requires low latency. A class of sensors that are particularly suitable for such scenarios are event cameras. In this paper, we present a deep learning -- based solution for dodging multiple dynamic obstacles on a quadrotor with a single event camera and on-board computation. Our approach uses a series of shallow neural networks for estimating both the ego-motion and the motion of independently moving objects. The networks are trained in simulation and directly transfer to the real world without any fine-tuning or retraining. We successfully evaluate and demonstrate the proposed approach in many real-world experiments with obstacles of different shapes and sizes, achieving an overall success rate of 70% including objects of unknown shape and a low light testing scenario. To our knowledge, this is the first deep learning -- based solution to the problem of dynamic obstacle avoidance using event cameras on a quadrotor. Finally, we also extend our work to the pursuit task by merely reversing the control policy, proving that our navigation stack can cater to different scenarios. △ Less

Submitted 1 March, 2020; v1 submitted 7 June, 2019; originally announced June 2019.

Comments: 15 pages, 16 figures, Code and Video can be found at: https://prg.cs.umd.edu/EVDodgeNet

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), 2020

arXiv:1905.09727 [pdf, other]

doi 10.1109/TRO.2019.2942989

Deep Drone Racing: From Simulation to Reality with Domain Randomization

Authors: Antonio Loquercio, Elia Kaufmann, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun, Davide Scaramuzza

Abstract: Dynamically changing environments, unreliable state estimation, and operation under severe resource constraints are fundamental challenges that limit the deployment of small autonomous drones. We address these challenges in the context of autonomous, vision-based drone racing in dynamic environments. A racing drone must traverse a track with possibly moving gates at high speed. We enable this func… ▽ More Dynamically changing environments, unreliable state estimation, and operation under severe resource constraints are fundamental challenges that limit the deployment of small autonomous drones. We address these challenges in the context of autonomous, vision-based drone racing in dynamic environments. A racing drone must traverse a track with possibly moving gates at high speed. We enable this functionality by combining the performance of a state-of-the-art planning and control system with the perceptual awareness of a convolutional neural network (CNN). The resulting modular system is both platform- and domain-independent: it is trained in simulation and deployed on a physical quadrotor without any fine-tuning. The abundance of simulated data, generated via domain randomization, makes our system robust to changes of illumination and gate appearance. To the best of our knowledge, our approach is the first to demonstrate zero-shot sim-to-real transfer on the task of agile drone flight. We extensively test the precision and robustness of our system, both in simulation and on a physical platform, and show significant improvements over the state of the art. △ Less

Submitted 25 November, 2019; v1 submitted 20 May, 2019; originally announced May 2019.

Comments: Accepted as a Regular Paper to the IEEE Transactions on Robotics Journal. arXiv admin note: substantial text overlap with arXiv:1806.08548

Journal ref: IEEE Transactions on Robotics 2019

arXiv:1904.10772 [pdf, other]

CED: Color Event Camera Dataset

Authors: Cedric Scheerlinck, Henri Rebecq, Timo Stoffregen, Nick Barnes, Robert Mahony, Davide Scaramuzza

Abstract: Event cameras are novel, bio-inspired visual sensors, whose pixels output asynchronous and independent timestamped spikes at local intensity changes, called 'events'. Event cameras offer advantages over conventional frame-based cameras in terms of latency, high dynamic range (HDR) and temporal resolution. Until recently, event cameras have been limited to outputting events in the intensity channel… ▽ More Event cameras are novel, bio-inspired visual sensors, whose pixels output asynchronous and independent timestamped spikes at local intensity changes, called 'events'. Event cameras offer advantages over conventional frame-based cameras in terms of latency, high dynamic range (HDR) and temporal resolution. Until recently, event cameras have been limited to outputting events in the intensity channel, however, recent advances have resulted in the development of color event cameras, such as the Color-DAVIS346. In this work, we present and release the first Color Event Camera Dataset (CED), containing 50 minutes of footage with both color frames and events. CED features a wide variety of indoor and outdoor scenes, which we hope will help drive forward event-based vision research. We also present an extension of the event camera simulator ESIM that enables simulation of color events. Finally, we present an evaluation of three state-of-the-art image reconstruction methods that can be used to convert the Color-DAVIS346 into a continuous-time, HDR, color video camera to visualise the event stream, and for use in downstream vision applications. △ Less

Submitted 24 April, 2019; originally announced April 2019.

Comments: Conference on Computer Vision and Pattern Recognition Workshops

arXiv:1904.08405 [pdf, other]

doi 10.1109/TPAMI.2020.3008413

Event-based Vision: A Survey

Authors: Guillermo Gallego, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew Davison, Joerg Conradt, Kostas Daniilidis, Davide Scaramuzza

Abstract: Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of… ▽ More Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of microseconds), very high dynamic range (140 dB vs. 60 dB), low power consumption, and high pixel bandwidth (on the order of kHz) resulting in reduced motion blur. Hence, event cameras have a large potential for robotics and computer vision in challenging scenarios for traditional cameras, such as low-latency, high speed, and high dynamic range. However, novel methods are required to process the unconventional output of these sensors in order to unlock their potential. This paper provides a comprehensive overview of the emerging field of event-based vision, with a focus on the applications and the algorithms developed to unlock the outstanding properties of event cameras. We present event cameras from their working principle, the actual sensors that are available and the tasks that they have been used for, from low-level vision (feature detection and tracking, optic flow, etc.) to high-level vision (reconstruction, segmentation, recognition). We also discuss the techniques developed to process events, including learning-based techniques, as well as specialized processors for these novel sensors, such as spiking neural networks. Additionally, we highlight the challenges that remain to be tackled and the opportunities that lie ahead in the search for a more efficient, bio-inspired way for machines to perceive and interact with the world. △ Less

Submitted 8 August, 2020; v1 submitted 17 April, 2019; originally announced April 2019.

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

arXiv:1904.08298 [pdf, other]

Events-to-Video: Bringing Modern Computer Vision to Event Cameras

Authors: Henri Rebecq, René Ranftl, Vladlen Koltun, Davide Scaramuzza

Abstract: Event cameras are novel sensors that report brightness changes in the form of asynchronous "events" instead of intensity frames. They have significant advantages over conventional cameras: high temporal resolution, high dynamic range, and no motion blur. Since the output of event cameras is fundamentally different from conventional cameras, it is commonly accepted that they require the development… ▽ More Event cameras are novel sensors that report brightness changes in the form of asynchronous "events" instead of intensity frames. They have significant advantages over conventional cameras: high temporal resolution, high dynamic range, and no motion blur. Since the output of event cameras is fundamentally different from conventional cameras, it is commonly accepted that they require the development of specialized algorithms to accommodate the particular nature of events. In this work, we take a different view and propose to apply existing, mature computer vision techniques to videos reconstructed from event data. We propose a novel recurrent network to reconstruct videos from a stream of events, and train it on a large amount of simulated event data. Our experiments show that our approach surpasses state-of-the-art reconstruction methods by a large margin (> 20%) in terms of image quality. We further apply off-the-shelf computer vision algorithms to videos reconstructed from event data on tasks such as object classification and visual-inertial odometry, and show that this strategy consistently outperforms algorithms that were specifically designed for event data. We believe that our approach opens the door to bringing the outstanding properties of event cameras to an entirely new range of tasks. A video of the experiments is available at https://youtu.be/IdYrC4cUO0I △ Less

Submitted 17 April, 2019; originally announced April 2019.

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 2019

arXiv:1904.08245 [pdf, other]

End-to-End Learning of Representations for Asynchronous Event-Based Data

Authors: Daniel Gehrig, Antonio Loquercio, Konstantinos G. Derpanis, Davide Scaramuzza

Abstract: Event cameras are vision sensors that record asynchronous streams of per-pixel brightness changes, referred to as "events". They have appealing advantages over frame-based cameras for computer vision, including high temporal resolution, high dynamic range, and no motion blur. Due to the sparse, non-uniform spatiotemporal layout of the event signal, pattern recognition algorithms typically aggregat… ▽ More Event cameras are vision sensors that record asynchronous streams of per-pixel brightness changes, referred to as "events". They have appealing advantages over frame-based cameras for computer vision, including high temporal resolution, high dynamic range, and no motion blur. Due to the sparse, non-uniform spatiotemporal layout of the event signal, pattern recognition algorithms typically aggregate events into a grid-based representation and subsequently process it by a standard vision pipeline, e.g., Convolutional Neural Network (CNN). In this work, we introduce a general framework to convert event streams into grid-based representations through a sequence of differentiable operations. Our framework comes with two main advantages: (i) allows learning the input event representation together with the task dedicated network in an end to end manner, and (ii) lays out a taxonomy that unifies the majority of extant event representations in the literature and identifies novel ones. Empirically, we show that our approach to learning the event representation end-to-end yields an improvement of approximately 12% on optical flow estimation and object recognition over state-of-the-art methods. △ Less

Submitted 20 August, 2019; v1 submitted 17 April, 2019; originally announced April 2019.

Comments: To appear at ICCV 2019

arXiv:1904.07235 [pdf, other]

doi 10.1109/CVPR.2019.01256

Focus Is All You Need: Loss Functions For Event-based Vision

Authors: Guillermo Gallego, Mathias Gehrig, Davide Scaramuzza

Abstract: Event cameras are novel vision sensors that output pixel-level brightness changes ("events") instead of traditional video frames. These asynchronous sensors offer several advantages over traditional cameras, such as, high temporal resolution, very high dynamic range, and no motion blur. To unlock the potential of such sensors, motion compensation methods have been recently proposed. We present a c… ▽ More Event cameras are novel vision sensors that output pixel-level brightness changes ("events") instead of traditional video frames. These asynchronous sensors offer several advantages over traditional cameras, such as, high temporal resolution, very high dynamic range, and no motion blur. To unlock the potential of such sensors, motion compensation methods have been recently proposed. We present a collection and taxonomy of twenty two objective functions to analyze event alignment in motion compensation approaches (Fig. 1). We call them Focus Loss Functions since they have strong connections with functions used in traditional shape-from-focus applications. The proposed loss functions allow bringing mature computer vision tools to the realm of event cameras. We compare the accuracy and runtime performance of all loss functions on a publicly available dataset, and conclude that the variance, the gradient and the Laplacian magnitudes are among the best loss functions. The applicability of the loss functions is shown on multiple tasks: rotational motion, depth and optical flow estimation. The proposed focus loss functions allow to unlock the outstanding properties of event cameras. △ Less

Submitted 15 April, 2019; originally announced April 2019.

Comments: 29 pages, 19 figures, 4 tables

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 2019

arXiv:1904.01293 [pdf, other]

doi 10.1109/ICCV.2019.00734

Event-Based Motion Segmentation by Motion Compensation

Authors: Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, Davide Scaramuzza

Abstract: In contrast to traditional cameras, whose pixels have a common exposure time, event-based cameras are novel bio-inspired sensors whose pixels work independently and asynchronously output intensity changes (called "events"), with microsecond resolution. Since events are caused by the apparent motion of objects, event-based cameras sample visual information based on the scene dynamics and are, there… ▽ More In contrast to traditional cameras, whose pixels have a common exposure time, event-based cameras are novel bio-inspired sensors whose pixels work independently and asynchronously output intensity changes (called "events"), with microsecond resolution. Since events are caused by the apparent motion of objects, event-based cameras sample visual information based on the scene dynamics and are, therefore, a more natural fit than traditional cameras to acquire motion, especially at high speeds, where traditional cameras suffer from motion blur. However, distinguishing between events caused by different moving objects and by the camera's ego-motion is a challenging task. We present the first per-event segmentation method for splitting a scene into independently moving objects. Our method jointly estimates the event-object associations (i.e., segmentation) and the motion parameters of the objects (or the background) by maximization of an objective function, which builds upon recent results on event-based motion-compensation. We provide a thorough evaluation of our method on a public dataset, outperforming the state-of-the-art by as much as 10%. We also show the first quantitative evaluation of a segmentation algorithm for event cameras, yielding around 90% accuracy at 4 pixels relative displacement. △ Less

Submitted 22 August, 2019; v1 submitted 2 April, 2019; originally announced April 2019.

Comments: When viewed in Acrobat Reader, several of the figures animate. Video: https://youtu.be/0q6ap_OSBAk

Journal ref: IEEE International Conference on Computer Vision 2019

arXiv:1901.03360 [pdf, other]

Unsupervised Moving Object Detection via Contextual Information Separation

Authors: Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, Stefano Soatto

Abstract: We propose an adversarial contextual model for detecting moving objects in images. A deep neural network is trained to predict the optical flow in a region using information from everywhere else but that region (context), while another network attempts to make such context as uninformative as possible. The result is a model where hypotheses naturally compete with no need for explicit regularizatio… ▽ More We propose an adversarial contextual model for detecting moving objects in images. A deep neural network is trained to predict the optical flow in a region using information from everywhere else but that region (context), while another network attempts to make such context as uninformative as possible. The result is a model where hypotheses naturally compete with no need for explicit regularization or hyper-parameter tuning. Although our method requires no supervision whatsoever, it outperforms several methods that are pre-trained on large annotated datasets. Our model can be thought of as a generalization of classical variational generative region-based segmentation, but in a way that avoids explicit regularization or solution of partial differential equations at run-time. △ Less

Submitted 14 April, 2019; v1 submitted 10 January, 2019; originally announced January 2019.

arXiv:1811.10681 [pdf, other]

Matching Features without Descriptors: Implicitly Matched Interest Points

Authors: Titus Cieslewski, Michael Bloesch, Davide Scaramuzza

Abstract: The extraction and matching of interest points is a prerequisite for many geometric computer vision problems. Traditionally, matching has been achieved by assigning descriptors to interest points and matching points that have similar descriptors. In this paper, we propose a method by which interest points are instead already implicitly matched at detection time. With this, descriptors do not need… ▽ More The extraction and matching of interest points is a prerequisite for many geometric computer vision problems. Traditionally, matching has been achieved by assigning descriptors to interest points and matching points that have similar descriptors. In this paper, we propose a method by which interest points are instead already implicitly matched at detection time. With this, descriptors do not need to be calculated, stored, communicated, or matched any more. This is achieved by a convolutional neural network with multiple output channels and can be thought of as a collection of a variety of detectors, each specialized to specific visual features. This paper describes how to design and train such a network in a way that results in successful relative pose estimation performance despite the limitation on interest point count. While the overall matching score is slightly lower than with traditional methods, the approach is descriptor free and thus enables localization systems with a significantly smaller memory footprint and multi-agent localization systems with lower bandwidth requirements. The network also outputs the confidence for a specific interest point resulting in a valid match. We evaluate performance relative to state-of-the-art alternatives. △ Less

Submitted 5 August, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

Comments: 10 pages without references, accepted for publication at the British Machine Vision Conference (BMVC), Cardiff, 2019. v2 contains additional results, and a bug in the evaluation of LF-NET has been fixed

Journal ref: British Machine Vision Conference (BMVC), Cardiff, 2019

arXiv:1810.06224 [pdf, other]

Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing

Authors: Elia Kaufmann, Mathias Gehrig, Philipp Foehn, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun, Davide Scaramuzza

Abstract: Autonomous micro aerial vehicles still struggle with fast and agile maneuvers, dynamic environments, imperfect sensing, and state estimation drift. Autonomous drone racing brings these challenges to the fore. Human pilots can fly a previously unseen track after a handful of practice runs. In contrast, state-of-the-art autonomous navigation algorithms require either a precise metric map of the envi… ▽ More Autonomous micro aerial vehicles still struggle with fast and agile maneuvers, dynamic environments, imperfect sensing, and state estimation drift. Autonomous drone racing brings these challenges to the fore. Human pilots can fly a previously unseen track after a handful of practice runs. In contrast, state-of-the-art autonomous navigation algorithms require either a precise metric map of the environment or a large amount of training data collected in the track of interest. To bridge this gap, we propose an approach that can fly a new track in a previously unseen environment without a precise map or expensive data collection. Our approach represents the global track layout with coarse gate locations, which can be easily estimated from a single demonstration flight. At test time, a convolutional network predicts the poses of the closest gates along with their uncertainty. These predictions are incorporated by an extended Kalman filter to maintain optimal maximum-a-posteriori estimates of gate locations. This allows the framework to cope with misleading high-variance estimates that could stem from poor observability or lack of visible gates. Given the estimated gate poses, we use model predictive control to quickly and accurately navigate through the track. We conduct extensive experiments in the physical world, demonstrating agile and robust flight through complex and diverse previously-unseen race tracks. The presented approach was used to win the IROS 2018 Autonomous Drone Race Competition, outracing the second-placing team by a factor of two. △ Less

Submitted 1 March, 2019; v1 submitted 15 October, 2018; originally announced October 2018.

Comments: 6 pages (+1 references)

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), 2019

arXiv:1807.09713 [pdf, other]

doi 10.1007/978-3-030-01258-8_46

Asynchronous, Photometric Feature Tracking using Events and Frames

Authors: Daniel Gehrig, Henri Rebecq, Guillermo Gallego, Davide Scaramuzza

Abstract: We present a method that leverages the complementarity of event cameras and standard cameras to track visual features with low-latency. Event cameras are novel sensors that output pixel-level brightness changes, called "events". They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the… ▽ More We present a method that leverages the complementarity of event cameras and standard cameras to track visual features with low-latency. Event cameras are novel sensors that output pixel-level brightness changes, called "events". They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the same scene pattern can produce different events depending on the motion direction, establishing event correspondences across time is challenging. By contrast, standard cameras provide intensity measurements (frames) that do not depend on motion direction. Our method extracts features on frames and subsequently tracks them asynchronously using events, thereby exploiting the best of both types of data: the frames provide a photometric representation that does not depend on motion direction and the events provide low-latency updates. In contrast to previous works, which are based on heuristics, this is the first principled method that uses raw intensity measurements directly, based on a generative event model within a maximum-likelihood framework. As a result, our method produces feature tracks that are both more accurate (subpixel accuracy) and longer than the state of the art, across a wide variety of scenes. △ Less

Submitted 25 July, 2018; originally announced July 2018.

Comments: 22 pages, 15 figures, Video: https://youtu.be/A7UfeUnG6c4

Journal ref: European Conference on Computer Vision (ECCV), Munich, 2018

arXiv:1807.07429 [pdf, other]

doi 10.1007/978-3-030-01246-5_15

Semi-Dense 3D Reconstruction with a Stereo Event Camera

Authors: Yi Zhou, Guillermo Gallego, Henri Rebecq, Laurent Kneip, Hongdong Li, Davide Scaramuzza

Abstract: Event cameras are bio-inspired sensors that offer several advantages, such as low latency, high-speed and high dynamic range, to tackle challenging scenarios in computer vision. This paper presents a solution to the problem of 3D reconstruction from data captured by a stereo event-camera rig moving in a static scene, such as in the context of stereo Simultaneous Localization and Map**. The propo… ▽ More Event cameras are bio-inspired sensors that offer several advantages, such as low latency, high-speed and high dynamic range, to tackle challenging scenarios in computer vision. This paper presents a solution to the problem of 3D reconstruction from data captured by a stereo event-camera rig moving in a static scene, such as in the context of stereo Simultaneous Localization and Map**. The proposed method consists of the optimization of an energy function designed to exploit small-baseline spatio-temporal consistency of events triggered across both stereo image planes. To improve the density of the reconstruction and to reduce the uncertainty of the estimation, a probabilistic depth-fusion strategy is also developed. The resulting method has no special requirements on either the motion of the stereo event-camera rig or on prior knowledge about the scene. Experiments demonstrate our method can deal with both texture-rich scenes as well as sparse scenes, outperforming state-of-the-art stereo methods based on event data image representations. △ Less

Submitted 19 July, 2018; originally announced July 2018.

Comments: 19 pages, 8 figures, Video: https://youtu.be/Qrnpj2FD1e4

Journal ref: European Conference on Computer Vision (ECCV), Munich, 2018

arXiv:1806.08548 [pdf, other]

Deep Drone Racing: Learning Agile Flight in Dynamic Environments

Authors: Elia Kaufmann, Antonio Loquercio, Rene Ranftl, Alexey Dosovitskiy, Vladlen Koltun, Davide Scaramuzza

Abstract: Autonomous agile flight brings up fundamental challenges in robotics, such as co** with unreliable state estimation, reacting optimally to dynamically changing environments, and coupling perception and action in real time under severe resource constraints. In this paper, we consider these challenges in the context of autonomous, vision-based drone racing in dynamic environments. Our approach com… ▽ More Autonomous agile flight brings up fundamental challenges in robotics, such as co** with unreliable state estimation, reacting optimally to dynamically changing environments, and coupling perception and action in real time under severe resource constraints. In this paper, we consider these challenges in the context of autonomous, vision-based drone racing in dynamic environments. Our approach combines a convolutional neural network (CNN) with a state-of-the-art path-planning and control system. The CNN directly maps raw images into a robust representation in the form of a waypoint and desired speed. This information is then used by the planner to generate a short, minimum-jerk trajectory segment and corresponding motor commands to reach the desired goal. We demonstrate our method in autonomous agile flight scenarios, in which a vision-based quadrotor traverses drone-racing tracks with possibly moving gates. Our method does not require any explicit map of the environment and runs fully onboard. We extensively test the precision and robustness of the approach in simulation and in the physical world. We also evaluate our method against state-of-the-art navigation approaches and professional human drone pilots. △ Less

Submitted 9 October, 2018; v1 submitted 22 June, 2018; originally announced June 2018.

Comments: Accepted for publication in the Conference on Robotic Learning (CoRL) 2018, Zurich. 10 pages (+3 supplementary)

Journal ref: Conference on Robotic Learning (CoRL), 2018

arXiv:1805.01831 [pdf, other]

doi 10.1109/JIOT.2019.2917066

A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones

Authors: Daniele Palossi, Antonio Loquercio, Francesco Conti, Eric Flamand, Davide Scaramuzza, Luca Benini

Abstract: Fully-autonomous miniaturized robots (e.g., drones), with artificial intelligence (AI) based visual navigation capabilities are extremely challenging drivers of Internet-of-Things edge intelligence capabilities. Visual navigation based on AI approaches, such as deep neural networks (DNNs) are becoming pervasive for standard-size drones, but are considered out of reach for nanodrones with size of a… ▽ More Fully-autonomous miniaturized robots (e.g., drones), with artificial intelligence (AI) based visual navigation capabilities are extremely challenging drivers of Internet-of-Things edge intelligence capabilities. Visual navigation based on AI approaches, such as deep neural networks (DNNs) are becoming pervasive for standard-size drones, but are considered out of reach for nanodrones with size of a few cm${}^\mathrm{2}$. In this work, we present the first (to the best of our knowledge) demonstration of a navigation engine for autonomous nano-drones capable of closed-loop end-to-end DNN-based visual navigation. To achieve this goal we developed a complete methodology for parallel execution of complex DNNs directly on-bard of resource-constrained milliwatt-scale nodes. Our system is based on GAP8, a novel parallel ultra-low-power computing platform, and a 27 g commercial, open-source CrazyFlie 2.0 nano-quadrotor. As part of our general methodology we discuss the software map** techniques that enable the state-of-the-art deep convolutional neural network presented in [1] to be fully executed on-board within a strict 6 fps real-time constraint with no compromise in terms of flight results, while all processing is done with only 64 mW on average. Our navigation engine is flexible and can be used to span a wide performance range: at its peak performance corner it achieves 18 fps while still consuming on average just 3.5% of the power envelope of the deployed nano-aircraft. △ Less

Submitted 14 May, 2019; v1 submitted 4 May, 2018; originally announced May 2018.

Comments: 15 pages, 13 figures, 5 tables, 2 listings, accepted for publication in the IEEE Internet of Things Journal (IEEE IOTJ)

arXiv:1805.01358 [pdf, other]

SIPs: Succinct Interest Points from Unsupervised Inlierness Probability Learning

Authors: Titus Cieslewski, Konstantinos G. Derpanis, Davide Scaramuzza

Abstract: A wide range of computer vision algorithms rely on identifying sparse interest points in images and establishing correspondences between them. However, only a subset of the initially identified interest points results in true correspondences (inliers). In this paper, we seek a detector that finds the minimum number of points that are likely to result in an application-dependent "sufficient" number… ▽ More A wide range of computer vision algorithms rely on identifying sparse interest points in images and establishing correspondences between them. However, only a subset of the initially identified interest points results in true correspondences (inliers). In this paper, we seek a detector that finds the minimum number of points that are likely to result in an application-dependent "sufficient" number of inliers k. To quantify this goal, we introduce the "k-succinctness" metric. Extracting a minimum number of interest points is attractive for many applications, because it can reduce computational load, memory, and data transmission. Alongside succinctness, we introduce an unsupervised training methodology for interest point detectors that is based on predicting the probability of a given pixel being an inlier. In comparison to previous learned detectors, our method requires the least amount of data pre-processing. Our detector and other state-of-the-art detectors are extensively evaluated with respect to succinctness on popular public datasets covering both indoor and outdoor scenes, and both wide and narrow baselines. In certain cases, our detector is able to obtain an equivalent amount of inliers with as little as 60% of the amount of points of other detectors. The code and trained networks are provided at https://github.com/uzh-rpg/sips2_open . △ Less

Submitted 19 August, 2019; v1 submitted 3 May, 2018; originally announced May 2018.

Comments: 8 pages, 2p references, 1p supplementary material. Accepted for publication at the IEEE International Conference on 3D Vision (3DV), Québec City, 2019. v2 contains significant changes VS v1

Journal ref: IEEE International Conference on 3D Vision (3DV), Québec City, 2019

arXiv:1804.04811 [pdf, other]

PAMPC: Perception-Aware Model Predictive Control for Quadrotors

Authors: Davide Falanga, Philipp Foehn, Peng Lu, Davide Scaramuzza

Abstract: We present the first perception-aware model predictive control framework for quadrotors that unifies control and planning with respect to action and perception objectives. Our framework leverages numerical optimization to compute trajectories that satisfy the system dynamics and require control inputs within the limits of the platform. Simultaneously, it optimizes perception objectives for robust… ▽ More We present the first perception-aware model predictive control framework for quadrotors that unifies control and planning with respect to action and perception objectives. Our framework leverages numerical optimization to compute trajectories that satisfy the system dynamics and require control inputs within the limits of the platform. Simultaneously, it optimizes perception objectives for robust and reliable sens- ing by maximizing the visibility of a point of interest and minimizing its velocity in the image plane. Considering both perception and action objectives for motion planning and control is challenging due to the possible conflicts arising from their respective requirements. For example, for a quadrotor to track a reference trajectory, it needs to rotate to align its thrust with the direction of the desired acceleration. However, the perception objective might require to minimize such rotation to maximize the visibility of a point of interest. A model-based optimization framework, able to consider both perception and action objectives and couple them through the system dynamics, is therefore necessary. Our perception-aware model predictive control framework works in a receding-horizon fashion by iteratively solving a non-linear optimization problem. It is capable of running in real-time, fully onboard our lightweight, small-scale quadrotor using a low-power ARM computer, to- gether with a visual-inertial odometry pipeline. We validate our approach in experiments demonstrating (I) the contradiction between perception and action objectives, and (II) improved behavior in extremely challenging lighting conditions. △ Less

Submitted 10 July, 2018; v1 submitted 13 April, 2018; originally announced April 2018.

Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 2018

arXiv:1804.01310 [pdf, other]

doi 10.1109/CVPR.2018.00568

Event-based Vision meets Deep Learning on Steering Prediction for Self-driving Cars

Authors: Ana I. Maqueda, Antonio Loquercio, Guillermo Gallego, Narciso Garcia, Davide Scaramuzza

Abstract: Event cameras are bio-inspired vision sensors that naturally capture the dynamics of a scene, filtering out redundant information. This paper presents a deep neural network approach that unlocks the potential of event cameras on a challenging motion-estimation task: prediction of a vehicle's steering angle. To make the best out of this sensor-algorithm combination, we adapt state-of-the-art convol… ▽ More Event cameras are bio-inspired vision sensors that naturally capture the dynamics of a scene, filtering out redundant information. This paper presents a deep neural network approach that unlocks the potential of event cameras on a challenging motion-estimation task: prediction of a vehicle's steering angle. To make the best out of this sensor-algorithm combination, we adapt state-of-the-art convolutional architectures to the output of event sensors and extensively evaluate the performance of our approach on a publicly available large scale event-camera dataset (~1000 km). We present qualitative and quantitative explanations of why event cameras allow robust steering prediction even in cases where traditional cameras fail, e.g. challenging illumination conditions and fast motion. Finally, we demonstrate the advantages of leveraging transfer learning from traditional to event-based vision, and show that our approach outperforms state-of-the-art algorithms based on standard cameras. △ Less

Submitted 4 April, 2018; originally announced April 2018.

Comments: 9 pages, 8 figures, 6 tables. Video: https://youtu.be/_r_bsjkJTHA

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 2018

arXiv:1804.01306 [pdf, other]

doi 10.1109/CVPR.2018.00407

A Unifying Contrast Maximization Framework for Event Cameras, with Applications to Motion, Depth, and Optical Flow Estimation

Authors: Guillermo Gallego, Henri Rebecq, Davide Scaramuzza

Abstract: We present a unifying framework to solve several computer vision problems with event cameras: motion, depth and optical flow estimation. The main idea of our framework is to find the point trajectories on the image plane that are best aligned with the event data by maximizing an objective function: the contrast of an image of warped events. Our method implicitly handles data association between th… ▽ More We present a unifying framework to solve several computer vision problems with event cameras: motion, depth and optical flow estimation. The main idea of our framework is to find the point trajectories on the image plane that are best aligned with the event data by maximizing an objective function: the contrast of an image of warped events. Our method implicitly handles data association between the events, and therefore, does not rely on additional appearance information about the scene. In addition to accurately recovering the motion parameters of the problem, our framework produces motion-corrected edge-like images with high dynamic range that can be used for further scene analysis. The proposed method is not only simple, but more importantly, it is, to the best of our knowledge, the first method that can be successfully applied to such a diverse set of important vision tasks with event cameras. △ Less

Submitted 4 April, 2018; originally announced April 2018.

Comments: 16 pages, 16 figures. Video: https://youtu.be/KFMZFhi-9Aw

Journal ref: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 2018

arXiv:1801.02302 [pdf, other]

A Real-Time Game Theoretic Planner for Autonomous Two-Player Drone Racing

Authors: Riccardo Spica, Davide Falanga, Eric Cristofalo, Eduardo Montijano, Davide Scaramuzza, Mac Schwager

Abstract: To be successful in multi-player drone racing, a player must not only follow the race track in an optimal way, but also compete with other drones through strategic blocking, faking, and opportunistic passing while avoiding collisions. Since unveiling one's own strategy to the adversaries is not desirable, this requires each player to independently predict the other players' future actions. Nash eq… ▽ More To be successful in multi-player drone racing, a player must not only follow the race track in an optimal way, but also compete with other drones through strategic blocking, faking, and opportunistic passing while avoiding collisions. Since unveiling one's own strategy to the adversaries is not desirable, this requires each player to independently predict the other players' future actions. Nash equilibria are a powerful tool to model this and similar multi-agent coordination problems in which the absence of communication impedes full coordination between the agents. In this paper, we propose a novel receding horizon planning algorithm that, exploiting sensitivity analysis within an iterated best response computational scheme, can approximate Nash equilibria in real time. We also describe a vision-based pipeline that allows each player to estimate its opponent's relative position. We demonstrate that our solution effectively competes against alternative strategies in a large number of drone racing simulations. Hardware experiments with onboard vision sensing prove the practicality of our strategy. △ Less

Submitted 26 January, 2018; v1 submitted 7 January, 2018; originally announced January 2018.

arXiv:1712.02402 [pdf, other]

doi 10.1109/LRA.2017.2776353

Differential Flatness of Quadrotor Dynamics Subject to Rotor Drag for Accurate Tracking of High-Speed Trajectories

Authors: Matthias Faessler, Antonio Franchi, Davide Scaramuzza

Abstract: In this paper, we prove that the dynamical model of a quadrotor subject to linear rotor drag effects is differentially flat in its position and heading. We use this property to compute feed-forward control terms directly from a reference trajectory to be tracked. The obtained feed-forward terms are then used in a cascaded, nonlinear feedback control law that enables accurate agile flight with quad… ▽ More In this paper, we prove that the dynamical model of a quadrotor subject to linear rotor drag effects is differentially flat in its position and heading. We use this property to compute feed-forward control terms directly from a reference trajectory to be tracked. The obtained feed-forward terms are then used in a cascaded, nonlinear feedback control law that enables accurate agile flight with quadrotors. Compared to state-of-the-art control methods, which treat the rotor drag as an unknown disturbance, our method reduces the trajectory tracking error significantly. Finally, we present a method based on a gradient-free optimization to identify the rotor drag coefficients, which are required to compute the feed-forward control terms. The new theoretical results are thoroughly validated trough extensive comparative experiments. △ Less

Submitted 28 March, 2018; v1 submitted 6 December, 2017; originally announced December 2017.

Journal ref: Robot.Autom.Lett. 3 (2018) 620-626

arXiv:1712.02052 [pdf, other]

doi 10.1002/rob.21774

Fast, Autonomous Flight in GPS-Denied and Cluttered Environments

Authors: Kartik Mohta, Michael Watterson, Yash Mulgaonkar, Sikang Liu, Chao Qu, Anurag Makineni, Kelsey Saulnier, Ke Sun, Alex Zhu, Jeffrey Delmerico, Konstantinos Karydis, Nikolay Atanasov, Giuseppe Loianno, Davide Scaramuzza, Kostas Daniilidis, Camillo Jose Taylor, Vijay Kumar

Abstract: One of the most challenging tasks for a flying robot is to autonomously navigate between target locations quickly and reliably while avoiding obstacles in its path, and with little to no a-priori knowledge of the operating environment. This challenge is addressed in the present paper. We describe the system design and software architecture of our proposed solution, and showcase how all the distinc… ▽ More One of the most challenging tasks for a flying robot is to autonomously navigate between target locations quickly and reliably while avoiding obstacles in its path, and with little to no a-priori knowledge of the operating environment. This challenge is addressed in the present paper. We describe the system design and software architecture of our proposed solution, and showcase how all the distinct components can be integrated to enable smooth robot operation. We provide critical insight on hardware and software component selection and development, and present results from extensive experimental testing in real-world warehouse environments. Experimental testing reveals that our proposed solution can deliver fast and robust aerial robot autonomous navigation in cluttered, GPS-denied environments. △ Less

Submitted 6 December, 2017; originally announced December 2017.

Comments: Pre-peer reviewed version of the article accepted in Journal of Field Robotics

arXiv:1710.05772 [pdf, other]

Data-Efficient Decentralized Visual SLAM

Authors: Titus Cieslewski, Siddharth Choudhary, Davide Scaramuzza

Abstract: Decentralized visual simultaneous localization and map** (SLAM) is a powerful tool for multi-robot applications in environments where absolute positioning systems are not available. Being visual, it relies on cameras, cheap, lightweight and versatile sensors, and being decentralized, it does not rely on communication to a central ground station. In this work, we integrate state-of-the-art decent… ▽ More Decentralized visual simultaneous localization and map** (SLAM) is a powerful tool for multi-robot applications in environments where absolute positioning systems are not available. Being visual, it relies on cameras, cheap, lightweight and versatile sensors, and being decentralized, it does not rely on communication to a central ground station. In this work, we integrate state-of-the-art decentralized SLAM components into a new, complete decentralized visual SLAM system. To allow for data association and co-optimization, existing decentralized visual SLAM systems regularly exchange the full map data between all robots, incurring large data transfers at a complexity that scales quadratically with the robot count. In contrast, our method performs efficient data association in two stages: in the first stage a compact full-image descriptor is deterministically sent to only one robot. In the second stage, which is only executed if the first stage succeeded, the data required for relative pose estimation is sent, again to only one robot. Thus, data association scales linearly with the robot count and uses highly compact place representations. For optimization, a state-of-the-art decentralized pose-graph optimization method is used. It exchanges a minimum amount of data which is linear with trajectory overlap. We characterize the resulting system and identify bottlenecks in its components. The system is evaluated on publicly available data and we provide open access to the code. △ Less

Submitted 16 October, 2017; originally announced October 2017.

Comments: 8 pages, submitted to ICRA 2018

Journal ref: IEEE International Conference on Robotics and Automation (ICRA), 2018

Showing 101–150 of 164 results for author: Scaramuzza, D