Search | arXiv e-print repository

LingoQA: Video Question Answering for Autonomous Driving

Authors: Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski

Abstract: Autonomous driving has long faced a challenge with public acceptance due to the lack of explainability in the decision-making process. Video question-answering (QA) in natural language provides the opportunity for bridging this gap. Nonetheless, evaluating the performance of Video QA models has proved particularly tough due to the absence of comprehensive benchmarks. To fill this gap, we introduce… ▽ More Autonomous driving has long faced a challenge with public acceptance due to the lack of explainability in the decision-making process. Video question-answering (QA) in natural language provides the opportunity for bridging this gap. Nonetheless, evaluating the performance of Video QA models has proved particularly tough due to the absence of comprehensive benchmarks. To fill this gap, we introduce LingoQA, a benchmark specifically for autonomous driving Video QA. The LingoQA trainable metric demonstrates a 0.95 Spearman correlation coefficient with human evaluations. We introduce a Video QA dataset of central London consisting of 419k samples that we release with the paper. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. △ Less

Submitted 19 March, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: Benchmark and dataset are available at https://github.com/wayveai/LingoQA/

arXiv:2309.17080 [pdf, other]

GAIA-1: A Generative World Model for Autonomous Driving

Authors: Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado

Abstract: Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA… ▽ More Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by map** the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: Technical Report

arXiv:2307.07147 [pdf, other]

Linking vision and motion for self-supervised object-centric perception

Authors: Kaylene C. Stocking, Zak Murez, Vijay Badrinarayanan, Jamie Shotton, Alex Kendall, Claire Tomlin, Christopher P. Burgess

Abstract: Object-centric representations enable autonomous driving algorithms to reason about interactions between many independent agents and scene features. Traditionally these representations have been obtained via supervised learning, but this decouples perception from the downstream driving task and could harm generalization. In this work we adapt a self-supervised object-centric vision model to perfor… ▽ More Object-centric representations enable autonomous driving algorithms to reason about interactions between many independent agents and scene features. Traditionally these representations have been obtained via supervised learning, but this decouples perception from the downstream driving task and could harm generalization. In this work we adapt a self-supervised object-centric vision model to perform object decomposition using only RGB video and the pose of the vehicle as inputs. We demonstrate that our method obtains promising results on the Waymo Open perception dataset. While object mask quality lags behind supervised methods or alternatives that use more privileged information, we find that our model is capable of learning a representation that fuses multiple camera viewpoints over time and successfully tracks many vehicles and pedestrians in the dataset. Code for our model is available at https://github.com/wayveai/SOCS. △ Less

Submitted 14 July, 2023; originally announced July 2023.

Comments: Presented at the CVPR 2023 Vision-Centric Autonomous Driving workshop

arXiv:2210.07729 [pdf, other]

Model-Based Imitation Learning for Urban Driving

Authors: Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zak Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, Jamie Shotton

Abstract: An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expe… ▽ More An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expert demonstrations. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. MILE improves upon prior state-of-the-art by 31% in driving score on the CARLA simulator when deployed in a completely new town and new weather conditions. Our model can predict diverse and plausible states and actions, that can be interpretably decoded to bird's-eye view semantic segmentation. Further, we demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. The code and model weights are available at https://github.com/wayveai/mile. △ Less

Submitted 3 November, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: NeurIPS 2022

arXiv:2108.05805 [pdf, other]

Reimagining an autonomous vehicle

Authors: Jeffrey Hawke, Haibo E, Vijay Badrinarayanan, Alex Kendall

Abstract: The self driving challenge in 2021 is this century's technological equivalent of the space race, and is now entering the second major decade of development. Solving the technology will create social change which parallels the invention of the automobile itself. Today's autonomous driving technology is laudable, though rooted in decisions made a decade ago. We argue that a rethink is required, reco… ▽ More The self driving challenge in 2021 is this century's technological equivalent of the space race, and is now entering the second major decade of development. Solving the technology will create social change which parallels the invention of the automobile itself. Today's autonomous driving technology is laudable, though rooted in decisions made a decade ago. We argue that a rethink is required, reconsidering the autonomous vehicle (AV) problem in the light of the body of knowledge that has been gained since the DARPA challenges which seeded the industry. What does AV2.0 look like? We present an alternative vision: a recipe for driving with machine learning, and grand challenges for research in driving. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: Under review

arXiv:2105.03533 [pdf, other]

Video Class Agnostic Segmentation with Contrastive Learning for Autonomous Driving

Authors: Mennatullah Siam, Alex Kendall, Martin Jagersand

Abstract: Semantic segmentation in autonomous driving predominantly focuses on learning from large-scale data with a closed set of known classes without considering unknown objects. Motivated by safety reasons, we address the video class agnostic segmentation task, which considers unknown objects outside the closed set of known classes in our training data. We propose a novel auxiliary contrastive loss to l… ▽ More Semantic segmentation in autonomous driving predominantly focuses on learning from large-scale data with a closed set of known classes without considering unknown objects. Motivated by safety reasons, we address the video class agnostic segmentation task, which considers unknown objects outside the closed set of known classes in our training data. We propose a novel auxiliary contrastive loss to learn the segmentation of known classes and unknown objects. Unlike previous work in contrastive learning that samples the anchor, positive and negative examples on an image level, our contrastive learning method leverages pixel-wise semantic and temporal guidance. We conduct experiments on Cityscapes-VPS by withholding four classes from training and show an improvement gain for both known and unknown objects segmentation with the auxiliary contrastive loss. We further release a large-scale synthetic dataset for different autonomous driving scenarios that includes distinct and rare unknown objects. We conduct experiments on the full synthetic dataset and a reduced small-scale version, and show how contrastive learning is more effective in small scale datasets. Our proposed models, dataset, and code will be released at https://github.com/MSiam/video_class_agnostic_segmentation. △ Less

Submitted 10 May, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

arXiv:2104.10490 [pdf, other]

FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular Cameras

Authors: Anthony Hu, Zak Murez, Nikhil Mohan, Sofía Dudas, Jeffrey Hawke, Vijay Badrinarayanan, Roberto Cipolla, Alex Kendall

Abstract: Driving requires interacting with road agents and predicting their future behaviour in order to navigate safely. We present FIERY: a probabilistic future prediction model in bird's-eye view from monocular cameras. Our model predicts future instance segmentation and motion of dynamic agents that can be transformed into non-parametric future trajectories. Our approach combines the perception, sensor… ▽ More Driving requires interacting with road agents and predicting their future behaviour in order to navigate safely. We present FIERY: a probabilistic future prediction model in bird's-eye view from monocular cameras. Our model predicts future instance segmentation and motion of dynamic agents that can be transformed into non-parametric future trajectories. Our approach combines the perception, sensor fusion and prediction components of a traditional autonomous driving stack by estimating bird's-eye-view prediction directly from surround RGB monocular camera inputs. FIERY learns to model the inherent stochastic nature of the future solely from camera driving data in an end-to-end manner, without relying on HD maps, and predicts multimodal future trajectories. We show that our model outperforms previous prediction baselines on the NuScenes and Lyft datasets. The code and trained models are available at https://github.com/wayveai/fiery. △ Less

Submitted 18 October, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

Comments: ICCV 2021

arXiv:2103.11015 [pdf, other]

Video Class Agnostic Segmentation Benchmark for Autonomous Driving

Authors: Mennatullah Siam, Alex Kendall, Martin Jagersand

Abstract: Semantic segmentation approaches are typically trained on large-scale data with a closed finite set of known classes without considering unknown objects. In certain safety-critical robotics applications, especially autonomous driving, it is important to segment all objects, including those unknown at training time. We formalize the task of video class agnostic segmentation from monocular video seq… ▽ More Semantic segmentation approaches are typically trained on large-scale data with a closed finite set of known classes without considering unknown objects. In certain safety-critical robotics applications, especially autonomous driving, it is important to segment all objects, including those unknown at training time. We formalize the task of video class agnostic segmentation from monocular video sequences in autonomous driving to account for unknown objects. Video class agnostic segmentation can be formulated as an open-set or a motion segmentation problem. We discuss both formulations and provide datasets and benchmark different baseline approaches for both tracks. In the motion-segmentation track we benchmark real-time joint panoptic and motion instance segmentation, and evaluate the effect of ego-flow suppression. In the open-set segmentation track we evaluate baseline methods that combine appearance, and geometry to learn prototypes per semantic class. We then compare it to a model that uses an auxiliary contrastive loss to improve the discrimination between known and unknown objects. Datasets and models are publicly released at https://msiam.github.io/vca/. △ Less

Submitted 19 April, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

Comments: Accepted in WAD workshop, CVPR 2021

arXiv:2102.09144 [pdf, other]

Stochastic Spatio-Temporal Optimization for Control and Co-Design of Systems in Robotics and Applied Physics

Authors: Ethan N. Evans, Andrew P. Kendall, Evangelos A. Theodorou

Abstract: Correlated with the trend of increasing degrees of freedom in robotic systems is a similar trend of rising interest in Spatio-Temporal systems described by Partial Differential Equations (PDEs) among the robotics and control communities. These systems often exhibit dramatic under-actuation, high dimensionality, bifurcations, and multimodal instabilities. Their control represents many of the curren… ▽ More Correlated with the trend of increasing degrees of freedom in robotic systems is a similar trend of rising interest in Spatio-Temporal systems described by Partial Differential Equations (PDEs) among the robotics and control communities. These systems often exhibit dramatic under-actuation, high dimensionality, bifurcations, and multimodal instabilities. Their control represents many of the current-day challenges facing the robotics and automation communities. Not only are these systems challenging to control, but the design of their actuation is an NP-hard problem on its own. Recent methods either discretize the space before optimization, or apply tools from linear systems theory under restrictive linearity assumptions in order to arrive at a control solution. This manuscript provides a novel sampling-based stochastic optimization framework based entirely in Hilbert spaces suitable for the general class of \textit{semi-linear} SPDEs which describes many systems in robotics and applied physics. This framework is utilized for simultaneous policy optimization and actuator co-design optimization. The resulting algorithm is based on variational optimization, and performs joint episodic optimization of the feedback control law and the actuation design over episodes. We study first and second order systems, and in doing so, extend several results to the case of second order SPDEs. Finally, we demonstrate the efficacy of the proposed approach with several simulated experiments on a variety of SPDEs in robotics and applied physics including an infinite degree-of-freedom soft robotic manipulator. △ Less

Submitted 17 February, 2021; originally announced February 2021.

Comments: 34 pages, 10 figures. Submitted to Autonomous Robots special issue of RSS 2020. arXiv admin note: text overlap with arXiv:2002.01397

arXiv:2003.06409 [pdf, other]

Probabilistic Future Prediction for Video Scene Understanding

Authors: Anthony Hu, Fergal Cotter, Nikhil Mohan, Corina Gurau, Alex Kendall

Abstract: We present a novel deep learning architecture for probabilistic future prediction from video. We predict the future semantics, geometry and motion of complex real-world urban scenes and use this representation to control an autonomous vehicle. This work is the first to jointly predict ego-motion, static scene, and the motion of dynamic agents in a probabilistic manner, which allows sampling consis… ▽ More We present a novel deep learning architecture for probabilistic future prediction from video. We predict the future semantics, geometry and motion of complex real-world urban scenes and use this representation to control an autonomous vehicle. This work is the first to jointly predict ego-motion, static scene, and the motion of dynamic agents in a probabilistic manner, which allows sampling consistent, highly probable futures from a compact latent space. Our model learns a representation from RGB video with a spatio-temporal convolutional module. The learned representation can be explicitly decoded to future semantic segmentation, depth, and optical flow, in addition to being an input to a learnt driving policy. To model the stochasticity of the future, we introduce a conditional variational approach which minimises the divergence between the present distribution (what could happen given what we have seen) and the future distribution (what we observe actually happens). During inference, diverse futures are generated by sampling from the present distribution. △ Less

Submitted 17 July, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

Comments: Accepted as a conference paper at ECCV 2020

arXiv:1912.08969 [pdf, other]

Learning a Spatio-Temporal Embedding for Video Instance Segmentation

Authors: Anthony Hu, Alex Kendall, Roberto Cipolla

Abstract: We present a novel embedding approach for video instance segmentation. Our method learns a spatio-temporal embedding integrating cues from appearance, motion, and geometry; a 3D causal convolutional network models motion, and a monocular self-supervised depth loss models geometry. In this embedding space, video-pixels of the same instance are clustered together while being separated from other ins… ▽ More We present a novel embedding approach for video instance segmentation. Our method learns a spatio-temporal embedding integrating cues from appearance, motion, and geometry; a 3D causal convolutional network models motion, and a monocular self-supervised depth loss models geometry. In this embedding space, video-pixels of the same instance are clustered together while being separated from other instances, to naturally track instances over time without any complex post-processing. Our network runs in real-time as our architecture is entirely causal - we do not incorporate information from future frames, contrary to previous methods. We show that our model can accurately track and segment instances, even with occlusions and missed detections, advancing the state-of-the-art on the KITTI Multi-Object and Tracking Dataset. △ Less

Submitted 18 December, 2019; originally announced December 2019.

arXiv:1912.00177 [pdf, other]

Urban Driving with Conditional Imitation Learning

Authors: Jeffrey Hawke, Richard Shen, Corina Gurau, Siddharth Sharma, Daniele Reda, Nikolay Nikolov, Przemyslaw Mazur, Sean Micklethwaite, Nicolas Griffiths, Amar Shah, Alex Kendall

Abstract: Hand-crafting generalised decision-making rules for real-world urban autonomous driving is hard. Alternatively, learning behaviour from easy-to-collect human driving demonstrations is appealing. Prior work has studied imitation learning (IL) for autonomous driving with a number of limitations. Examples include only performing lane-following rather than following a user-defined route, only using a… ▽ More Hand-crafting generalised decision-making rules for real-world urban autonomous driving is hard. Alternatively, learning behaviour from easy-to-collect human driving demonstrations is appealing. Prior work has studied imitation learning (IL) for autonomous driving with a number of limitations. Examples include only performing lane-following rather than following a user-defined route, only using a single camera view or heavily cropped frames lacking state observability, only lateral (steering) control, but not longitudinal (speed) control and a lack of interaction with traffic. Importantly, the majority of such systems have been primarily evaluated in simulation - a simple domain, which lacks real-world complexities. Motivated by these challenges, we focus on learning representations of semantics, geometry and motion with computer vision for IL from human driving demonstrations. As our main contribution, we present an end-to-end conditional imitation learning approach, combining both lateral and longitudinal control on a real vehicle for following urban routes with simple traffic. We address inherent dataset bias by data balancing, training our final policy on approximately 30 hours of demonstrations gathered over six months. We evaluate our method on an autonomous vehicle by driving 35km of novel routes in European urban streets. △ Less

Submitted 5 December, 2019; v1 submitted 30 November, 2019; originally announced December 2019.

Comments: Under submission; added acknowledgements

arXiv:1901.02709 [pdf]

A taxonomy of circular economy indicators

Authors: Michael Saidani, Bernard Yannou, Yann Leroy, François Cluzel, Alissa Kendall

Abstract: Implementing circular economy (CE) principles is increasingly recommended as a convenient solution to meet the goals of sustainable development. New tools are required to support practitioners, decision-makers and policy-makers towards more CE practices, as well as to monitor the effects of CE adoption. Worldwide, academics, industrialists and politicians all agree on the need to use CE-related me… ▽ More Implementing circular economy (CE) principles is increasingly recommended as a convenient solution to meet the goals of sustainable development. New tools are required to support practitioners, decision-makers and policy-makers towards more CE practices, as well as to monitor the effects of CE adoption. Worldwide, academics, industrialists and politicians all agree on the need to use CE-related measuring instruments to manage this transition at different systemic levels. In this context, a wide range of circularity indicators (C-indicators) has been developed in recent years. Yet, as there is not one single definition of the CE concept, it is of the utmost importance to know what the available indicators measure in order to use them properly. Indeed, through a systematic literature review-considering both academic and grey literature-55 sets of C-indicators, developed by scholars, consulting companies and governmental agencies, have been identified, encompassing different purposes, scopes, and potential usages. Inspired by existing taxonomies of eco-design tools and sustainability indicators, and in line with the CE characteristics, a classification of indicators aiming to assess, improve, monitor and communicate on the CE performance is proposed and discussed. In the developed taxonomy including 10 categories, C-indicators are differentiated regarding criteria such as the levels of CE implementation (e.g. micro, meso, macro), the CE loops (maintain, reuse, remanufacture, recycle), the performance (intrinsic, impacts), the perspective of circularity (actual, potential) they are taking into account, or their degree of transversality (generic, sector-specific). In addition, the database inventorying the 55 sets of C-indicators is linked to an Excel-based query tool to facilitate the selection of appropriate indicators according to the specific user's needs and requirements. This study enriches the literature by giving a first need-driven taxonomy of C-indicators, which is experienced on several use cases. It provides a synthesis and clarification to the emerging and must-needed research theme of C-indicators, and sheds some light on remaining key challenges like their effective uptake by industry. Eventually, limitations, improvement areas, as well as implications of the proposed taxonomy are intently addressed to guide future research on C-indicators and CE implementation. △ Less

Submitted 29 December, 2018; originally announced January 2019.

Journal ref: Journal of Cleaner Production, Elsevier, 2019, 207, pp.542-559

arXiv:1812.03823 [pdf, other]

Learning to Drive from Simulation without Real World Labels

Authors: Alex Bewley, Jessica Rigley, Yuxuan Liu, Jeffrey Hawke, Richard Shen, Vinh-Dieu Lam, Alex Kendall

Abstract: Simulation can be a powerful tool for understanding machine learning systems and designing methods to solve real-world problems. Training and evaluating methods purely in simulation is often "doomed to succeed" at the desired task in a simulated environment, but the resulting models are incapable of operation in the real world. Here we present and evaluate a method for transferring a vision-based… ▽ More Simulation can be a powerful tool for understanding machine learning systems and designing methods to solve real-world problems. Training and evaluating methods purely in simulation is often "doomed to succeed" at the desired task in a simulated environment, but the resulting models are incapable of operation in the real world. Here we present and evaluate a method for transferring a vision-based lane following driving policy from simulation to operation on a rural road without any real-world labels. Our approach leverages recent advances in image-to-image translation to achieve domain transfer while jointly learning a single-camera control policy from simulation control labels. We assess the driving performance of this method using both open-loop regression metrics, and closed-loop performance operating an autonomous vehicle on rural and urban roads. △ Less

Submitted 13 December, 2018; v1 submitted 10 December, 2018; originally announced December 2018.

arXiv:1811.08188 [pdf, other]

Orthographic Feature Transform for Monocular 3D Object Detection

Authors: Thomas Roddick, Alex Kendall, Roberto Cipolla

Abstract: 3D object detection from monocular images has proven to be an enormously challenging task, with the performance of leading systems not yet achieving even 10\% of that of LiDAR-based counterparts. One explanation for this performance gap is that existing systems are entirely at the mercy of the perspective image-based representation, in which the appearance and scale of objects varies drastically w… ▽ More 3D object detection from monocular images has proven to be an enormously challenging task, with the performance of leading systems not yet achieving even 10\% of that of LiDAR-based counterparts. One explanation for this performance gap is that existing systems are entirely at the mercy of the perspective image-based representation, in which the appearance and scale of objects varies drastically with depth and meaningful distances are difficult to infer. In this work we argue that the ability to reason about the world in 3D is an essential element of the 3D object detection task. To this end, we introduce the orthographic feature transform, which enables us to escape the image domain by map** image-based features into an orthographic 3D space. This allows us to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful. We apply this transformation as part of an end-to-end deep learning architecture and achieve state-of-the-art performance on the KITTI 3D object benchmark.\footnote{We will release full source code and pretrained models upon acceptance of this manuscript for publication. △ Less

Submitted 20 November, 2018; originally announced November 2018.

arXiv:1807.00412 [pdf, other]

Learning to Drive in a Day

Authors: Alex Kendall, Jeffrey Hawke, David Janz, Przemyslaw Mazur, Daniele Reda, John-Mark Allen, Vinh-Dieu Lam, Alex Bewley, Amar Shah

Abstract: We demonstrate the first application of deep reinforcement learning to autonomous driving. From randomly initialised parameters, our model is able to learn a policy for lane following in a handful of training episodes using a single monocular image as input. We provide a general and easy to obtain reward: the distance travelled by the vehicle without the safety driver taking control. We use a cont… ▽ More We demonstrate the first application of deep reinforcement learning to autonomous driving. From randomly initialised parameters, our model is able to learn a policy for lane following in a handful of training episodes using a single monocular image as input. We provide a general and easy to obtain reward: the distance travelled by the vehicle without the safety driver taking control. We use a continuous, model-free deep reinforcement learning algorithm, with all exploration and optimisation performed on-vehicle. This demonstrates a new framework for autonomous driving which moves away from reliance on defined logical rules, map**, and direct supervision. We discuss the challenges and opportunities to scale this approach to a broader range of autonomous driving tasks. △ Less

Submitted 11 September, 2018; v1 submitted 1 July, 2018; originally announced July 2018.

Comments: Further results and demo videos can be viewed at: https://wayve.ai/blog/l2diad

arXiv:1705.07115 [pdf, other]

Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

Authors: Alex Kendall, Yarin Gal, Roberto Cipolla

Abstract: Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We prop… ▽ More Numerous deep learning applications benefit from multi-task learning with multiple regression and classification objectives. In this paper we make the observation that the performance of such systems is strongly dependent on the relative weighting between each task's loss. Tuning these weights by hand is a difficult and expensive process, making multi-task learning prohibitive in practice. We propose a principled approach to multi-task deep learning which weighs multiple loss functions by considering the homoscedastic uncertainty of each task. This allows us to simultaneously learn various quantities with different units or scales in both classification and regression settings. We demonstrate our model learning per-pixel depth regression, semantic and instance segmentation from a monocular input image. Perhaps surprisingly, we show our model can learn multi-task weightings and outperform separate models trained individually on each task. △ Less

Submitted 24 April, 2018; v1 submitted 19 May, 2017; originally announced May 2017.

Comments: CVPR 2018

arXiv:1704.00390 [pdf, other]

Geometric Loss Functions for Camera Pose Regression with Deep Learning

Authors: Alex Kendall, Roberto Cipolla

Abstract: Deep learning has shown to be effective for robust and real-time monocular image relocalisation. In particular, PoseNet is a deep convolutional neural network which learns to regress the 6-DOF camera pose from a single image. It learns to localize using high level features and is robust to difficult lighting, motion blur and unknown camera intrinsics, where point based SIFT registration fails. How… ▽ More Deep learning has shown to be effective for robust and real-time monocular image relocalisation. In particular, PoseNet is a deep convolutional neural network which learns to regress the 6-DOF camera pose from a single image. It learns to localize using high level features and is robust to difficult lighting, motion blur and unknown camera intrinsics, where point based SIFT registration fails. However, it was trained using a naive loss function, with hyper-parameters which require expensive tuning. In this paper, we give the problem a more fundamental theoretical treatment. We explore a number of novel loss functions for learning camera pose which are based on geometry and scene reprojection error. Additionally we show how to automatically learn an optimal weighting to simultaneously regress position and orientation. By leveraging geometry, we demonstrate that our technique significantly improves PoseNet's performance across datasets ranging from indoor rooms to a small city. △ Less

Submitted 23 May, 2017; v1 submitted 2 April, 2017; originally announced April 2017.

Comments: CVPR 2017

arXiv:1703.04977 [pdf, other]

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Authors: Alex Kendall, Yarin Gal

Abstract: There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -- uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is… ▽ More There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -- uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. For this we present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. We study models under the framework with per-pixel semantic segmentation and depth regression tasks. Further, our explicit uncertainty formulation leads to new loss functions for these tasks, which can be interpreted as learned attenuation. This makes the loss more robust to noisy data, also giving new state-of-the-art results on segmentation and depth regression benchmarks. △ Less

Submitted 5 October, 2017; v1 submitted 15 March, 2017; originally announced March 2017.

Comments: NIPS 2017

arXiv:1703.04309 [pdf, other]

End-to-End Learning of Geometry and Context for Deep Stereo Regression

Authors: Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, Adam Bry

Abstract: We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin… ▽ More We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new state-of-the-art benchmark, while being significantly faster than competing approaches. △ Less

Submitted 13 March, 2017; originally announced March 2017.

arXiv:1511.02680 [pdf, other]

Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

Authors: Alex Kendall, Vijay Badrinarayanan, Roberto Cipolla

Abstract: We present a deep learning framework for probabilistic pixel-wise semantic segmentation, which we term Bayesian SegNet. Semantic segmentation is an important tool for visual scene understanding and a meaningful measure of uncertainty is essential for decision making. Our contribution is a practical system which is able to predict pixel-wise class labels with a measure of model uncertainty. We achi… ▽ More We present a deep learning framework for probabilistic pixel-wise semantic segmentation, which we term Bayesian SegNet. Semantic segmentation is an important tool for visual scene understanding and a meaningful measure of uncertainty is essential for decision making. Our contribution is a practical system which is able to predict pixel-wise class labels with a measure of model uncertainty. We achieve this by Monte Carlo sampling with dropout at test time to generate a posterior distribution of pixel class labels. In addition, we show that modelling uncertainty improves segmentation performance by 2-3% across a number of state of the art architectures such as SegNet, FCN and Dilation Network, with no additional parametrisation. We also observe a significant improvement in performance for smaller datasets where modelling uncertainty is more effective. We benchmark Bayesian SegNet on the indoor SUN Scene Understanding and outdoor CamVid driving scenes datasets. △ Less

Submitted 10 October, 2016; v1 submitted 9 November, 2015; originally announced November 2015.

arXiv:1511.00561 [pdf, other]

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

Authors: Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla

Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16… ▽ More We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. We show that SegNet provides good performance with competitive inference time and more efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/. △ Less

Submitted 10 October, 2016; v1 submitted 2 November, 2015; originally announced November 2015.

arXiv:1509.05909 [pdf, other]

Modelling Uncertainty in Deep Learning for Camera Relocalization

Authors: Alex Kendall, Roberto Cipolla

Abstract: We present a robust and real-time monocular six degree of freedom visual relocalization system. We use a Bayesian convolutional neural network to regress the 6-DOF camera pose from a single RGB image. It is trained in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking under 6ms to compute. It obtain… ▽ More We present a robust and real-time monocular six degree of freedom visual relocalization system. We use a Bayesian convolutional neural network to regress the 6-DOF camera pose from a single RGB image. It is trained in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking under 6ms to compute. It obtains approximately 2m and 6 degrees accuracy for very large scale outdoor scenes and 0.5m and 10 degrees accuracy indoors. Using a Bayesian convolutional neural network implementation we obtain an estimate of the model's relocalization uncertainty and improve state of the art localization accuracy on a large scale outdoor dataset. We leverage the uncertainty measure to estimate metric relocalization error and to detect the presence or absence of the scene in the input image. We show that the model's uncertainty is caused by images being dissimilar to the training dataset in either pose or appearance. △ Less

Submitted 18 February, 2016; v1 submitted 19 September, 2015; originally announced September 2015.

Comments: ICRA 2016; Fixed numerical error with rotation results

arXiv:1505.07427 [pdf, other]

PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization

Authors: Alex Kendall, Matthew Grimes, Roberto Cipolla

Abstract: We present a robust and real-time monocular six degree of freedom relocalization system. Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking 5ms per frame to compute. It obtains approximately… ▽ More We present a robust and real-time monocular six degree of freedom relocalization system. Our system trains a convolutional neural network to regress the 6-DOF camera pose from a single RGB image in an end-to-end manner with no need of additional engineering or graph optimisation. The algorithm can operate indoors and outdoors in real time, taking 5ms per frame to compute. It obtains approximately 2m and 6 degree accuracy for large scale outdoor scenes and 0.5m and 10 degree accuracy indoors. This is achieved using an efficient 23 layer deep convnet, demonstrating that convnets can be used to solve complicated out of image plane regression problems. This was made possible by leveraging transfer learning from large scale classification data. We show the convnet localizes from high level features and is robust to difficult lighting, motion blur and different camera intrinsics where point based SIFT registration fails. Furthermore we show how the pose feature that is produced generalizes to other scenes allowing us to regress pose with only a few dozen training examples. PoseNet code, dataset and an online demonstration is available on our project webpage, at http://mi.eng.cam.ac.uk/projects/relocalisation/ △ Less

Submitted 18 February, 2016; v1 submitted 27 May, 2015; originally announced May 2015.

Comments: 9 pages, 13 figures; Corrected numerical error in orientation results

Showing 1–24 of 24 results for author: Kendall, A