Search | arXiv e-print repository

PixRO: Pixel-Distributed Rotational Odometry with Gaussian Belief Propagation

Authors: Ignacio Alzugaray, Riku Murai, Andrew Davison

Abstract: Visual sensors are not only becoming better at capturing high-quality images but also they have steadily increased their capabilities in processing data on their own on-chip. Yet the majority of VO pipelines rely on the transmission and processing of full images in a centralized unit (e.g. CPU or GPU), which often contain much redundant and low-quality information for the task. In this paper, we a… ▽ More Visual sensors are not only becoming better at capturing high-quality images but also they have steadily increased their capabilities in processing data on their own on-chip. Yet the majority of VO pipelines rely on the transmission and processing of full images in a centralized unit (e.g. CPU or GPU), which often contain much redundant and low-quality information for the task. In this paper, we address the task of frame-to-frame rotational estimation but, instead of reasoning about relative motion between frames using the full images, distribute the estimation at pixel-level. In this paradigm, each pixel produces an estimate of the global motion by only relying on local information and local message-passing with neighbouring pixels. The resulting per-pixel estimates can then be communicated to downstream tasks, yielding higher-level, informative cues instead of the original raw pixel-readings. We evaluate the proposed approach on real public datasets, where we offer detailed insights about this novel technique and open-source our implementation for the future benefit of the community. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2404.03531 [pdf, other]

COMO: Compact Map** and Odometry

Authors: Eric Dexheimer, Andrew J. Davison

Abstract: We present COMO, a real-time monocular map** and odometry system that encodes dense geometry via a compact set of 3D anchor points. Decoding anchor point projections into dense geometry via per-keyframe depth covariance functions guarantees that depth maps are joined together at visible anchor points. The representation enables joint optimization of camera poses and dense geometry, intrinsic 3D… ▽ More We present COMO, a real-time monocular map** and odometry system that encodes dense geometry via a compact set of 3D anchor points. Decoding anchor point projections into dense geometry via per-keyframe depth covariance functions guarantees that depth maps are joined together at visible anchor points. The representation enables joint optimization of camera poses and dense geometry, intrinsic 3D consistency, and efficient second-order inference. To maintain a compact yet expressive map, we introduce a frontend that leverages the covariance function for tracking and initializing potentially visually indistinct 3D points across frames. Altogether, we introduce a real-time system capable of estimating accurate poses and consistent geometry. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2403.15583 [pdf, other]

U-ARE-ME: Uncertainty-Aware Rotation Estimation in Manhattan Environments

Authors: Aalok Patwardhan, Callum Rhodes, Gwangbin Bae, Andrew J. Davison

Abstract: Camera rotation estimation from a single image is a challenging task, often requiring depth data and/or camera intrinsics, which are generally not available for in-the-wild videos. Although external sensors such as inertial measurement units (IMUs) can help, they often suffer from drift and are not applicable in non-inertial reference frames. We present U-ARE-ME, an algorithm that estimates camera… ▽ More Camera rotation estimation from a single image is a challenging task, often requiring depth data and/or camera intrinsics, which are generally not available for in-the-wild videos. Although external sensors such as inertial measurement units (IMUs) can help, they often suffer from drift and are not applicable in non-inertial reference frames. We present U-ARE-ME, an algorithm that estimates camera rotation along with uncertainty from uncalibrated RGB images. Using a Manhattan World assumption, our method leverages the per-pixel geometric priors encoded in single-image surface normal predictions and performs optimisation over the SO(3) manifold. Given a sequence of images, we can use the per-frame rotation estimates and their uncertainty to perform multi-frame optimisation, achieving robustness and temporal consistency. Our experiments demonstrate that U-ARE-ME performs comparably to RGB-D methods and is more robust than sparse feature-based SLAM methods. We encourage the reader to view the accompanying video at https://callum-rhodes.github.io/U-ARE-ME for a visual overview of our method. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: For the project page and video see https://callum-rhodes.github.io/U-ARE-ME

arXiv:2403.00712 [pdf, other]

Rethinking Inductive Biases for Surface Normal Estimation

Authors: Gwangbin Bae, Andrew J. Davison

Abstract: Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by lea… ▽ More Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: CVPR 2024 (camera-ready version will be uploaded in March 2024)

arXiv:2402.03908 [pdf, other]

EscherNet: A Generative Model for Scalable View Synthesis

Authors: Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison

Abstract: We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scala… ▽ More We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis -- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet. △ Less

Submitted 19 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: CVPR2024 Project Page: https://kxhit.github.io/EscherNet

arXiv:2401.15036 [pdf, other]

doi 10.1109/LRA.2024.3352361

Distributed Simultaneous Localisation and Auto-Calibration using Gaussian Belief Propagation

Authors: Riku Murai, Ignacio Alzugaray, Paul H. J. Kelly, Andrew J. Davison

Abstract: We present a novel scalable, fully distributed, and online method for simultaneous localisation and extrinsic calibration for multi-robot setups. Individual a priori unknown robot poses are probabilistically inferred as robots sense each other while simultaneously calibrating their sensors and markers extrinsic using Gaussian Belief Propagation. In the presented experiments, we show how our method… ▽ More We present a novel scalable, fully distributed, and online method for simultaneous localisation and extrinsic calibration for multi-robot setups. Individual a priori unknown robot poses are probabilistically inferred as robots sense each other while simultaneously calibrating their sensors and markers extrinsic using Gaussian Belief Propagation. In the presented experiments, we show how our method not only yields accurate robot localisation and auto-calibration but also is able to perform under challenging circumstances such as highly noisy measurements, significant communication failures or limited communication range. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: Published in IEEE Robotics and Automation Letters (RA-L) 2024

Journal ref: IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2136-2143, March 2024

arXiv:2401.02357 [pdf, other]

Fit-NGP: Fitting Object Models to Neural Graphics Primitives

Authors: Marwan Taher, Ignacio Alzugaray, Andrew J. Davison

Abstract: Accurate 3D object pose estimation is key to enabling many robotic applications that involve challenging object interactions. In this work, we show that the density field created by a state-of-the-art efficient radiance field reconstruction method is suitable for highly accurate and robust pose estimation for objects with known 3D models, even when they are very small and with challenging reflecti… ▽ More Accurate 3D object pose estimation is key to enabling many robotic applications that involve challenging object interactions. In this work, we show that the density field created by a state-of-the-art efficient radiance field reconstruction method is suitable for highly accurate and robust pose estimation for objects with known 3D models, even when they are very small and with challenging reflective surfaces. We present a fully automatic object pose estimation system based on a robot arm with a single wrist-mounted camera, which can scan a scene from scratch, detect and estimate the 6-Degrees of Freedom (DoF) poses of multiple objects within a couple of minutes of operation. Small objects such as bolts and nuts are estimated with accuracy on order of 1mm. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2312.06741 [pdf, other]

Gaussian Splatting SLAM

Authors: Hidenobu Matsuki, Riku Murai, Paul H. J. Kelly, Andrew J. Davison

Abstract: We present the first application of 3D Gaussian Splatting in monocular SLAM, the most fundamental but the hardest setup for Visual SLAM. Our method, which runs live at 3fps, utilises Gaussians as the only 3D representation, unifying the required representation for accurate, efficient tracking, map**, and high-quality rendering. Designed for challenging monocular settings, our approach is seamles… ▽ More We present the first application of 3D Gaussian Splatting in monocular SLAM, the most fundamental but the hardest setup for Visual SLAM. Our method, which runs live at 3fps, utilises Gaussians as the only 3D representation, unifying the required representation for accurate, efficient tracking, map**, and high-quality rendering. Designed for challenging monocular settings, our approach is seamlessly extendable to RGB-D SLAM when an external depth sensor is available. Several innovations are required to continuously reconstruct 3D scenes with high fidelity from a live camera. First, to move beyond the original 3DGS algorithm, which requires accurate poses from an offline Structure from Motion (SfM) system, we formulate camera tracking for 3DGS using direct optimisation against the 3D Gaussians, and show that this enables fast and robust tracking with a wide basin of convergence. Second, by utilising the explicit nature of the Gaussians, we introduce geometric verification and regularisation to handle the ambiguities occurring in incremental 3D dense reconstruction. Finally, we introduce a full SLAM system which not only achieves state-of-the-art results in novel view synthesis and trajectory estimation but also reconstruction of tiny and even transparent objects. △ Less

Submitted 14 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Comments: CVPR2024 Highlight. First two authors contributed equally to this work. Project Page: https://rmurai.co.uk/projects/GaussianSplattingSLAM/

arXiv:2312.05889 [pdf, other]

SuperPrimitive: Scene Reconstruction at a Primitive Level

Authors: Kirill Mazur, Gwangbin Bae, Andrew J. Davison

Abstract: Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations o… ▽ More Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency (e.g. caused by textureless or specular surfaces). We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions, both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive, while their relative positions are adjusted based on multi-view observations. We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion, few-view structure from motion, and monocular dense visual odometry. △ Less

Submitted 17 April, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

Comments: CVPR2024. Project Page: https://makezur.github.io/SuperPrimitive/

arXiv:2311.14649 [pdf, other]

Learning in Deep Factor Graphs with Gaussian Belief Propagation

Authors: Seth Nabarro, Mark van der Wilk, Andrew J Davison

Abstract: We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, latents) as random variables in a graphical model, and view both training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently loca… ▽ More We propose an approach to do learning in Gaussian factor graphs. We treat all relevant quantities (inputs, outputs, parameters, latents) as random variables in a graphical model, and view both training and prediction as inference problems with different observed nodes. Our experiments show that these problems can be efficiently solved with belief propagation (BP), whose updates are inherently local, presenting exciting opportunities for distributed and asynchronous training. Our approach can be scaled to deep networks and provides a natural means to do continual learning: use the BP-estimated parameter marginals of the current task as parameter priors for the next. On a video denoising task we demonstrate the benefit of learnable parameters over a classical factor graph approach and we show encouraging performance of deep factor graphs for continual image classification. △ Less

Submitted 28 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

arXiv:2310.17712 [pdf, other]

Community Detection and Classification Guarantees Using Embeddings Learned by Node2Vec

Authors: Andrew Davison, S. Carlyle Morgan, Owen G. Ward

Abstract: Embedding the nodes of a large network into an Euclidean space is a common objective in modern machine learning, with a variety of tools available. These embeddings can then be used as features for tasks such as community detection/node clustering or link prediction, where they achieve state of the art performance. With the exception of spectral clustering methods, there is little theoretical unde… ▽ More Embedding the nodes of a large network into an Euclidean space is a common objective in modern machine learning, with a variety of tools available. These embeddings can then be used as features for tasks such as community detection/node clustering or link prediction, where they achieve state of the art performance. With the exception of spectral clustering methods, there is little theoretical understanding for other commonly used approaches to learning embeddings. In this work we examine the theoretical properties of the embeddings learned by node2vec. Our main result shows that the use of k-means clustering on the embedding vectors produced by node2vec gives weakly consistent community recovery for the nodes in (degree corrected) stochastic block models. We also discuss the use of these embeddings for node and link prediction tasks. We demonstrate this result empirically, and examine how this relates to other embedding tools for network data. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.01930 [pdf, other]

A Distributed Multi-Robot Framework for Exploration, Information Acquisition and Consensus

Authors: Aalok Patwardhan, Andrew J. Davison

Abstract: The distributed coordination of robot teams performing complex tasks is challenging to formulate. The different aspects of a complete task such as local planning for obstacle avoidance, global goal coordination and collaborative map** are often solved separately, when clearly each of these should influence the others for the most efficient behaviour. In this paper we use the example application… ▽ More The distributed coordination of robot teams performing complex tasks is challenging to formulate. The different aspects of a complete task such as local planning for obstacle avoidance, global goal coordination and collaborative map** are often solved separately, when clearly each of these should influence the others for the most efficient behaviour. In this paper we use the example application of distributed information acquisition as a robot team explores a large space to show that we can formulate the whole problem as a single factor graph with multiple connected layers representing each aspect. We use Gaussian Belief Propagation (GBP) as the inference mechanism, which permits parallel, on-demand or asynchronous computation for efficiency when different aspects are more or less important. This is the first time that a distributed GBP multi-robot solver has been proven to enable intelligent collaborative behaviour rather than just guiding robots to individual, selfish goals. We encourage the reader to view our demos at https://aalpatya.github.io/gbpstack △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: We encourage the reader to view our demos at https://aalpatya.github.io/gbpstack

arXiv:2303.12157 [pdf, other]

Learning a Depth Covariance Function

Authors: Eric Dexheimer, Andrew J. Davison

Abstract: We propose learning a depth covariance function with applications to geometric vision tasks. Given RGB images as input, the covariance function can be flexibly used to define priors over depth functions, predictive distributions given observations, and methods for active point selection. We leverage these techniques for a selection of downstream tasks: depth completion, bundle adjustment, and mono… ▽ More We propose learning a depth covariance function with applications to geometric vision tasks. Given RGB images as input, the covariance function can be flexibly used to define priors over depth functions, predictive distributions given observations, and methods for active point selection. We leverage these techniques for a selection of downstream tasks: depth completion, bundle adjustment, and monocular dense visual odometry. △ Less

Submitted 21 March, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: CVPR 2023. Project page: https://edexheim.github.io/DepthCov/

arXiv:2302.01838 [pdf, other]

vMAP: Vectorised Object Map** for Neural Field SLAM

Authors: Xin Kong, Shikun Liu, Marwan Taher, Andrew J. Davison

Abstract: We present vMAP, an object-level dense SLAM system using neural field representations. Each object is represented by a small MLP, enabling efficient, watertight object modelling without the need for 3D priors. As an RGB-D camera browses a scene with no prior information, vMAP detects object instances on-the-fly, and dynamically adds them to its map. Specifically, thanks to the power of vectorised… ▽ More We present vMAP, an object-level dense SLAM system using neural field representations. Each object is represented by a small MLP, enabling efficient, watertight object modelling without the need for 3D priors. As an RGB-D camera browses a scene with no prior information, vMAP detects object instances on-the-fly, and dynamically adds them to its map. Specifically, thanks to the power of vectorised training, vMAP can optimise as many as 50 individual objects in a single scene, with an extremely efficient training speed of 5Hz map update. We experimentally demonstrate significantly improved scene-level and object-level reconstruction quality compared to prior neural field SLAM systems. Project page: https://kxhit.github.io/vMAP. △ Less

Submitted 13 March, 2023; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: CVPR2023 Project Page:https://kxhit.github.io/vMAP

arXiv:2210.17325 [pdf, other]

Real-time Map** of Physical Scene Properties with an Autonomous Robot Experimenter

Authors: Iain Haughton, Edgar Sucar, Andre Mouton, Edward Johns, Andrew J. Davison

Abstract: Neural fields can be trained from scratch to represent the shape and appearance of 3D scenes efficiently. It has also been shown that they can densely map correlated properties such as semantics, via sparse interactions from a human labeller. In this work, we show that a robot can densely annotate a scene with arbitrary discrete or continuous physical properties via its own fully-autonomous experi… ▽ More Neural fields can be trained from scratch to represent the shape and appearance of 3D scenes efficiently. It has also been shown that they can densely map correlated properties such as semantics, via sparse interactions from a human labeller. In this work, we show that a robot can densely annotate a scene with arbitrary discrete or continuous physical properties via its own fully-autonomous experimental interactions, as it simultaneously scans and maps it with an RGB-D camera. A variety of scene interactions are possible, including poking with force sensing to determine rigidity, measuring local material type with single-pixel spectroscopy or predicting force distributions by pushing. Sparse experimental interactions are guided by entropy to enable high efficiency, with tabletop scene properties densely mapped from scratch in a few minutes from a few tens of interactions. △ Less

Submitted 31 October, 2022; originally announced October 2022.

arXiv:2210.03043 [pdf, other]

Feature-Realistic Neural Fusion for Real-Time, Open Set Scene Understanding

Authors: Kirill Mazur, Edgar Sucar, Andrew J. Davison

Abstract: General scene understanding for robotics requires flexible semantic representation, so that novel objects and structures which may not have been known at training time can be identified, segmented and grouped. We present an algorithm which fuses general learned features from a standard pre-trained network into a highly efficient 3D geometric neural field representation during real-time SLAM. The f… ▽ More General scene understanding for robotics requires flexible semantic representation, so that novel objects and structures which may not have been known at training time can be identified, segmented and grouped. We present an algorithm which fuses general learned features from a standard pre-trained network into a highly efficient 3D geometric neural field representation during real-time SLAM. The fused 3D feature maps inherit the coherence of the neural field's geometry representation. This means that tiny amounts of human labelling interacting at runtime enable objects or even parts of objects to be robustly and accurately segmented in an open set manner. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: For our project page, see https://makezur.github.io/FeatureRealisticFusion/

arXiv:2208.05067 [pdf, other]

Learning to Complete Object Shapes for Object-level Map** in Dynamic Scenes

Authors: Binbin Xu, Andrew J. Davison, Stefan Leutenegger

Abstract: In this paper, we propose a novel object-level map** system that can simultaneously segment, track, and reconstruct objects in dynamic scenes. It can further predict and complete their full geometries by conditioning on reconstructions from depth inputs and a category-level shape prior with the aim that completed object geometry leads to better object reconstruction and tracking accuracy. For ea… ▽ More In this paper, we propose a novel object-level map** system that can simultaneously segment, track, and reconstruct objects in dynamic scenes. It can further predict and complete their full geometries by conditioning on reconstructions from depth inputs and a category-level shape prior with the aim that completed object geometry leads to better object reconstruction and tracking accuracy. For each incoming RGB-D frame, we perform instance segmentation to detect objects and build data associations between the detection and the existing object maps. A new object map will be created for each unmatched detection. For each matched object, we jointly optimise its pose and latent geometry representations using geometric residual and differential rendering residual towards its shape prior and completed geometry. Our approach shows better tracking and reconstruction performance compared to methods using traditional volumetric map** or learned shape prior approaches. We evaluate its effectiveness by quantitatively and qualitatively testing it in both synthetic and real-world sequences. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: International Conference on Intelligent Robots and Systems (IROS) 2022

arXiv:2203.11618 [pdf, other]

doi 10.1109/LRA.2022.3227858

Distributing Collaborative Multi-Robot Planning with Gaussian Belief Propagation

Authors: Aalok Patwardhan, Riku Murai, Andrew J. Davison

Abstract: Precise coordinated planning over a forward time window enables safe and highly efficient motion when many robots must work together in tight spaces, but this would normally require centralised control of all devices which is difficult to scale. We demonstrate GBP Planning, a new purely distributed technique based on Gaussian Belief Propagation for multi-robot planning problems, formulated by a ge… ▽ More Precise coordinated planning over a forward time window enables safe and highly efficient motion when many robots must work together in tight spaces, but this would normally require centralised control of all devices which is difficult to scale. We demonstrate GBP Planning, a new purely distributed technique based on Gaussian Belief Propagation for multi-robot planning problems, formulated by a generic factor graph defining dynamics and collision constraints over a forward time window. In simulations, we show that our method allows high performance collaborative planning where robots are able to cross each other in busy, intricate scenarios. They maintain shorter, quicker and smoother trajectories than alternative distributed planning techniques even in cases of communication failure. We encourage the reader to view the accompanying video demonstration at https://youtu.be/8VSrEUjH610. △ Less

Submitted 26 January, 2023; v1 submitted 22 March, 2022; originally announced March 2022.

Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 552-559, Feb. 2023

arXiv:2203.08040 [pdf, other]

Simultaneous Localisation and Map** with Quadric Surfaces

Authors: Tristan Laidlow, Andrew J. Davison

Abstract: There are many possibilities for how to represent the map in simultaneous localisation and map** (SLAM). While sparse, keypoint-based SLAM systems have achieved impressive levels of accuracy and robustness, their maps may not be suitable for many robotic tasks. Dense SLAM systems are capable of producing dense reconstructions, but can be computationally expensive and, like sparse systems, lack h… ▽ More There are many possibilities for how to represent the map in simultaneous localisation and map** (SLAM). While sparse, keypoint-based SLAM systems have achieved impressive levels of accuracy and robustness, their maps may not be suitable for many robotic tasks. Dense SLAM systems are capable of producing dense reconstructions, but can be computationally expensive and, like sparse systems, lack higher-level information about the structure of a scene. Human-made environments contain a lot of structure, and we seek to take advantage of this by enabling the use of quadric surfaces as features in SLAM systems. We introduce a minimal representation for quadric surfaces and show how this can be included in a least-squares formulation. We also show how our representation can be easily extended to include additional constraints on quadrics such as those found in quadrics of revolution. Finally, we introduce a proof-of-concept SLAM system using our representation, and provide some experimental results using an RGB-D dataset. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: 7 pages, 4 figures

arXiv:2202.11092 [pdf, other]

ReorientBot: Learning Object Reorientation for Specific-Posed Placement

Authors: Kentaro Wada, Stephen James, Andrew J. Davison

Abstract: Robots need the capability of placing objects in arbitrary, specific poses to rearrange the world and achieve various valuable tasks. Object reorientation plays a crucial role in this as objects may not initially be oriented such that the robot can grasp and then immediately place them in a specific goal pose. In this work, we present a vision-based manipulation system, ReorientBot, which consists… ▽ More Robots need the capability of placing objects in arbitrary, specific poses to rearrange the world and achieve various valuable tasks. Object reorientation plays a crucial role in this as objects may not initially be oriented such that the robot can grasp and then immediately place them in a specific goal pose. In this work, we present a vision-based manipulation system, ReorientBot, which consists of 1) visual scene understanding with pose estimation and volumetric reconstruction using an onboard RGB-D camera; 2) learned waypoint selection for successful and efficient motion generation for reorientation; 3) traditional motion planning to generate a collision-free trajectory from the selected waypoints. We evaluate our method using the YCB objects in both simulation and the real world, achieving 93% overall success, 81% improvement in success rate, and 22% improvement in execution time compared to a heuristic approach. We demonstrate extended multi-object rearrangement showing the general capability of the system. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: 7 pages, 6 figures, IEEE International Conference on Robotics and Automation (ICRA) 2022

arXiv:2202.05832 [pdf, other]

SafePicking: Learning Safe Object Extraction via Object-Level Map**

Authors: Kentaro Wada, Stephen James, Andrew J. Davison

Abstract: Robots need object-level scene understanding to manipulate objects while reasoning about contact, support, and occlusion among objects. Given a pile of objects, object recognition and reconstruction can identify the boundary of object instances, giving important cues as to how the objects form and support the pile. In this work, we present a system, SafePicking, that integrates object-level mappin… ▽ More Robots need object-level scene understanding to manipulate objects while reasoning about contact, support, and occlusion among objects. Given a pile of objects, object recognition and reconstruction can identify the boundary of object instances, giving important cues as to how the objects form and support the pile. In this work, we present a system, SafePicking, that integrates object-level map** and learning-based motion planning to generate a motion that safely extracts occluded target objects from a pile. Planning is done by learning a deep Q-network that receives observations of predicted poses and a depth-based heightmap to output a motion trajectory, trained to maximize a safety metric reward. Our results show that the observation fusion of poses and depth-sensing gives both better performance and robustness to the model. We evaluate our methods using the YCB objects in both simulation and the real world, achieving safe object extraction from piles. △ Less

Submitted 1 March, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

Comments: 7 pages, 6 figures, IEEE International Conference on Robotics and Automation (ICRA) 2022

arXiv:2202.03314 [pdf, other]

doi 10.1109/TRO.2023.3324127

A Robot Web for Distributed Many-Device Localisation

Authors: Riku Murai, Joseph Ortiz, Sajad Saeedi, Paul H. J. Kelly, Andrew J. Davison

Abstract: We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localise via efficient ad-hoc peer to peer communication. Our Robot Web solution is based on Gaussian Belief Propagation on the fundamental non-linear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, a… ▽ More We show that a distributed network of robots or other devices which make measurements of each other can collaborate to globally localise via efficient ad-hoc peer to peer communication. Our Robot Web solution is based on Gaussian Belief Propagation on the fundamental non-linear factor graph describing the probabilistic structure of all of the observations robots make internally or of each other, and is flexible for any type of robot, motion or sensor. We define a simple and efficient communication protocol which can be implemented by the publishing and reading of web pages or other asynchronous communication technologies. We show in simulations with up to 1000 robots interacting in arbitrary patterns that our solution convergently achieves global accuracy as accurate as a centralised non-linear factor graph solver while operating with high distributed efficiency of computation and communication. Via the use of robust factors in GBP, our method is tolerant to a high percentage of faults in sensor measurements or dropped communication packets. △ Less

Submitted 26 January, 2024; v1 submitted 7 February, 2022; originally announced February 2022.

Comments: Published in IEEE Transactions on Robotics (TRO) 2023

Journal ref: IEEE Transactions on Robotics, vol. 40, pp. 121-138, 2024

arXiv:2202.03091 [pdf, other]

Auto-Lambda: Disentangling Dynamic Task Relationships

Authors: Shikun Liu, Stephen James, Andrew J. Davison, Edward Johns

Abstract: Understanding the structure of multiple related tasks allows for multi-task learning to improve the generalisation ability of one or all of them. However, it usually requires training each pairwise combination of tasks together in order to capture task relationships, at an extremely high computational cost. In this work, we learn task relationships via an automated weighting framework, named Auto-… ▽ More Understanding the structure of multiple related tasks allows for multi-task learning to improve the generalisation ability of one or all of them. However, it usually requires training each pairwise combination of tasks together in order to capture task relationships, at an extremely high computational cost. In this work, we learn task relationships via an automated weighting framework, named Auto-Lambda. Unlike previous methods where task relationships are assumed to be fixed, Auto-Lambda is a gradient-based meta learning framework which explores continuous, dynamic task relationships via task-specific weightings, and can optimise any choice of combination of tasks through the formulation of a meta-loss; where the validation loss automatically influences task weightings throughout training. We apply the proposed framework to both multi-task and auxiliary learning problems in computer vision and robotics, and show that Auto-Lambda achieves state-of-the-art performance, even when compared to optimisation strategies designed specifically for each problem and data domain. Finally, we observe that Auto-Lambda can discover interesting learning behaviors, leading to new insights in multi-task learning. Code is available at https://github.com/lorenmt/auto-lambda. △ Less

Submitted 2 June, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

Comments: Published at TMLR 2022. Project Page: https://shikun.io/projects/auto-lambda Code: https://github.com/lorenmt/auto-lambda

arXiv:2201.01689 [pdf, other]

Asymptotics of $\ell_2$ Regularized Network Embeddings

Authors: Andrew Davison

Abstract: A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at ea… ▽ More A common approach to solving prediction tasks on large networks, such as node classification or link prediction, begin by learning a Euclidean embedding of the nodes of the network, from which traditional machine learning methods can then be applied. This includes methods such as DeepWalk and node2vec, which learn embeddings by optimizing stochastic losses formed over subsamples of the graph at each iteration of stochastic gradient descent. In this paper, we study the effects of adding an $\ell_2$ penalty of the embedding vectors to the training loss of these types of methods. We prove that, under some exchangeability assumptions on the graph, this asymptotically leads to learning a graphon with a nuclear-norm-type penalty, and give guarantees for the asymptotic distribution of the learned embedding vectors. In particular, the exact form of the penalty depends on the choice of subsampling method used as part of stochastic gradient descent. We also illustrate empirically that concatenating node covariates to $\ell_2$ regularized node2vec embeddings leads to comparable, when not superior, performance to methods which incorporate node covariates and the network structure in a non-linear manner. △ Less

Submitted 18 December, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

Comments: Accepted in Neural Information Processing Systems 2022. 44 pages, 2 figures, 2 tables

arXiv:2111.14637 [pdf, other]

ILabel: Interactive Neural Scene Labelling

Authors: Shuaifeng Zhi, Edgar Sucar, Andre Mouton, Iain Haughton, Tristan Laidlow, Andrew J. Davison

Abstract: Joint representation of geometry, colour and semantics using a 3D neural field enables accurate dense labelling from ultra-sparse interactions as a user reconstructs a scene in real-time using a handheld RGB-D sensor. Our iLabel system requires no training data, yet can densely label scenes more accurately than standard methods trained on large, expensively labelled image datasets. Furthermore, it… ▽ More Joint representation of geometry, colour and semantics using a 3D neural field enables accurate dense labelling from ultra-sparse interactions as a user reconstructs a scene in real-time using a handheld RGB-D sensor. Our iLabel system requires no training data, yet can densely label scenes more accurately than standard methods trained on large, expensively labelled image datasets. Furthermore, it works in an 'open set' manner, with semantic classes defined on the fly by the user. ILabel's underlying model is a multilayer perceptron (MLP) trained from scratch in real-time to learn a joint neural scene representation. The scene model is updated and visualised in real-time, allowing the user to focus interactions to achieve efficient labelling. A room or similar scene can be accurately labelled into 10+ semantic categories with only a few tens of clicks. Quantitative labelling accuracy scales powerfully with the number of clicks, and rapidly surpasses standard pre-trained semantic segmentation methods. We also demonstrate a hierarchical labelling variant. △ Less

Submitted 3 December, 2021; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: Project page: https://edgarsucar.github.io/ilabel/ Video: https://youtu.be/bL7RZaMhRbk

arXiv:2109.06241 [pdf, other]

Incremental Abstraction in Distributed Probabilistic SLAM Graphs

Authors: Joseph Ortiz, Talfan Evans, Edgar Sucar, Andrew J. Davison

Abstract: Scene graphs represent the key components of a scene in a compact and semantically rich way, but are difficult to build during incremental SLAM operation because of the challenges of robustly identifying abstract scene elements and optimising continually changing, complex graphs. We present a distributed, graph-based SLAM framework for incrementally building scene graphs based on two novel compone… ▽ More Scene graphs represent the key components of a scene in a compact and semantically rich way, but are difficult to build during incremental SLAM operation because of the challenges of robustly identifying abstract scene elements and optimising continually changing, complex graphs. We present a distributed, graph-based SLAM framework for incrementally building scene graphs based on two novel components. First, we propose an incremental abstraction framework in which a neural network proposes abstract scene elements that are incorporated into the factor graph of a feature-based monocular SLAM system. Scene elements are confirmed or rejected through optimisation and incrementally replace the points yielding a more dense, semantic and compact representation. Second, enabled by our novel routing procedure, we use Gaussian Belief Propagation (GBP) for distributed inference on a graph processor. The time per iteration of GBP is structure-agnostic and we demonstrate the speed advantages over direct methods for inference of heterogeneous factor graphs. We run our system on real indoor datasets using planar abstractions and recover the major planes with significant compression. △ Less

Submitted 4 April, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: Published at ICRA 2022. Project page: https://joeaortiz.github.io/incremental_abstraction/

arXiv:2107.08994 [pdf, other]

CodeMap**: Real-Time Dense Map** for Sparse SLAM using Compact Scene Representations

Authors: Hidenobu Matsuki, Raluca Scona, Jan Czarnowski, Andrew J. Davison

Abstract: We propose a novel dense map** framework for sparse visual SLAM systems which leverages a compact scene representation. State-of-the-art sparse visual SLAM systems provide accurate and reliable estimates of the camera trajectory and locations of landmarks. While these sparse maps are useful for localization, they cannot be used for other tasks such as obstacle avoidance or scene understanding. I… ▽ More We propose a novel dense map** framework for sparse visual SLAM systems which leverages a compact scene representation. State-of-the-art sparse visual SLAM systems provide accurate and reliable estimates of the camera trajectory and locations of landmarks. While these sparse maps are useful for localization, they cannot be used for other tasks such as obstacle avoidance or scene understanding. In this paper we propose a dense map** framework to complement sparse visual SLAM systems which takes as input the camera poses, keyframes and sparse points produced by the SLAM system and predicts a dense depth image for every keyframe. We build on CodeSLAM and use a variational autoencoder (VAE) which is conditioned on intensity, sparse depth and reprojection error images from sparse SLAM to predict an uncertainty-aware dense depth map. The use of a VAE then enables us to refine the dense depth images through multi-view optimization which improves the consistency of overlap** frames. Our mapper runs in a separate thread in parallel to the SLAM system in a loosely coupled manner. This flexible design allows for integration with arbitrary metric sparse SLAM systems without delaying the main SLAM process. Our dense mapper can be used not only for local map** but also globally consistent dense 3D reconstruction through TSDF fusion. We demonstrate our system running with ORB-SLAM3 and show accurate dense depth estimation which could enable applications such as robotics and augmented reality. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: Accepted to IEEE Robotics and Automation Letters (RA-L) 2021

arXiv:2107.02363 [pdf, other]

Asymptotics of Network Embeddings Learned via Subsampling

Authors: Andrew Davison, Morgane Austern

Abstract: Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme ca… ▽ More Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency. △ Less

Submitted 17 May, 2023; v1 submitted 5 July, 2021; originally announced July 2021.

Comments: Accepted at Journal of Machine Learning Research (JMLR). 120 pages, 3 figures, 1 table

Journal ref: Journal of Machine Learning Research 24 (2023) 1-120. Published 5/23

arXiv:2107.02308 [pdf, other]

A visual introduction to Gaussian Belief Propagation

Authors: Joseph Ortiz, Talfan Evans, Andrew J. Davison

Abstract: In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), an approximate probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that,… ▽ More In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), an approximate probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that, given recent trends in computing hardware, GBP has the right computational properties to act as a scalable distributed probabilistic inference framework for future machine learning systems. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: See online version of this article: https://gaussianbp.github.io/

arXiv:2106.12534 [pdf, other]

Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation

Authors: Stephen James, Kentaro Wada, Tristan Laidlow, Andrew J. Davison

Abstract: We present a coarse-to-fine discretisation method that enables the use of discrete reinforcement learning approaches in place of unstable and data-inefficient actor-critic methods in continuous robotics domains. This approach builds on the recently released ARM algorithm, which replaces the continuous next-best pose agent with a discrete one, with coarse-to-fine Q-attention. Given a voxelised scen… ▽ More We present a coarse-to-fine discretisation method that enables the use of discrete reinforcement learning approaches in place of unstable and data-inefficient actor-critic methods in continuous robotics domains. This approach builds on the recently released ARM algorithm, which replaces the continuous next-best pose agent with a discrete one, with coarse-to-fine Q-attention. Given a voxelised scene, coarse-to-fine Q-attention learns what part of the scene to 'zoom' into. When this 'zooming' behaviour is applied iteratively, it results in a near-lossless discretisation of the translation space, and allows the use of a discrete action, deep Q-learning method. We show that our new coarse-to-fine algorithm achieves state-of-the-art performance on several difficult sparsely rewarded RLBench vision-based robotics tasks, and can train real-world policies, tabula rasa, in a matter of minutes, with as little as 3 demonstrations. △ Less

Submitted 14 March, 2022; v1 submitted 23 June, 2021; originally announced June 2021.

Comments: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2022). Videos and code: https://sites.google.com/view/c2f-q-attention

arXiv:2105.14829 [pdf, other]

Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation

Authors: Stephen James, Andrew J. Davison

Abstract: Despite the success of reinforcement learning methods, they have yet to have their breakthrough moment when applied to a broad range of robotic manipulation tasks. This is partly due to the fact that reinforcement learning algorithms are notoriously difficult and time consuming to train, which is exacerbated when training from images rather than full-state inputs. As humans perform manipulation ta… ▽ More Despite the success of reinforcement learning methods, they have yet to have their breakthrough moment when applied to a broad range of robotic manipulation tasks. This is partly due to the fact that reinforcement learning algorithms are notoriously difficult and time consuming to train, which is exacerbated when training from images rather than full-state inputs. As humans perform manipulation tasks, our eyes closely monitor every step of the process with our gaze focusing sequentially on the objects being manipulated. With this in mind, we present our Attention-driven Robotic Manipulation (ARM) algorithm, which is a general manipulation algorithm that can be applied to a range of sparse-rewarded tasks, given only a small number of demonstrations. ARM splits the complex task of manipulation into a 3 stage pipeline: (1) a Q-attention agent extracts relevant pixel locations from RGB and point cloud inputs, (2) a next-best pose agent that accepts crops from the Q-attention agent and outputs poses, and (3) a control agent that takes the goal pose and outputs joint actions. We show that current learning algorithms fail on a range of RLBench tasks, whilst ARM is successful. △ Less

Submitted 3 February, 2022; v1 submitted 31 May, 2021; originally announced May 2021.

Comments: IEEE Robotics and Automation Letters, 2022 (+ presentation at ICRA 2022). Videos and code found at: https://sites.google.com/view/q-attention

arXiv:2105.06340 [pdf, other]

3D-CNN for Facial Micro- and Macro-expression Spotting on Long Video Sequences using Temporal Oriented Reference Frame

Authors: Chuin Hong Yap, Moi Hoon Yap, Adrian K. Davison, Connah Kendrick, **gting Li, Su**g Wang, Ryan Cunningham

Abstract: Facial expression spotting is the preliminary step for micro- and macro-expression analysis. The task of reliably spotting such expressions in video sequences is currently unsolved. The current best systems depend upon optical flow methods to extract regional motion features, before categorisation of that motion into a specific class of facial movement. Optical flow is susceptible to drift error,… ▽ More Facial expression spotting is the preliminary step for micro- and macro-expression analysis. The task of reliably spotting such expressions in video sequences is currently unsolved. The current best systems depend upon optical flow methods to extract regional motion features, before categorisation of that motion into a specific class of facial movement. Optical flow is susceptible to drift error, which introduces a serious problem for motions with long-term dependencies, such as high frame-rate macro-expression. We propose a purely deep learning solution which, rather than tracking frame differential motion, compares via a convolutional model, each frame with two temporally local reference frames. Reference frames are sampled according to calculated micro- and macro-expression duration. As baseline for MEGC2021 using leave-one-subject-out evaluation method, we show that our solution achieves F1-score of 0.105 in a high frame-rate (200 fps) SAMM long videos dataset (SAMM-LV) and is competitive in a low frame-rate (30 fps) (CAS(ME)2) dataset. On unseen MEGC2022 challenge dataset, the baseline results are 0.1176 on SAMM Challenge dataset, 0.1739 on CAS(ME)3 and overall performance of 0.1531 on both dataset. △ Less

Submitted 26 May, 2022; v1 submitted 13 May, 2021; originally announced May 2021.

arXiv:2104.04465 [pdf, other]

Bootstrap** Semantic Segmentation with Regional Contrast

Authors: Shikun Liu, Shuaifeng Zhi, Edward Johns, Andrew J. Davison

Abstract: We present ReCo, a contrastive learning framework designed at a regional level to assist learning in semantic segmentation. ReCo performs semi-supervised or supervised pixel-level contrastive learning on a sparse set of hard negative pixels, with minimal additional memory footprint. ReCo is easy to implement, being built on top of off-the-shelf segmentation networks, and consistently improves perf… ▽ More We present ReCo, a contrastive learning framework designed at a regional level to assist learning in semantic segmentation. ReCo performs semi-supervised or supervised pixel-level contrastive learning on a sparse set of hard negative pixels, with minimal additional memory footprint. ReCo is easy to implement, being built on top of off-the-shelf segmentation networks, and consistently improves performance in both semi-supervised and supervised semantic segmentation methods, achieving smoother segmentation boundaries and faster convergence. The strongest effect is in semi-supervised learning with very few labels. With ReCo, we achieve high-quality semantic segmentation models, requiring only 5 examples of each semantic class. Code is available at https://github.com/lorenmt/reco. △ Less

Submitted 31 January, 2022; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: Published at ICLR 2022. Project Page: https://shikun.io/projects/regional-contrast. Code: https://github.com/lorenmt/reco

arXiv:2103.16442 [pdf, other]

SIMstack: A Generative Shape and Instance Model for Unordered Object Stacks

Authors: Zoe Landgraf, Raluca Scona, Tristan Laidlow, Stephen James, Stefan Leutenegger, Andrew J. Davison

Abstract: By estimating 3D shape and instances from a single view, we can capture information about an environment quickly, without the need for comprehensive scanning and multi-view fusion. Solving this task for composite scenes (such as object stacks) is challenging: occluded areas are not only ambiguous in shape but also in instance segmentation; multiple decompositions could be valid. We observe that ph… ▽ More By estimating 3D shape and instances from a single view, we can capture information about an environment quickly, without the need for comprehensive scanning and multi-view fusion. Solving this task for composite scenes (such as object stacks) is challenging: occluded areas are not only ambiguous in shape but also in instance segmentation; multiple decompositions could be valid. We observe that physics constrains decomposition as well as shape in occluded regions and hypothesise that a latent space learned from scenes built under physics simulation can serve as a prior to better predict shape and instances in occluded regions. To this end we propose SIMstack, a depth-conditioned Variational Auto-Encoder (VAE), trained on a dataset of objects stacked under physics simulation. We formulate instance segmentation as a centre voting task which allows for class-agnostic detection and doesn't require setting the maximum number of objects in the scene. At test time, our model can generate 3D shape and instance segmentation from a single depth view, probabilistically sampling proposals for the occluded region from the learned latent space. Our method has practical applications in providing robots some of the ability humans have to make rapid intuitive inferences of partially observed scenes. We demonstrate an application for precise (non-disruptive) object gras** of unknown objects from a single depth view. △ Less

Submitted 26 September, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Journal ref: ICCV 2021

arXiv:2103.15875 [pdf, other]

In-Place Scene Labelling and Understanding with Implicit Scene Representation

Authors: Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, Andrew J. Davison

Abstract: Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined prope… ▽ More Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined properties. We extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, so that complete and accurate 2D semantic labels can be achieved using a small amount of in-place annotations specific to the scene. The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate. We show the benefit of this approach when labels are either sparse or very noisy in room-scale scenes. We demonstrate its advantageous properties in various interesting applications such as an efficient scene labelling tool, novel semantic view synthesis, label denoising, super-resolution, label interpolation and multi-view semantic label fusion in visual semantic map** systems. △ Less

Submitted 21 August, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

Comments: Camera ready version. To be published in Proceedings of IEEE International Conference on Computer Vision (ICCV 2021) as Oral Presentation. Project page with more videos: https://shuaifengzhi.com/Semantic-NeRF/

arXiv:2103.12352 [pdf, other]

iMAP: Implicit Map** and Positioning in Real-Time

Authors: Edgar Sucar, Shikun Liu, Joseph Ortiz, Andrew J. Davison

Abstract: We show for the first time that a multilayer perceptron (MLP) can serve as the only scene representation in a real-time SLAM system for a handheld RGB-D camera. Our network is trained in live operation without prior data, building a dense, scene-specific implicit 3D model of occupancy and colour which is also immediately used for tracking. Achieving real-time SLAM via continual training of a neu… ▽ More We show for the first time that a multilayer perceptron (MLP) can serve as the only scene representation in a real-time SLAM system for a handheld RGB-D camera. Our network is trained in live operation without prior data, building a dense, scene-specific implicit 3D model of occupancy and colour which is also immediately used for tracking. Achieving real-time SLAM via continual training of a neural network against a live image stream requires significant innovation. Our iMAP algorithm uses a keyframe structure and multi-processing computation flow, with dynamic information-guided pixel sampling for speed, with tracking at 10 Hz and global map updating at 2 Hz. The advantages of an implicit MLP over standard dense SLAM techniques include efficient geometry representation with automatic detail control and smooth, plausible filling-in of unobserved regions such as the back surfaces of objects. △ Less

Submitted 13 September, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

Comments: Typos, make pdf smaller

arXiv:2102.07764 [pdf, other]

End-to-End Egospheric Spatial Memory

Authors: Daniel Lenton, Stephen James, Ronald Clark, Andrew J. Davison

Abstract: Spatial memory, or the ability to remember and recall specific locations and objects, is central to autonomous agents' ability to carry out tasks in real environments. However, most existing artificial memory modules are not very adept at storing spatial information. We propose a parameter-free module, Egospheric Spatial Memory (ESM), which encodes the memory in an ego-sphere around the agent, ena… ▽ More Spatial memory, or the ability to remember and recall specific locations and objects, is central to autonomous agents' ability to carry out tasks in real environments. However, most existing artificial memory modules are not very adept at storing spatial information. We propose a parameter-free module, Egospheric Spatial Memory (ESM), which encodes the memory in an ego-sphere around the agent, enabling expressive 3D representations. ESM can be trained end-to-end via either imitation or reinforcement learning, and improves both training efficiency and final performance against other memory baselines on both drone and manipulator visuomotor control tasks. The explicit egocentric geometry also enables us to seamlessly combine the learned controller with other non-learned modalities, such as local obstacle avoidance. We further show applications to semantic segmentation on the ScanNet dataset, where ESM naturally combines image-level and map-level inference modalities. Through our broad set of experiments, we show that ESM provides a general computation graph for embodied spatial reasoning, and the module forms a bridge between real-time map** systems and differentiable memory architectures. Implementation at: https://github.com/ivy-dl/memory. △ Less

Submitted 17 February, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

Comments: Conference paper at ICLR 2021. Implementation: https://github.com/ivy-dl/memory Project page: https://djl11.github.io/ESM/

arXiv:2011.01975 [pdf, other]

Rearrangement: A Challenge for Embodied AI

Authors: Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, Hao Su

Abstract: We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specifie… ▽ More We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specified by object poses, by images, by a description in language, or by letting the agent experience the environment in the goal state. We characterize rearrangement scenarios along different axes and describe metrics for benchmarking rearrangement performance. To facilitate research and exploration, we present experimental testbeds of rearrangement scenarios in four different simulation environments. We anticipate that other datasets will be released and new simulation platforms will be built to support training of rearrangement agents and their deployment on physical systems. △ Less

Submitted 3 November, 2020; originally announced November 2020.

Comments: Authors are listed in alphabetical order

arXiv:2008.13504 [pdf, other]

Deep Probabilistic Feature-metric Tracking

Authors: Binbin Xu, Andrew J. Davison, Stefan Leutenegger

Abstract: Dense image alignment from RGB-D images remains a critical issue for real-world applications, especially under challenging lighting conditions and in a wide baseline setting. In this paper, we propose a new framework to learn a pixel-wise deep feature map and a deep feature-metric uncertainty map predicted by a Convolutional Neural Network (CNN), which together formulate a deep probabilistic featu… ▽ More Dense image alignment from RGB-D images remains a critical issue for real-world applications, especially under challenging lighting conditions and in a wide baseline setting. In this paper, we propose a new framework to learn a pixel-wise deep feature map and a deep feature-metric uncertainty map predicted by a Convolutional Neural Network (CNN), which together formulate a deep probabilistic feature-metric residual of the two-view constraint that can be minimised using Gauss-Newton in a coarse-to-fine optimisation framework. Furthermore, our network predicts a deep initial pose for faster and more reliable convergence. The optimisation steps are differentiable and unrolled to train in an end-to-end fashion. Due to its probabilistic essence, our approach can easily couple with other residuals, where we show a combination with ICP. Experimental results demonstrate state-of-the-art performances on the TUM RGB-D dataset and the 3D rigid object tracking dataset. We further demonstrate our method's robustness and convergence qualitatively. △ Less

Submitted 25 November, 2020; v1 submitted 31 August, 2020; originally announced August 2020.

Comments: RAL 2020. 8 pages, 9 figures, video link: https://youtu.be/6pMosl6ZAPE

arXiv:2007.05385 [pdf, ps, other]

doi 10.1002/sam.11486

Next Waves in Veridical Network Embedding

Authors: Owen G. Ward, Zhen Huang, Andrew Davison, Tian Zheng

Abstract: Embedding nodes of a large network into a metric (e.g., Euclidean) space has become an area of active research in statistical machine learning, which has found applications in natural and social sciences. Generally, a representation of a network object is learned in a Euclidean geometry and is then used for subsequent tasks regarding the nodes and/or edges of the network, such as community detecti… ▽ More Embedding nodes of a large network into a metric (e.g., Euclidean) space has become an area of active research in statistical machine learning, which has found applications in natural and social sciences. Generally, a representation of a network object is learned in a Euclidean geometry and is then used for subsequent tasks regarding the nodes and/or edges of the network, such as community detection, node classification and link prediction. Network embedding algorithms have been proposed in multiple disciplines, often with domain-specific notations and details. In addition, different measures and tools have been adopted to evaluate and compare the methods proposed under different settings, often dependent of the downstream tasks. As a result, it is challenging to study these algorithms in the literature systematically. Motivated by the recently proposed Veridical Data Science (VDS) framework, we propose a framework for network embedding algorithms and discuss how the principles of predictability, computability and stability apply in this context. The utilization of this framework in network embedding holds the potential to motivate and point to new directions for future research. △ Less

Submitted 12 August, 2021; v1 submitted 10 July, 2020; originally announced July 2020.

arXiv:2004.04485 [pdf, other]

NodeSLAM: Neural Object Descriptors for Multi-View Shape Reconstruction

Authors: Edgar Sucar, Kentaro Wada, Andrew Davison

Abstract: The choice of scene representation is crucial in both the shape inference algorithms it requires and the smart applications it enables. We present efficient and optimisable multi-class learned object descriptors together with a novel probabilistic and differential rendering engine, for principled full object shape inference from one or more RGB-D images. Our framework allows for accurate and robus… ▽ More The choice of scene representation is crucial in both the shape inference algorithms it requires and the smart applications it enables. We present efficient and optimisable multi-class learned object descriptors together with a novel probabilistic and differential rendering engine, for principled full object shape inference from one or more RGB-D images. Our framework allows for accurate and robust 3D object reconstruction which enables multiple applications including robot gras** and placing, augmented reality, and the first object-level SLAM system capable of optimising object poses and shapes jointly with camera trajectory. △ Less

Submitted 10 October, 2020; v1 submitted 9 April, 2020; originally announced April 2020.

Comments: to be published in 3DV

arXiv:2004.04336 [pdf, other]

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

Authors: Kentaro Wada, Edgar Sucar, Stephen James, Daniel Lenton, Andrew J. Davison

Abstract: Robots and other smart devices need efficient object-based scene representations from their on-board vision systems to reason about contact, physics and occlusion. Recognized precise object models will play an important role alongside non-parametric reconstructions of unrecognized structures. We present a system which can estimate the accurate poses of multiple known objects in contact and occlusi… ▽ More Robots and other smart devices need efficient object-based scene representations from their on-board vision systems to reason about contact, physics and occlusion. Recognized precise object models will play an important role alongside non-parametric reconstructions of unrecognized structures. We present a system which can estimate the accurate poses of multiple known objects in contact and occlusion from real-time, embodied multi-view vision. Our approach makes 3D object pose proposals from single RGB-D views, accumulates pose estimates and non-parametric occupancy information from multiple views as the camera moves, and performs joint optimization to estimate consistent, non-intersecting poses for multiple objects in contact. We verify the accuracy and robustness of our approach experimentally on 2 object datasets: YCB-Video, and our own challenging Cluttered YCB-Video. We demonstrate a real-time robotics application where a robot arm precisely and orderly disassembles complicated piles of objects, using only on-board RGB-D vision. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: 10 pages, 10 figures, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020

arXiv:2003.03134 [pdf, other]

Bundle Adjustment on a Graph Processor

Authors: Joseph Ortiz, Mark Pupilli, Stefan Leutenegger, Andrew J. Davison

Abstract: Graph processors such as Graphcore's Intelligence Processing Unit (IPU) are part of the major new wave of novel computer architecture for AI, and have a general design with massively parallel computation, distributed on-chip memory and very high inter-core communication bandwidth which allows breakthrough performance for message passing algorithms on arbitrary graphs. We show for the first time th… ▽ More Graph processors such as Graphcore's Intelligence Processing Unit (IPU) are part of the major new wave of novel computer architecture for AI, and have a general design with massively parallel computation, distributed on-chip memory and very high inter-core communication bandwidth which allows breakthrough performance for message passing algorithms on arbitrary graphs. We show for the first time that the classical computer vision problem of bundle adjustment (BA) can be solved extremely fast on a graph processor using Gaussian Belief Propagation. Our simple but fully parallel implementation uses the 1216 cores on a single IPU chip to, for instance, solve a real BA problem with 125 keyframes and 1919 points in under 40ms, compared to 1450ms for the Ceres CPU library. Further code optimisation will surely increase this difference on static problems, but we argue that the real promise of graph processing is for flexible in-place optimisation of general, dynamically changing factor graphs representing Spatial AI problems. We give indications of this with experiments showing the ability of GBP to efficiently solve incremental SLAM problems, and deal with robust cost functions and different types of factors. △ Less

Submitted 30 March, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

Comments: Published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2020). Video: https://www.youtube.com/watch?v=TqeN8aQNgd0

arXiv:2002.10342 [pdf, other]

Comparing View-Based and Map-Based Semantic Labelling in Real-Time SLAM

Authors: Zoe Landgraf, Fabian Falck, Michael Bloesch, Stefan Leutenegger, Andrew Davison

Abstract: Generally capable Spatial AI systems must build persistent scene representations where geometric models are combined with meaningful semantic labels. The many approaches to labelling scenes can be divided into two clear groups: view-based which estimate labels from the input view-wise data and then incrementally fuse them into the scene model as it is built; and map-based which label the generated… ▽ More Generally capable Spatial AI systems must build persistent scene representations where geometric models are combined with meaningful semantic labels. The many approaches to labelling scenes can be divided into two clear groups: view-based which estimate labels from the input view-wise data and then incrementally fuse them into the scene model as it is built; and map-based which label the generated scene model. However, there has so far been no attempt to quantitatively compare view-based and map-based labelling. Here, we present an experimental framework and comparison which uses real-time height map fusion as an accessible platform for a fair comparison, opening up the route to further systematic research in this area. △ Less

Submitted 24 February, 2020; originally announced February 2020.

Comments: ICRA 2020

arXiv:2001.05049 [pdf, other]

doi 10.1109/LRA.2020.2965415

DeepFactors: Real-Time Probabilistic Dense Monocular SLAM

Authors: Jan Czarnowski, Tristan Laidlow, Ronald Clark, Andrew J. Davison

Abstract: The ability to estimate rich geometry and camera motion from monocular imagery is fundamental to future interactive robotics and augmented reality applications. Different approaches have been proposed that vary in scene geometry representation (sparse landmarks, dense maps), the consistency metric used for optimising the multi-view problem, and the use of learned priors. We present a SLAM system t… ▽ More The ability to estimate rich geometry and camera motion from monocular imagery is fundamental to future interactive robotics and augmented reality applications. Different approaches have been proposed that vary in scene geometry representation (sparse landmarks, dense maps), the consistency metric used for optimising the multi-view problem, and the use of learned priors. We present a SLAM system that unifies these methods in a probabilistic framework while still maintaining real-time performance. This is achieved through the use of a learned compact depth map representation and reformulating three different types of errors: photometric, reprojection and geometric, which we make use of within standard factor graph software. We evaluate our system on trajectory estimation and depth reconstruction on real-world sequences and present various examples of estimated dense geometry. △ Less

Submitted 14 January, 2020; originally announced January 2020.

Comments: RA-L

arXiv:1911.05116 [pdf, other]

An Unethical Optimization Principle

Authors: Nicholas Beale, Heather Battey, Anthony C. Davison, Robert S. MacKay

Abstract: If an artificial intelligence aims to maximise risk-adjusted return, then under mild conditions it is disproportionately likely to pick an unethical strategy unless the objective function allows sufficiently for this risk. Even if the proportion $η$ of available unethical strategies is small, the probability ${p_U}$ of picking an unethical strategy can become large; indeed unless returns are fat-t… ▽ More If an artificial intelligence aims to maximise risk-adjusted return, then under mild conditions it is disproportionately likely to pick an unethical strategy unless the objective function allows sufficiently for this risk. Even if the proportion $η$ of available unethical strategies is small, the probability ${p_U}$ of picking an unethical strategy can become large; indeed unless returns are fat-tailed ${p_U}$ tends to unity as the strategy space becomes large. We define an Unethical Odds Ratio Upsilon ($Υ$) that allows us to calculate ${p_U}$ from $η$, and we derive a simple formula for the limit of $Υ$ as the strategy space becomes large. We give an algorithm for estimating $Υ$ and ${p_U}$ in finite cases and discuss how to deal with infinite strategy spaces. We show how this principle can be used to help detect unethical strategies and to estimate $η$. Finally we sketch some policy implications of this work. △ Less

Submitted 12 November, 2019; originally announced November 2019.

arXiv:1911.01103 [pdf, other]

Learning One-Shot Imitation from Humans without Humans

Authors: Alessandro Bonardi, Stephen James, Andrew J. Davison

Abstract: Humans can naturally learn to execute a new task by seeing it performed by other individuals once, and then reproduce it in a variety of configurations. Endowing robots with this ability of imitating humans from third person is a very immediate and natural way of teaching new tasks. Only recently, through meta-learning, there have been successful attempts to one-shot imitation learning from humans… ▽ More Humans can naturally learn to execute a new task by seeing it performed by other individuals once, and then reproduce it in a variety of configurations. Endowing robots with this ability of imitating humans from third person is a very immediate and natural way of teaching new tasks. Only recently, through meta-learning, there have been successful attempts to one-shot imitation learning from humans; however, these approaches require a lot of human resources to collect the data in the real world to train the robot. But is there a way to remove the need for real world human demonstrations during training? We show that with Task-Embedded Control Networks, we can infer control polices by embedding human demonstrations that can condition a control policy and achieve one-shot imitation learning. Importantly, we do not use a real human arm to supply demonstrations during training, but instead leverage domain randomisation in an application that has not been seen before: sim-to-real transfer on humans. Upon evaluating our approach on pushing and placing tasks in both simulation and in the real world, we show that in comparison to a system that was trained on real-world data we are able to achieve similar results by utilising only simulation data. △ Less

Submitted 4 November, 2019; originally announced November 2019.

Comments: Videos can be found here: https://sites.google.com/view/tecnets-humans

arXiv:1910.14139 [pdf, other]

FutureMap** 2: Gaussian Belief Propagation for Spatial AI

Authors: Andrew J. Davison, Joseph Ortiz

Abstract: We argue the case for Gaussian Belief Propagation (GBP) as a strong algorithmic framework for the distributed, generic and incremental probabilistic estimation we need in Spatial AI as we aim at high performance smart robots and devices which operate within the constraints of real products. Processor hardware is changing rapidly, and GBP has the right character to take advantage of highly distribu… ▽ More We argue the case for Gaussian Belief Propagation (GBP) as a strong algorithmic framework for the distributed, generic and incremental probabilistic estimation we need in Spatial AI as we aim at high performance smart robots and devices which operate within the constraints of real products. Processor hardware is changing rapidly, and GBP has the right character to take advantage of highly distributed processing and storage while estimating global quantities, as well as great flexibility. We present a detailed tutorial on GBP, relating to the standard factor graph formulation used in robotics and computer vision, and give several simulation examples with code which demonstrate its properties. △ Less

Submitted 7 November, 2022; v1 submitted 30 October, 2019; originally announced October 2019.

arXiv:1909.12271 [pdf, other]

RLBench: The Robot Learning Benchmark & Learning Environment

Authors: Stephen James, Zicong Ma, David Rovick Arrojo, Andrew J. Davison

Abstract: We present a challenging new benchmark and learning-environment for robot learning: RLBench. The benchmark features 100 completely unique, hand-designed tasks ranging in difficulty, from simple target reaching and door opening, to longer multi-stage tasks, such as opening an oven and placing a tray in it. We provide an array of both proprioceptive observations and visual observations, which includ… ▽ More We present a challenging new benchmark and learning-environment for robot learning: RLBench. The benchmark features 100 completely unique, hand-designed tasks ranging in difficulty, from simple target reaching and door opening, to longer multi-stage tasks, such as opening an oven and placing a tray in it. We provide an array of both proprioceptive observations and visual observations, which include rgb, depth, and segmentation masks from an over-the-shoulder stereo camera and an eye-in-hand monocular camera. Uniquely, each task comes with an infinite supply of demos through the use of motion planners operating on a series of waypoints given during task creation time; enabling an exciting flurry of demonstration-based learning. RLBench has been designed with scalability in mind; new tasks, along with their motion-planned demos, can be easily created and then verified by a series of tools, allowing users to submit their own tasks to the RLBench task repository. This large-scale benchmark aims to accelerate progress in a number of vision-guided manipulation research areas, including: reinforcement learning, imitation learning, multi-task learning, geometric computer vision, and in particular, few-shot learning. With the benchmark's breadth of tasks and demonstrations, we propose the first large-scale few-shot challenge in robotics. We hope that the scale and diversity of RLBench offers unparalleled research opportunities in the robot learning community and beyond. △ Less

Submitted 26 September, 2019; originally announced September 2019.

Comments: Videos and code: https://sites.google.com/view/rlbench

arXiv:1906.11176 [pdf, other]

PyRep: Bringing V-REP to Deep Robot Learning

Authors: Stephen James, Marc Freese, Andrew J. Davison

Abstract: PyRep is a toolkit for robot learning research, built on top of the virtual robotics experimentation platform (V-REP). Through a series of modifications and additions, we have created a tailored version of V-REP built with robot learning in mind. The new PyRep toolkit offers three improvements: (1) a simple and flexible API for robot control and scene manipulation, (2) a new rendering engine, and… ▽ More PyRep is a toolkit for robot learning research, built on top of the virtual robotics experimentation platform (V-REP). Through a series of modifications and additions, we have created a tailored version of V-REP built with robot learning in mind. The new PyRep toolkit offers three improvements: (1) a simple and flexible API for robot control and scene manipulation, (2) a new rendering engine, and (3) speed boosts upwards of 10,000x in comparison to the previous Python Remote API. With these improvements, we believe PyRep is the ideal toolkit to facilitate rapid prototy** of learning algorithms in the areas of reinforcement learning, imitation learning, state estimation, map**, and computer vision. △ Less

Submitted 26 June, 2019; originally announced June 2019.

Showing 1–50 of 86 results for author: Davison, A