-
Contrastive Learning for Self-Supervised Pre-Training of Point Cloud Segmentation Networks With Image Data
Authors:
Andrej Janda,
Brandon Wagstaff,
Edwin G. Ng,
Jonathan Kelly
Abstract:
Reducing the quantity of annotations required for supervised training is vital when labels are scarce and costly. This reduction is particularly important for semantic segmentation tasks involving 3D datasets, which are often significantly smaller and more challenging to annotate than their image-based counterparts. Self-supervised pre-training on unlabelled data is one way to reduce the amount of…
▽ More
Reducing the quantity of annotations required for supervised training is vital when labels are scarce and costly. This reduction is particularly important for semantic segmentation tasks involving 3D datasets, which are often significantly smaller and more challenging to annotate than their image-based counterparts. Self-supervised pre-training on unlabelled data is one way to reduce the amount of manual annotations needed. Previous work has focused on pre-training with point clouds exclusively. While useful, this approach often requires two or more registered views. In the present work, we combine image and point cloud modalities by first learning self-supervised image features and then using these features to train a 3D model. By incorporating image data, which is often included in many 3D datasets, our pre-training method only requires a single scan of a scene and can be applied to cases where localization information is unavailable. We demonstrate that our pre-training approach, despite using single scans, achieves comparable performance to other multi-scan, point cloud-only methods.
△ Less
Submitted 4 September, 2023; v1 submitted 17 January, 2023;
originally announced January 2023.
-
Self-Supervised Pre-training of 3D Point Cloud Networks with Image Data
Authors:
Andrej Janda,
Brandon Wagstaff,
Edwin G. Ng,
Jonathan Kelly
Abstract:
Reducing the quantity of annotations required for supervised training is vital when labels are scarce and costly. This reduction is especially important for semantic segmentation tasks involving 3D datasets that are often significantly smaller and more challenging to annotate than their image-based counterparts. Self-supervised pre-training on large unlabelled datasets is one way to reduce the amo…
▽ More
Reducing the quantity of annotations required for supervised training is vital when labels are scarce and costly. This reduction is especially important for semantic segmentation tasks involving 3D datasets that are often significantly smaller and more challenging to annotate than their image-based counterparts. Self-supervised pre-training on large unlabelled datasets is one way to reduce the amount of manual annotations needed. Previous work has focused on pre-training with point cloud data exclusively; this approach often requires two or more registered views. In the present work, we combine image and point cloud modalities, by first learning self-supervised image features and then using these features to train a 3D model. By incorporating image data, which is often included in many 3D datasets, our pre-training method only requires a single scan of a scene. We demonstrate that our pre-training approach, despite using single scans, achieves comparable performance to other multi-scan, point cloud-only methods.
△ Less
Submitted 16 December, 2022; v1 submitted 21 November, 2022;
originally announced November 2022.
-
A Self-Supervised, Differentiable Kalman Filter for Uncertainty-Aware Visual-Inertial Odometry
Authors:
Brandon Wagstaff,
Emmett Wise,
Jonathan Kelly
Abstract:
Visual-inertial odometry (VIO) systems traditionally rely on filtering or optimization-based techniques for egomotion estimation. While these methods are accurate under nominal conditions, they are prone to failure during severe illumination changes, rapid camera motions, or on low-texture image sequences. Learning-based systems have the potential to outperform classical implementations in challen…
▽ More
Visual-inertial odometry (VIO) systems traditionally rely on filtering or optimization-based techniques for egomotion estimation. While these methods are accurate under nominal conditions, they are prone to failure during severe illumination changes, rapid camera motions, or on low-texture image sequences. Learning-based systems have the potential to outperform classical implementations in challenging environments, but, currently, do not perform as well as classical methods in nominal settings. Herein, we introduce a framework for training a hybrid VIO system that leverages the advantages of learning and standard filtering-based state estimation. Our approach is built upon a differentiable Kalman filter, with an IMU-driven process model and a robust, neural network-derived relative pose measurement model. The use of the Kalman filter framework enables the principled treatment of uncertainty at training time and at test time. We show that our self-supervised loss formulation outperforms a similar, supervised method, while also enabling online retraining. We evaluate our system on a visually degraded version of the EuRoC dataset and find that our estimator operates without a significant reduction in accuracy in cases where classical estimators consistently diverge. Finally, by properly utilizing the metric information contained in the IMU measurements, our system is able to recover metric scene scale, while other self-supervised monocular VIO approaches cannot.
△ Less
Submitted 1 October, 2022; v1 submitted 14 March, 2022;
originally announced March 2022.
-
On the Coupling of Depth and Egomotion Networks for Self-Supervised Structure from Motion
Authors:
Brandon Wagstaff,
Valentin Peretroukhin,
Jonathan Kelly
Abstract:
Structure from motion (SfM) has recently been formulated as a self-supervised learning problem, where neural network models of depth and egomotion are learned jointly through view synthesis. Herein, we address the open problem of how to best couple, or link, the depth and egomotion network components, so that information such as a common scale factor can be shared between the networks. Towards thi…
▽ More
Structure from motion (SfM) has recently been formulated as a self-supervised learning problem, where neural network models of depth and egomotion are learned jointly through view synthesis. Herein, we address the open problem of how to best couple, or link, the depth and egomotion network components, so that information such as a common scale factor can be shared between the networks. Towards this end, we introduce several notions of coupling, categorize existing approaches, and present a novel tightly-coupled approach that leverages the interdependence of depth and egomotion at training time and at test time. Our approach uses iterative view synthesis to recursively update the egomotion network input, permitting contextual information to be passed between the components. We demonstrate through substantial experiments that our approach promotes consistency between the depth and egomotion predictions at test time, improves generalization, and leads to state-of-the-art accuracy on indoor and outdoor depth and egomotion evaluation benchmarks.
△ Less
Submitted 11 July, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Learned Camera Gain and Exposure Control for Improved Visual Feature Detection and Matching
Authors:
Justin Tomasi,
Brandon Wagstaff,
Steven L. Waslander,
Jonathan Kelly
Abstract:
Successful visual navigation depends upon capturing images that contain sufficient useful information. In this letter, we explore a data-driven approach to account for environmental lighting changes, improving the quality of images for use in visual odometry (VO) or visual simultaneous localization and map** (SLAM). We train a deep convolutional neural network model to predictively adjust camera…
▽ More
Successful visual navigation depends upon capturing images that contain sufficient useful information. In this letter, we explore a data-driven approach to account for environmental lighting changes, improving the quality of images for use in visual odometry (VO) or visual simultaneous localization and map** (SLAM). We train a deep convolutional neural network model to predictively adjust camera gain and exposure time parameters such that consecutive images contain a maximal number of matchable features. The training process is fully self-supervised: our training signal is derived from an underlying VO or SLAM pipeline and, as a result, the model is optimized to perform well with that specific pipeline. We demonstrate through extensive real-world experiments that our network can anticipate and compensate for dramatic lighting changes (e.g., transitions into and out of road tunnels), maintaining a substantially higher number of inlier feature matches than competing camera parameter control algorithms.
△ Less
Submitted 11 July, 2022; v1 submitted 8 February, 2021;
originally announced February 2021.
-
Self-Supervised Scale Recovery for Monocular Depth and Egomotion Estimation
Authors:
Brandon Wagstaff,
Jonathan Kelly
Abstract:
The self-supervised loss formulation for jointly training depth and egomotion neural networks with monocular images is well studied and has demonstrated state-of-the-art accuracy. One of the main limitations of this approach, however, is that the depth and egomotion estimates are only determined up to an unknown scale. In this paper, we present a novel scale recovery loss that enforces consistency…
▽ More
The self-supervised loss formulation for jointly training depth and egomotion neural networks with monocular images is well studied and has demonstrated state-of-the-art accuracy. One of the main limitations of this approach, however, is that the depth and egomotion estimates are only determined up to an unknown scale. In this paper, we present a novel scale recovery loss that enforces consistency between a known camera height and the estimated camera height, generating metric (scaled) depth and egomotion predictions. We show that our proposed method is competitive with other scale recovery techniques that require more information. Further, we demonstrate that our method facilitates network retraining within new environments, whereas other scale-resolving approaches are incapable of doing so. Notably, our egomotion network is able to produce more accurate estimates than a similar method which recovers scale at test time only.
△ Less
Submitted 1 May, 2022; v1 submitted 8 September, 2020;
originally announced September 2020.
-
Heteroscedastic Uncertainty for Robust Generative Latent Dynamics
Authors:
Oliver Limoyo,
Bryan Chan,
Filip Marić,
Brandon Wagstaff,
Rupam Mahmood,
Jonathan Kelly
Abstract:
Learning or identifying dynamics from a sequence of high-dimensional observations is a difficult challenge in many domains, including reinforcement learning and control. The problem has recently been studied from a generative perspective through latent dynamics: high-dimensional observations are embedded into a lower-dimensional space in which the dynamics can be learned. Despite some successes, l…
▽ More
Learning or identifying dynamics from a sequence of high-dimensional observations is a difficult challenge in many domains, including reinforcement learning and control. The problem has recently been studied from a generative perspective through latent dynamics: high-dimensional observations are embedded into a lower-dimensional space in which the dynamics can be learned. Despite some successes, latent dynamics models have not yet been applied to real-world robotic systems where learned representations must be robust to a variety of perceptual confounds and noise sources not seen during training. In this paper, we present a method to jointly learn a latent state representation and the associated dynamics that is amenable for long-term planning and closed-loop control under perceptually difficult conditions. As our main contribution, we describe how our representation is able to capture a notion of heteroscedastic or input-specific uncertainty at test time by detecting novel or out-of-distribution (OOD) inputs. We present results from prediction and control experiments on two image-based tasks: a simulated pendulum balancing task and a real-world robotic manipulator reaching task. We demonstrate that our model produces significantly more accurate predictions and exhibits improved control performance, compared to a model that assumes homoscedastic uncertainty only, in the presence of varying degrees of input degradation.
△ Less
Submitted 11 July, 2022; v1 submitted 18 August, 2020;
originally announced August 2020.
-
Self-Supervised Deep Pose Corrections for Robust Visual Odometry
Authors:
Brandon Wagstaff,
Valentin Peretroukhin,
Jonathan Kelly
Abstract:
We present a self-supervised deep pose correction (DPC) network that applies pose corrections to a visual odometry estimator to improve its accuracy. Instead of regressing inter-frame pose changes directly, we build on prior work that uses data-driven learning to regress pose corrections that account for systematic errors due to violations of modelling assumptions. Our self-supervised formulation…
▽ More
We present a self-supervised deep pose correction (DPC) network that applies pose corrections to a visual odometry estimator to improve its accuracy. Instead of regressing inter-frame pose changes directly, we build on prior work that uses data-driven learning to regress pose corrections that account for systematic errors due to violations of modelling assumptions. Our self-supervised formulation removes any requirement for six-degrees-of-freedom ground truth and, in contrast to expectations, often improves overall navigation accuracy compared to a supervised approach. Through extensive experiments, we show that our self-supervised DPC network can significantly enhance the performance of classical monocular and stereo odometry estimators and substantially out-performs state-of-the-art learning-only approaches.
△ Less
Submitted 14 October, 2020; v1 submitted 27 February, 2020;
originally announced February 2020.
-
Robust Data-Driven Zero-Velocity Detection for Foot-Mounted Inertial Navigation
Authors:
Brandon Wagstaff,
Valentin Peretroukhin,
Jonathan Kelly
Abstract:
We present two novel techniques for detecting zero-velocity events to improve foot-mounted inertial navigation. Our first technique augments a classical zero-velocity detector by incorporating a motion classifier that adaptively updates the detector's threshold parameter. Our second technique uses a long short-term memory (LSTM) recurrent neural network to classify zero-velocity events from raw in…
▽ More
We present two novel techniques for detecting zero-velocity events to improve foot-mounted inertial navigation. Our first technique augments a classical zero-velocity detector by incorporating a motion classifier that adaptively updates the detector's threshold parameter. Our second technique uses a long short-term memory (LSTM) recurrent neural network to classify zero-velocity events from raw inertial data, in contrast to the majority of zero-velocity detection methods that rely on basic statistical hypothesis testing. We demonstrate that both of our proposed detectors achieve higher accuracies than existing detectors for trajectories including walking, running, and stair-climbing motions. Additionally, we present a straightforward data augmentation method that is able to extend the LSTM-based model to different inertial sensors without the need to collect new training data.
△ Less
Submitted 1 May, 2020; v1 submitted 1 October, 2019;
originally announced October 2019.
-
Probabilistic Regression of Rotations using Quaternion Averaging and a Deep Multi-Headed Network
Authors:
Valentin Peretroukhin,
Brandon Wagstaff,
Matthew Giamou,
Jonathan Kelly
Abstract:
Accurate estimates of rotation are crucial to vision-based motion estimation in augmented reality and robotics. In this work, we present a method to extract probabilistic estimates of rotation from deep regression models. First, we build on prior work and argue that a multi-headed network structure we name HydraNet provides better calibrated uncertainty estimates than methods that rely on stochast…
▽ More
Accurate estimates of rotation are crucial to vision-based motion estimation in augmented reality and robotics. In this work, we present a method to extract probabilistic estimates of rotation from deep regression models. First, we build on prior work and argue that a multi-headed network structure we name HydraNet provides better calibrated uncertainty estimates than methods that rely on stochastic forward passes. Second, we extend HydraNet to targets that belong to the rotation group, SO(3), by regressing unit quaternions and using the tools of rotation averaging and uncertainty injection onto the manifold to produce three-dimensional covariances. Finally, we present results and analysis on a synthetic dataset, learn consistent orientation estimates on the 7-Scenes dataset, and show how we can use our learned covariances to fuse deep estimates of relative orientation with classical stereo visual odometry to improve localization on the KITTI dataset.
△ Less
Submitted 8 May, 2020; v1 submitted 1 April, 2019;
originally announced April 2019.
-
LSTM-Based Zero-Velocity Detection for Robust Inertial Navigation
Authors:
Brandon Wagstaff,
Jonathan Kelly
Abstract:
We present a method to improve the accuracy of a zero-velocity-aided inertial navigation system (INS) by replacing the standard zero-velocity detector with a long short-term memory (LSTM) neural network. While existing threshold-based zero-velocity detectors are not robust to varying motion types, our learned model accurately detects stationary periods of the inertial measurement unit (IMU) despit…
▽ More
We present a method to improve the accuracy of a zero-velocity-aided inertial navigation system (INS) by replacing the standard zero-velocity detector with a long short-term memory (LSTM) neural network. While existing threshold-based zero-velocity detectors are not robust to varying motion types, our learned model accurately detects stationary periods of the inertial measurement unit (IMU) despite changes in the motion of the user. Upon detection, zero-velocity pseudo-measurements are fused with a dead reckoning motion model in an extended Kalman filter (EKF). We demonstrate that our LSTM-based zero-velocity detector, used within a zero-velocity-aided INS, improves zero-velocity detection during human localization tasks. Consequently, localization accuracy is also improved.
Our system is evaluated on more than 7.5 km of indoor pedestrian locomotion data, acquired from five different subjects. We show that 3D positioning error is reduced by over 34% compared to existing fixed-threshold zero-velocity detectors for walking, running, and stair climbing motions. Additionally, we demonstrate how our learned zero-velocity detector operates effectively during crawling and ladder climbing. Our system is calibration-free (no careful threshold-tuning is required) and operates consistently with differing users, IMU placements, and shoe types, while being compatible with any generic zero-velocity-aided INS.
△ Less
Submitted 13 August, 2019; v1 submitted 13 July, 2018;
originally announced July 2018.
-
Improving Foot-Mounted Inertial Navigation Through Real-Time Motion Classification
Authors:
Brandon Wagstaff,
Valentin Peretroukhin,
Jonathan Kelly
Abstract:
We present a method to improve the accuracy of a foot-mounted, zero-velocity-aided inertial navigation system (INS) by varying estimator parameters based on a real-time classification of motion type. We train a support vector machine (SVM) classifier using inertial data recorded by a single foot-mounted sensor to differentiate between six motion types (walking, jogging, running, sprinting, crouch-…
▽ More
We present a method to improve the accuracy of a foot-mounted, zero-velocity-aided inertial navigation system (INS) by varying estimator parameters based on a real-time classification of motion type. We train a support vector machine (SVM) classifier using inertial data recorded by a single foot-mounted sensor to differentiate between six motion types (walking, jogging, running, sprinting, crouch-walking, and ladder-climbing) and report mean test classification accuracy of over 90% on a dataset with five different subjects. From these motion types, we select two of the most common (walking and running), and describe a method to compute optimal zero-velocity detection parameters tailored to both a specific user and motion type by maximizing the detector F-score. By combining the motion classifier with a set of optimal detection parameters, we show how we can reduce INS position error during mixed walking and running motion. We evaluate our adaptive system on a total of 5.9 km of indoor pedestrian navigation performed by five different subjects moving along a 130 m path with surveyed ground truth markers.
△ Less
Submitted 13 July, 2018; v1 submitted 4 July, 2017;
originally announced July 2017.