Search | arXiv e-print repository

BACS: Background Aware Continual Semantic Segmentation

Authors: Mostafa ElAraby, Ali Harakeh, Liam Paull

Abstract: Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation effic… ▽ More Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation efficiency, ideally, only pixels belonging to new classes would be annotated. This approach is known as Continual Semantic Segmentation (CSS). Besides the common problem of classical catastrophic forgetting in the continual learning setting, CSS suffers from the inherent ambiguity of the background, a phenomenon we refer to as the "background shift'', since pixels labeled as background could correspond to future classes (forward background shift) or previous classes (backward background shift). As a result, continual learning approaches tend to fail. This paper proposes a Backward Background Shift Detector (BACS) to detect previously observed classes based on their distance in the latent space from the foreground centroids of previous steps. Moreover, we propose a modified version of the cross-entropy loss function, incorporating the BACS detector to down-weight background pixels associated with formerly observed classes. To combat catastrophic forgetting, we employ masked feature distillation alongside dark experience replay. Additionally, our approach includes a transformer decoder capable of adjusting to new classes without necessitating an additional classification head. We validate BACS's superior performance over existing state-of-the-art methods on standard CSS benchmarks. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 8 pages, 4 figures, CRV 2024

arXiv:2301.05709 [pdf, other]

Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss

Authors: Anas Mahmoud, Jordan S. K. Hu, Tianshu Kuai, Ali Harakeh, Liam Paull, Steven L. Waslander

Abstract: An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and th… ▽ More An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and thus disturbing the local semantic structure of the learned representations, and 2) severe class imbalance as pretraining gets dominated by over-represented classes. We propose to alleviate the self-similarity problem through a novel semantically tolerant image-to-point contrastive loss that takes into consideration the semantic distance between positive and negative image regions to minimize contrasting semantically similar point and image regions. Additionally, we address class imbalance by designing a class-agnostic balanced loss that approximates the degree of class imbalance through an aggregate sample-to-samples semantic similarity measure. We demonstrate that our semantically-tolerant contrastive loss with class balancing improves state-of-the art 2D-to-3D representation learning in all evaluation settings on 3D semantic segmentation. Our method consistently outperforms state-of-the-art 2D-to-3D representation learning frameworks across a wide range of 2D self-supervised pretrained models. △ Less

Submitted 24 March, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: Accepted in CVPR 2023

arXiv:2211.13724 [pdf, other]

Estimating Regression Predictive Distributions with Sample Networks

Authors: Ali Harakeh, Jordan Hu, Naiqing Guan, Steven L. Waslander, Liam Paull

Abstract: Estimating the uncertainty in deep neural network predictions is crucial for many real-world applications. A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation. The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates. In this work, we propose S… ▽ More Estimating the uncertainty in deep neural network predictions is crucial for many real-world applications. A common approach to model uncertainty is to choose a parametric distribution and fit the data to it using maximum likelihood estimation. The chosen parametric form can be a poor fit to the data-generating distribution, resulting in unreliable uncertainty estimates. In this work, we propose SampleNet, a flexible and scalable architecture for modeling uncertainty that avoids specifying a parametric form on the output distribution. SampleNets do so by defining an empirical distribution using samples that are learned with the Energy Score and regularized with the Sinkhorn Divergence. SampleNets are shown to be able to well-fit a wide range of distributions and to outperform baselines on large-scale real-world regression tasks. △ Less

Submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted for publication in AAAI 2023. Example code at: https://samplenet.github.io/

arXiv:2107.13682 [pdf, other]

Bayesian Embeddings for Few-Shot Open World Recognition

Authors: John Willes, James Harrison, Ali Harakeh, Chelsea Finn, Marco Pavone, Steven Waslander

Abstract: As autonomous decision-making agents move from narrow operating environments to unstructured worlds, learning systems must move from a closed-world formulation to an open-world and few-shot setting in which agents continuously learn new classes from small amounts of information. This stands in stark contrast to modern machine learning systems that are typically designed with a known set of classes… ▽ More As autonomous decision-making agents move from narrow operating environments to unstructured worlds, learning systems must move from a closed-world formulation to an open-world and few-shot setting in which agents continuously learn new classes from small amounts of information. This stands in stark contrast to modern machine learning systems that are typically designed with a known set of classes and a large number of examples for each class. In this work we extend embedding-based few-shot learning algorithms to the open-world recognition setting. We combine Bayesian non-parametric class priors with an embedding-based pre-training scheme to yield a highly flexible framework which we refer to as few-shot learning for open world recognition (FLOWR). We benchmark our framework on open-world extensions of the common MiniImageNet and TieredImageNet few-shot learning datasets. Our results show, compared to prior methods, strong classification accuracy performance and up to a 12% improvement in H-measure (a measure of novel class detection) from our non-parametric open-world few-shot learning scheme. △ Less

Submitted 5 October, 2022; v1 submitted 28 July, 2021; originally announced July 2021.

arXiv:2103.01100 [pdf, other]

Categorical Depth Distribution Network for Monocular 3D Object Detection

Authors: Cody Reading, Ali Harakeh, Julia Chae, Steven L. Waslander

Abstract: Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to… ▽ More Monocular 3D object detection is a key problem for autonomous vehicles, as it provides a solution with simple configuration compared to typical multi-sensor systems. The main challenge in monocular 3D detection lies in accurately predicting object depth, which must be inferred from object and scene cues due to the lack of direct range measurement. Many methods attempt to directly estimate depth to assist in 3D detection, but show limited performance as a result of depth inaccuracy. Our proposed solution, Categorical Depth Distribution Network (CaDDN), uses a predicted categorical depth distribution for each pixel to project rich contextual feature information to the appropriate depth interval in 3D space. We then use the computationally efficient bird's-eye-view projection and single-stage detector to produce the final output bounding boxes. We design CaDDN as a fully differentiable end-to-end approach for joint depth estimation and object detection. We validate our approach on the KITTI 3D object detection benchmark, where we rank 1st among published monocular methods. We also provide the first monocular 3D detection results on the newly released Waymo Open Dataset. We provide a code release for CaDDN which is made available. △ Less

Submitted 23 March, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

Comments: Accepted in CVPR 2021

arXiv:2101.05036 [pdf, other]

Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors

Authors: Ali Harakeh, Steven L. Waslander

Abstract: Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy pred… ▽ More Predictive uncertainty estimation is an essential next step for the reliable deployment of deep object detectors in safety-critical tasks. In this work, we focus on estimating predictive distributions for bounding box regression output with variance networks. We show that in the context of object detection, training variance networks with negative log likelihood (NLL) can lead to high entropy predictive distributions regardless of the correctness of the output mean. We propose to use the energy score as a non-local proper scoring rule and find that when used for training, the energy score leads to better calibrated and lower entropy predictive distributions than NLL. We also address the widespread use of non-proper scoring metrics for evaluating predictive distributions from deep object detectors by proposing an alternate evaluation approach founded on proper scoring rules. Using the proposed evaluation tools, we show that although variance networks can be used to produce high quality predictive distributions, ad-hoc approaches used by seminal object detectors for choosing regression targets during training do not provide wide enough data support for reliable variance learning. We hope that our work helps shift evaluation in probabilistic object detection to better align with predictive uncertainty evaluation in other machine learning domains. Code for all models, evaluation, and datasets is available at: https://github.com/asharakeh/probdet.git. △ Less

Submitted 12 March, 2021; v1 submitted 13 January, 2021; originally announced January 2021.

Comments: Published as a conference paper at ICLR 2021. Link: https://openreview.net/forum?id=YLewtnvKgR7. This is the final camera-ready version

arXiv:2011.10671 [pdf, other]

doi 10.1109/TITS.2021.3096854

A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving

Authors: Di Feng, Ali Harakeh, Steven Waslander, Klaus Dietmayer

Abstract: Capturing uncertainty in object detection is indispensable for safe autonomous driving. In recent years, deep learning has become the de-facto approach for object detection, and many probabilistic object detectors have been proposed. However, there is no summary on uncertainty estimation in deep object detection, and existing methods are not only built with different network architectures and unce… ▽ More Capturing uncertainty in object detection is indispensable for safe autonomous driving. In recent years, deep learning has become the de-facto approach for object detection, and many probabilistic object detectors have been proposed. However, there is no summary on uncertainty estimation in deep object detection, and existing methods are not only built with different network architectures and uncertainty estimation methods, but also evaluated on different datasets with a wide range of evaluation metrics. As a result, a comparison among methods remains challenging, as does the selection of a model that best suits a particular application. This paper aims to alleviate this problem by providing a review and comparative study on existing probabilistic object detection methods for autonomous driving applications. First, we provide an overview of generic uncertainty estimation in deep learning, and then systematically survey existing methods and evaluation metrics for probabilistic object detection. Next, we present a strict comparative study for probabilistic object detection based on an image detector and three public autonomous driving datasets. Finally, we present a discussion of the remaining challenges and future works. Code has been made available at https://github.com/asharakeh/pod_compare.git △ Less

Submitted 11 July, 2021; v1 submitted 20 November, 2020; originally announced November 2020.

Comments: Accepted in the IEEE Transactions on Intelligent Transportation Systems

arXiv:1903.03838 [pdf, other]

BayesOD: A Bayesian Approach for Uncertainty Estimation in Deep Object Detectors

Authors: Ali Harakeh, Michael Smart, Steven L. Waslander

Abstract: When incorporating deep neural networks into robotic systems, a major challenge is the lack of uncertainty measures associated with their output predictions. Methods for uncertainty estimation in the output of deep object detectors (DNNs) have been proposed in recent works, but have had limited success due to 1) information loss at the detectors non-maximum suppression (NMS) stage, and 2) failure… ▽ More When incorporating deep neural networks into robotic systems, a major challenge is the lack of uncertainty measures associated with their output predictions. Methods for uncertainty estimation in the output of deep object detectors (DNNs) have been proposed in recent works, but have had limited success due to 1) information loss at the detectors non-maximum suppression (NMS) stage, and 2) failure to take into account the multitask, many-to-one nature of anchor-based object detection. To that end, we introduce BayesOD, an uncertainty estimation approach that reformulates the standard object detector inference and Non-Maximum suppression components from a Bayesian perspective. Experiments performed on four common object detection datasets show that BayesOD provides uncertainty estimates that are better correlated with the accuracy of detections, manifesting as a significant reduction of 9.77\%-13.13\% on the minimum Gaussian uncertainty error metric and a reduction of 1.63\%-5.23\% on the minimum Categorical uncertainty error metric. Code will be released at {\url{https://github.com/asharakeh/bayes-od-rc}}. △ Less

Submitted 16 September, 2019; v1 submitted 9 March, 2019; originally announced March 2019.

arXiv:1807.06072 [pdf, other]

Leveraging Pre-Trained 3D Object Detection Models For Fast Ground Truth Generation

Authors: Jungwook Lee, Sean Walsh, Ali Harakeh, Steven L. Waslander

Abstract: Training 3D object detectors for autonomous driving has been limited to small datasets due to the effort required to generate annotations. Reducing both task complexity and the amount of task switching done by annotators is key to reducing the effort and time required to generate 3D bounding box annotations. This paper introduces a novel ground truth generation method that combines human supervisi… ▽ More Training 3D object detectors for autonomous driving has been limited to small datasets due to the effort required to generate annotations. Reducing both task complexity and the amount of task switching done by annotators is key to reducing the effort and time required to generate 3D bounding box annotations. This paper introduces a novel ground truth generation method that combines human supervision with pretrained neural networks to generate per-instance 3D point cloud segmentation, 3D bounding boxes, and class annotations. The annotators provide object anchor clicks which behave as a seed to generate instance segmentation results in 3D. The points belonging to each instance are then used to regress object centroids, bounding box dimensions, and object orientation. Our proposed annotation scheme requires 30x lower human annotation time. We use the KITTI 3D object detection dataset to evaluate the efficiency and the quality of our annotation scheme. We also test the the proposed scheme on previously unseen data from the Autonomoose self-driving vehicle to demonstrate generalization capabilities of the network. △ Less

Submitted 16 July, 2018; originally announced July 2018.

arXiv:1807.06056 [pdf, other]

Unlimited Road-scene Synthetic Annotation (URSA) Dataset

Authors: Matt Angus, Mohamed ElBalkini, Samin Khan, Ali Harakeh, Oles Andrienko, Cody Reading, Steven Waslander, Krzysztof Czarnecki

Abstract: In training deep neural networks for semantic segmentation, the main limiting factor is the low amount of ground truth annotation data that is available in currently existing datasets. The limited availability of such data is due to the time cost and human effort required to accurately and consistently label real images on a pixel level. Modern sandbox video game engines provide open world environ… ▽ More In training deep neural networks for semantic segmentation, the main limiting factor is the low amount of ground truth annotation data that is available in currently existing datasets. The limited availability of such data is due to the time cost and human effort required to accurately and consistently label real images on a pixel level. Modern sandbox video game engines provide open world environments where traffic and pedestrians behave in a pseudo-realistic manner. This caters well to the collection of a believable road-scene dataset. Utilizing open-source tools and resources found in single-player modding communities, we provide a method for persistent, ground truth, asset annotation of a game world. By collecting a synthetic dataset containing upwards of $1,000,000$ images, we demonstrate real-time, on-demand, ground truth data annotation capability of our method. Supplementing this synthetic data to Cityscapes dataset, we show that our data generation method provides qualitative as well as quantitative improvements---for training networks---over previous methods that use video games as surrogate. △ Less

Submitted 16 July, 2018; originally announced July 2018.

Comments: Accepted in The 21st IEEE International Conference on Intelligent Transportation Systems

arXiv:1806.07987 [pdf, other]

A Hierarchical Deep Architecture and Mini-Batch Selection Method For Joint Traffic Sign and Light Detection

Authors: Alex D. Pon, Oles Andrienko, Ali Harakeh, Steven L. Waslander

Abstract: Traffic light and sign detectors on autonomous cars are integral for road scene perception. The literature is abundant with deep learning networks that detect either lights or signs, not both, which makes them unsuitable for real-life deployment due to the limited graphics processing unit (GPU) memory and power available on embedded systems. The root cause of this issue is that no public dataset c… ▽ More Traffic light and sign detectors on autonomous cars are integral for road scene perception. The literature is abundant with deep learning networks that detect either lights or signs, not both, which makes them unsuitable for real-life deployment due to the limited graphics processing unit (GPU) memory and power available on embedded systems. The root cause of this issue is that no public dataset contains both traffic light and sign labels, which leads to difficulties in develo** a joint detection framework. We present a deep hierarchical architecture in conjunction with a mini-batch proposal selection mechanism that allows a network to detect both traffic lights and signs from training on separate traffic light and sign datasets. Our method solves the overlap** issue where instances from one dataset are not labelled in the other dataset. We are the first to present a network that performs joint detection on traffic lights and signs. We measure our network on the Tsinghua-Tencent 100K benchmark for traffic sign detection and the Bosch Small Traffic Lights benchmark for traffic light detection and show it outperforms the existing Bosch Small Traffic light state-of-the-art method. We focus on autonomous car deployment and show our network is more suitable than others because of its low memory footprint and real-time image processing time. Qualitative results can be viewed at https://youtu.be/_YmogPzBXOw △ Less

Submitted 13 September, 2018; v1 submitted 20 June, 2018; originally announced June 2018.

Comments: Accepted in the IEEE 15th Conference on Computer and Robot Vision

arXiv:1802.00036 [pdf, other]

In Defense of Classical Image Processing: Fast Depth Completion on the CPU

Authors: Jason Ku, Ali Harakeh, Steven L. Waslander

Abstract: With the rise of data driven deep neural networks as a realization of universal function approximators, most research on computer vision problems has moved away from hand crafted classical image processing algorithms. This paper shows that with a well designed algorithm, we are capable of outperforming neural network based methods on the task of depth completion. The proposed algorithm is simple a… ▽ More With the rise of data driven deep neural networks as a realization of universal function approximators, most research on computer vision problems has moved away from hand crafted classical image processing algorithms. This paper shows that with a well designed algorithm, we are capable of outperforming neural network based methods on the task of depth completion. The proposed algorithm is simple and fast, runs on the CPU, and relies only on basic image processing operations to perform depth completion of sparse LIDAR depth data. We evaluate our algorithm on the challenging KITTI depth completion benchmark, and at the time of submission, our method ranks first on the KITTI test server among all published methods. Furthermore, our algorithm is data independent, requiring no training data to perform the task at hand. The code written in Python will be made publicly available at https://github.com/kujason/ip_basic. △ Less

Submitted 31 January, 2018; originally announced February 2018.

arXiv:1712.02294 [pdf, other]

Joint 3D Proposal Generation and Object Detection from View Aggregation

Authors: Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, Steven Waslander

Abstract: We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high r… ▽ More We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is at: https://github.com/kujason/avod △ Less

Submitted 12 July, 2018; v1 submitted 6 December, 2017; originally announced December 2017.

Comments: For any inquiries contact aharakeh(at)uwaterloo(dot)ca

Showing 1–13 of 13 results for author: Harakeh, A