-
On Advantages of Mask-level Recognition for Outlier-aware Segmentation
Authors:
Matej Grcić,
Josip Šarić,
Siniša Šegvić
Abstract:
Most dense recognition approaches bring a separate decision in each particular pixel. These approaches deliver competitive performance in usual closed-set setups. However, important applications in the wild typically require strong performance in presence of outliers. We show that this demanding setup greatly benefit from mask-level predictions, even in the case of non-finetuned baseline models. M…
▽ More
Most dense recognition approaches bring a separate decision in each particular pixel. These approaches deliver competitive performance in usual closed-set setups. However, important applications in the wild typically require strong performance in presence of outliers. We show that this demanding setup greatly benefit from mask-level predictions, even in the case of non-finetuned baseline models. Moreover, we propose an alternative formulation of dense recognition uncertainty that effectively reduces false positive responses at semantic borders. The proposed formulation produces a further improvement over a very strong baseline and sets the new state of the art in outlier-aware semantic segmentation with and without training on negative data. Our contributions also lead to performance improvement in a recent panoptic setup. In-depth experiments confirm that our approach succeeds due to implicit aggregation of pixel-level cues into mask-level predictions.
△ Less
Submitted 5 April, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
Weakly supervised training of universal visual concepts for multi-domain semantic segmentation
Authors:
Petra Bevandić,
Marin Oršić,
Ivan Grubišić,
Josip Šarić,
Siniša Šegvić
Abstract:
Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, different datasets often have incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, wh…
▽ More
Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, different datasets often have incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. Furthermore, many datasets have overlap** labels. For instance, pickups are labeled as trucks in VIPER, cars in Vistas, and vans in ADE20k. We address this challenge by considering labels as unions of universal visual concepts. This allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort. Our method achieves competitive within-dataset and cross-dataset generalization, as well as ability to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark.
△ Less
Submitted 12 March, 2024; v1 submitted 20 December, 2022;
originally announced December 2022.
-
Panoptic SwiftNet: Pyramidal Fusion for Real-time Panoptic Segmentation
Authors:
Josip Šarić,
Marin Oršić,
Siniša Šegvić
Abstract:
Dense panoptic prediction is a key ingredient in many existing applications such as autonomous driving, automated warehouses or remote sensing. Many of these applications require fast inference over large input resolutions on affordable or even embedded hardware. We propose to achieve this goal by trading off backbone capacity for multi-scale feature extraction. In comparison with contemporaneous…
▽ More
Dense panoptic prediction is a key ingredient in many existing applications such as autonomous driving, automated warehouses or remote sensing. Many of these applications require fast inference over large input resolutions on affordable or even embedded hardware. We propose to achieve this goal by trading off backbone capacity for multi-scale feature extraction. In comparison with contemporaneous approaches to panoptic segmentation, the main novelties of our method are efficient scale-equivariant feature extraction, cross-scale upsampling through pyramidal fusion and boundary-aware learning of pixel-to-instance assignment. The proposed method is very well suited for remote sensing imagery due to the huge number of pixels in typical city-wide and region-wide datasets. We present panoptic experiments on Cityscapes, Vistas, COCO and the BSB-Aerial dataset. Our models outperform the state of the art on the BSB-Aerial dataset while being able to process more than a hundred 1MPx images per second on a RTX3090 GPU with FP16 precision and TensorRT optimization.
△ Less
Submitted 18 April, 2023; v1 submitted 15 March, 2022;
originally announced March 2022.
-
Multi-domain semantic segmentation with overlap** labels
Authors:
Petra Bevandić,
Marin Oršić,
Ivan Grubišić,
Josip Šarić,
Siniša Šegvić
Abstract:
Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on many datasets becomes a method of choice towards graceful degradation in unusual scenes. Unfortunately, different datasets often use incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings,…
▽ More
Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on many datasets becomes a method of choice towards graceful degradation in unusual scenes. Unfortunately, different datasets often use incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. We address this challenge by proposing a principled method for seamless learning on datasets with overlap** classes based on partial labels and probabilistic loss. Our method achieves competitive within-dataset and cross-dataset generalization, as well as ability to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark.
△ Less
Submitted 2 November, 2021; v1 submitted 25 August, 2021;
originally announced August 2021.
-
Dense Semantic Forecasting in Video by Joint Regression of Features and Feature Motion
Authors:
Josip Šarić,
Sacha Vražić,
Siniša Šegvić
Abstract:
Dense semantic forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. We present a novel approach that is applicable to various single-frame architectures and tasks. Our approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions. Feature-to-featur…
▽ More
Dense semantic forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. We present a novel approach that is applicable to various single-frame architectures and tasks. Our approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions. Feature-to-feature (F2F) module regresses the future features directly and is therefore able to account for emergent scenery. The compound F2MF model decouples the effects of motion from the effects of novelty in a task-agnostic manner. We aim to apply F2MF forecasting to the most subsampled and the most abstract representation of a desired single-frame model. Our design takes advantage of deformable convolutions and spatial correlation coefficients across neighbouring time instants. We perform experiments on three dense prediction tasks: semantic segmentation, instance-level segmentation, and panoptic segmentation. The results reveal state-of-the-art forecasting accuracy across three dense prediction tasks.
△ Less
Submitted 16 December, 2021; v1 submitted 26 January, 2021;
originally announced January 2021.
-
Multimodal semantic forecasting based on conditional generation of future features
Authors:
Kristijan Fugošić,
Josip Šarić,
Siniša Šegvić
Abstract:
This paper considers semantic forecasting in road-driving scenes. Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames. However, such approaches ignore the fact that future can not always be guessed with certainty. For example, when a car is about to turn around a corner, the road which is currently occluded by bui…
▽ More
This paper considers semantic forecasting in road-driving scenes. Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames. However, such approaches ignore the fact that future can not always be guessed with certainty. For example, when a car is about to turn around a corner, the road which is currently occluded by buildings may turn out to be either free to drive, or occupied by people, other vehicles or roadworks. When a deterministic model confronts such situation, its best guess is to forecast the most likely outcome. However, this is not acceptable since it defeats the purpose of forecasting to improve security. It also throws away valuable training data, since a deterministic model is unable to learn any deviation from the norm. We address this problem by providing more freedom to the model through allowing it to forecast different futures. We propose to formulate multimodal forecasting as sampling of a multimodal generative model conditioned on the observed frames. Experiments on the Cityscapes dataset reveal that our multimodal model outperforms its deterministic counterpart in short-term forecasting while performing slightly worse in the mid-term case.
△ Less
Submitted 18 October, 2020;
originally announced October 2020.
-
Multi-domain semantic segmentation with pyramidal fusion
Authors:
Petra Bevandić,
Marin Oršić,
Ivan Grubišić,
Josip Šarić,
Siniša Šegvić
Abstract:
We present our submission to the semantic segmentation contest of the Robust Vision Challenge held at ECCV 2020. The contest requires submitting the same model to seven benchmarks from three different domains. Our approach is based on the SwiftNet architecture with pyramidal fusion. We address inconsistent taxonomies with a single-level 193-dimensional softmax output. We strive to train with large…
▽ More
We present our submission to the semantic segmentation contest of the Robust Vision Challenge held at ECCV 2020. The contest requires submitting the same model to seven benchmarks from three different domains. Our approach is based on the SwiftNet architecture with pyramidal fusion. We address inconsistent taxonomies with a single-level 193-dimensional softmax output. We strive to train with large batches in order to stabilize optimization of a hard recognition problem, and to favour smooth evolution of batchnorm statistics. We achieve this by implementing a custom backward step through log-sum-prob loss, and by using small crops before freezing the population statistics. Our model ranks first on the RVC semantic segmentation challenge as well as on the WildDash 2 leaderboard. This suggests that pyramidal fusion is competitive not only for efficient inference with lightweight backbones, but also in large-scale setups for multi-domain application.
△ Less
Submitted 7 October, 2021; v1 submitted 2 September, 2020;
originally announced September 2020.
-
Single Level Feature-to-Feature Forecasting with Deformable Convolutions
Authors:
Josip Šarić,
Marin Oršić,
Tonći Antunović,
Sacha Vražić,
Siniša Šegvić
Abstract:
Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only…
▽ More
Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only the most abstract features on a very coarse resolution. We further propose to express feature-to-feature forecasting with deformable convolutions. This increases the modelling power due to being able to represent different motion patterns within a single feature map. Experiments show that our models with deformable convolutions outperform their regular and dilated counterparts while minimally increasing the number of parameters. Our method achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future.
△ Less
Submitted 26 July, 2019;
originally announced July 2019.