Search | arXiv e-print repository

On Advantages of Mask-level Recognition for Outlier-aware Segmentation

Authors: Matej Grcić, Josip Šarić, Siniša Šegvić

Abstract: Most dense recognition approaches bring a separate decision in each particular pixel. These approaches deliver competitive performance in usual closed-set setups. However, important applications in the wild typically require strong performance in presence of outliers. We show that this demanding setup greatly benefit from mask-level predictions, even in the case of non-finetuned baseline models. M… ▽ More Most dense recognition approaches bring a separate decision in each particular pixel. These approaches deliver competitive performance in usual closed-set setups. However, important applications in the wild typically require strong performance in presence of outliers. We show that this demanding setup greatly benefit from mask-level predictions, even in the case of non-finetuned baseline models. Moreover, we propose an alternative formulation of dense recognition uncertainty that effectively reduces false positive responses at semantic borders. The proposed formulation produces a further improvement over a very strong baseline and sets the new state of the art in outlier-aware semantic segmentation with and without training on negative data. Our contributions also lead to performance improvement in a recent panoptic setup. In-depth experiments confirm that our approach succeeds due to implicit aggregation of pixel-level cues into mask-level predictions. △ Less

Submitted 5 April, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

Comments: Accepted to CVPR 2023 workshop on Visual Anomaly and Novelty Detection (VAND)

arXiv:2212.10340 [pdf, other]

doi 10.1007/s11263-024-01986-z

Weakly supervised training of universal visual concepts for multi-domain semantic segmentation

Authors: Petra Bevandić, Marin Oršić, Ivan Grubišić, Josip Šarić, Siniša Šegvić

Abstract: Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, different datasets often have incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, wh… ▽ More Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on multiple datasets becomes a method of choice towards strong generalization in usual scenes and graceful performance degradation in edge cases. Unfortunately, different datasets often have incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. Furthermore, many datasets have overlap** labels. For instance, pickups are labeled as trucks in VIPER, cars in Vistas, and vans in ADE20k. We address this challenge by considering labels as unions of universal visual concepts. This allows seamless and principled learning on multi-domain dataset collections without requiring any relabeling effort. Our method achieves competitive within-dataset and cross-dataset generalization, as well as ability to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark. △ Less

Submitted 12 March, 2024; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: 27 pages, 16 figures, 10 tables, accepted to International Journal of Computer Vision

Journal ref: International Journal of Computer Vision, 2024, 1-23

arXiv:2203.07908 [pdf, other]

doi 10.3390/rs15081968

Panoptic SwiftNet: Pyramidal Fusion for Real-time Panoptic Segmentation

Authors: Josip Šarić, Marin Oršić, Siniša Šegvić

Abstract: Dense panoptic prediction is a key ingredient in many existing applications such as autonomous driving, automated warehouses or remote sensing. Many of these applications require fast inference over large input resolutions on affordable or even embedded hardware. We propose to achieve this goal by trading off backbone capacity for multi-scale feature extraction. In comparison with contemporaneous… ▽ More Dense panoptic prediction is a key ingredient in many existing applications such as autonomous driving, automated warehouses or remote sensing. Many of these applications require fast inference over large input resolutions on affordable or even embedded hardware. We propose to achieve this goal by trading off backbone capacity for multi-scale feature extraction. In comparison with contemporaneous approaches to panoptic segmentation, the main novelties of our method are efficient scale-equivariant feature extraction, cross-scale upsampling through pyramidal fusion and boundary-aware learning of pixel-to-instance assignment. The proposed method is very well suited for remote sensing imagery due to the huge number of pixels in typical city-wide and region-wide datasets. We present panoptic experiments on Cityscapes, Vistas, COCO and the BSB-Aerial dataset. Our models outperform the state of the art on the BSB-Aerial dataset while being able to process more than a hundred 1MPx images per second on a RTX3090 GPU with FP16 precision and TensorRT optimization. △ Less

Submitted 18 April, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

Comments: Code available at: https://github.com/jsaric/panoptic-swiftnet

Journal ref: Remote Sensing. 2023, 15(8), 1968;

arXiv:2108.11224 [pdf, other]

Multi-domain semantic segmentation with overlap** labels

Authors: Petra Bevandić, Marin Oršić, Ivan Grubišić, Josip Šarić, Siniša Šegvić

Abstract: Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on many datasets becomes a method of choice towards graceful degradation in unusual scenes. Unfortunately, different datasets often use incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings,… ▽ More Deep supervised models have an unprecedented capacity to absorb large quantities of training data. Hence, training on many datasets becomes a method of choice towards graceful degradation in unusual scenes. Unfortunately, different datasets often use incompatible labels. For instance, the Cityscapes road class subsumes all driving surfaces, while Vistas defines separate classes for road markings, manholes etc. We address this challenge by proposing a principled method for seamless learning on datasets with overlap** classes based on partial labels and probabilistic loss. Our method achieves competitive within-dataset and cross-dataset generalization, as well as ability to learn visual concepts which are not separately labeled in any of the training datasets. Experiments reveal competitive or state-of-the-art performance on two multi-domain dataset collections and on the WildDash 2 benchmark. △ Less

Submitted 2 November, 2021; v1 submitted 25 August, 2021; originally announced August 2021.

Comments: 18 pages, 8 figures, 11 tables

arXiv:2101.10777 [pdf, other]

doi 10.1109/TNNLS.2021.3136624

Dense Semantic Forecasting in Video by Joint Regression of Features and Feature Motion

Authors: Josip Šarić, Sacha Vražić, Siniša Šegvić

Abstract: Dense semantic forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. We present a novel approach that is applicable to various single-frame architectures and tasks. Our approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions. Feature-to-featur… ▽ More Dense semantic forecasting anticipates future events in video by inferring pixel-level semantics of an unobserved future image. We present a novel approach that is applicable to various single-frame architectures and tasks. Our approach consists of two modules. Feature-to-motion (F2M) module forecasts a dense deformation field that warps past features into their future positions. Feature-to-feature (F2F) module regresses the future features directly and is therefore able to account for emergent scenery. The compound F2MF model decouples the effects of motion from the effects of novelty in a task-agnostic manner. We aim to apply F2MF forecasting to the most subsampled and the most abstract representation of a desired single-frame model. Our design takes advantage of deformable convolutions and spatial correlation coefficients across neighbouring time instants. We perform experiments on three dense prediction tasks: semantic segmentation, instance-level segmentation, and panoptic segmentation. The results reveal state-of-the-art forecasting accuracy across three dense prediction tasks. △ Less

Submitted 16 December, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

Comments: 13 pages, 10 figures

arXiv:2010.09067 [pdf, other]

Multimodal semantic forecasting based on conditional generation of future features

Authors: Kristijan Fugošić, Josip Šarić, Siniša Šegvić

Abstract: This paper considers semantic forecasting in road-driving scenes. Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames. However, such approaches ignore the fact that future can not always be guessed with certainty. For example, when a car is about to turn around a corner, the road which is currently occluded by bui… ▽ More This paper considers semantic forecasting in road-driving scenes. Most existing approaches address this problem as deterministic regression of future features or future predictions given observed frames. However, such approaches ignore the fact that future can not always be guessed with certainty. For example, when a car is about to turn around a corner, the road which is currently occluded by buildings may turn out to be either free to drive, or occupied by people, other vehicles or roadworks. When a deterministic model confronts such situation, its best guess is to forecast the most likely outcome. However, this is not acceptable since it defeats the purpose of forecasting to improve security. It also throws away valuable training data, since a deterministic model is unable to learn any deviation from the norm. We address this problem by providing more freedom to the model through allowing it to forecast different futures. We propose to formulate multimodal forecasting as sampling of a multimodal generative model conditioned on the observed frames. Experiments on the Cityscapes dataset reveal that our multimodal model outperforms its deterministic counterpart in short-term forecasting while performing slightly worse in the mid-term case. △ Less

Submitted 18 October, 2020; originally announced October 2020.

Comments: Accepted to German Conference on Pattern Recognition 2020. 24 pages, 11 figures, 5 tables

arXiv:2009.01636 [pdf, ps, other]

Multi-domain semantic segmentation with pyramidal fusion

Authors: Petra Bevandić, Marin Oršić, Ivan Grubišić, Josip Šarić, Siniša Šegvić

Abstract: We present our submission to the semantic segmentation contest of the Robust Vision Challenge held at ECCV 2020. The contest requires submitting the same model to seven benchmarks from three different domains. Our approach is based on the SwiftNet architecture with pyramidal fusion. We address inconsistent taxonomies with a single-level 193-dimensional softmax output. We strive to train with large… ▽ More We present our submission to the semantic segmentation contest of the Robust Vision Challenge held at ECCV 2020. The contest requires submitting the same model to seven benchmarks from three different domains. Our approach is based on the SwiftNet architecture with pyramidal fusion. We address inconsistent taxonomies with a single-level 193-dimensional softmax output. We strive to train with large batches in order to stabilize optimization of a hard recognition problem, and to favour smooth evolution of batchnorm statistics. We achieve this by implementing a custom backward step through log-sum-prob loss, and by using small crops before freezing the population statistics. Our model ranks first on the RVC semantic segmentation challenge as well as on the WildDash 2 leaderboard. This suggests that pyramidal fusion is competitive not only for efficient inference with lightweight backbones, but also in large-scale setups for multi-domain application. △ Less

Submitted 7 October, 2021; v1 submitted 2 September, 2020; originally announced September 2020.

Comments: 2 pages, 2 tables, no figures

arXiv:1907.11475 [pdf, other]

Single Level Feature-to-Feature Forecasting with Deformable Convolutions

Authors: Josip Šarić, Marin Oršić, Tonći Antunović, Sacha Vražić, Siniša Šegvić

Abstract: Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only… ▽ More Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only the most abstract features on a very coarse resolution. We further propose to express feature-to-feature forecasting with deformable convolutions. This increases the modelling power due to being able to represent different motion patterns within a single feature map. Experiments show that our models with deformable convolutions outperform their regular and dilated counterparts while minimally increasing the number of parameters. Our method achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future. △ Less

Submitted 26 July, 2019; originally announced July 2019.

Comments: Accepted to German Conference on Pattern Recognition 2019. 19 pages, 8 figures, 7 tables

Showing 1–8 of 8 results for author: Šarić, J