-
Gradient-based Class Weighting for Unsupervised Domain Adaptation in Dense Prediction Visual Tasks
Authors:
Roberto Alcover-Couso,
Marcos Escudero-Viñolo,
Juan C. SanMiguel,
Jesus Bescós
Abstract:
In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imba…
▽ More
In unsupervised domain adaptation (UDA), where models are trained on source data (e.g., synthetic) and adapted to target data (e.g., real-world) without target annotations, addressing the challenge of significant class imbalance remains an open issue. Despite considerable progress in bridging the domain gap, existing methods often experience performance degradation when confronted with highly imbalanced dense prediction visual tasks like semantic and panoptic segmentation. This discrepancy becomes especially pronounced due to the lack of equivalent priors between the source and target domains, turning class imbalanced techniques used for other areas (e.g., image classification) ineffective in UDA scenarios. This paper proposes a class-imbalance mitigation strategy that incorporates class-weights into the UDA learning losses, but with the novelty of estimating these weights dynamically through the loss gradient, defining a Gradient-based class weighting (GBW) learning. GBW naturally increases the contribution of classes whose learning is hindered by large-represented classes, and has the advantage of being able to automatically and quickly adapt to the iteration training outcomes, avoiding explicitly curricular learning patterns common in loss-weighing strategies. Extensive experimentation validates the effectiveness of GBW across architectures (convolutional and transformer), UDA strategies (adversarial, self-training and entropy minimization), tasks (semantic and panoptic segmentation), and datasets (GTA and Synthia). Analysing the source of advantage, GBW consistently increases the recall of low represented classes.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models
Authors:
Pablo Marcos-Manchón,
Roberto Alcover-Couso,
Juan C. SanMiguel,
Jose M. Martínez
Abstract:
Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts, models such as Stable Diffusion have been successfully extended to the joint generation of semantic segmentation pseudo-masks. However, current extensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the gen…
▽ More
Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts, models such as Stable Diffusion have been successfully extended to the joint generation of semantic segmentation pseudo-masks. However, current extensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the generation of segmentation masks derived from word tokens not contained in the text prompt. In this work, we introduce Open-Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. In addition, we propose a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
The Robust Semantic Segmentation UNCV2023 Challenge Results
Authors:
Xuanlong Yu,
Yi Zuo,
Zitao Wang,
Xiaowen Zhang,
Jiaxuan Zhao,
Yuting Yang,
Licheng Jiao,
Rui Peng,
Xinyi Wang,
Junpei Zhang,
Kexin Zhang,
Fang Liu,
Roberto Alcover-Couso,
Juan C. SanMiguel,
Marcos Escudero-Viñolo,
Hanlin Tian,
Kenta Matsui,
Tianhao Wang,
Fahmy Adan,
Zhitong Gao,
Xuming He,
Quentin Bouniot,
Hossein Moghaddam,
Shyam Nandan Rai,
Fabio Cermelli
, et al. (12 additional authors not shown)
Abstract:
This paper outlines the winning solutions employed in addressing the MUAD uncertainty quantification challenge held at ICCV 2023. The challenge was centered around semantic segmentation in urban environments, with a particular focus on natural adversarial scenarios. The report presents the results of 19 submitted entries, with numerous techniques drawing inspiration from cutting-edge uncertainty q…
▽ More
This paper outlines the winning solutions employed in addressing the MUAD uncertainty quantification challenge held at ICCV 2023. The challenge was centered around semantic segmentation in urban environments, with a particular focus on natural adversarial scenarios. The report presents the results of 19 submitted entries, with numerous techniques drawing inspiration from cutting-edge uncertainty quantification methodologies presented at prominent conferences in the fields of computer vision and machine learning and journals over the past few years. Within this document, the challenge is introduced, shedding light on its purpose and objectives, which primarily revolved around enhancing the robustness of semantic segmentation in urban scenes under varying natural adversarial conditions. The report then delves into the top-performing solutions. Moreover, the document aims to provide a comprehensive overview of the diverse solutions deployed by all participants. By doing so, it seeks to offer readers a deeper insight into the array of strategies that can be leveraged to effectively handle the inherent uncertainties associated with autonomous driving and semantic segmentation, especially within urban environments.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
Soft labelling for semantic segmentation: Bringing coherence to label down-sampling
Authors:
Roberto Alcover-Couso,
Marcos Escudero-Vinolo,
Juan C. SanMiguel,
Jose M. Martinez
Abstract:
In semantic segmentation, training data down-sampling is commonly performed due to limited resources, the need to adapt image size to the model input, or improve data augmentation. This down-sampling typically employs different strategies for the image data and the annotated labels. Such discrepancy leads to mismatches between the down-sampled color and label images. Hence, the training performanc…
▽ More
In semantic segmentation, training data down-sampling is commonly performed due to limited resources, the need to adapt image size to the model input, or improve data augmentation. This down-sampling typically employs different strategies for the image data and the annotated labels. Such discrepancy leads to mismatches between the down-sampled color and label images. Hence, the training performance significantly decreases as the down-sampling factor increases. In this paper, we bring together the down-sampling strategies for the image data and the training labels. To that aim, we propose a novel framework for label down-sampling via soft-labeling that better conserves label information after down-sampling. Therefore, fully aligning soft-labels with image data to keep the distribution of the sampled pixels. This proposal also produces reliable annotations for under-represented semantic classes. Altogether, it allows training competitive models at lower resolutions. Experiments show that the proposal outperforms other down-sampling strategies. Moreover, state-of-the-art performance is achieved for reference benchmarks, but employing significantly less computational resources than foremost approaches. This proposal enables competitive research for semantic segmentation under resource constraints.
△ Less
Submitted 19 February, 2024; v1 submitted 27 February, 2023;
originally announced February 2023.
-
Detection-aware multi-object tracking evaluation
Authors:
Juan C. SanMiguel,
Jorge Muñoz,
Fabio Poiesi
Abstract:
How would you fairly evaluate two multi-object tracking algorithms (i.e. trackers), each one employing a different object detector? Detectors keep improving, thus trackers can make less effort to estimate object states over time. Is it then fair to compare a new tracker employing a new detector with another tracker using an old detector? In this paper, we propose a novel performance measure, named…
▽ More
How would you fairly evaluate two multi-object tracking algorithms (i.e. trackers), each one employing a different object detector? Detectors keep improving, thus trackers can make less effort to estimate object states over time. Is it then fair to compare a new tracker employing a new detector with another tracker using an old detector? In this paper, we propose a novel performance measure, named Tracking Effort Measure (TEM), to evaluate trackers that use different detectors. TEM estimates the improvement that the tracker does with respect to its input data (i.e. detections) at frame level (intra-frame complexity) and sequence level (inter-frame complexity). We evaluate TEM over well-known datasets, four trackers and eight detection sets. Results show that, unlike conventional tracking evaluation measures, TEM can quantify the effort done by the tracker with a reduced correlation on the input detections. Its implementation is publicly available online at https://github.com/vpulab/MOT-evaluation.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
Attention-based Knowledge Distillation in Multi-attention Tasks: The Impact of a DCT-driven Loss
Authors:
Alejandro López-Cifuentes,
Marcos Escudero-Viñolo,
Jesús Bescós,
Juan C. SanMiguel
Abstract:
Knowledge Distillation (KD) is a strategy for the definition of a set of transferability gangways to improve the efficiency of Convolutional Neural Networks. Feature-based Knowledge Distillation is a subfield of KD that relies on intermediate network representations, either unaltered or depth-reduced via maximum activation maps, as the source knowledge. In this paper, we propose and analyse the us…
▽ More
Knowledge Distillation (KD) is a strategy for the definition of a set of transferability gangways to improve the efficiency of Convolutional Neural Networks. Feature-based Knowledge Distillation is a subfield of KD that relies on intermediate network representations, either unaltered or depth-reduced via maximum activation maps, as the source knowledge. In this paper, we propose and analyse the use of a 2D frequency transform of the activation maps before transferring them. We pose that\textemdash by using global image cues rather than pixel estimates, this strategy enhances knowledge transferability in tasks such as scene recognition, defined by strong spatial and contextual relationships between multiple and varied concepts. To validate the proposed method, an extensive evaluation of the state-of-the-art in scene recognition is presented. Experimental results provide strong evidences that the proposed strategy enables the student network to better focus on the relevant image areas learnt by the teacher network, hence leading to better descriptive features and higher transferred performance than every other state-of-the-art alternative. We publicly release the training and evaluation framework used along this paper at http://www-vpu.eps.uam.es/publications/DCTBasedKDForSceneRecognition.
△ Less
Submitted 6 June, 2022; v1 submitted 4 May, 2022;
originally announced May 2022.
-
Graph Neural Networks for Cross-Camera Data Association
Authors:
Elena Luna,
Juan C. SanMiguel,
José M. Martínez,
Pablo Carballeira
Abstract:
Cross-camera image data association is essential for many multi-camera computer vision tasks, such as multi-camera pedestrian detection, multi-camera multi-target tracking, 3D pose estimation, etc. This association task is typically stated as a bipartite graph matching problem and often solved by applying minimum-cost flow techniques, which may be computationally inefficient with large data. Furth…
▽ More
Cross-camera image data association is essential for many multi-camera computer vision tasks, such as multi-camera pedestrian detection, multi-camera multi-target tracking, 3D pose estimation, etc. This association task is typically stated as a bipartite graph matching problem and often solved by applying minimum-cost flow techniques, which may be computationally inefficient with large data. Furthermore, cameras are usually treated by pairs, obtaining local solutions, rather than finding a global solution at once. Other key issue is that of the affinity measurement: the widespread usage of non-learnable pre-defined distances, such as the Euclidean and Cosine ones. This paper proposes an efficient approach for cross-cameras data-association focused on a global solution, instead of processing cameras by pairs. To avoid the usage of fixed distances, we leverage the connectivity of Graph Neural Networks, previously unused in this scope, using a Message Passing Network to jointly learn features and similarity. We validate the proposal for pedestrian multi-view association, showing results over the EPFL multi-camera pedestrian dataset. Our approach considerably outperforms the literature data association techniques, without requiring to be trained in the same scenario in which it is tested. Our code is available at \url{http://www-vpu.eps.uam.es/publications/gnn_cca}.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
Improved skin lesion recognition by a Self-Supervised Curricular Deep Learning approach
Authors:
Kirill Sirotkin,
Marcos Escudero-Viñolo,
Pablo Carballeira,
Juan Carlos SanMiguel
Abstract:
State-of-the-art deep learning approaches for skin lesion recognition often require pretraining on larger and more varied datasets, to overcome the generalization limitations derived from the reduced size of the skin lesion imaging datasets. ImageNet is often used as the pretraining dataset, but its transferring potential is hindered by the domain gap between the source dataset and the target derm…
▽ More
State-of-the-art deep learning approaches for skin lesion recognition often require pretraining on larger and more varied datasets, to overcome the generalization limitations derived from the reduced size of the skin lesion imaging datasets. ImageNet is often used as the pretraining dataset, but its transferring potential is hindered by the domain gap between the source dataset and the target dermatoscopic scenario. In this work, we introduce a novel pretraining approach that sequentially trains a series of Self-Supervised Learning pretext tasks and only requires the unlabeled skin lesion imaging data. We present a simple methodology to establish an ordering that defines a pretext task curriculum. For the multi-class skin lesion classification problem, and ISIC-2019 dataset, we provide experimental evidence showing that: i) a model pretrained by a curriculum of pretext tasks outperforms models pretrained by individual pretext tasks, and ii) a model pretrained by the optimal pretext task curriculum outperforms a model pretrained on ImageNet. We demonstrate that this performance gain is related to the fact that the curriculum of pretext tasks better focuses the attention of the final model on the skin lesion. Beyond performance improvement, this strategy allows for a large reduction in the training time with respect to ImageNet pretraining, which is especially advantageous for network architectures tailored for a specific problem.
△ Less
Submitted 22 December, 2021;
originally announced December 2021.
-
Deep Anomaly Generation: An Image Translation Approach of Synthesizing Abnormal Banded Chromosome Images
Authors:
Lukas Uzolas,
Javier Rico,
Pierrick Coupé,
Juan C. SanMiguel,
György Cserey
Abstract:
Advances in deep-learning-based pipelines have led to breakthroughs in a variety of microscopy image diagnostics. However, a sufficiently big training data set is usually difficult to obtain due to high annotation costs. In the case of banded chromosome images, the creation of big enough libraries is difficult for multiple pathologies due to the rarity of certain genetic disorders. Generative Adve…
▽ More
Advances in deep-learning-based pipelines have led to breakthroughs in a variety of microscopy image diagnostics. However, a sufficiently big training data set is usually difficult to obtain due to high annotation costs. In the case of banded chromosome images, the creation of big enough libraries is difficult for multiple pathologies due to the rarity of certain genetic disorders. Generative Adversarial Networks (GANs) have proven to be effective in generating synthetic images and extending training data sets. In our work, we implement a conditional adversarial network that allows generation of realistic single chromosome images following user-defined banding patterns. To this end, an image-to-image translation approach based on self-generated 2D chromosome segmentation label maps is used. Our validation shows promising results when synthesizing chromosomes with seen as well as unseen banding patterns. We believe that this approach can be exploited for data augmentation of chromosome data sets with structural abnormalities. Therefore, the proposed method could help to tackle medical image analysis problems such as data simulation, segmentation, detection, or classification in the field of cytogenetics.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Online Clustering-based Multi-Camera Vehicle Tracking in Scenarios with overlap** FOVs
Authors:
Elena Luna,
Juan C. SanMiguel,
Jose M. Martínez,
Marcos Escudero-Viñolo
Abstract:
Multi-Target Multi-Camera (MTMC) vehicle tracking is an essential task of visual traffic monitoring, one of the main research fields of Intelligent Transportation Systems. Several offline approaches have been proposed to address this task; however, they are not compatible with real-world applications due to their high latency and post-processing requirements. In this paper, we present a new low-la…
▽ More
Multi-Target Multi-Camera (MTMC) vehicle tracking is an essential task of visual traffic monitoring, one of the main research fields of Intelligent Transportation Systems. Several offline approaches have been proposed to address this task; however, they are not compatible with real-world applications due to their high latency and post-processing requirements. In this paper, we present a new low-latency online approach for MTMC tracking in scenarios with partially overlap** fields of view (FOVs), such as road intersections. Firstly, the proposed approach detects vehicles at each camera. Then, the detections are merged between cameras by applying cross-camera clustering based on appearance and location. Lastly, the clusters containing different detections of the same vehicle are temporally associated to compute the tracks on a frame-by-frame basis. The experiments show promising low-latency results while addressing real-world challenges such as the a priori unknown and time-varying number of targets and the continuous state estimation of them without performing any post-processing of the trajectories.
△ Less
Submitted 8 February, 2021;
originally announced February 2021.
-
On guiding video object segmentation
Authors:
Diego Ortego,
Kevin McGuinness,
Juan C. SanMiguel,
Eric Arazo,
José M. Martínez,
Noel E. O'Connor
Abstract:
This paper presents a novel approach for segmenting moving objects in unconstrained environments using guided convolutional neural networks. This guiding process relies on foreground masks from independent algorithms (i.e. state-of-the-art algorithms) to implement an attention mechanism that incorporates the spatial location of foreground and background to compute their separated representations.…
▽ More
This paper presents a novel approach for segmenting moving objects in unconstrained environments using guided convolutional neural networks. This guiding process relies on foreground masks from independent algorithms (i.e. state-of-the-art algorithms) to implement an attention mechanism that incorporates the spatial location of foreground and background to compute their separated representations. Our approach initially extracts two kinds of features for each frame using colour and optical flow information. Such features are combined following a multiplicative scheme to benefit from their complementarity. These unified colour and motion features are later processed to obtain the separated foreground and background representations. Then, both independent representations are concatenated and decoded to perform foreground segmentation. Experiments conducted on the challenging DAVIS 2016 dataset demonstrate that our guided representations not only outperform non-guided, but also recent and top-performing video object segmentation algorithms.
△ Less
Submitted 25 April, 2019;
originally announced April 2019.