-
A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
Authors:
Thomas Stegmüller,
Tim Lebailly,
Nikola Dukic,
Behzad Bozorgtabar,
Tinne Tuytelaars,
Jean-Philippe Thiran
Abstract:
Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both i…
▽ More
Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
△ Less
Submitted 1 July, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
CrIBo: Self-Supervised Learning via Cross-Image Object-Level Bootstrap**
Authors:
Tim Lebailly,
Thomas Stegmüller,
Behzad Bozorgtabar,
Jean-Philippe Thiran,
Tinne Tuytelaars
Abstract:
Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrap** can lead to undesirable entanglement of object representations.…
▽ More
Leveraging nearest neighbor retrieval for self-supervised representation learning has proven beneficial with object-centric images. However, this approach faces limitations when applied to scene-centric datasets, where multiple objects within an image are only implicitly captured in the global representation. Such global bootstrap** can lead to undesirable entanglement of object representations. Furthermore, even object-centric datasets stand to benefit from a finer-grained bootstrap** approach. In response to these challenges, we introduce a novel Cross-Image Object-Level Bootstrap** method tailored to enhance dense visual representation learning. By employing object-level nearest neighbor bootstrap** throughout the training, CrIBo emerges as a notably strong and adequate candidate for in-context learning, leveraging nearest neighbor retrieval at test time. CrIBo shows state-of-the-art performance on the latter task while being highly competitive in more standard downstream segmentation tasks. Our code and pretrained models are publicly available at https://github.com/tileb1/CrIBo.
△ Less
Submitted 3 March, 2024; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Adaptive Similarity Bootstrap** for Self-Distillation based Representation Learning
Authors:
Tim Lebailly,
Thomas Stegmüller,
Behzad Bozorgtabar,
Jean-Philippe Thiran,
Tinne Tuytelaars
Abstract:
Most self-supervised methods for representation learning leverage a cross-view consistency objective i.e., they maximize the representation similarity of a given image's augmented views. Recent work NNCLR goes beyond the cross-view paradigm and uses positive pairs from different images obtained via nearest neighbor bootstrap** in a contrastive setting. We empirically show that as opposed to the…
▽ More
Most self-supervised methods for representation learning leverage a cross-view consistency objective i.e., they maximize the representation similarity of a given image's augmented views. Recent work NNCLR goes beyond the cross-view paradigm and uses positive pairs from different images obtained via nearest neighbor bootstrap** in a contrastive setting. We empirically show that as opposed to the contrastive learning setting which relies on negative samples, incorporating nearest neighbor bootstrap** in a self-distillation scheme can lead to a performance drop or even collapse. We scrutinize the reason for this unexpected behavior and provide a solution. We propose to adaptively bootstrap neighbors based on the estimated quality of the latent space. We report consistent improvements compared to the naive bootstrap** approach and the original baselines. Our approach leads to performance improvements for various self-distillation method/backbone combinations and standard downstream tasks. Our code is publicly available at https://github.com/tileb1/AdaSim.
△ Less
Submitted 7 September, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.
-
CrOC: Cross-View Online Clustering for Dense Visual Representation Learning
Authors:
Thomas Stegmüller,
Tim Lebailly,
Behzad Bozorgtabar,
Tinne Tuytelaars,
Jean-Philippe Thiran
Abstract:
Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require…
▽ More
Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at https://github.com/stegmuel/CrOC.
△ Less
Submitted 23 March, 2023;
originally announced March 2023.
-
Self-supervised learning-based cervical cytology for the triage of HPV-positive women in resource-limited settings and low-data regime
Authors:
Thomas Stegmüller,
Christian Abbet,
Behzad Bozorgtabar,
Holly Clarke,
Patrick Petignat,
Pierre Vassilakos,
Jean-Philippe Thiran
Abstract:
Screening Papanicolaou test samples has proven to be highly effective in reducing cervical cancer-related mortality. However, the lack of trained cytopathologists hinders its widespread implementation in low-resource settings. Deep learning-based telecytology diagnosis emerges as an appealing alternative, but it requires the collection of large annotated training datasets, which is costly and time…
▽ More
Screening Papanicolaou test samples has proven to be highly effective in reducing cervical cancer-related mortality. However, the lack of trained cytopathologists hinders its widespread implementation in low-resource settings. Deep learning-based telecytology diagnosis emerges as an appealing alternative, but it requires the collection of large annotated training datasets, which is costly and time-consuming. In this paper, we demonstrate that the abundance of unlabeled images that can be extracted from Pap smear test whole slide images presents a fertile ground for self-supervised learning methods, yielding performance improvements relative to readily available pre-trained models for various downstream tasks. In particular, we propose \textbf{C}ervical \textbf{C}ell \textbf{C}opy-\textbf{P}asting ($\texttt{C}^{3}\texttt{P}$) as an effective augmentation method, which enables knowledge transfer from open-source and labeled single-cell datasets to unlabeled tiles. Not only does $\texttt{C}^{3}\texttt{P}$ outperforms naive transfer from single-cell images, but we also demonstrate its advantageous integration into multiple instance learning methods. Importantly, all our experiments are conducted on our introduced \textit{in-house} dataset comprising liquid-based cytology Pap smear images obtained using low-cost technologies. This aligns with our objective of leveraging deep learning-based telecytology for diagnosis in low-resource settings.
△ Less
Submitted 7 June, 2023; v1 submitted 10 February, 2023;
originally announced February 2023.
-
ScoreNet: Learning Non-Uniform Attention and Augmentation for Transformer-Based Histopathological Image Classification
Authors:
Thomas Stegmüller,
Behzad Bozorgtabar,
Antoine Spahr,
Jean-Philippe Thiran
Abstract:
Progress in digital pathology is hindered by high-resolution images and the prohibitive cost of exhaustive localized annotations. The commonly used paradigm to categorize pathology images is patch-based processing, which often incorporates multiple instance learning (MIL) to aggregate local patch-level representations yielding image-level prediction. Nonetheless, diagnostically relevant regions ma…
▽ More
Progress in digital pathology is hindered by high-resolution images and the prohibitive cost of exhaustive localized annotations. The commonly used paradigm to categorize pathology images is patch-based processing, which often incorporates multiple instance learning (MIL) to aggregate local patch-level representations yielding image-level prediction. Nonetheless, diagnostically relevant regions may only take a small fraction of the whole tissue, and current MIL-based approaches often process images uniformly, discarding the inter-patches interactions. To alleviate these issues, we propose ScoreNet, a new efficient transformer that exploits a differentiable recommendation stage to extract discriminative image regions and dedicate computational resources accordingly. The proposed transformer leverages the local and global attention of a few dynamically recommended high-resolution regions at an efficient computational cost. We further introduce a novel mixing data-augmentation, namely ScoreMix, by leveraging the image's semantic distribution to guide the data mixing and produce coherent sample-label pairs. ScoreMix is embarrassingly simple and mitigates the pitfalls of previous augmentations, which assume a uniform semantic distribution and risk mislabeling the samples. Thorough experiments and ablation studies on three breast cancer histology datasets of Haematoxylin & Eosin (H&E) have validated the superiority of our approach over prior arts, including transformer-based models on tumour regions-of-interest (TRoIs) classification. ScoreNet equipped with proposed ScoreMix augmentation demonstrates better generalization capabilities and achieves new state-of-the-art (SOTA) results with only 50% of the data compared to other mixing augmentation variants. Finally, ScoreNet yields high efficacy and outperforms SOTA efficient transformers, namely TransPath and SwinTransformer.
△ Less
Submitted 18 July, 2022; v1 submitted 15 February, 2022;
originally announced February 2022.