A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

Thomas Stegmüller1           Tim Lebailly22{}^{2\hskip 0.99594pt*}start_FLOATSUPERSCRIPT 2 ∗ end_FLOATSUPERSCRIPT           Nikola Dukic2

Behzad Bozorgtabar1,3           Tinne Tuytelaars2           Jean-Philippe Thiran1,3

1EPFL           2KU Leuven           3CHUV

1{firstname}.{lastname}@epfl.ch 2{firstname}.{lastname}@esat.kuleuven.be
denotes equal contribution.
Abstract

Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.

Refer to caption
Figure 1: Visualization of the patch-level visual representations and text concepts in the RGB color space. PCA is used to map the dense representation of a single image into a three-dimensional space. The three-dimensional representations (color) of the text concepts from the concept bank and the corresponding caption are obtained and shown for each image. Each row includes the original image and dense feature visualization at different resolutions. These include the training resolution (16×16161616\times 1616 × 16) and higher resolutions (2×2\times2 ×, 4×4\times4 ×, and 8×8\times8 ×).

1 Introduction

Semantic segmentation stands as a cornerstone task within the realm of computer vision, playing a pivotal role in various applications ranging from autonomous driving to medical image analysis. However, its widespread adoption and scalability are hindered by the inherent label-intensive nature of traditional methods, demanding copious amounts of fine-grained annotations for training. Furthermore, traditional approaches, e.g.,  Strudel et al. (2021); Cheng et al. (2022) are typically closed-vocabulary, meaning they only work for a pre-defined set of categories and generalize poorly to unseen classes. Self-supervised learning paradigms (Caron et al., 2021; Zhou et al., 2022b; He et al., 2020; Balazevic et al., 2024; Lebailly et al., 2023), which use pretext tasks to learn discriminative representations from data itself, offer a promising solution to alleviate the annotation burden. Representations obtained in this manner are typically clustered semantically, potentially even at a fine-grained level (Balažević et al., 2023; Lebailly et al., 2023; Stegmüller et al., 2023) and, as such, yield excellent performance in various applications. Nonetheless, a degree of labeling is still necessary, whether for finetuning or constructing the support set of a k𝑘kitalic_k-NN classifier, for instance.

Recently, vision-language contrastive learning has proven to be a simple and effective approach for transforming web-scale amounts of noisy image-caption pairs into zero-shot classification capabilities (Radford et al., 2021; Jia et al., 2021). Nonetheless, the substantial computational and data requirements of these methods pose significant challenges. To address these limitations, LiT (Zhai et al., 2022) proposed aligning a text encoder with a pretrained and frozen pure vision model. By decoupling the image representation learning process from the vision-language alignment process, this methodology enhances both computational and data efficiency while also improving performance in zero-shot classification tasks. Orthogonal to this, various studies (Hamilton et al., 2022; Siméoni et al., 2021; Wang et al., 2023; Zhang et al., 2023; Siméoni et al., 2023) have observed and capitalized on the spatial awareness of self-supervised pretrained vision transformers (ViTs). These works have demonstrated remarkable capabilities for dense downstream tasks with minimal or even zero trainable parameters (Balazevic et al., 2024). This evidence suggests that self-supervised vision transformers may just be one text encoder away from becoming open-vocabulary segmenters. We investigate this hypothesis and propose a simple framework for open-vocabulary zero-shot segmentation. In essence, we leverage linguistic knowledge to identify concepts in captions and build upon the quality of a frozen vision tower to retrieve corresponding concepts in images.

Our main contributions are as follows:

  • (i)

    SimZSS is designed to be simple, with minimal hyperparameters, making it highly compatible with both small curated datasets and large-scale noisy datasets.

  • (ii)

    The proposed framework is robust, supporting various pretrained vision towers, and accommodating both supervised and self-supervised pretraining of the vision backbone.

  • (iii)

    By decoupling visual representation learning from cross-modality concept-level alignment, our proposed framework, SimZSS achieves high computational and data efficiency.

2 Related works

Open-vocabulary learning.

Traditionally, computer vision methods operate under the closed-vocabulary hypothesis. This assumption presumes that all object categories a model is expected to classify, detect, or segment at test time are already known and labeled during training. This presents significant challenges due to the extensive labeling required and the limited generalizability of the resulting models. Open-vocabulary learning aims to eliminate these limitations. Notably, this is a more challenging objective than the one targeted by self-supervised visual representation learning (Chen & He, 2020; Caron et al., 2021; Li et al., 2021; Grill et al., 2020; Oquab et al., 2023), which assumes the availability of labels at test time. To address this additional constraint, a text encoder is typically trained jointly with the vision tower to maximize vision-language alignment on large amounts of image-caption pairs via contrastive learning (Radford et al., 2021; Jia et al., 2021). Subsequent studies have either focused on improving the cross-modality alignment pretext task (Alayrac et al., 2022; Yu et al., 2022; Tschannen et al., 2024) or on improving the computational efficiency (Zhai et al., 2022; Li et al., 2023). However, while these models excel at classification, their performance on dense downstream tasks is usually subpar.

Open-vocabulary segmentation.

The shortcomings of open-vocabulary methods on dense downstream tasks have been the focus of significant research efforts. A recurring approach in the literature is to use dense annotations for learning segmentation and to leverage a pretrained text encoder to provide the weights of a generalizable classifier (Li et al., 2022; Ghiasi et al., 2022; Liang et al., 2023; Ding et al., 2022; Zhou et al., 2022a; Xu et al., 2023b; Ma et al., 2022; Yu et al., 2024). While these methods demonstrate excellent zero-shot segmentation performance, they only partially alleviate the pixel-level annotation requirements.

To overcome this limitation, approaches that do not require segmentation masks have been proposed. PACL (Mukhoti et al., 2023) demonstrates the effectiveness of training only the projections at the interface of the two modalities and bootstrap** patch-level cross-modal similarities. Similarly, TCL (Cha et al., 2023) leverages existing fine-grained similarities to jointly learn text-grounded masks and impose contrastive region-text alignment based on the obtained masks. Alternatively, GroupViT Xu et al. (2022) proposes a specialized architecture that performs hierarchical grou**/pooling of visual tokens based on their similarity with learnable query tokens. Another line of work showed the benefits of integrating a consistency objective between augmented and/or masked views of the input image (Dong et al., 2023; Ren et al., 2023) in the context of vision-language contrastive learning. Related to that, ReCo (Shin et al., 2022) reported improved localization capabilities by exploiting the co-occurrence of objects across multiple images. Finally, Wysoczańska et al. (2024; 2024); Rewatbowornwong et al. (2023) combine the vision-language alignment capabilities of CLIP (Radford et al., 2021) with the spatial-awareness of self-supervised vision transformers, e.g., DINO (Caron et al., 2021) to develop open-vocabulary zero-shot segmenter.

Refer to caption
Figure 2: Overview of SimZSS. On the text side (a.), each concept in the caption is represented using a trainable text encoder. On the vision side (c.), visual representations of each concept are obtained via a similarity-based pooling of the visual tokens. These visual concept representations are then projected onto a linear classifier, with weights derived from the text concept representations of the current batch. Cross-modality consistency is enforced using cross-entropy loss (b.).

3 Method

3.1 Preliminaries

We first provide a brief primer on the terminology used throughout the paper.

Dense representation.

Transformers encode input signals as a sequence of tokens, and we refer to the corresponding output sequence as the dense representation. For vision transformers, this corresponds to a tensor 𝒛vnv×dvsubscript𝒛𝑣superscriptsubscript𝑛𝑣subscript𝑑𝑣\bm{z}_{v}\in\mathbb{R}^{n_{v}\times d_{v}}bold_italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the dimension of the representation space and nvsubscript𝑛𝑣n_{v}italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the number of patches in the input image. For text input, the dense representation 𝒛tnt×dtsubscript𝒛𝑡superscriptsubscript𝑛𝑡subscript𝑑𝑡\bm{z}_{t}\in\mathbb{R}^{n_{t}\times d_{t}}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a sequence of dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-dimensional tokens. The number of tokens, ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is determined by the byte-pair encoding algorithm (Gage, 1994).

Local representation.

We refer to the result of indexing a dense representation with a sequence index i𝑖iitalic_i as a local representation. This translates to 𝒛tidtsuperscriptsubscript𝒛𝑡𝑖superscriptsubscript𝑑𝑡\bm{z}_{t}^{i}\in\mathbb{R}^{d_{t}}bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for textual input signals and 𝒛vidvsuperscriptsubscript𝒛𝑣𝑖superscriptsubscript𝑑𝑣\bm{z}_{v}^{i}\in\mathbb{R}^{d_{v}}bold_italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for visual input signals.

Concept representation.

Aggregating local representations corresponding to the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT semantic concept in the input signal yields a concept representation. For visual input, this is denoted as 𝐜vldvsuperscriptsubscript𝐜𝑣𝑙superscriptsubscript𝑑𝑣\mathbf{c}_{v}^{l}\in\mathbb{R}^{d_{v}}bold_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, whereas for textual input, we use 𝐜tldtsuperscriptsubscript𝐜𝑡𝑙superscriptsubscript𝑑𝑡\mathbf{c}_{t}^{l}\in\mathbb{R}^{d_{t}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Global representation.

A representation of the entire input signal is referred to as the global representation. Specialized tokens are typically used for this purpose. In vision transformers, the global representation 𝒛¯vdvsubscript¯𝒛𝑣superscriptsubscript𝑑𝑣\bar{\bm{z}}_{v}\in\mathbb{R}^{d_{v}}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the [CLS] token, while for a text encoder 𝒛¯tdtsubscript¯𝒛𝑡superscriptsubscript𝑑𝑡\bar{\bm{z}}_{t}\in\mathbb{R}^{d_{t}}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the [EOS] token.

3.2 Vision & language concept identification

At a conceptual level, enforcing cross-modality consistency faces a significant challenge due to the complexity of identifying and matching concepts across modalities. In a nutshell, our proposed solution leverages the discrete nature of textual data, enabling the straightforward identification of concepts within captions. Subsequently, we retrieve associated visual concepts, conditioned on those identified within the text, effectively circumventing the complexity of cross-modality concept matching.

3.2.1 Textual concept identification

Given an image-caption pair (𝒙v,𝒙t)subscript𝒙𝑣subscript𝒙𝑡(\bm{x}_{v},\bm{x}_{t})( bold_italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we utilize the text modality to identify semantic concepts. Language adheres to structured grammatical rules, allowing us to leverage linguistic knowledge to pinpoint the main subjects and objects within a sentence. Since captions are intended to describe images, the primary subjects and objects in the captions are likely to correspond to visual elements in the images. To achieve this, we employ a part-of-speech (POS) tagger, which automatically classifies words according to their grammatical roles. In our context, noun phrases (NPs) are of particular interest, as they generally encapsulate the main subjects and objects in the sentences, providing a direct link to the visual concepts depicted in the images. However, captions can be quite noisy, so further filtering is necessary to discard NPs that are unlikely to appear in the images. To address this, we refine a noun phrase into a concept by restricting it to the noun and its first compound. We then discard any concept that is absent in a predefined bank of concepts.

Once the concepts in the caption have been identified, it remains to obtain their equivalent in the representation space. This is accomplished by tracking the token indices spanned by each concept. Denoted as 𝒮lsubscript𝒮𝑙\mathcal{S}_{l}caligraphic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, this set represents the indices associated with the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT concept. The concept representation 𝐜tldtsuperscriptsubscript𝐜𝑡𝑙superscriptsubscript𝑑𝑡\mathbf{c}_{t}^{l}\in\mathbb{R}^{d_{t}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is then derived by averaging the local representation at the corresponding indices:

𝐜tl=1|𝒮l|i𝒮l𝒛tisuperscriptsubscript𝐜𝑡𝑙1subscript𝒮𝑙subscript𝑖subscript𝒮𝑙superscriptsubscript𝒛𝑡𝑖\mathbf{c}_{t}^{l}=\frac{1}{|\mathcal{S}_{l}|}\sum_{i\in\mathcal{S}_{l}}\bm{z}% _{t}^{i}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (1)

The lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT textual concepts 𝐜tldtsuperscriptsubscript𝐜𝑡𝑙superscriptsubscript𝑑𝑡\mathbf{c}_{t}^{l}\in\mathbb{R}^{d_{t}}bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is then mapped to the visual space via a linear projection g:dtdv:𝑔superscriptsubscript𝑑𝑡superscriptsubscript𝑑𝑣g:\mathbb{R}^{d_{t}}\rightarrow\mathbb{R}^{d_{v}}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT:

𝐜~tl=g(𝐜tl)subscriptsuperscript~𝐜𝑙𝑡𝑔superscriptsubscript𝐜𝑡𝑙\tilde{\mathbf{c}}^{l}_{t}=g\left(\mathbf{c}_{t}^{l}\right)over~ start_ARG bold_c end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ( bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) (2)

In the remainder of this paper, we also refer to 𝐜~ttsubscriptsuperscript~𝐜𝑡𝑡\tilde{\mathbf{c}}^{t}_{t}over~ start_ARG bold_c end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a text concept.

3.2.2 Visual concept identification

Upon identifying the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT text concept 𝐜~tldvsubscriptsuperscript~𝐜𝑙𝑡superscriptsubscript𝑑𝑣\tilde{\mathbf{c}}^{l}_{t}\in\mathbb{R}^{d_{v}}over~ start_ARG bold_c end_ARG start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we use it to query the dense visual representation and obtain the corresponding visual concept 𝐜vldvsuperscriptsubscript𝐜𝑣𝑙superscriptsubscript𝑑𝑣\mathbf{c}_{v}^{l}\in\mathbb{R}^{d_{v}}bold_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. To do this, we first embed the image 𝒙vsubscript𝒙𝑣\bm{x}_{v}bold_italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT using a vision encoder fvsubscript𝑓𝑣f_{v}italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, which outputs the dense visual representation 𝒛vnv×dvsubscript𝒛𝑣superscriptsubscript𝑛𝑣subscript𝑑𝑣\bm{z}_{v}\in\mathbb{R}^{n_{v}\times d_{v}}bold_italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and a global visual representation 𝒛¯vdvsubscript¯𝒛𝑣superscriptsubscript𝑑𝑣\bar{\bm{z}}_{v}\in\mathbb{R}^{d_{v}}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We then compute the similarities between each local visual representation and the query text concept:

𝐬=softmax(𝒛v𝐜~tlτ)𝐬softmaxsubscript𝒛𝑣superscriptsubscript~𝐜𝑡𝑙𝜏\mathbf{s}=\texttt{softmax}\left(\frac{\bm{z}_{v}\tilde{\mathbf{c}}_{t}^{l}}{% \tau}\right)bold_s = softmax ( divide start_ARG bold_italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT over~ start_ARG bold_c end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_τ end_ARG ) (3)

where τ𝜏\tauitalic_τ is a temperature parameter that regulates the sharpness of the similarity (see Tab. 7). Consequently, the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT visual concept is obtained by bootstrap** existing cross-modality similarities, i.e., via similarity-based pooling (see Fig. 2):

𝐜v=𝒛v𝐬subscript𝐜𝑣superscriptsubscript𝒛𝑣top𝐬\mathbf{c}_{v}=\bm{z}_{v}^{\top}\mathbf{s}bold_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_s (4)

This process is performed in parallel for all concepts from the b𝑏bitalic_b captions in the batch, resulting in b~~𝑏\tilde{b}over~ start_ARG italic_b end_ARG cross-modality concept pairs for which consistency can be encouraged.

3.3 Cross-modality consistency

In Section 3.2, we outline a methodology for identifying pairs of local concepts across textual and visual modalities. This method leverages established similarities and thus may benefit from additional supervision at the beginning of training, as discussed in Section 3.3.1.

3.3.1 Global consistency

We now discuss the global consistency objective, which ensures similarities between images and captions. To this end, let 𝒁¯vb×dvsubscript¯𝒁𝑣superscript𝑏subscript𝑑𝑣\bar{\bm{Z}}_{v}\in\mathbb{R}^{b\times d_{v}}over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be the matrix containing all global visual representations 𝒛¯vsubscript¯𝒛𝑣\bar{\bm{z}}_{v}over¯ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT within a batch, and let 𝒁¯tb×dtsubscript¯𝒁𝑡superscript𝑏subscript𝑑𝑡\bar{\bm{Z}}_{t}\in\mathbb{R}^{b\times d_{t}}over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be its equivalent in the text modality. After projecting the text concepts to the visual space, we can compute the global cross-modality similarity:

𝐬¯=g(𝒁¯t)𝒁¯v¯𝐬𝑔subscript¯𝒁𝑡superscriptsubscript¯𝒁𝑣top\bar{\mathbf{s}}=g\left(\bar{\bm{Z}}_{t}\right)\bar{\bm{Z}}_{v}^{\top}over¯ start_ARG bold_s end_ARG = italic_g ( over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (5)

The learning objective is to maximize the similarity of paired entries and minimize that of unpaired ones:

g=12bilog(exp(𝐬¯ii)jexp(𝐬¯ij))12bjlog(exp(𝐬¯jj)iexp(𝐬¯ij))subscript𝑔12𝑏subscript𝑖subscript¯𝐬𝑖𝑖subscript𝑗subscript¯𝐬𝑖𝑗12𝑏subscript𝑗subscript¯𝐬𝑗𝑗subscript𝑖subscript¯𝐬𝑖𝑗\mathcal{L}_{g}=-\frac{1}{2b}\sum_{i}\log\left(\frac{\exp\left(\bar{\mathbf{s}% }_{ii}\right)}{\sum_{j}\exp\left(\bar{\mathbf{s}}_{ij}\right)}\right)-\frac{1}% {2b}\sum_{j}\log\left(\frac{\exp\left(\bar{\mathbf{s}}_{jj}\right)}{\sum_{i}% \exp\left(\bar{\mathbf{s}}_{ij}\right)}\right)caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( divide start_ARG roman_exp ( over¯ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_exp ( over¯ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG ) - divide start_ARG 1 end_ARG start_ARG 2 italic_b end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( divide start_ARG roman_exp ( over¯ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( over¯ start_ARG bold_s end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG ) (6)

Overall the global consistency objective is identical to the one used in CLIP (Radford et al., 2021).

3.3.2 Concept-level consistency

After obtaining pairs of vision-language concepts, an intuitive approach to enforce consistency at the concept level is to use a contrastive objective, akin to the one described in Section 3.3.1, but applied between concepts. Empirically, we found that this approach did not yield the desired performance improvements on dense downstream tasks. At a global level, images and captions represent complex scenes, supporting the hypothesis that only b𝑏bitalic_b positive image-caption pairs exist among the b2superscript𝑏2b^{2}italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT possible ones in a batch. Conversely, concepts typically encode individual objects that are likely to occur multiple times within a batch. This suggests that an instance-level objective is ill-suited for our setting. Therefore, we opt for a semantic-level objective.

Let’s define 𝐂tb~×dtsubscript𝐂𝑡superscript~𝑏subscript𝑑𝑡\mathbf{C}_{t}\in\mathbb{R}^{\tilde{b}\times d_{t}}bold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT over~ start_ARG italic_b end_ARG × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT representing the set of all b~~𝑏\tilde{b}over~ start_ARG italic_b end_ARG text concepts in a batch and 𝐂vb~×dvsubscript𝐂𝑣superscript~𝑏subscript𝑑𝑣\mathbf{C}_{v}\in\mathbb{R}^{\tilde{b}\times d_{v}}bold_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT over~ start_ARG italic_b end_ARG × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT as its counterpart in the visual space. Thanks to the discrete nature of text, it is straightforward to keep track of concepts occurring in a batch, with each unique concept assigned a specific index, conveniently stored in 𝐪{0,1,,k1}b~𝐪superscript01𝑘1~𝑏\mathbf{q}\in\{0,1,...,k-1\}^{\tilde{b}}bold_q ∈ { 0 , 1 , … , italic_k - 1 } start_POSTSUPERSCRIPT over~ start_ARG italic_b end_ARG end_POSTSUPERSCRIPT (where k𝑘kitalic_k denotes the number of unique concepts in a batch). The weights 𝐡k×dv𝐡superscript𝑘subscript𝑑𝑣\mathbf{h}\in\mathbb{R}^{k\times d_{v}}bold_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of a linear classifier can be computed by summing the representations of identical concepts:

𝐡i=j𝟙{𝐪j=i}g(𝐂t)jsubscript𝐡𝑖subscript𝑗subscript1subscript𝐪𝑗𝑖𝑔subscriptsubscript𝐂𝑡𝑗\mathbf{h}_{i}=\sum_{j}\mathds{1}_{\{\mathbf{q}_{j}=i\}}g\left(\mathbf{C}_{t}% \right)_{j}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i } end_POSTSUBSCRIPT italic_g ( bold_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (7)

After l2𝑙2l2italic_l 2-normalizing the columns of both 𝐡𝐡\mathbf{h}bold_h and 𝐂vsubscript𝐂𝑣\mathbf{C}_{v}bold_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, we project one onto the other to derive a probability distribution over the visual concepts:

𝐩=softmax𝑘(𝐂v𝐡)𝐩𝑘softmaxsubscript𝐂𝑣superscript𝐡top\mathbf{p}=\underset{k}{\texttt{softmax}}\left(\mathbf{C}_{v}\mathbf{h}^{\top}\right)bold_p = underitalic_k start_ARG softmax end_ARG ( bold_C start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_h start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) (8)

It follows that the cross-entropy loss can be used to ensure consistency between the query text concept and the retrieved visual concept:

l=1b~ij𝟙{𝐪i=j}log(𝐩ij)subscript𝑙1~𝑏subscript𝑖subscript𝑗subscript1subscript𝐪𝑖𝑗subscript𝐩𝑖𝑗\mathcal{L}_{l}=\frac{1}{\tilde{b}}\sum_{i}\sum_{j}-\mathds{1}_{\{\mathbf{q}_{% i}=j\}}\log\left(\mathbf{p}_{ij}\right)caligraphic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_b end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - blackboard_1 start_POSTSUBSCRIPT { bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j } end_POSTSUBSCRIPT roman_log ( bold_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) (9)

The overall objective of SimZSS, denoted as totsubscripttot\mathcal{L}_{\text{tot}}caligraphic_L start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT, is a weighted sum of the global and local consistency objectives:

tot=g+λlsubscripttotsubscript𝑔𝜆subscript𝑙\mathcal{L}_{\text{tot}}=\mathcal{L}_{g}+\lambda\mathcal{L}_{l}caligraphic_L start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (10)

where λ𝜆\lambdaitalic_λ is a weighting parameter whose impact is ablated in Table 7.

4 Experiments

In this section, we investigate the properties of SimZSS through various experiments. Details of the experimental setup are provided in Appendix A, and additional experiments can be found in Appendix B.

4.1 Zero-shot segmentation of foreground

Table 1: Zero-shot foreground segmentation. Pixel-wise predictions are obtained by projecting patch representations onto pre-computed text embeddings of the class names, followed by up-sampling. The mIoU scores are reported across five standard segmentation datasets. \dagger refers to our reproduction using DINOv2 pretrained vision backbones. The remaining results are as reported in Wysoczańska et al. (2024).
Method [Uncaptioned image]Params [Uncaptioned image]Params Pascal VOC Pascal Context COCO-Stuff Cityscapes ADE20K
ReCo (Shin et al., 2022) 313M 0 57.7 22.3 14.8 21.1 11.2
GroupViT (Xu et al., 2022) 0 55M 79.7 23.4 15.3 11.1 9.2
TCL (Cha et al., 2023) 156M 21M 77.5 30.3 19.6 23.1 14.9
MaskCLIP (Dong et al., 2023) 291M 0 74.9 26.4 16.4 12.6 9.8
OVDiff (Karazija et al., 2023) 1,226M 0 81.7 33.7 - - 14.9
LiT (Zhai et al., 2022) (ViT-B + LAION-400M) 94M 63M 80.5 31.8 23.3 24.7 18.7
LiT (Zhai et al., 2022) (ViT-B + COCO Captions) 94M 63M 86.1 35.5 25.6 25.8 18.1
CLIP-DNOiser (Wysoczańska et al., 2024) - - 80.9 35.9 24.6 31.7 20.0
Ours
SimZSS (ViT-B + LAION-400M) 94M 63M 85.1 34.2 24.9 27.8 19.6
SimZSS (ViT-S + COCO Captions) 21M 40M 87.2 37.3 23.8 29.2 17.9
SimZSS (ViT-B + COCO Captions) 94M 63M 90.3 43.1 29.0 33.0 21.8

We validate the proposed method on a pixel-level zero-shot segmentation task. In this scenario, the model relies solely on textual class descriptions to classify image pixels. However, accurately representing the background class can be challenging due to dataset-specific properties not fully captured by text descriptions. As such, we follow previous works (Wysoczańska et al., 2024; Cha et al., 2023) and first evaluate without the background class.

Refer to caption
Figure 3: Vision-language alignment of text concepts and dense visual representations. Concepts present in the image are embedded independently by the text encoder and then projected onto the representations of each patch within the image. The images are processed at a resolution of 896×896896896896\times 896896 × 896 pixels, corresponding to 4×4\times4 × the training resolution. The alignment is performed on LAION-400M using a ViT-B/14 as the vision tower.

We follow the MMSegmentation (Contributors, 2020) implementation of Cha et al. (2023). Specifically, images are resized to have a shorter side of 448 pixels, and inference is performed with a sliding window using a stride of 224 pixels. ImageNet templates from Radford et al. (2021) are used to contextualize each class name before obtaining their textual representation. Finally, the predictions obtained by projecting patch representations onto class names, are up-sampled using bilinear interpolation to restore the predictions to the original image size. We report the mIoU scores across five standard datasets, namely Pascal VOC (Everingham et al., 2012), Pascal Context (Mottaghi et al., 2014), COCO-Stuff (Caesar et al., 2018), Cityscapes (Cordts et al., 2016) and ADE20K (Zhou et al., 2017). For datasets that encompass the background class, the corresponding annotations and pixels are simply ignored. For the sake of clarity, we report the results without any post-processing techniques such as Pixel-Adaptive Mask Refinement (PAMR) (Araslanov & Roth, 2020).

In Table 1, we observe that our method using a ViT-S/14 as the vision tower is already competitive and that shifting to a larger vision backbone yields state-of-the-art results on all datasets when trained on COCO Captions. In particular, it can be seen that the proposed pipeline significantly outperforms models trained under the same conditions with LiT (Zhai et al., 2022). This suggests that the reported performance is not solely attributable to the freezing of the vision tower or DINOv2 (Oquab et al., 2023) visual pretraining. The results obtained with LAION-400M indicate that SimZSS is compatible with large-scale uncurated datasets, exhibiting overall strong performance and improvements compared to models trained with LiT in the same setting. However, it is also apparent that both LiT and SimZSS seem to benefit more from curation than scale in the segmentation tasks. This trend contrasts with findings in classification tasks, as shown in Table 3.

4.2 Zero-shot segmentation

Table 2: Zero-shot segmentation. Pixel-wise predictions of the foreground classes are obtained by projecting patch representations onto pre-computed text embeddings of the class names, followed by up-sampling. Predictions that fall below a predetermined threshold are assigned to the background class. The mIoU scores are reported across three standard segmentation datasets. \dagger refers to our reproduction using DINOv2 pretrained vision backbones. The remaining results are as reported in Wysoczańska et al. (2024).
Method [Uncaptioned image]Params [Uncaptioned image]Params Pascal Context COCO-Object Pascal VOC
ReCo (Shin et al., 2022) 313M 0 19.9 15.7 25.1
OVDiff (Karazija et al., 2023) 1,226M 0 30.1 34.8 67.1
GroupViT (Xu et al., 2022) 0 55M 18.7 27.5 50.4
ZeroSeg (Chen et al., 2023) - - 21.8 22.1 42.9
SegCLIP (Luo et al., 2023) - - 24.7 26.5 52.6
TCL (Cha et al., 2023) 156M 21M 24.3 30.4 51.2
CLIPpy (Ranasinghe et al., 2023) - - - 32.0 52.2
OVSegmentor (Xu et al., 2023a) - - 20.4 25.1 53.8
CLIP-DIY (Wysoczańska et al., 2024) - - 19.7 31.0 59.9
MaskCLIP (Dong et al., 2023) 291M 0 23.6 20.6 38.8
CLIP-DNOiser (Wysoczańska et al., 2024) - - 32.4 34.8 62.1
LiT (Zhai et al., 2022) (ViT-B + LAION-400M) 94M 63M 29.6 38.3 48.1
LiT (Zhai et al., 2022) (ViT-B + LAION-400M) 94M 63M 31.5 39.5 51.4
Ours
SimZSS (ViT-B + LAION-400M) 94M 63M 31.1 38.1 48.6
SimZSS (ViT-S + COCO Captions) 23M 40M 32.8 39.5 55.5
SimZSS (ViT-B + COCO Captions ) 94M 63M 37.2 43.5 58.4

In a second scenario, we explore a zero-shot segmentation task including background class. Similar to Cha et al. (2023), we do not rely on the textual representation to predict the background class, but rather on the confidence levels of the predictions on the remaining classes. More precisely, we assign a given pixel to the background class if the highest confidence score among the other class predictions falls below a dataset and model-specific threshold. The remaining implementation details are identical to the above-described setting. The mIoU scores are reported on three datasets, namely Pascal Context (Mottaghi et al., 2014), COCO-Object (Caesar et al., 2018) and Pascal VOC (Everingham et al., 2012).

In Table 2, we observe similar trends as in the scenario without the background class (see Tab. 1). Reported results are not as unequivocal as in the former evaluation. This is in part due to the crudeness of the background detection mechanism, which contrasts with the more sophisticated approaches used by some of the competing baselines. For instance, CLIP-DNOiser (Wysoczańska et al., 2024) relies on FOUND (Siméoni et al., 2023), whereas OVDiff (Karazija et al., 2023) uses different background prototypes for each class. Once again, training on COCO Captions leads to improved performance. When training on LAION-400M, a concept is identified in less than 15% of the samples on average. As a result, concepts rarely co-occur, making it sufficient to detect the concept without localizing it in the image to minimize loss. Therefore, the model is not trained to be less confident in the background. Exploring the possibility of enlarging the list of concepts could provide further insights.

4.3 Zero-shot classification

Table 3: Zero-shot classification on ImageNet-1k. Image-level predictions are obtained by projecting the image [CLS] token onto pre-computed text embeddings of class names. Accuracy is reported for various visual pretraining and vision-language alignment methods. \dagger refers to our reproduction using DINOv2 pretrained vision backbones. The remaining results are as reported in Zhai et al. (2022).
Method Visual pretraining Backbone Pretraining dataset Alignment dataset Labels Top-1 accuracy
LiT (Zhai et al., 2022) MoCo-v3 (Chen et al., 2021) ViT-B/16 ImageNet-1k CC12M+YFCC100M 55.4
LiT (Zhai et al., 2022) DINOv1 (Caron et al., 2021) ViT-B/16 ImageNet-1k CC12M+YFCC100M 55.5
LiT (Zhai et al., 2022) AugReg (Steiner et al., 2022) ViT-B/16 ImageNet-21k CC12M+YFCC100M 55.9
LiT (Zhai et al., 2022) DINOv2 (Oquab et al., 2023) ViT-B/14 LVD-142M COCO Captions 22.6
LiT (Zhai et al., 2022) DINOv2 (Oquab et al., 2023) ViT-B/14 LVD-142M LAION-400M 63.6
Ours
SimZSS DINOv2 (Oquab et al., 2023) ViT-B/14 LVD-142M COCO Captions 24.3
SimZSS DINOv2 (Oquab et al., 2023) ViT-B/14 LVD-142M LAION-400M 64.1
Refer to caption
Figure 4: Zero-shot segmentation performance as a function of the number of processed image-caption pairs in LAION-400M. The left plot shows the mIoU percentages for different datasets, while the right plot shows the relative performance percentages. Each data point represents the result of running a vision-language alignment from scratch using SimZSS; these are not training curves.

The proposed method derives from LiT (Zhai et al., 2022), noted for its excellent zero-shot classification performance on ImageNet-1k (Deng et al., 2009) among other benchmarks. Although our approach is tailored for dense downstream tasks, we investigate whether SimZSS exhibits zero-shot classification capabilities or if the introduced local consistency objective breaks this property. We follow the evaluation protocol from OpenCLIP (Ilharco et al., 2021). Class names from ImageNet-1k are contextualized with the corresponding templates from CLIP (Radford et al., 2021) and embedded through the text encoder. On the vision side, images are passed through the vision encoder, and the global representation is used to obtain class predictions via projection onto the class embeddings. The accuracy of the prediction is reported.

The results reported in Table 3 reveal that SimZSS not only preserves classification capabilities comparable to LiT but also exhibits slight improvements in the low-data regime. It is worth mentioning that it takes CLIP 32323232 epochs on LAION-400M to train a ViT-B/16 to reach a zero-shot accuracy of 67.067.067.067.0 (see Ilharco et al. (2021), Tab. 18). Overall, while pretraining on COCO Captions excels for segmentation tasks, zero-shot classification benefits from larger datasets. This suggests that a pretraining on LAION-400M followed by finetuning on COCO Captions could optimize the benefits of both datasets.

4.4 Scalability of SimZSS

An important property of vision-language pretraining methods is their capacity to leverage large datasets effectively, which necessitates them to be i) compatible with noisy image-caption pairs, and ii) computationally efficient.

We first verify in Tables 9 and 8 that SimZSS-specific operations incur only moderate memory overhead and negligible time overhead compared to LiT. Considering this, along with the data efficiency of the proposed approach, SimZSS is orders of magnitude cheaper to train than CLIP. Secondly, we escalate the number of image-caption pairs from LAION-400M during training and report the performance on zero-shot segmentation at the end of each training phase. The results depicted in Figure 4 underscore the scalability of our method SimZSS and suggest potential gains from even larger datasets, such as LAION-5B (Schuhmann et al., 2022), or increased training epochs.

4.5 Bank of concepts

Table 4: Ablation study on the impact of concept sets. Training is conducted on COCO Captions with and without class names from Pascal VOC included in the concept set, using a ViT-B/14 vision backbone. Performance on the zero-shot segmentation task is then compared based on achieved mIoU scores. \dagger refers to our reproduction using DINOv2 pretrained vision backbones.
Without background With background
Method Pascal VOC concepts Pascal VOC Pascal Context COCO-Stuff Cityscapes ADE20K Pascal Context COCO-Object Pascal VOC
LiT (Zhai et al., 2022) 86.1 35.5 25.6 25.8 18.1 31.5 39.5 51.4
SimZSS 88.1 42.9 28.7 32.9 22.6 36.6 42.4 53.4
SimZSS 90.3 43.1 29.0 33.0 21.8 37.2 43.5 58.4

At its core, SimZSS relies on identifying noun phrases in captions and filtering them through a bank of concepts. This filtering ensures that the identified noun phrases are valid candidates for visual concepts and can be used to query the images. A potential question that arises is whether the segmentation improvements of the proposed method are limited to the classes included in the bank of concepts or if they generalize to any class. To investigate this, we train SimZSS with the class names from Pascal VOC (Everingham et al., 2012) removed from the bank of concepts and report the resulting zero-shot segmentation mIoU scores across all downstream datasets.

Most importantly,  Table 4 shows that SimZSS outperforms LiT on Pascal VOC both with and without background. This invalidates the hypothesis that only classes contained in the bank of concepts benefit from the proposed local consistency objective. Furthermore, the performance on other downstream datasets only slightly deviates from the one obtained using the complete list of concepts, which further supports our claim, considering that the labels from Pascal VOC account for approximately 22%percent2222\%22 %, 26%percent2626\%26 %, and 33%percent3333\%33 % of the labels from COCO Object (Caesar et al., 2018), Cityscapes (Cordts et al., 2016), and Pascal Context (Mottaghi et al., 2014), respectively. Overall, these results suggest that our method effectively trains the text encoder to align with local features.

Alternatively,  Figure 1 suggests that the vision tower, here a ViT-B/14 pretrained with DINOv2 (Oquab et al., 2023), could handle more concepts than those present in the concept bank. Indeed, some visual concepts, such as bike (third row, columns 1-5), appear well-segmented in the image but are not highlighted in the caption, meaning the concept is absent from the concept bank, and thus, no constraint is enforced. This claim is further supported by the fine-grained vision-language alignment depicted in Figure 3. The visualization further shows that, even though concepts such as forest, chef, water buffalo, or bike are not in the concept bank, the text concepts and the corresponding patches appear well aligned.

5 Conclusion

We introduce SimZSS, a simple framework to endow pretrained pure vision models with open-vocabulary segmentation capabilities. This approach is versatile: it can accommodate various backbones, pretraining methods, and datasets, irrespective of their scale and degree of curation. Overall, SimZSS is both a computationally and data-efficient method that yields state-of-the-art results across standard zero-shot segmentation benchmarks.

Aknowledgement.

This research is supported by the Personalized Health and Related Technologies (PHRT), grant number 2021/344. Additional funding is provided by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant Agreement No. 101021347). This work is supported by a grant from the Swiss National Supercomputing Centre (CSCS) on the Swiss share of the LUMI system under project ID 606.

References

  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • Araslanov & Roth (2020) Nikita Araslanov and Stefan Roth. Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4253–4262, 2020.
  • Balažević et al. (2023) Ivana Balažević, David Steiner, Nikhil Parthasarathy, Relja Arandjelović, and Olivier J Hénaff. Towards in-context scene understanding. arXiv preprint arXiv:2306.01667, 2023.
  • Balazevic et al. (2024) Ivana Balazevic, David Steiner, Nikhil Parthasarathy, Relja Arandjelović, and Olivier Henaff. Towards in-context scene understanding. Advances in Neural Information Processing Systems, 36, 2024.
  • Caesar et al. (2018) Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1209–1218, 2018.
  • Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  • Cha et al. (2023) Junbum Cha, Jonghwan Mun, and Byungseok Roh. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11165–11174, 2023.
  • Chen et al. (2023) Jun Chen, Deyao Zhu, Guocheng Qian, Bernard Ghanem, Zhicheng Yan, Chenchen Zhu, Fanyi Xiao, Sean Chang Culatana, and Mohamed Elhoseiny. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  699–710, 2023.
  • Chen & He (2020) Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. CoRR, abs/2011.10566, 2020. URL https://arxiv.longhoe.net/abs/2011.10566.
  • Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • Chen et al. (2021) Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9640–9649, 2021.
  • Cheng et al. (2022) Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  1290–1299, 2022.
  • Contributors (2020) MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  • Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3213–3223, 2016.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  • Ding et al. (2022) Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11583–11592, 2022.
  • Dong et al. (2023) Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, and Nenghai Yu. Maskclip: Masked self-distillation advances contrastive language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  10995–11005, June 2023.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Everingham et al. (2012) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  • Gage (1994) Philip Gage. A new algorithm for data compression. The C Users Journal, 12(2):23–38, 1994.
  • Ghiasi et al. (2022) Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pp.  540–557. Springer, 2022.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020.
  • Hamilton et al. (2022) Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T Freeman. Unsupervised semantic segmentation by distilling feature correspondences. arXiv preprint arXiv:2203.08414, 2022.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9729–9738, 2020.
  • Honnibal et al. (2020) Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spacy: Industrial-strength natural language processing in python, 2020.
  • Ilharco et al. (2021) Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
  • Jia et al. (2021) Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp.  4904–4916. PMLR, 2021.
  • Karazija et al. (2023) Laurynas Karazija, Iro Laina, Andrea Vedaldi, and Christian Rupprecht. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
  • Lebailly et al. (2023) Tim Lebailly, Thomas Stegmüller, Behzad Bozorgtabar, Jean-Philippe Thiran, and Tinne Tuytelaars. Cribo: Self-supervised learning via cross-image object-level bootstrap**. arXiv preprint arXiv:2310.07855, 2023.
  • Li et al. (2022) Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Rene Ranftl. Language-driven semantic segmentation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RriDjddCLN.
  • Li et al. (2021) Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, and Jianfeng Gao. Efficient self-supervised vision transformers for representation learning. CoRR, abs/2106.09785, 2021. URL https://arxiv.longhoe.net/abs/2106.09785.
  • Li et al. (2023) Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, and Kaiming He. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23390–23400, 2023.
  • Liang et al. (2023) Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7061–7070, June 2023.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  • Luo et al. (2023) Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. Segclip: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In International Conference on Machine Learning, pp.  23033–23044. PMLR, 2023.
  • Ma et al. (2022) Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya Zhang, and Weidi Xie. Open-vocabulary semantic segmentation with frozen vision-language models. arXiv preprint arXiv:2210.15138, 2022.
  • Mottaghi et al. (2014) Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  891–898, 2014.
  • Mukhoti et al. (2023) Jishnu Mukhoti, Tsung-Yu Lin, Omid Poursaeed, Rui Wang, Ashish Shah, Philip HS Torr, and Ser-Nam Lim. Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19413–19423, 2023.
  • Oquab et al. (2023) Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
  • Ranasinghe et al. (2023) Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, and Jonathon Shlens. Perceptual grou** in contrastive vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5571–5584, 2023.
  • Ren et al. (2023) Pengzhen Ren, Changlin Li, Hang Xu, Yi Zhu, Guangrun Wang, Jianzhuang Liu, Xiaojun Chang, and Xiaodan Liang. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. arXiv preprint arXiv:2302.10307, 2023.
  • Rewatbowornwong et al. (2023) Pitchaporn Rewatbowornwong, Nattanat Chatthee, Ekapol Chuangsuwanich, and Supasorn Suwajanakorn. Zero-guidance segmentation using zero segment labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1162–1172, 2023.
  • Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  • Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • Shin et al. (2022) Gyungin Shin, Weidi Xie, and Samuel Albanie. Reco: Retrieve and co-segment for zero-shot transfer. Advances in Neural Information Processing Systems, 35:33754–33767, 2022.
  • Siméoni et al. (2021) Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279, 2021.
  • Siméoni et al. (2023) Oriane Siméoni, Chloé Sekkat, Gilles Puy, Antonín Vobeckỳ, Éloi Zablocki, and Patrick Pérez. Unsupervised object localization: Observing the background to discover objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3176–3186, 2023.
  • Stegmüller et al. (2023) Thomas Stegmüller, Tim Lebailly, Behzad Bozorgtabar, Tinne Tuytelaars, and Jean-Philippe Thiran. Croc: Cross-view online clustering for dense visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  7000–7009, June 2023.
  • Steiner et al. (2022) Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
  • Strudel et al. (2021) Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  7262–7272, 2021.
  • Tschannen et al. (2024) Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, and Lucas Beyer. Image captioners are scalable vision learners too. Advances in Neural Information Processing Systems, 36, 2024.
  • Wang et al. (2023) Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut. IEEE transactions on pattern analysis and machine intelligence, 2023.
  • Wysoczańska et al. (2024) Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, and Oriane Siméoni. Clip-diy: Clip dense inference yields open-vocabulary semantic segmentation for-free. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1403–1413, 2024.
  • Wysoczańska et al. (2024) Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, and Patrick Pérez. Clip-dinoiser: Teaching clip a few dino tricks for open-vocabulary semantic segmentation, 2024.
  • Xu et al. (2022) Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and Xiaolong Wang. Groupvit: Semantic segmentation emerges from text supervision. arXiv preprint arXiv:2202.11094, 2022.
  • Xu et al. (2023a) Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, and Weidi Xie. Learning open-vocabulary semantic segmentation models from natural language supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2935–2944, 2023a.
  • Xu et al. (2023b) Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2945–2954, 2023b.
  • Yang et al. (2024) Jiawei Yang, Katie Z Luo, Jiefeng Li, Kilian Q Weinberger, Yonglong Tian, and Yue Wang. Denoising vision transformers. arXiv preprint arXiv:2401.02957, 2024.
  • Yu et al. (2022) Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  • Yu et al. (2024) Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhai et al. (2022) Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.  18102–18112, United States, 2022. IEEE Computer Society. doi: 10.1109/CVPR52688.2022.01759. Publisher Copyright: © 2022 IEEE.; 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 ; Conference date: 19-06-2022 Through 24-06-2022.
  • Zhang et al. (2023) Xinyu Zhang, Yuting Wang, and Abdeslam Boularias. Detect every thing with few examples, 2023.
  • Zhou et al. (2017) Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  633–641, 2017.
  • Zhou et al. (2022a) Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. In European Conference on Computer Vision, pp.  696–712. Springer, 2022a.
  • Zhou et al. (2022b) **ghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022b.

Appendix

Appendix A Experimental setup

A.1 Pretraining

A.1.1 Pretraining datasets

We train our models on two distinct datasets: COCO Captions (Lin et al., 2014; Chen et al., 2015) and LAION-400M (Schuhmann et al., 2021). The former comprises 600K image-caption pairs with high-quality human-generated captions, while the latter is derived from LAION-5B (Schuhmann et al., 2022) consisting of image-caption pairs with vision-language cosine similarity exceeding 0.3, as determined by a pretrained CLIP model (Radford et al., 2021). These datasets represent opposing ends of the spectrum in terms of curation and scale.

A.1.2 Networks architectures

Vision transformers (Dosovitskiy et al., 2020) are used to obtain image representations. More precisely, we experiment with ViT-S/14 and ViT-B/14 pretrained with DINOv2 (Oquab et al., 2023) on LVD-142M. A ViT-B/16 pretrained with AugReg (Steiner et al., 2022) on ImageNet-21k is also tested. The architecture of the text transformers is identical to the one used in CLIP, and its weights are randomly initialized. Overall, the only architectural difference w.r.t. CLIP is the removal of the learnable linear layer that maps visual representations to the cross-modal representation space i.e., we project textual representations directly onto the visual space rather than projecting both the textual and visual representations onto an intermediary space.

A.1.3 Optimization

For COCO Captions, we conduct training over 10 epochs using a global batchsize of 16,384. We incorporate a warm-up strategy spanning 10% of the training steps, linearly ram** up the learning rate until it reaches its peak value, chosen from the set {8e-5, 3e-5, 8e-6, 3e-6}111scaled by batchsize / 256. Subsequently, we employ a cosine decay schedule for the remaining steps. Similarly, for LAION-400M, we train for 1 epoch with a global batchsize of 32,768, and we set the learning rate from the options {3e-5, 8e-6, 3e-6, 8e-7, 3e-7}. The remaining optimization settings align with those of OpenCLIP (Ilharco et al., 2021).

A.1.4 Text concepts

We use the en_core_web_trf model from SpaCy (Honnibal et al., 2020) as part-of-speech tagger to identify noun phrases. The concept bank is obtained as the union of the class names from Pascal VOC (Everingham et al., 2012), Pascal Context (Mottaghi et al., 2014), COCO-Stuff (Caesar et al., 2018), Cityscapes (Cordts et al., 2016) and ADE20K (Zhou et al., 2017). This results in 574574574574 concepts.

Appendix B Additional experiments

B.1 Impact of the vision tower

Table 5: Ablation study on the vision backbone for zero-shot foreground segmentation. We investigate the effect of the vision backbone’s size and pretraining method on the downstream task. The impact of using artifact-free models is also evaluated (with/without DVT (Yang et al., 2024)). We report the mIoU scores across five standard datasets. \dagger refers to our reproduction using DINOv2 pretrained vision backbones.
Method Pretraining Backbone DVT Pascal VOC Pascal Context COCO-Stuff Cityscapes ADE20K
Supervised
LiT (Zhai et al., 2022) AugReg ViT-B/16 74.1 18.3 13.3 15.6 10.2
SimZSS AugReg ViT-B/16 81.2 23.3 16.1 18.9 12.6
Self-supervised
LiT (Zhai et al., 2022) DINOv2 ViT-S/14 84.0 32.8 22.0 24.0 16.3
SimZSS DINOv2 ViT-S/14 87.2 37.3 23.8 29.2 17.9
LiT (Zhai et al., 2022) DINOv2 ViT-B/14 86.3 34.1 24.6 25.8 17.1
SimZSS DINOv2 ViT-B/14 87.4 41.1 26.6 31.6 19.9
LiT (Zhai et al., 2022) DINOv2 ViT-B/14 86.1 35.5 25.6 25.8 18.1
SimZSS DINOv2 ViT-B/14 90.3 43.1 29.0 33.0 21.8

The underlying assumption of SimZSS is that the visual representations do not require further training, and only the vision-language alignment must be learned. As such, the quality of the vision backbone is pivotal to the overall performance of SimZSS.

Table 6: Ablation study on the vision backbone for zero-shot segmentation. We investigate the effect of the vision backbone’s size, pretraining method, and patch size on the downstream task. The impact of using artifact-free models is also evaluated (with/without DVT (Yang et al., 2024)). We report the mIoU scores on three standard datasets.
Method Pretraining Backbone DVT Pascal Context COCO-Object Pascal VOC
Supervised
LiT (Zhai et al., 2022) AugReg ViT-B/16 17.8 26.1 32.2
SimZSS AugReg ViT-B/16 21.6 29.7 35.9
Self-supervised
LiT (Zhai et al., 2022) DINOv2 ViT-S/14 29.9 36.5 49.6
SimZSS DINOv2 ViT-S/14 32.8 39.5 55.5
LiT (Zhai et al., 2022) DINOv2 ViT-B/14 30.5 28.7 49.8
SimZSS DINOv2 ViT-B/14 35.7 40.5 57.5
LiT (Zhai et al., 2022) DINOv2 ViT-B/14 31.5 39.5 51.4
SimZSS DINOv2 ViT-B/14 37.2 43.5 58.4

In Tables 5 and 6, we explore the performance of various pretrained vision towers in zero-shot segmentation tasks. Our observations indicate that SimZSS is versatile and can integrate diverse models and pretraining approaches effectively. As expected, models with superior performance on pure vision tasks also show enhanced results when used with our method, as well as with LiT. Most importantly, regardless of the vision backbone employed, SimZSS consistently surpasses LiT. Additionally, employing denoised ViTs (Yang et al., 2024), which exhibit better semantic feature correlation, leads to improved segmentation. This outcome is anticipated since our approach assumes that features corresponding to semantically related regions are highly correlated (see Sec. 3.2).

B.2 SimZSS-specific hyperparameters

Table 7: Ablation study on the impact of the loss weight λ𝜆\mathbf{\lambda}italic_λ and the temperature parameter τ𝜏\mathbf{\tau}italic_τ. Training is performed on COCO Captions using a ViT-B/14 vision backbone. The performance is evaluated through a comparison of mIoU scores on the zero-shot segmentation task.
Without background With background
Loss weight Temperature Pascal VOC Pascal Context COCO-Stuff Cityscapes ADE20K Pascal Context COCO-Object Pascal VOC
λ=0.00𝜆0.00\lambda=0.00italic_λ = 0.00 - 86.1 35.5 25.6 25.8 18.1 31.5 39.5 51.4
λ=0.01𝜆0.01\lambda=0.01italic_λ = 0.01 τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1 89.8 40.2 27.6 29.7 19.6 34.9 42.1 54.9
λ=0.02𝜆0.02\lambda=0.02italic_λ = 0.02 τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1 89.8 41.9 28.2 30.9 20.4 36.1 42.3 57.1
λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1 90.3 43.1 29.0 33.0 21.8 37.2 43.5 58.4
λ=0.10𝜆0.10\lambda=0.10italic_λ = 0.10 τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1 90.1 41.8 28.6 33.6 22.1 36.1 43.1 58.5
λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 τ=0.01𝜏0.01\tau=0.01italic_τ = 0.01 89.2 42.3 28.8 33.1 21.6 36.6 43.2 57.0
λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 τ=0.04𝜏0.04\tau=0.04italic_τ = 0.04 89.7 42.7 28.7 34.3 22.1 36.7 42.9 57.0
λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 τ=0.07𝜏0.07\tau=0.07italic_τ = 0.07 90.0 42.9 28.9 33.4 21.3 37.0 43.1 58.8
λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 τ=0.10𝜏0.10\tau=0.10italic_τ = 0.10 90.3 43.1 29.0 33.0 21.8 37.2 43.5 58.4
λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 τ=0.40𝜏0.40\tau=0.40italic_τ = 0.40 88.7 40.0 27.6 28.2 19.5 34.4 42.1 52.6

Thanks to its simplicity, SimZSS has few hyperparameters, with the primary ones being τ𝜏\tauitalic_τ, which regulates the temperature used to query the dense visual representation, and λ𝜆\lambdaitalic_λ, which modulates the contribution of the dense loss to the overall objective (see Eqs. 10 and 4). After a grid search on λ𝜆\lambdaitalic_λ, τ𝜏\tauitalic_τ, and the learning rate, we find that the best-performing setting for training on COCO Captions is (λ=0.05,τ=0.1)formulae-sequence𝜆0.05𝜏0.1(\lambda=0.05,\tau=0.1)( italic_λ = 0.05 , italic_τ = 0.1 ). In Table 7, we report the resulting zero-shot segmentation performance when varying λ𝜆\lambdaitalic_λ and τ𝜏\tauitalic_τ around this optimal point.

It can be observed that the setting with λ=0𝜆0\lambda=0italic_λ = 0, corresponding to LiT (Zhai et al., 2022), performs the worst. Regarding τ𝜏\tauitalic_τ, SimZSS works best with low temperatures. This is not surprising, as in this setting, fewer patches contribute to the representation of the visual concepts, aligning more closely with the downstream evaluations.

B.3 High-level profiling & runtime comparison

Table 8: High-level Profiling of SimZSS. We report the absolute and relative times of the various operations performed in SimZSS. Experiments are conducted on a single node with 4x AMD MI250x GPUs (2 compute dies per GPU, i.e., worldsize=8worldsize8\texttt{worldsize}=8worldsize = 8) with a memory usage of 38GB per compute die. The backbone used is ViT-B/14, and the batch size is set to 1024 per compute die, totaling 8192. SimZSS-specific operations are highlighted.
Description Operation Absolute time per iteration [ms] Relative time [%]
Forward pass vision fv()subscript𝑓𝑣f_{v}(\cdot)italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( ⋅ ) 1435.0 69.3
Forward pass text ft()subscript𝑓𝑡f_{t}(\cdot)italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) 194.0 9.3
Vision concept extraction Equations 4 and 3 12.0 0.6
Features gathering - 22.5 1.1
Global consistency Equation 6 1.0 0.05
Concept-level consistency Equation 9 0.9 0.04
Weights update backpropagation 404.4 19.5
Total 2069.8 100
Table 9: Computational and memory efficiency. The efficiency of SimZSS is compared to that of related methods, i.e., LiT and CLIP. When feasible, we report results using the local training batch size; otherwise, the largest power of 2 that fits into memory is utilized. The reported values are obtained on a single node equipped with 4x AMD MI250x (2 compute die per GPU, i.e., worldsize=8worldsize8\texttt{worldsize}=8worldsize = 8).
Method Batchsize per compute die Memory per compute die [GB] Time per step [ms] Throughput [img/s]
CLIP 256 40similar-toabsent40\sim 40∼ 40 1196.0 1712
LiT 1024 27similar-toabsent27\sim 27∼ 27 2049.2 3997
SimZSS 1024 38similar-toabsent38\sim 38∼ 38 2069.8 3957

In Table 8, we present a profiling of the operations performed during a training step of SimZSS. Overall, SimZSS-specific operations account for less than 1%percent11\%1 % of the total runtime.  Table 9 further confirms that SimZSS runtime is comparable to that of LiT (Zhai et al., 2022) and adds modest memory overhead.

Appendix C Evaluation datasets

Pascal VOC 2012

The Pascal VOC dataset (Everingham et al., 2010) contains 20 classes with semantic segmentation annotations. The training set consists of 1,464 images, while the validation set includes 1,449 images. An additional background class is provided.

Pascal Context

The Pascal Context dataset (Mottaghi et al., 2014) extends the Pascal VOC dataset by providing detailed annotations for entire scenes. It includes 60 classes (including a background class) of semantic segmentation annotations. The training set contains approximately 4,998 images, while the validation set includes around 5,105 images.

COCO-Stuff

COCO-Stuff (Caesar et al., 2018) is an extension of COCO (Lin et al., 2014), providing pixel-wise annotations for 80 things classes and 91 stuff categories. The annotations are exhaustive, i.e., no pixel remains unlabeled. It includes over 164K images for training and 20K images for validation, covering a wide range of scenes and objects.

COCO-Object

COCO-Object uses the same set of images as the above-described COCO-Stuff, but only contains labels for the “things” categories. An additional background label covers the remaining pixels.

ADE20K

The ADE20K dataset (Zhou et al., 2017) comprises a diverse set of images with annotations for 150 semantic categories. The training set includes 20,210 images, and the validation set consists of 2,000 images.

Cityscapes

The Cityscapes dataset (Cordts et al., 2016) contains high-resolution images of urban street scenes, annotated for 30 classes. The training set includes 2,975 images, and the validation set includes 500 images captured from 50 cities for semantic segmentation and urban scene understanding.