HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-SA 4.0
arXiv:2401.05224v2 [cs.CV] 22 Mar 2024

(cvpr) Package cvpr Warning: Incorrect paper size - CVPR uses paper size ‘letter’. Please load document class ‘article’ with ‘letterpaper’ option

Do Vision and Language Encoders Represent the World Similarly?

Mayug Maniparambil joint first authorsML Labs, Dublin City University    Raiymbek Akshulakov11footnotemark: 1 University of California Berkeley    Yasser Abdelaziz Dahou Djilali  22footnotemark: 2 Technological Innovation Institute    Sanath Narayan44footnotemark: 4    Mohamed El Amine Seddik44footnotemark: 4    Karttikeya Mangalam33footnotemark: 3    Noel E. O’Connor22footnotemark: 2
Abstract

Aligned text-image encoders such as CLIP have become the de-facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision.

1 Introduction

Refer to caption
Figure 1: For matching, we calculate the kernels for image and text embeddings and employ QAP-based seeded matching to maximize CKA for obtaining the optimal permutation 𝑷𝑷\bm{P}bold_italic_P. For retrieval, we append query embeddings to base embeddings and retrieve the best caption that maximizes the local CKA for a query image.

The recent success of deep learning on vision-language tasks mainly relies on jointly trained language and image encoders following the success of CLIP and ALIGN [20, 40]. The standard procedure for training these models aims at aligning text and image representation using a contrastive loss that maximizes the similarity between image-text pairs while pushing negative captions away [19, 36, 10]. This achieves a statistical similarity across the two latent spaces, which is key to retrieving the closest cross-modal representations using cosine similarity. This property is not valid for unaligned encoders, hence, extra transformations are needed to bridge the gap. These transformations can be training a map** network that captures the prior distribution over the text and image representations  [31, 34, 35]. The work of [31] has shown that it is possible to train a linear map** from the output embeddings of vision encoders to the input embeddings of language models and exhibit impressive performance on image captioning and VQA tasks. This indicates that the representations between the unaligned uni-modal vision and language encoders are sufficiently high level and differ only by a linear transformation. However, this linear layer is trained on CC-3M [9] consisting of three million image-caption pairs.

Is this training step necessary? In an ideal scenario, we anticipate an alignment between vision and language encoders as they inherently capture representations of the same physical world. To this end, we employ Centered Kernel Alignment (CKA)  [42, 12, 22], which is known for measuring representation similarity both within and between networks. As shown in Figure 2, we measure the CKA between a variety of unaligned vision and language encoders  [16, 47, 28, 37, 8], on the image-caption pairs of the COCO [27] dataset and observe that some have comparable scores to that of aligned encoders like CLIP [40], affirmative of semantic similarities.

We then ask the question: If the unaligned image and text encoders are semantically similar, is there a way to connect them in a zero-shot manner? Do they build a similar representation graph over the same information coming from the two modalities? We study these questions, revealing key similarities between unaligned image and text encoders, and how these similarities can be exploited for downstream tasks. Furthermore, we devise a caption matching downstream task and show using two novel methods that latent space communication between unaligned encoders could be achieved by leveraging the semantic similarities between the cross-modal spaces. Our contributions are:

  • We present a matching method that seeks to find the permutation of the captions that maximizes the CKA (see Fig. 1). Hence, We formulate maximizing CKA as a quadratic assignment problem and introduce transformations and normalizations that greatly improve the matching performance.

  • We propose a local CKA metric and use it to perform retrieval between two unaligned embedding spaces, demonstrating superior performance with that of relative representations [34] on the COCO caption image retrieval.

  • The method is benchmarked on COCO, NoCaps [2] cross-domain caption and image retrieval as well ImageNet-100 [15] classification tasks despite our method not being optimized to align the representation in any manner demonstrating zero-shot communication between the encoder’s latent spaces.

  • Finally, we show a practical application of our method on cross-lingual image retrieval by making use of sentence transformers trained in various languages and a CLIP vision encoder trained only in English.

2 Related Work

Recently, there has been an increasing consensus that good networks, when trained independently, learn general representations across different architectures and tasks. On the one hand, the works of [33, 26, 22, 6] show that these networks exhibit representation similarity by learning similar latent spaces when trained on similar tasks and data [44, 5, 46, 11, 24, 32, 3]. Specifically, [22] introduced centered kernel alignment (CKA) as a similarity metric for comparing the inner representations across networks. The CKA measure mitigates the limitation of canonical correlation analysis (CCA) [41] being invariant to an invertible linear transformation that often leads to difficulty in measuring meaningful similarities between representations. [48] uses CKA for comparing the representations from different layers of different language models and the effect of downstream task-finetuning on the representation similarities, while [6] utilizes CKA along with Procrustes similarity for understanding the ability of variational autoencoders (VAEs) [21] in learning disentangled representations. In general, these approaches study the representation similarity in unimodal models, either vision or language. Clearly, however, the use of CKA has been limited to visualization and analysis purposes, whereas we attempt at exploiting CKA as an optimization objective.

Recent works  [34, 35] employ relative representations to match embeddings of unaligned encoders using the cosine similarity to a set of anchors. However, these relative representations are sensitive to the selection of anchors and noise in the original embeddings. Similarly, approaches [4, 14] analyze networks and empirically verify the “good networks learn similar representations” hypothesis by utilizing model stitching [24], which introduces trainable stitching layers to enable swap** parts of different networks. LiMBeR [31] can be seen as stitching the output of an image encoder to the input of a language model in the form of soft prompts [25]. However, these approaches involve training of stitching layers for evaluating the representation similarity between two models.

In this work, we argue that using an explicit similarity measure as done in  [34, 35] is sensitive to the selection of anchors and noise in the original embeddings. One design choice is an implicit measure that captures the similarity of similarities, hence, inducing more robustness to the alignment process. Furthermore, we explore how this similarity can be leveraged for downstream cross-modal tasks in a training-free manner with the aid of CKA and a set of parallel anchors in the image and text latent embedding spaces.

3 Preliminaries

Centered Kernel Alignment (CKA) has shown its relevance in understanding and comparing the information encoded by different layers of a neural network [22]. Formally, CKA relies on two sets of data 𝐗p×N𝐗superscript𝑝𝑁{\mathbf{X}}\in{\mathbb{R}}^{p\times N}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_N end_POSTSUPERSCRIPT and 𝐘q×N𝐘superscript𝑞𝑁{\mathbf{Y}}\in{\mathbb{R}}^{q\times N}bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_q × italic_N end_POSTSUPERSCRIPT through their corresponding kernels 𝐊=k(𝐗,𝐗)N×N𝐊𝑘superscript𝐗top𝐗superscript𝑁𝑁{\mathbf{K}}=k({\mathbf{X}}^{\top},{\mathbf{X}})\in{\mathbb{R}}^{N\times N}bold_K = italic_k ( bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_X ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and 𝐋=(𝐘,𝐘)N×N𝐋superscript𝐘top𝐘superscript𝑁𝑁{\mathbf{L}}=\ell({\mathbf{Y}}^{\top},{\mathbf{Y}})\in{\mathbb{R}}^{N\times N}bold_L = roman_ℓ ( bold_Y start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT where k,𝑘k,\ellitalic_k , roman_ℓ are some kernel functions applied on the columns of 𝐗𝐗{\mathbf{X}}bold_X and 𝐘𝐘{\mathbf{Y}}bold_Y respectively (e.g., linear or RBF kernels). Therefore, the CKA is computed in terms of 𝐊𝐊{\mathbf{K}}bold_K and 𝐋𝐋{\mathbf{L}}bold_L as:

CKA(𝐊,𝐋)=HSIC(𝐊,𝐋)HSIC(𝐊,𝐊)HSIC(𝐋,𝐋),CKA𝐊𝐋HSIC𝐊𝐋HSIC𝐊𝐊HSIC𝐋𝐋\displaystyle\operatorname{CKA}({\mathbf{K}},{\mathbf{L}})=\frac{\operatorname% {HSIC}({\mathbf{K}},{\mathbf{L}})}{\sqrt{\operatorname{HSIC}({\mathbf{K}},{% \mathbf{K}})\operatorname{HSIC}({\mathbf{L}},{\mathbf{L}})}},roman_CKA ( bold_K , bold_L ) = divide start_ARG roman_HSIC ( bold_K , bold_L ) end_ARG start_ARG square-root start_ARG roman_HSIC ( bold_K , bold_K ) roman_HSIC ( bold_L , bold_L ) end_ARG end_ARG , (1)

where HSIC(,)HSIC\operatorname{HSIC}(\cdot,\cdot)roman_HSIC ( ⋅ , ⋅ ) is the Hilbert-Schmidt Independence Criterion [18, 30] defined as:

HSIC(𝐊,𝐋)=1(N1)2tr(𝐊𝐂𝐋𝐂),HSIC𝐊𝐋1superscript𝑁12tr𝐊𝐂𝐋𝐂\displaystyle\operatorname{HSIC}({\mathbf{K}},{\mathbf{L}})=\frac{1}{(N-1)^{2}% }\operatorname{tr}\left({\mathbf{K}}{\mathbf{C}}{\mathbf{L}}{\mathbf{C}}\right),roman_HSIC ( bold_K , bold_L ) = divide start_ARG 1 end_ARG start_ARG ( italic_N - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_tr ( bold_KCLC ) , (2)

with 𝐂=𝐈1N𝟏𝟏𝐂𝐈1𝑁superscript11top{\mathbf{C}}={\mathbf{I}}-\frac{1}{N}{\bm{1}}{\bm{1}}^{\top}bold_C = bold_I - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG bold_11 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT the centring matrix. We refer the reader to [22] for broader properties and studies of the CKA metric on neural network representations.

4 Proposed Method

Consider a set of N𝑁Nitalic_N image-caption pairs, 𝒮={(𝒙i,𝒄i)}i=1N𝒮superscriptsubscriptsubscript𝒙𝑖subscript𝒄𝑖𝑖1𝑁\mathcal{S}=\{({\bm{x}}_{i},{\bm{c}}_{i})\}_{i=1}^{N}caligraphic_S = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where 𝒙i𝒳subscript𝒙𝑖𝒳{\bm{x}}_{i}\in{\mathcal{X}}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X and 𝐜i𝒞subscript𝐜𝑖𝒞\mathbf{c}_{i}\in{\mathcal{C}}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C represent the i𝑖iitalic_i-th image and its corresponding caption, respectively. In this particular example, we are performing caption-to-image retrieval, but it is applicable for the reverse as well. Let 𝒇:𝒳d1:𝒇maps-to𝒳superscriptsubscript𝑑1{\bm{f}}:{\mathcal{X}}\mapsto{\mathbb{R}}^{d_{1}}bold_italic_f : caligraphic_X ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒈:𝒞d2:𝒈maps-to𝒞superscriptsubscript𝑑2{\bm{g}}:{\mathcal{C}}\mapsto{\mathbb{R}}^{d_{2}}bold_italic_g : caligraphic_C ↦ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote some vision and language encoders respectively. The image-caption pairs are mapped into their corresponding sets of representations 𝐙=[𝒛1,,𝒛N]d1×N𝐙subscript𝒛1subscript𝒛𝑁superscriptsubscript𝑑1𝑁{\mathbf{Z}}=[{\bm{z}}_{1},\ldots,{\bm{z}}_{N}]\in{\mathbb{R}}^{d_{1}\times N}bold_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT and 𝐇=[𝒉1,,𝒉N]d2×N𝐇subscript𝒉1subscript𝒉𝑁superscriptsubscript𝑑2𝑁{\mathbf{H}}=[{\bm{h}}_{1},\ldots,{\bm{h}}_{N}]\in{\mathbb{R}}^{d_{2}\times N}bold_H = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT, where 𝒛i=𝒇(𝒙i)subscript𝒛𝑖𝒇subscript𝒙𝑖{\bm{z}}_{i}={\bm{f}}({\bm{x}}_{i})bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝒉i=𝒈(𝒄i)subscript𝒉𝑖𝒈subscript𝒄𝑖{\bm{h}}_{i}={\bm{g}}({\bm{c}}_{i})bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_g ( bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

As shown in Table 1, the maximum CKA score is obtained on the ground-truth ordering of the representations CKAmax=CKA(𝐊𝐙,𝐊𝐇)subscriptCKAmaxCKAsubscript𝐊𝐙subscript𝐊𝐇\operatorname{CKA}_{\text{max}}=\operatorname{CKA}({\mathbf{K}}_{{\mathbf{Z}}}% ,{\mathbf{K}}_{{\mathbf{H}}})roman_CKA start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = roman_CKA ( bold_K start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT ), where 𝐊𝐙subscript𝐊𝐙{\mathbf{K}}_{{\mathbf{Z}}}bold_K start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT and 𝐊𝐇subscript𝐊𝐇{\mathbf{K}}_{{\mathbf{H}}}bold_K start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT are the kernels for the image and text representations, defined respectively as 𝐊𝐙=k(𝐙,𝐙)subscript𝐊𝐙𝑘superscript𝐙top𝐙{\mathbf{K}}_{\mathbf{Z}}=k({\mathbf{Z}}^{\top},{\mathbf{Z}})bold_K start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT = italic_k ( bold_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_Z ) and 𝐊𝐇=k(𝐇,𝐇)subscript𝐊𝐇𝑘superscript𝐇top𝐇{\mathbf{K}}_{\mathbf{H}}=k({\mathbf{H}}^{\top},{\mathbf{H}})bold_K start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT = italic_k ( bold_H start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_H ). We find that the CKA is sensitive to the data ordering. Specifically, we shuffle x% of data to obtain wrong matches while kee** the remaining 100-x% aligned, measure the CKA on each new data set, and observe that it monotonically decreases with random shuffling. This motivates our methodology for finding an optimal permutation of the image data that maximizes the CKA.

Table 1: CKA reduces with shuffling. We measure the CKA score between DINOv2 [37] and All-Roberta-large-v1 [28] on the 5k COCO [27] image-caption representations pairs of the valset. The exact ordering yields the best score, whereas randomly shuffling the representations reduces the CKA score.
Shuffling (%) 0 20 40 60 80 100
CKA Score 0.72 0.46 0.27 0.13 0.04 0.01

Formally, let σ𝜎\sigmaitalic_σ be some permutation of the set {1,,N}1𝑁\{1,\cdots,N\}{ 1 , ⋯ , italic_N } and denote σ(𝐙)=[𝒛σ(1),,𝒛σ(N)]d1×N𝜎𝐙subscript𝒛𝜎1subscript𝒛𝜎𝑁superscriptsubscript𝑑1𝑁\sigma({\mathbf{Z}})=[{\bm{z}}_{\sigma(1)},\cdots,{\bm{z}}_{\sigma(N)}]\in{% \mathbb{R}}^{d_{1}\times N}italic_σ ( bold_Z ) = [ bold_italic_z start_POSTSUBSCRIPT italic_σ ( 1 ) end_POSTSUBSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_σ ( italic_N ) end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT the set of permuted image representations by σ𝜎\sigmaitalic_σ. If σ𝜎\sigmaitalic_σ is not identity, it disrupts the original ordering of the image representations leading to a lower CKA score as shown in Table 1. Therefore, our goal is to find a permutation σ*superscript𝜎\sigma^{*}italic_σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT that maximizes the CKA. Formally:

σ*=argmaxσCKA(𝐊σ(𝐙),𝐊𝐇).superscript𝜎subscriptargmax𝜎CKAsubscript𝐊𝜎𝐙subscript𝐊𝐇\displaystyle\sigma^{*}=\operatorname*{arg\,max}_{\sigma}\operatorname{CKA}({% \mathbf{K}}_{\sigma({\mathbf{Z}})},{\mathbf{K}}_{{\mathbf{H}}}).italic_σ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT roman_CKA ( bold_K start_POSTSUBSCRIPT italic_σ ( bold_Z ) end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT ) . (3)

The solution to this problem seeks to realign the permuted set of images in a way that maximizes the CKA, potentially recovering the ground-truth pairing between images and their corresponding captions.

To solve the aforementioned optimization problem, we explore two main approaches (visualized in Fig. 1): the Quadratic Assignment Problem (QAP) algorithm and Local CKA-based retrieval and matching. The QAP algorithm provides a global matching solution, seeking the optimal permutation across the query set considered. On the other hand, Local CKA-based retrieval and matching focuses on aligning images and captions using a localized metric, facilitating retrieval on a more granular level. This approach is more suitable where a single query image is given for a set of captions or vice versa.

4.1 QAP Matching

For some random permutation σ𝜎\sigmaitalic_σ, the optimization problem in Equation 3 can be reformulated as a quadratic optimization problem [45] which reads as:

max𝐏𝒫Ntr(𝐏𝐊¯σ(𝐙)𝐏𝐊¯𝐇),subscript𝐏subscript𝒫𝑁trsuperscript𝐏topsubscript¯𝐊𝜎𝐙𝐏subscript¯𝐊𝐇\displaystyle\max_{{\mathbf{P}}\in{\mathcal{P}}_{N}}\,\operatorname{tr}\left({% \mathbf{P}}^{\top}\bar{\mathbf{K}}_{\sigma({\mathbf{Z}})}{\mathbf{P}}\bar{% \mathbf{K}}_{\mathbf{H}}\right),roman_max start_POSTSUBSCRIPT bold_P ∈ caligraphic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_tr ( bold_P start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_σ ( bold_Z ) end_POSTSUBSCRIPT bold_P over¯ start_ARG bold_K end_ARG start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT ) , (4)

where 𝒫Nsubscript𝒫𝑁{\mathcal{P}}_{N}caligraphic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is the set of all permutation matrices of size N𝑁Nitalic_N and 𝐊¯=HSIC(𝐊,𝐊)12𝐊𝐂\bar{\mathbf{K}}=\operatorname{HSIC}({\mathbf{K}},{\mathbf{K}})^{-\frac{1}{2}}% {\mathbf{K}}{\mathbf{C}}over¯ start_ARG bold_K end_ARG = roman_HSIC ( bold_K , bold_K ) start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_KC stands for the centered and re-scaled kernel. In principle, maximizing the above objective is a relaxation of a graph-matching problem. Moreover, finding a global maximum of Equation 4 is NP-hard due to the combinatorial nature of the problem and therefore optimizing it can lead to sub-optimal or approximate solutions.

To overcome the NP-hardness of QAP, in practice, we suppose that we have access to a base set ={(𝒛ib,𝒉ib)}i=1Msuperscriptsubscriptsuperscriptsubscript𝒛𝑖𝑏superscriptsubscript𝒉𝑖𝑏𝑖1𝑀{\mathcal{B}}=\{({\bm{z}}_{i}^{b},{\bm{h}}_{i}^{b})\}_{i=1}^{M}caligraphic_B = { ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT of image-caption representations pairs and solve an equivalent objective to Equation 4 only partially on some unmatched query set 𝒬={𝒛iq}i=1N×{𝒉iq}i=1N𝒬superscriptsubscriptsuperscriptsubscript𝒛𝑖𝑞𝑖1𝑁superscriptsubscriptsuperscriptsubscript𝒉𝑖𝑞𝑖1𝑁{\mathcal{Q}}=\{{\bm{z}}_{i}^{q}\}_{i=1}^{N}\times\{{\bm{h}}_{i}^{q}\}_{i=1}^{N}caligraphic_Q = { bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT × { bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT using a seeded version of the fast QAP algorithm [17]. Formally, let 𝐙=[𝒛1b,,𝒛Mb,𝒛1q,,𝒛Nq]d1×(M+N)𝐙superscriptsubscript𝒛1𝑏superscriptsubscript𝒛𝑀𝑏superscriptsubscript𝒛1𝑞superscriptsubscript𝒛𝑁𝑞superscriptsubscript𝑑1𝑀𝑁{\mathbf{Z}}=[{\bm{z}}_{1}^{b},\cdots,{\bm{z}}_{M}^{b},{\bm{z}}_{1}^{q},\cdots% ,{\bm{z}}_{N}^{q}]\in{\mathbb{R}}^{d_{1}\times(M+N)}bold_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ( italic_M + italic_N ) end_POSTSUPERSCRIPT and 𝐇=[𝒉1b,,𝒉Mb,𝒉1q,,𝒉Nq]d2×(M+N)𝐇superscriptsubscript𝒉1𝑏superscriptsubscript𝒉𝑀𝑏superscriptsubscript𝒉1𝑞superscriptsubscript𝒉𝑁𝑞superscriptsubscript𝑑2𝑀𝑁{\mathbf{H}}=[{\bm{h}}_{1}^{b},\cdots,{\bm{h}}_{M}^{b},{\bm{h}}_{1}^{q},\cdots% ,{\bm{h}}_{N}^{q}]\in{\mathbb{R}}^{d_{2}\times(M+N)}bold_H = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ( italic_M + italic_N ) end_POSTSUPERSCRIPT be the matrix concatenating all base and query representations of images and captions respectively, and denote by 𝐊¯𝐙,𝐊¯𝐇(M+N)×(M+N)subscript¯𝐊𝐙subscript¯𝐊𝐇superscript𝑀𝑁𝑀𝑁\bar{\mathbf{K}}_{\mathbf{Z}},\bar{\mathbf{K}}_{\mathbf{H}}\in{\mathbb{R}}^{(M% +N)\times(M+N)}over¯ start_ARG bold_K end_ARG start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT , over¯ start_ARG bold_K end_ARG start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + italic_N ) × ( italic_M + italic_N ) end_POSTSUPERSCRIPT the corresponding centered and re-scaled kernels. The partial matching for aligning the query samples is then performed by solving the following:

max𝐏𝒫Ntr((𝐈M𝐏)𝐊¯𝐙(𝐈M𝐏)𝐊¯𝐇),subscript𝐏subscript𝒫𝑁trsuperscriptdirect-sumsubscript𝐈𝑀𝐏topsubscript¯𝐊𝐙direct-sumsubscript𝐈𝑀𝐏subscript¯𝐊𝐇\displaystyle\max_{{\mathbf{P}}\in{\mathcal{P}}_{N}}\,\operatorname{tr}\left((% {\mathbf{I}}_{M}\oplus{\mathbf{P}})^{\top}\bar{\mathbf{K}}_{\mathbf{Z}}({% \mathbf{I}}_{M}\oplus{\mathbf{P}})\bar{\mathbf{K}}_{\mathbf{H}}\right),roman_max start_POSTSUBSCRIPT bold_P ∈ caligraphic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_tr ( ( bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ⊕ bold_P ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_K end_ARG start_POSTSUBSCRIPT bold_Z end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ⊕ bold_P ) over¯ start_ARG bold_K end_ARG start_POSTSUBSCRIPT bold_H end_POSTSUBSCRIPT ) , (5)

where 𝐈M𝐏(M+N)×(M+N)direct-sumsubscript𝐈𝑀𝐏superscript𝑀𝑁𝑀𝑁{\mathbf{I}}_{M}\oplus{\mathbf{P}}\in{\mathbb{R}}^{(M+N)\times(M+N)}bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ⊕ bold_P ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + italic_N ) × ( italic_M + italic_N ) end_POSTSUPERSCRIPT stands for the block-diagonal matrix having diagonal blocks 𝐈Msubscript𝐈𝑀{\mathbf{I}}_{M}bold_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and 𝐏𝐏{\mathbf{P}}bold_P.

4.2 Local CKA based Retrieval and Matching

The concept of a global CKA metric is extended to derive local similarity measures suitable for retrieval. This process begins with a base set ={(𝒛ib,𝒉ib)}i=1Msuperscriptsubscriptsuperscriptsubscript𝒛𝑖𝑏superscriptsubscript𝒉𝑖𝑏𝑖1𝑀{\mathcal{B}}=\{({\bm{z}}_{i}^{b},{\bm{h}}_{i}^{b})\}_{i=1}^{M}caligraphic_B = { ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT consisting of aligned pairs of images and captions representations. The objective is to facilitate caption-image retrieval/matching within an unaligned query set 𝒬={𝒛iq}i=1N×{𝒉iq}i=1N𝒬superscriptsubscriptsuperscriptsubscript𝒛𝑖𝑞𝑖1𝑁superscriptsubscriptsuperscriptsubscript𝒉𝑖𝑞𝑖1𝑁{\mathcal{Q}}=\{{\bm{z}}_{i}^{q}\}_{i=1}^{N}\times\{{\bm{h}}_{i}^{q}\}_{i=1}^{N}caligraphic_Q = { bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT × { bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

A local CKA score, denoted as localCKA(𝒛q,𝒉q)localCKAsuperscript𝒛𝑞superscript𝒉𝑞\operatorname{localCKA}({\bm{z}}^{q},{\bm{h}}^{q})roman_localCKA ( bold_italic_z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) for a couple (𝒛q,𝒉q)𝒬superscript𝒛𝑞superscript𝒉𝑞𝒬({\bm{z}}^{q},{\bm{h}}^{q})\in\mathcal{Q}( bold_italic_z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) ∈ caligraphic_Q is calculated by computing a global CKA score for the image-caption pairs in \mathcal{B}caligraphic_B, augmented with the query pair (𝒛q,𝒉q)superscript𝒛𝑞superscript𝒉𝑞({\bm{z}}^{q},{\bm{h}}^{q})( bold_italic_z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ). The local CKA is computed as follows:

localCKA(𝒛q,𝒉q)=CKA(𝐊[𝐙,𝒛q],𝐊[𝐇,𝒉q]),localCKAsuperscript𝒛𝑞superscript𝒉𝑞CKAsubscript𝐊𝐙superscript𝒛𝑞subscript𝐊𝐇superscript𝒉𝑞\operatorname{localCKA}({\bm{z}}^{q},{\bm{h}}^{q})=\operatorname{CKA}({\mathbf% {K}}_{[{\mathbf{Z}},{\bm{z}}^{q}]},{\mathbf{K}}_{[{\mathbf{H}},{\bm{h}}^{q}]}),roman_localCKA ( bold_italic_z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ) = roman_CKA ( bold_K start_POSTSUBSCRIPT [ bold_Z , bold_italic_z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT [ bold_H , bold_italic_h start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT ) , (6)

where [𝐌,𝒗]𝐌𝒗[{\mathbf{M}},{\bm{v}}][ bold_M , bold_italic_v ] denotes the concatenation of the matrix 𝐌𝐌{\mathbf{M}}bold_M and the vector 𝒗𝒗{\bm{v}}bold_italic_v column-wise and 𝐙=[𝒛1b,,𝒛Mb]d1×M𝐙superscriptsubscript𝒛1𝑏superscriptsubscript𝒛𝑀𝑏superscriptsubscript𝑑1𝑀{\mathbf{Z}}=[{\bm{z}}_{1}^{b},\cdots,{\bm{z}}_{M}^{b}]\in{\mathbb{R}}^{d_{1}% \times M}bold_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT and 𝐇=[𝒉1b,,𝒉Mb]d2×M𝐇superscriptsubscript𝒉1𝑏superscriptsubscript𝒉𝑀𝑏superscriptsubscript𝑑2𝑀{\mathbf{H}}=[{\bm{h}}_{1}^{b},\cdots,{\bm{h}}_{M}^{b}]\in{\mathbb{R}}^{d_{2}% \times M}bold_H = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT. In essence, a correctly matched image-caption pair in 𝒬𝒬\mathcal{Q}caligraphic_Q would exhibit a higher degree of alignment with the base set \mathcal{B}caligraphic_B in terms of the CKA score, resulting in an elevated localCKAlocalCKA\operatorname{localCKA}roman_localCKA score. This metric can be used to calculate a score between one source query and N𝑁Nitalic_N target queries enabling effective retrieval. Furthermore, this framework allows for the use of linear sum assignment [23] for matching tasks.

4.3 Stretching and Clustering

The choice of base samples and the spread of the representations in each embedding space affect the performance of the QAP and Local CKA algorithms. To spread the representations out in each domain for matching, we introduce a stretching matrix that normalizes the features of each dimension by the variance calculated from the query and base sets. Given 𝐗=[𝒙1,,𝒙d]d×N𝐗superscriptsubscript𝒙1subscript𝒙𝑑topsuperscript𝑑𝑁{\mathbf{X}}=[{\bm{x}}_{1},\cdots,{\bm{x}}_{d}]^{\top}\in{\mathbb{R}}^{d\times N}bold_X = [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N end_POSTSUPERSCRIPT, the stretched matrix 𝐗ssubscript𝐗𝑠{\mathbf{X}}_{s}bold_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is computed as 𝐗s=𝐒𝐗subscript𝐗𝑠𝐒𝐗{\mathbf{X}}_{s}={\mathbf{S}}{\mathbf{X}}bold_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = bold_SX, where the stretching matrix 𝐒d×d𝐒superscript𝑑𝑑{\mathbf{S}}\in{\mathbb{R}}^{d\times d}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT is a diagonal matrix with inverse empirical standard deviation of the feature dimension as entries, i.e., 𝐒=diag(1std(𝒙1),,1std(𝒙d))𝐒diag1stdsubscript𝒙11stdsubscript𝒙𝑑{\mathbf{S}}=\text{diag}\left(\frac{1}{\text{std}({\bm{x}}_{1})},\cdots,\frac{% 1}{\text{std}({\bm{x}}_{d})}\right)bold_S = diag ( divide start_ARG 1 end_ARG start_ARG std ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG , ⋯ , divide start_ARG 1 end_ARG start_ARG std ( bold_italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_ARG ) and 𝒙iNsubscript𝒙𝑖superscript𝑁{\bm{x}}_{i}\in{\mathbb{R}}^{N}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of 𝐗𝐗{\mathbf{X}}bold_X. This stretching operation is performed for both the image and text before calculating the kernels for both QAP and local CKA matching algorithms. For picking the most effective base samples, we find that the simple k𝑘kitalic_k-means clustering on the image embeddings works best. An ablation on how these affect the QAP and local CKA matching and retrieval accuracies is provided in Sec  7.

5 Experiments

We assess the performance of the proposed method using various vision and language encoders on a set of downstream tasks. We first detail the encoders, datasets, downstream tasks, and the baselines used.

5.1 Vision and Language Encoders

The experimental setup covers vision encoders of different architectures, such as ViTs [16] and ConvNeXt [29], trained in various ways: supervised, language-supervised, and self-supervised, across different training data regimes. For the language encoder, an encoder capable of producing a global embedding for a caption is essential. This includes encoders of multiple architectures varying in size, languages, and training data sizes. The Huggingface’s sentence-transformers [43] library is utilized, where each sentence transformer is first pre-trained on the masked language modeling task using a large text corpus, followed by a finetuning stage on a sentence pairs dataset with a contrastive loss. It’s not straightforward to acquire a global sentence embedding from decoder-only models like GPT models [39, 7], hence we did not study the semantic alignment of these class of models to vision encoders.

The CKA and Matching Score (MS) of the various combinations of vision and language encoders are reported in supplementary. The findings indicate that the All-Roberta-large-v1 [28] demonstrates the best CKA/MS across all vision models, establishing it as the primary language encoder for subsequent tasks, unless specified otherwise.

5.2 Baselines

Here, we briefly describe three baselines that we compare our methods against for caption matching/retrieval, image classification, and cross-lingual tasks.

Linear Regression: We propose a baseline that learns a linear transformation from the image embedding space to the text using M𝑀Mitalic_M aligned base examples and apply the transformation to the query image embeddings. Concretely, given query image embeddings 𝐙q=[𝒛1q,,𝒛Nq]d1×Nsuperscript𝐙𝑞superscriptsubscript𝒛1𝑞superscriptsubscript𝒛𝑁𝑞superscriptsubscript𝑑1𝑁{\mathbf{Z}}^{q}=[{\bm{z}}_{1}^{q},\cdots,{\bm{z}}_{N}^{q}]\in{\mathbb{R}}^{d_% {1}\times N}bold_Z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT and text embeddings 𝐇q=[𝒉1q,,𝒉Nq]d2×Nsuperscript𝐇𝑞superscriptsubscript𝒉1𝑞superscriptsubscript𝒉𝑁𝑞superscriptsubscript𝑑2𝑁{\mathbf{H}}^{q}=[{\bm{h}}_{1}^{q},\cdots,{\bm{h}}_{N}^{q}]\in{\mathbb{R}}^{d_% {2}\times N}bold_H start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT, and a set of aligned base samples 𝐙b=[𝒛1b,,𝒛Nb]d1×Msuperscript𝐙𝑏superscriptsubscript𝒛1𝑏superscriptsubscript𝒛𝑁𝑏superscriptsubscript𝑑1𝑀{\mathbf{Z}}^{b}=[{\bm{z}}_{1}^{b},\cdots,{\bm{z}}_{N}^{b}]\in{\mathbb{R}}^{d_% {1}\times M}bold_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT and 𝐇b=[𝒉1b,,𝒉Nb]d2×Msuperscript𝐇𝑏superscriptsubscript𝒉1𝑏superscriptsubscript𝒉𝑁𝑏superscriptsubscript𝑑2𝑀{\mathbf{H}}^{b}=[{\bm{h}}_{1}^{b},\cdots,{\bm{h}}_{N}^{b}]\in{\mathbb{R}}^{d_% {2}\times M}bold_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT, we first construct a linear transformation between 𝐙bsuperscript𝐙𝑏{\mathbf{Z}}^{b}bold_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT and 𝐇bsuperscript𝐇𝑏{\mathbf{H}}^{b}bold_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT by minimizing the MSE loss as 𝐖=argmin𝐖𝐖𝐙b𝐇bF2𝐖subscriptargmin𝐖superscriptsubscriptnormsuperscript𝐖topsuperscript𝐙𝑏superscript𝐇𝑏𝐹2{\mathbf{W}}=\operatorname*{arg\,min}_{\mathbf{W}}\|{\mathbf{W}}^{\top}{% \mathbf{Z}}^{b}-{\mathbf{H}}^{b}\|_{F}^{2}bold_W = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_W end_POSTSUBSCRIPT ∥ bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - bold_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then we use 𝐖𝐖{\mathbf{W}}bold_W to transform the query image embeddings 𝐙qsuperscript𝐙𝑞{\mathbf{Z}}^{q}bold_Z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT to the text domain as 𝐇^q=𝐖𝐙qsuperscript^𝐇𝑞superscript𝐖topsuperscript𝐙𝑞\hat{{\mathbf{H}}}^{q}={\mathbf{W}}^{\top}{\mathbf{Z}}^{q}over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = bold_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT. Cosine similarity on 𝐇^qsuperscript^𝐇𝑞\hat{{\mathbf{H}}}^{q}over^ start_ARG bold_H end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT and 𝐇qsuperscript𝐇𝑞{\mathbf{H}}^{q}bold_H start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT can be used to perform caption retrieval.

Relative Representations [34]: enable latent space communication between unaligned encoders by representing each query point relative to an aligned base set. Concretely, let 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-normalized embeddings for image and text queries be 𝐙q=[𝒛1q,,𝒛Nq]d1×Nsuperscript𝐙𝑞superscriptsubscript𝒛1𝑞superscriptsubscript𝒛𝑁𝑞superscriptsubscript𝑑1𝑁{\mathbf{Z}}^{q}=[{\bm{z}}_{1}^{q},\cdots,{\bm{z}}_{N}^{q}]\in{\mathbb{R}}^{d_% {1}\times N}bold_Z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT and 𝐇=[𝒉1q,,𝒉Nq]d2×N𝐇superscriptsubscript𝒉1𝑞superscriptsubscript𝒉𝑁𝑞superscriptsubscript𝑑2𝑁{\mathbf{H}}=[{\bm{h}}_{1}^{q},\cdots,{\bm{h}}_{N}^{q}]\in{\mathbb{R}}^{d_{2}% \times N}bold_H = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT, respectively. Utilizing a set of aligned base sample 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-normalized embeddings 𝐙b=[𝒛1b,,𝒛Mb]d1×Msuperscript𝐙𝑏superscriptsubscript𝒛1𝑏superscriptsubscript𝒛𝑀𝑏superscriptsubscript𝑑1𝑀{\mathbf{Z}}^{b}=[{\bm{z}}_{1}^{b},\cdots,{\bm{z}}_{M}^{b}]\in{\mathbb{R}}^{d_% {1}\times M}bold_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_z start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT and 𝐇b=[𝒉1b,,𝒉Mb]d2×Msuperscript𝐇𝑏superscriptsubscript𝒉1𝑏superscriptsubscript𝒉𝑀𝑏superscriptsubscript𝑑2𝑀{\mathbf{H}}^{b}=[{\bm{h}}_{1}^{b},\cdots,{\bm{h}}_{M}^{b}]\in{\mathbb{R}}^{d_% {2}\times M}bold_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = [ bold_italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ⋯ , bold_italic_h start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT, we can construct relative image and text query representations as 𝐙relq=(𝐙b)𝐙qsuperscriptsubscript𝐙𝑟𝑒𝑙𝑞superscriptsuperscript𝐙𝑏topsuperscript𝐙𝑞{\mathbf{Z}}_{rel}^{q}=({\mathbf{Z}}^{b})^{\top}{\mathbf{Z}}^{q}bold_Z start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = ( bold_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Z start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT and 𝐇relq=(𝐇b)𝐇qsuperscriptsubscript𝐇𝑟𝑒𝑙𝑞superscriptsuperscript𝐇𝑏topsuperscript𝐇𝑞{\mathbf{H}}_{rel}^{q}=({\mathbf{H}}^{b})^{\top}{\mathbf{H}}^{q}bold_H start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = ( bold_H start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT. Relative representations are a single vector of dimension M𝑀Mitalic_M for each query specifying the cosine similarity of a query sample with all the base samples. Now we can use the cosine similarity on the relative representations to perform retrieval. Sec D in appendix provides a further comparison with our method.

CLIP [40]: We also compare against CLIP which has been contrastively trained to obtain a joint embedding space- as an upper limit on performance for both retrieval and matching tasks. We perform retrieval using cosine similarity

For all 3 methods, caption matching can be achieved by constructing a cost matrix using cosine similarities and using linear sum assignment to find the permutation matrix.

5.3 Downstream Tasks

Table 2: Caption matching and retrieval task performance comparison in cross-domain and in-domain settings. Base samples from COCO are utilized for matching/retrieval tasks on queries from NoCaps (cross-domain) and COCO (in-domain). CLIP-V denotes the vision encoder of CLIP [40]. We use the Large version of all vision encoders. Table A.5 shows the reverse setting.
Method Vision Model NoCaps [2] COCO [27]
Matching accuracy Top-5 retrieval Matching accuracy Top-5 retrieval
Cosine Similarity* CLIP [40] 99.5 99.6 97.1 96.1
Linear regression CLIP-V [40] 29.3 44.7 42.7 59.1
ConvNeXt [47] 19.0 28.5 31.3 46.1
DINOv2 [37] 38.1 50.3 45.1 65.4
Relative CLIP-V [40] 61.3 37.6 61.6 41.3
representations [34] ConvNeXt [47] 25.5 17.8 38.6 34.1
DINOv2 [37] 46.0 46.4 47.7 52.3
Ours: QAP CLIP-V [40] 67.3 - 72.3 -
ConvNeXt [47] 46.7 - 66.1 -
DINOv2 [37] 57.7 - 66.0 -
Ours: Local CKA CLIP-V [40] 65.1 60.5 71.9 69.9
ConvNeXt [47] 43.7 44.4 64.8 65.5
DINOv2 [37] 58.7 61.8 64.3 70.5

Caption Matching: Given N𝑁Nitalic_N query images and their corresponding captions, a query set is constructed by shuffling the captions. The task involves finding the correct permutation over captions for perfect matching. In Retrieval, the objective is, given one caption, to retrieve the correct image from the overall set of N𝑁Nitalic_N images. The alignment between unaligned vision and text encoders is investigated using our methods on the COCO and NoCaps validation sets.

The COCO dataset [27] comprises over 120,000 images with multiple captions per image. It is used for testing unimodal representation quality via a caption-matching task, utilizing a validation set of 5,000 image-caption pairs. The NoCaps dataset [2] is designed for testing image captioning models on unseen objects, with 166,100 captions for 15,100 images from OpenImages. Its validation set includes novel concepts absent from COCO.

Cross-lingual Caption Matching/Retrieval: The task mirrors prior matching and retrieval but uses multilingual captions, say German. Given N𝑁Nitalic_N images and shuffled German captions, the objective is to match each image with the correct caption. In retrieval, the goal is to select the most fitting German caption for a given query image from the set.

The XTD-10 dataset [1] enhances COCO2014 with 1,000 human-annotated multi-lingual captions in ten languages for cross-lingual image retrieval and tagging, serving as a zero-shot model benchmark.

ImageNet-100 Classification. The task setup is similar to the conventional classification task with small differences to account for the methods used. Given N𝑁Nitalic_N query images and their corresponding classes, image representations are obtained by processing them through a vision encoder. In parallel, textual representations are generated in a multi-step process. Initially, several text captions are derived from the class-associated Wordnet synsets’ lemmas, definitions, and hypernyms. These captions are then passed through the language encoder and averaged to get the text representations. The classification task is performed by retrieving the closest text representations to each image representation using our local CKA metric. We employ the ImageNet-100 dataset. This dataset is a subset of the larger ImageNet dataset, featuring only 100 classes. It includes 130,000 training images, 50,000 validation images, and 100 classes.

5.4 Results

Importance of Good Initialization: For all tasks, we make use of a set of base samples of size S𝑆Sitalic_S that is kept fixed at 320 samples. The size of the query set is analogously fixed at 500 samples (see Sec A for more details). These base samples are selected after clustering the image embeddings and choosing one closest sample to each of the S𝑆Sitalic_S cluster centers. By aligning the initial samples with the diverse cluster centers, we ensure sufficient coverage of the sample space. This enhances the accuracy of the matching process, as the initial alignment closely mirrors the inherent structure and variability within the data. In the case of linear regression, uniform sampling is employed to select the base samples. For relative representations [34], the same clustering methodology is applied to select base samples, ensuring a fair and consistent comparison between all methods.

COCO and NoCaps Caption Matching: We present the results of cross-domain and in-domain caption matching/retrieval, as detailed in Table 2. We tested each baseline against three different vision models, while employing a consistent language model—specifically, the all-roberta-large-v1. The vision models utilized are OpenAI’s CLIP ViT-L/14, the ConvNeXT-Base model (trained on the ImageNet-22k dataset at a resolution of 224x224), and the ViT-L/14 model trained using the DINOv2 method. It is important to note that the first row of the results table features vision and language models both being OpenAI’s CLIP ViT-L/14. To effectively analyze cross-domain capabilities, our experiment design involved the use of the COCO validation set as the source of the base set and the NoCaps validation set for querying. Additionally, in-domain results are shown, when using COCO validation for both base and queries. We uniformly sample the query set and average the results over three different seeds. Although CLIP’s cosine similarity metric emerges as the most robust due to the training paradigm inherent in CLIP models, our methods demonstrate commendable performance without necessitating any training. The DINOv2 model, trained solely through self-supervision, demonstrates the formation of semantic concepts independently of language supervision. This is evident in its remarkable top-5 retrieval scores of 70.5% and 61.8% on COCO and NoCaps datasets when coupled with an unaligned language encoder through our Local Kernel CKA method. However, the best-performing vision encoder is CLIP’s vision encoder which has been trained using language supervision.

ImageNet-100 Classification: In Table 3, we detail the performance of our methods on the ImageNet-100 classification task. Mirroring our approach in cross-domain matching and retrieval, we evaluated three different vision models for each method. Notably, the first row of the table highlights the performance using CLIP’s embedding cosine similarity. The results are averaged over three different seeds for sampling the query set. A significant observation from this table is the comparatively narrower performance gap between the CLIP’s cosine similarity and our methods, as well as the baseline linear regression method, in contrast to the results observed in cross-domain caption matching/retrieval tasks.

It is interesting that ConvNeXt encoder trained on ImageNet has a classification top1 accuracy improvement of over 14% compared to CLIP and Dinov2 while on the caption matching task DinoV2 and CLIP perform much better.

Cross-lingual Caption Retrieval: The results of cross-lingual caption matching/retrieval are presented in Table 4 for the 10 languages in the XTD-dataset. OpenAI CLIP’s ViT-L vision encoder, trained on English image-caption pairs, and a multilingual sentence transformer paraphrase-multilingual-mpnet-base-v2 were utilized for this task. The accuracy of CLIP’s cosine retrieval method exhibits a significant drop when applied to languages other than English. E.g., CLIP’s retrieval at 5 experiences a drop of 30 points when switching from English to other Latin-alphabet languages (Spanish, French, German, and Italian). For non-Latin alphabet languages such as Korean, Chinese, Turkish, etc., CLIP’s performance decreases substantially, collapsing to zero, primarily due to most words resulting in unknown tokens. In contrast, the QAP and local CKA matching methods demonstrate consistent performance across all languages, including non-Latin languages, attributing to the robustness of a multilingual sentence transformer trained solely on text. On average, QAP surpasses CLIP by 12% in the caption matching task and also outperforms other baselines like relative representations and linear regression methods. For retrieval at 5, the local CKA-based method exceeds CLIP’s performance by over 17%.

It is possible to push the performance further by using language-specific sentence encoders and we report these results for a few languages in Sec I of supplementary. This is a practical application of our method as we can now turn a well-trained English CLIP model’s vision encoder into a CLIP model for any low-resource language if a text-only Sentence Transformer trained on that language is available.

Table 3: ImageNet-100 classification performance comparison. We observe a narrow performance gap between the CLIP model and our methods. CLIP-V denotes the vision encoder of CLIP.
Method Vision Model Top 1 Top 5
Cosine Similarity* CLIP 86.1 99.2
Linear Regression CLIP-V 76.1 93.0
ConvNeXt 84.5 95.4
DINOv2 73.5 92.1
Relative CLIP-V 8.90 30.3
representations [34] ConvNeXt 7.20 15.7
DINOv2 49.7 75.5
Local CKA CLIP-V 68.7 91.2
ConvNeXt 83.3 95.8
DINOv2 67.7 88.3
Table 4: Cross-Lingual caption matching and retrieval performance comparison. Using QAP and local CKA-based methods we are able to do cross-lingual caption matching/retrieval using CLIP’s ViT-L vision encoder and a multi-lingual sentence transformer paraphrase-multilingual-mpnet-base-v2. While CLIP performs well on the Latin languages, it degrades on non-Latin languages. In comparison, our QAP and Local-CKA-based methods perform comparably in Latin languages while outperforming non-Latin languages, highlighting the efficacy of our training-free transfer approach. See Table A.6 and Table A.7 in appendix for additional results.
Language Kernel CKA Matching Accuracy Retrieval @ 5
CLIP Ours CLIP Relative[34] Linear Ours (QAP) CLIP Ours (Local)
Latin de 0.472 0.627 41.8 35.0 34.0 39.6 65.1 56.7
en 0.567 0.646 81.5 52.5 40.9 51.6 92.5 69.0
es 0.471 0.634 50.2 37.8 31.7 41.4 68.5 61.6
fr 0.477 0.624 49.4 37.5 30.7 40.2 68.7 57.6
it 0.472 0.638 41.0 37.2 34.9 38.5 61.3 59.7
Non-Latin jp 0.337 0.598 13.2 28.3 23.5 30.5 30.0 49.4
ko 0.154 0.620 0.50 30.4 23.5 30.9 3.30 53.4
pl 0.261 0.642 5.40 36.6 30.2 40.2 18.8 59.5
ru 0.077 0.632 0.80 31.9 30.7 35.1 4.10 53.2
tr 0.301 0.624 4.30 35.8 29.6 38.9 15.2 59.3
zh 0.133 0.641 2.70 36.5 31.1 40.3 8.90 57.8
Avg. 26.4 36.3 30.9 38.8 39.6 57.9
Refer to caption
Figure 2: Kernel CKA and QAP Matching accuracy are correlated with the training set size and quality of the training set. Here the language encoder is kept constant to the best BERT-sentence encoder (i.e.All-Roberta-large-v1). There is a clear correlation between CKA and QAP Matching accuracy across all architectures, training paradigm and data regimes.

5.5 Matching complexity

Table 5: Run times for different methods
Method QAP Local CKA Relative Linear
Run times 40 seconds 5 mins 1 second 1 second
Complexity 𝒪(n3)𝒪superscript𝑛3{\mathcal{O}}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) 𝒪(n4)𝒪superscript𝑛4{\mathcal{O}}(n^{4})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) 𝒪(n2)𝒪superscript𝑛2{\mathcal{O}}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) 𝒪(n×d)𝒪𝑛𝑑{\mathcal{O}}(n\times d)caligraphic_O ( italic_n × italic_d )

In  Table 5, we go over the time complexity and runtimes of QAP matching and local CKA based retrieval in comparison to the other baselines for matching when number of base samples and query samples are 320, 500 respectively. For all time complexities, we assume number of base samples m to be of the order of the number of query samples n. QAP uses the seeded version of the fast QAP algorithm from the SciPy library, which has a worst time complexity of 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) [17], while local CKA retrieval requires constructing a graph over all the query image and text pairs, 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), using local CKA, which is also 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) resulting in 𝒪(n4)𝒪superscript𝑛4\mathcal{O}(n^{4})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ). Relative involves the calculation of the relative representations for every query image and text pair, resulting in a time complexity of 𝒪(n2)𝒪superscript𝑛2\mathcal{O}(n^{2})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), but it’s fast due to highly optimized algorithms for matrix multiplications in PyTorch [38]. Linear has a time complexity of 𝒪(nd)𝒪𝑛𝑑\mathcal{O}(nd)caligraphic_O ( italic_n italic_d ), where n𝑛nitalic_n is the number of samples and d𝑑ditalic_d is the number of dimensions. It is to be noted that QAP runs on the CPU, and a CUDA-optimized version could bring the runtimes further down from 40 seconds. An efficient implementation of Local Kernel CKA is also possible, where the CKA of base samples is precalculated, and the graph is constructed in an additive manner, which would bring down the time complexity to 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ). For both relative and linear matching, we make use of SciPy’s modified Jonker-Volgenant algorithm [13] for linear sum assignment, which has the worst time complexity of 𝒪(n3)𝒪superscript𝑛3\mathcal{O}(n^{3})caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ).

6 Analysis

This section focuses on how training paradigms, data regimes, and encoder size/architecture influence a vision encoder’s ability to represent the world similarly to a language encoder. This is assessed by comparing the semantic alignment of their representation spaces using CKA as well as QAP matching accuracy. Figure 2 compares the kernel CKA and caption matching accuracy of different vision encoders with a fixed text-encoder (i.e., All-Roberta-large-v1), against the training datasets on which the vision encoder was trained for all pairs in the COCO captions validation set. The findings are summarized below:

Scale and quality of dataset results in encoders with high semantic alignment with the language space: It is observed that SSL methods like DINOv2 can learn semantic concepts in a relative manner even without language supervision during training. The CKA and QAP matching accuracy for DINOv2 embeddings are comparable to CLIP models, despite lacking language supervision and having significantly less data (LVD-142’s 142M vs Open-AI-CLIP’s 400M). A general trend emerges where more training data leads to semantically richer visual embeddings, evident when comparing CKA and QAP Accuracies from ImageNet1K to DFN-5B datasets. Notably, training on a curated dataset proves more effective than on an uncurated dataset of the same size, especially for smaller models. This is illustrated by the higher CKA and QAP accuracy of ViT-Large trained on the curated DFN-2B dataset compared to ViT-Large/Giant, and ConvNext-xxLarge trained on Laion 2B. Additionally, SSL methods show less semantic consistency when trained on ImageNet1K, as indicated by the clear difference in QAP accuracies between DINO trained on ImageNet1K and DINOv2 trained on LVD-142M.
Vision Encoders Trained with Language Supervision Exhibit Greater Semantic Alignment with Language Encoders: In line with the findings of Merullo et al.[31], it is observed in our experiments that vision encoders trained with more language supervision on datasets of comparable size exhibit a higher degree of semantic alignment with language encoders compared to self-supervised methods. For example, ViT-Large trained on CLIP-400M with language supervision demonstrates superior caption-matching capabilities compared to DINOv2’s ViT-Large trained on LVD-142M. Similarly, we verify that class label supervision, like that from ImageNet, leads to more semantically aligned image encoders when compared to self-supervision when similarly sized models are compared on ImageNet-1k. For example, all supervised encoders trained on ImageNet-1k have higher CKA as well as QAP matching accuracy than all the self-supervised models.

7 Ablations

This section rationalizes our method choices through ablation studies on clustering, stretching, and the global CKA metric. We demonstrate the impact of these components on the performance of our methods, primarily through Table 6, which delineates the effectiveness of the QAP and the local CKA metric under various configurations. It shows the performance metrics in scenarios where each main component is either integrated or omitted. Notably, in instances where the CKA metric is not used, we opt for normalized correlation matrices for each graph. The empirical results presented are derived from the caption matching/retrieval task, utilizing both base and query sets extracted from the COCO validation set of size 320 and 500 respectively.

Choice of the metric: CKA is more beneficial than using just the scaled correlation matrix to represent the semantic relationships in an embedding space as matching accuracy increases from 10.1% to 48.8%. The choice of a robust metric is core to aligning vision and language latent spaces.

Impact of Stretching: It is clear that stretching facilitates better alignment of embeddings in our methods as stretching spreads the representations out in each modality without sacrificing the relative positions of the different embeddings within each embedding space. This is reflected in the increase of QAP accuracy from 48.8% to 57.3%.

Clustering vs. Uniform Sampling: The choice of the base set is important in QAP matching and local CKA retrieval, as it measures any query pair alignment with the base set. A diverse base set is essential to capture a broad semantic range, and clustering within one of the embedding spaces aids in achieving this diversity. The third and fifth rows of the table demonstrate that clustering enhances the QAP performance from 57.3% to 65.5%. Consequently, these results highlight that all the components together significantly enhance the efficacy of our proposed approach.

Table 6: Impact of clustering and stretching. The matching and retrieval performance is the best when both clustering and stretching are employed. Hence, justifying this choice.
Clustering Stretching CKA QAP Local CKA Local CKA
Matching Matching Retrieval @ 5
10.1 16.2 1.0
48.8 48.5 60.2
57.3 56.7 73.0
56.2 55.1 66.4
65.5 63.3 77.2

8 Conclusion

In this work, we ask the question, ‘Do vision encoders and language encoders represent the world similarly?’ and study this using CKA and a caption-matching task. We find that well-trained vision encoders on sufficiently large datasets exhibit surprisingly high semantic similarity with language encoders comparable to aligned encoders, irrespective of the training paradigm. Inspired by this, we draw parallels between CKA and the QAP matching objective and use seeded graph matching to align vision and language encoders by maximizing CKA. We also devise a local CKA-based metric to enable retrieval between unaligned vision and language encoders demonstrating a better performance than that of relative representations on cross-domain and cross-lingual caption matching/retrieval tasks, facilitating zero-shot latent space communication between unaligned encoders.

References

  • Aggarwal and Kale [2020] Pranav Aggarwal and A**kya Kale. Towards zero-shot cross-lingual image retrieval. arXiv preprint arXiv:2012.05107, 2020.
  • Agrawal et al. [2019] Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
  • Antonello et al. [2021] Richard Antonello, Javier S Turek, Vy Vo, and Alexander Huth. Low-dimensional structure in the space of language representations is reflected in brain responses. Advances in neural information processing systems, 2021.
  • Bansal et al. [2021] Yamini Bansal, Preetum Nakkiran, and Boaz Barak. Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 2021.
  • Barannikov et al. [2021] Serguei Barannikov, Ilya Trofimov, Nikita Balabin, and Evgeny Burnaev. Representation topology divergence: A method for comparing neural network representations. arXiv preprint arXiv:2201.00058, 2021.
  • Bonheme and Grzes [2022] Lisa Bonheme and Marek Grzes. How do variational autoencoders learn? insights from representational similarity. arXiv preprint arXiv:2205.08399, 2022.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  • Changpinyo et al. [2021] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • Conneau et al. [2017] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017.
  • Cortes et al. [2012] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795–828, 2012.
  • Crouse [2016] David F Crouse. On implementing 2d rectangular assignment algorithms. IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.
  • Csiszárik et al. [2021] Adrián Csiszárik, Péter Korösi-Szabó, Akos K Matszangosz, Gergely Papp, and Dániel Varga. Similarity and matching of neural network representations. arXiv preprint arXiv:2110.14633, 2021.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Fishkind et al. [2019] Donniell E Fishkind, Sancar Adali, Heather G Patsolic, Lingyao Meng, Digvijay Singh, Vince Lyzinski, and Carey E Priebe. Seeded graph matching. Pattern recognition, 87:203–215, 2019.
  • Gretton et al. [2005] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
  • Gutmann and Hyvärinen [2010] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
  • Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.
  • Kornblith et al. [2019] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR, 2019.
  • Kuhn [1955] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • Lenc and Vedaldi [2015] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 991–999, 2015.
  • Lester et al. [2021] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  • Li et al. [2015] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neural networks learn the same representations? arXiv preprint arXiv:1511.07543, 2015.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Liu et al. [2022] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  • Ma et al. [2020] Wan-Duo Kurt Ma, JP Lewis, and W Bastiaan Kleijn. The hsic bottleneck: Deep learning without back-propagation. In Proceedings of the AAAI conference on artificial intelligence, pages 5085–5092, 2020.
  • Merullo et al. [2022] Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly map** from image to text space. arXiv preprint arXiv:2209.15162, 2022.
  • Mikolov et al. [2022] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation (2013). arXiv preprint arXiv:1309.4168, 2022.
  • Morcos et al. [2018] Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networks with canonical correlation. Advances in neural information processing systems, 31, 2018.
  • Moschella et al. [2022] Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà. Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2022.
  • Norelli et al. [2022] Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodola, and Francesco Locatello. Asif: Coupled data turns unimodal models to multimodal without training. arXiv preprint arXiv:2210.01738, 2022.
  • Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Raghu et al. [2017] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30, 2017.
  • Raghu et al. [2021] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34:12116–12128, 2021.
  • Reimers and Gurevych [2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
  • Tsitsulin et al. [2019] Anton Tsitsulin, Marina Munkhoeva, Davide Mottin, Panagiotis Karras, Alex Bronstein, Ivan Oseledets, and Emmanuel Müller. The shape of data: Intrinsic distance for data distributions. arXiv preprint arXiv:1905.11141, 2019.
  • Vogelstein et al. [2015] Joshua T Vogelstein, John M Conroy, Vince Lyzinski, Louis J Podrazik, Steven G Kratzer, Eric T Harley, Donniell E Fishkind, R Jacob Vogelstein, and Carey E Priebe. Fast approximate quadratic programming for graph matching. PLOS one, 10(4):e0121002, 2015.
  • Vulić et al. [2020] Ivan Vulić, Sebastian Ruder, and Anders Søgaard. Are all good word vector spaces isomorphic? arXiv preprint arXiv:2004.04070, 2020.
  • Woo et al. [2023] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
  • Wu et al. [2020] John M Wu, Yonatan Belinkov, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. Similarity analysis of contextual word representation models. arXiv preprint arXiv:2005.01172, 2020.
\thetitle

Supplementary Material

Appendix A Varying the Number of Samples

In Figure A.1, we show QAP and local CKA matching accuracies and retrieval scores for different number of base samples M𝑀Mitalic_M, kee** the number of query samples N𝑁Nitalic_N constant at 500. It can be observed that as M𝑀Mitalic_M increases, accuracy/retrieval scores improve, demonstrating the importance of seed initialization for matching algorithms. Figure A.2 shows the accuracy/retrieval scores as N𝑁Nitalic_N the number of query samples changes kee** the number of base samples constant at M=320. We see that QAP matching accuracy as local CKA-based retrieval scores decrease with increase in N𝑁Nitalic_N, but we still get 70% matching accuracy when MN=1𝑀𝑁1\frac{M}{N}=1divide start_ARG italic_M end_ARG start_ARG italic_N end_ARG = 1.

Refer to caption
Figure A.1: Accuracy and Retrieval Scores of QAP Matching and Local CKA-based retrieval as the number of base samples is varied, kee** the number of query samples fixed at 500.
Refer to caption
Figure A.2: Accuracy and Retrieval Scores of QAP Matching and Local CKA based retrieval as the number of query samples is varied, kee** the number of base samples fixed at 320.
Table A.1: Image Encoders Summary. List of hugging face vision encoder names and information regarding their train data, paradigm, dataset size, model type, and model sizes for the comparison in Figure A.3 and Table A.13.

Model Name Training Data Training Paradigm Model Type Training Data Size Model Size facebook\dino-vits8 ImageNet-1k DinoV1 vit-small 1.2 22 openai\clip-vit-large-patch14-336 CLIP-400M Language Supervised vit-large 400 307 facebook\dinov2-base LVD-142M DinoV2 vit-base 142 86 facebook\dinov2-small LVD-142M DinoV2 vit-small 142 22 facebook\dinov2-large LVD-142M DinoV2 vit-large 142 307 facebook\dinov2-giant LVD-142M DinoV2 vit-giant 142 1000 openai\clip-vit-base-patch16 CLIP-400M Language Supervised vit-base 400 86 facebook\dino-vitb8 ImageNet-1k DinoV1 vit-base 1.2 86 timm\convnext_base.fb_in1k ImageNet-1k Supervised convnext-base 1.2 89 timm\convnext_tiny.fb_in1k ImageNet-1k Supervised convnext-tiny 1.2 29 facebook\convnext-base-224-22k ImageNet-21k Supervised convnext-base 14.1 89 timm\convnext_base.fb_in22k ImageNet-21k Supervised convnext-base 14.1 89 timm\vit_base_patch16_224.augreg_in21k ImageNet-21k Supervised vit-base 14.1 86 timm\vit_small_patch16_224.augreg_in1k ImageNet-1k Supervised vit-small 1.2 22

Table A.2: Text Encoders Summary. List of huggingface text encoder names and information regarding their train data, paradigm, dataset size, and model sizes for the comparison in Figure A.3 and Table A.13

Model Name Model Size Train Data Training Paradigm Training Data Size all-mpnet-base-v1 109 multiple datasets contr. sent. 1.12B sent. pairs gtr-t5-base 110 multiple datasets contr. sent. 2B sent. pairs paraphrase-MiniLM-L12-v2 33 multiple datasets contr. sent. 10M sent. pairs gtr-t5-large 335 multiple datasets contr. sent. 2B sent. pairs all-mpnet-base-v2 109 multiple datasets contr. sent. 1.12B sent. pairs average_word_embeddings_komninos 66 Wiki2015 skipgram 2 billion words average_word_embeddings_glove.6B.300d 120 Wiki2014, GigaWord 5 glove 6 billion tokens all-MiniLM-L12-v1 33 multiple datasets contr. sent. 1B sent. pairs openai_clip-vit-large-patch14 123 CLIP-400M contr. img-text 400M image-text pairs all-MiniLM-L12-v2 33 multiple datasets contr. sent. 1B sent. pairs all-MiniLM-L6-v2 22 multiple datasets contr. sent. 1B sent. pairs sentence-t5-base 110 multiple datasets contr. sent. 2B sent. pairs msmarco-distilbert-dot-v5 66 MSMarco contr. sent. 500k sent. pairs paraphrase-MiniLM-L3-v2 17 multiple datasets contr. sent. 10M sent. pairs paraphrase-albert-small-v2 11 multiple datasets contr. sent. 10M sent. pairs all-MiniLM-L6-v1 22 multiple datasets contr. sent. 1B sent. pairs all-distilroberta-v1 82 OpenWebTextCorpus contr. sent. 1B sent. pairs sentence-t5-large 335 multiple datasets contr. sent. 2B sent. pairs All-Roberta-large-v1 355 multiple datasets contr. sent. 1B sent. pairs msmarco-bert-base-dot-v5 109 MSMarco contr. sent. 500k sent. pairs sentence-t5-xxl 4870 multiple datasets contr. sent. 2B sent. pairs paraphrase-TinyBERT-L6-v2 66 multiple datasets contr. sent. 10M sent. pairs sentence-t5-xl 1240 multiple datasets contr. sent. 2B sent. pairs gtr-t5-xxl 4870 multiple datasets contr. sent. 2B sent. pairs paraphrase-distilroberta-base-v2 82 multiple datasets contr. sent. 10M sent. pairs gtr-t5-xl 1240 multiple datasets contr. sent. 2B sent. pairs

Appendix B Vision and Text Encoders

CKA is measured on combinations of a wide variety of vision and text encoders to examine the impact of: model sizes, dataset regimes, and training paradigms on vision-language alignment. This analysis also identifies the optimal pair of unaligned vision and text encoder for caption-matching tasks. Huggingface’s transformers library is utilized for vision models, while the sentence transformers library is employed for text encoders. Table A.1 details the vision models, their training data, paradigms, and model types and sizes. Similarly, Table A.2 presents information on various text encoders. The study covers three training paradigms for vision models: supervised, self-supervised, and language-supervised, with training dataset sizes ranging from 1 million to 400 million images. Text encoders predominantly use sentence transformers, trained for semantic search using a contrastive sentence pairs loss, with dataset sizes varying from 500k to 2B.

Kernel CKA of various model combinations is presented in Table A.13. The top-performing text encoder trained exclusively on text information is identified as All-Roberta-large-v1 paired with DINOv2, achieving a CKA of 0.706. Consequently, All-Roberta-large-v1 is selected as the text encoder for all tasks and experiments in the main paper, except for cross-lingual experiments. For these, paraphrase-multilingual-mpnet-base-v2 emerges as the most effective text encoder.

Figure A.3 illustrates the relationship between CKA and text model size across different vision encoder types, training paradigms, and sizes. It is observed that text model size has a limited impact on achieving high CKA with the vision model. Well-trained vision models on large datasets consistently show high kernel CKA with text encoders, regardless of text model size. For instance, language-supervised models (green) and DINOv2 models, which are trained on datasets with hundreds of millions of instances (such as LVD-142’s 142 million images and CLIP-400M’s 400 million image-caption pairs), demonstrate high CKA with language encoders of various sizes.

Refer to caption
Figure A.3: CKA vs. text model size for vision encoders of different training paradigms, model types, and model sizes. We see that text model size is not the most important for high semantic similarity with vision models.

Appendix C Layerwise CKA Analysis

Figure A.4, Table A.3, and Table A.4 show the progression of CKA and QAP matching scores across layers for both text and vision models. We explore two configurations: one involves comparing layers of All-Roberta-large-V1 and DINOv2 VIT-L/14, while the other examines layers of CLIP’s vision and text hidden states. For CLIP, the layer proj𝑝𝑟𝑜𝑗projitalic_p italic_r italic_o italic_j points to the final image and text embeddings that were passed through the final projection layers. In the first configuration, CKA and QAP scores gradually improve where the image model layer has a far greater effect on the similarity than the text model layer. On the other hand, the second configuration reveals that the QAP matching score in CLIP manifests prominently in the absolute last layers of both the vision/text encoders.

As shown in Table A.3, the CLIP model obtains a significant jump in matching score after the projection head, highlighting the central role of this layer in aligning text and image modalities within a unified representation space. Here, the QAP matching accuracy does not follow a linear increase over the layers for CLIP, but rather suddenly jumps from 0.29 to 0.79 from the last layer to the projection head. This likely suggests that most of the CLIP performance comes from the projection heads ensuring a high statistical similarity. In contrast, Table A.4 shows that DINOv2 and All-Roberta-large-v1 demonstrate a consistent improvement in the matching accuracy across successive layers, suggesting an inherent alignment process within their architectures in a hierarchical way. Here, the QAP matching accuracy linearly increases for the DINOv2 and All-Roberta-large-v1 combination when we fix the last layer of All-Roberta-large-v1 and vary the layers of DINOv2. Inversely, when we fix the last layer of DINOv2 and vary the layers of the text encoder, the QAP starts high at 0.44 and reaches 0.68 at the top layer, thus, we hypothesize that the text encoder representations do not change as much as the image representations.

Refer to caption
(a)
Refer to caption
(b)
Figure A.4: Layer-wise CKA heatmap illustration. The heatmaps depict the CKA scores obtained by varying the layers from which the text and visual embeddings are taken. On the left: CKA scores for All-Roberta-large-v1 and DINOv2 unaligned combination. On the right: CKA scores for CLIP text and vision encoders. In both cases, we observe that the CKA scores are low for earlier layer embeddings of the vision model and they improve when the embeddings later layers are considered. This illustrates that both aligned and unaligned text-vision encoders behave similarly in terms of the cross-modal similarity w.r.t. CKA.
Table A.3: QAP accuracy for different layers of vision and text encoder of CLIP model.

Vision 6th 11th 16th 21st 26th proj Text 6th 0.02 0.022 0.022 0.098 0.126 0.118 11th 0.028 0.038 0.016 0.248 0.278 0.278 14th 0.026 0.03 0.036 0.238 0.282 0.296 proj 0.038 0.026 0.034 0.622 0.716 0.792

Table A.4: QAP accuracy for different layers of DINOv2 and All-Roberta-large-v1 models.
Vision
6th 11th 16th 21st 26th
Text 6th 0.008 0.020 0.150 0.314 0.448
11th 0.010 0.022 0.146 0.360 0.498
16th 0.008 0.016 0.194 0.334 0.500
21st 0.002 0.004 0.148 0.420 0.538
26th 0.008 0.016 0.198 0.450 0.672
Table A.5: Impact of adding noise to the embeddings. Performance comparison, in terms of matching accuracy, between relative representations [34] and our global CKA-based QAP approach is shown for the image-caption matching task with 320 base samples and 500 query samples on COCO validation set. Gaussian noise with std-dev (σ𝜎\sigmaitalic_σ) being a multiple of the embeddings std-dev is added to both image and textual embeddings. Noise level of 0 (σ=0𝜎0\sigma=0italic_σ = 0) denotes the performance for the original embeddings. The relative performance drop for a noise level from its reference (σ=0𝜎0\sigma=0italic_σ = 0) is shown in parenthesis. In comparison to relative representations, our QAP approach performance drops at a slower rate as σ𝜎\sigmaitalic_σ increases, illustrating better noise robustness for our approach.

Method Noise Level (σ𝜎\sigmaitalic_σ) 0.0 0.1 0.2 0.3 0.4 0.5 Relative representations [34] 47.3 45.3 (4.44.4\mathord{\downarrow}4.4↓ 4.4) 44.2 (6.56.5\mathord{\downarrow}6.5↓ 6.5) 41.3 (12.712.7\mathord{\downarrow}12.7↓ 12.7) 39.0 (17.617.6\mathord{\downarrow}17.6↓ 17.6) 35.6 (24.824.8\mathord{\downarrow}24.8↓ 24.8) Ours (QAP) 53.9 53.7 (0.30.3\mathord{\downarrow}0.3↓ 0.3) 51.8 (3.93.9\mathord{\downarrow}3.9↓ 3.9) 48.7 (9.59.5\mathord{\downarrow}9.5↓ 9.5) 46.9 (13.013.0\mathord{\downarrow}13.0↓ 13.0) 43.3 (19.619.6\mathord{\downarrow}19.6↓ 19.6)

Appendix D Mathematical Relationship between Local CKA-based Retrieval and Relative Representations

In this section, we provide derivations that show that the relative representations method [34] can be seen as a particular case of our proposed localCKAlocalCKA\operatorname{localCKA}roman_localCKA method. Denote the set of query and base representations samples respectively as 𝐐A=[𝒒1A,,𝒒NA]dA×Nsubscript𝐐𝐴superscriptsubscript𝒒1𝐴superscriptsubscript𝒒𝑁𝐴superscriptsubscript𝑑𝐴𝑁{\mathbf{Q}}_{A}=\left[{\bm{q}}_{1}^{A},\ldots,{\bm{q}}_{N}^{A}\right]\in{% \mathbb{R}}^{d_{A}\times N}bold_Q start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = [ bold_italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , … , bold_italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT and 𝐁A=[𝒃1A,,𝒃MA]dA×Msubscript𝐁𝐴superscriptsubscript𝒃1𝐴superscriptsubscript𝒃𝑀𝐴superscriptsubscript𝑑𝐴𝑀{\mathbf{B}}_{A}=\left[{\bm{b}}_{1}^{A},\ldots,{\bm{b}}_{M}^{A}\right]\in{% \mathbb{R}}^{d_{A}\times M}bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = [ bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT, where A{I,C}𝐴𝐼𝐶A\in\{I,C\}italic_A ∈ { italic_I , italic_C } for images and captions, the retrieval matrix for the relative representations (RR) method is therefore given by:

𝐑RR=𝐐I𝐁I𝐁C𝐐CN×N.superscript𝐑RRsuperscriptsubscript𝐐𝐼topsubscript𝐁𝐼superscriptsubscript𝐁𝐶topsubscript𝐐𝐶superscript𝑁𝑁\displaystyle{\mathbf{R}}^{\text{RR}}={\mathbf{Q}}_{I}^{\top}{\mathbf{B}}_{I}{% \mathbf{B}}_{C}^{\top}{\mathbf{Q}}_{C}\in{\mathbb{R}}^{N\times N}.bold_R start_POSTSUPERSCRIPT RR end_POSTSUPERSCRIPT = bold_Q start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT .

From which, for instance, the i𝑖iitalic_i-th image query is mapped to its corresponding caption via:

argmaxjRijRR=argmaxj(𝒒iI)𝐁I𝐁C𝒒jC.\displaystyle\operatorname*{arg\,max}_{j}{R}_{ij}^{\text{RR}}=\operatorname*{% arg\,max}_{j}\,({\bm{q}}_{i}^{I})^{\top}{\mathbf{B}}_{I}{\mathbf{B}}_{C}^{\top% }{\bm{q}}_{j}^{C}.start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RR end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT . (7)

Whereas, our proposed localCKAlocalCKA\operatorname{localCKA}roman_localCKA method constructs the retrieval matrix 𝐑Ourssuperscript𝐑Ours{\mathbf{R}}^{\text{Ours}}bold_R start_POSTSUPERSCRIPT Ours end_POSTSUPERSCRIPT having entries RijOurs=localCKA(𝒒iI,𝒒jC)superscriptsubscript𝑅𝑖𝑗OurslocalCKAsuperscriptsubscript𝒒𝑖𝐼superscriptsubscript𝒒𝑗𝐶{R}_{ij}^{\text{Ours}}=\operatorname{localCKA}\left({\bm{q}}_{i}^{I},{\bm{q}}_% {j}^{C}\right)italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Ours end_POSTSUPERSCRIPT = roman_localCKA ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ) with:

localCKA(𝒒iI,𝒒jC)=CKA(𝐊[𝐁I,𝒒iI],𝐊[𝐁C,𝒒jC]).localCKAsuperscriptsubscript𝒒𝑖𝐼superscriptsubscript𝒒𝑗𝐶CKAsubscript𝐊subscript𝐁𝐼superscriptsubscript𝒒𝑖𝐼subscript𝐊subscript𝐁𝐶superscriptsubscript𝒒𝑗𝐶\displaystyle\operatorname{localCKA}\left({\bm{q}}_{i}^{I},{\bm{q}}_{j}^{C}% \right)=\operatorname{CKA}\left({\mathbf{K}}_{[{\mathbf{B}}_{I},{\bm{q}}_{i}^{% I}]},{\mathbf{K}}_{[{\mathbf{B}}_{C},{\bm{q}}_{j}^{C}]}\right).roman_localCKA ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ) = roman_CKA ( bold_K start_POSTSUBSCRIPT [ bold_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT [ bold_B start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT ) . (8)

In particular, taking the particular case of the linear kernel and defining the CKA score as the trace of the product of two kernels, i.e., CKA(𝐊,𝐋)=tr(𝐊𝐋)CKA𝐊𝐋tr𝐊𝐋\operatorname{CKA}({\mathbf{K}},{\mathbf{L}})=\operatorname{tr}\left({\mathbf{% K}}{\mathbf{L}}\right)roman_CKA ( bold_K , bold_L ) = roman_tr ( bold_KL ). We first have, for A{I,C}𝐴𝐼𝐶A\in\{I,C\}italic_A ∈ { italic_I , italic_C }:

𝐊[𝐁A,𝒒iA]=[𝐁A,𝒒iA][𝐁A,𝒒iA]=[𝐁A𝐁A𝐁A𝒒iA(𝐁A𝒒iA)𝒒iA2].subscript𝐊subscript𝐁𝐴superscriptsubscript𝒒𝑖𝐴superscriptsubscript𝐁𝐴superscriptsubscript𝒒𝑖𝐴topsubscript𝐁𝐴superscriptsubscript𝒒𝑖𝐴matrixsuperscriptsubscript𝐁𝐴topsubscript𝐁𝐴superscriptsubscript𝐁𝐴topsuperscriptsubscript𝒒𝑖𝐴superscriptsuperscriptsubscript𝐁𝐴topsuperscriptsubscript𝒒𝑖𝐴topsuperscriptnormsuperscriptsubscript𝒒𝑖𝐴2\displaystyle{\mathbf{K}}_{[{\mathbf{B}}_{A},{\bm{q}}_{i}^{A}]}=[{\mathbf{B}}_% {A},{\bm{q}}_{i}^{A}]^{\top}[{\mathbf{B}}_{A},{\bm{q}}_{i}^{A}]=\begin{bmatrix% }{\mathbf{B}}_{A}^{\top}{\mathbf{B}}_{A}&{\mathbf{B}}_{A}^{\top}{\bm{q}}_{i}^{% A}\\ \left({\mathbf{B}}_{A}^{\top}{\bm{q}}_{i}^{A}\right)^{\top}&\|{\bm{q}}_{i}^{A}% \|^{2}\end{bmatrix}.bold_K start_POSTSUBSCRIPT [ bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT = [ bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ] = [ start_ARG start_ROW start_CELL bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_CELL start_CELL bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ( bold_B start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL start_CELL ∥ bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] .

Hence, we have:

trtr\displaystyle\operatorname{tr}roman_tr (𝐊[𝐁I,𝒒iI]𝐊[𝐁C,𝒒jC])=tr(𝐁I𝐁I𝐁C𝐁C)subscript𝐊subscript𝐁𝐼superscriptsubscript𝒒𝑖𝐼subscript𝐊subscript𝐁𝐶superscriptsubscript𝒒𝑗𝐶trsuperscriptsubscript𝐁𝐼topsubscript𝐁𝐼superscriptsubscript𝐁𝐶topsubscript𝐁𝐶\displaystyle\left({\mathbf{K}}_{[{\mathbf{B}}_{I},{\bm{q}}_{i}^{I}]}{\mathbf{% K}}_{[{\mathbf{B}}_{C},{\bm{q}}_{j}^{C}]}\right)=\operatorname{tr}\left({% \mathbf{B}}_{I}^{\top}{\mathbf{B}}_{I}{\mathbf{B}}_{C}^{\top}{\mathbf{B}}_{C}\right)( bold_K start_POSTSUBSCRIPT [ bold_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT [ bold_B start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT ) = roman_tr ( bold_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT )
+2(𝒒iI)𝐁I𝐁C𝒒jCrelative representations term+𝒒iI2𝒒jC2.2subscriptsuperscriptsuperscriptsubscript𝒒𝑖𝐼topsubscript𝐁𝐼superscriptsubscript𝐁𝐶topsuperscriptsubscript𝒒𝑗𝐶relative representations termsuperscriptnormsuperscriptsubscript𝒒𝑖𝐼2superscriptnormsuperscriptsubscript𝒒𝑗𝐶2\displaystyle+2\underbrace{\left({\bm{q}}_{i}^{I}\right)^{\top}{\mathbf{B}}_{I% }{\mathbf{B}}_{C}^{\top}{\bm{q}}_{j}^{C}}_{\text{relative representations term% }}+\|{\bm{q}}_{i}^{I}\|^{2}\|{\bm{q}}_{j}^{C}\|^{2}.+ 2 under⏟ start_ARG ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_B start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT bold_B start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT relative representations term end_POSTSUBSCRIPT + ∥ bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Therefore, in this particular case, there is equivalence between our method and the relative representations method, since RijOurs=RijRR+csuperscriptsubscript𝑅𝑖𝑗Ourssuperscriptsubscript𝑅𝑖𝑗RR𝑐{R}_{ij}^{\text{Ours}}={R}_{ij}^{\text{RR}}+citalic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Ours end_POSTSUPERSCRIPT = italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RR end_POSTSUPERSCRIPT + italic_c where c𝑐citalic_c is a constant scalar if the representations are normalized. As such, the relative representations method falls within our proposed localCKAlocalCKA\operatorname{localCKA}roman_localCKA method if one considers the linear kernel and takes the trace instead of the HSICHSIC\operatorname{HSIC}roman_HSIC metric. Therefore, our proposed method is more general since it relies on general kernel functions and the HSICHSIC\operatorname{HSIC}roman_HSIC metric, which might explain its performance.

Impact of noise addition: Table A.5 shows the performance comparison between relative representations [34] and our global CKA-based QAP approach for the image-caption matching task with 320 base samples and 500 query samples on COCO validation set. For this experiment, 10 trials were conducted with different seeds and clustering of base samples was employed. Gaussian noise with std-dev (σ𝜎\sigmaitalic_σ) being a multiple of the embeddings std-dev is added to both image and textual embeddings. The performance of original embeddings is also shown for reference (noise level of 0, i.e., σ=0𝜎0\sigma=0italic_σ = 0). The relative performance drop for a noise level from its reference (σ=0𝜎0\sigma=0italic_σ = 0) is shown in parenthesis. Compared to relative representations, our QAP approach performance drops at a slower rate as σ𝜎\sigmaitalic_σ increases. E.g., for σ=0.2𝜎0.2\sigma=0.2italic_σ = 0.2, relative representations matching accuracy drops 6.5% from it maximum of 47.3, while ours is more robust and drops only 3.9% from its maximum of 53.9 when σ=0𝜎0\sigma=0italic_σ = 0. These results show that our QAP approach is more robust to noise addition, in comparison to relative representations.

Appendix E Other text encoders

Evaluating on COCO with M=320 and N=500, Table A.6 shows that DINOv2-large achieves high QAP accuracy and retrieval performance when combined with different text encoders. This underscores the potential of pairing well-trained sentence and vision encoders for achieving high semantic similarity between image and text embeddings

Table A.6: Comparison of CKA, QAP acc. and local CKA retrieval for different text encoders with DINOv2-large image encoder.

Text Encoder Kernel CKA QAP Acc. Ret @ 5 all-roberta-large-v1 0.690 64.93 77.27 paraphrase-distilroberta-base-v2 0.689 65.07 76.33 paraphrase-mpnet-base-v2 0.695 68.20 81.07 sentence-t5-large 0.660 57.87 69.13 sentence-t5-xxl 0.677 63.40 73.00

Appendix F Simple projection

We trained a 2-layer MLP on frozen DINOv2-large encoder till convergence using CLIP loss and MSE loss. For fair comparison with our setting, we use 320 training and 500 query image-text samples. Results in Table A.8 are averaged over 3 seeds. Notably, QAP matching and local-CKA retrieval excel over projection learning, which demands hyperparameter tuning. In contrast, QAP and local-CKA provide a novel, training-free mechanism to evaluate encoder representational similarity, demonstrating effective latent space communication.

Appendix G Effect of unimodal tasks on alignment

Table A.8 shows using ViT, DETR, DPT, and SegFormer vision encoders for local-CKA and QAP matching on COCO captions (M=320, N=500). ViT is trained on ImageNet-1k (classification), DETR on COCO 2017 (detection), DPT on 1.4M depth images (depth estimation), and SegFormer is fine-tuned on ADE20k (semantic segmentation). Results indicate that classification models exhibit higher semantic similarity to all-roberta-large text encoder in QAP accuracy and local-CKA scores than pixel-level tasks such as object detection, segmentation, and depth estimation.

Table A.7: QAP acc. and Top-5 retrieval scores on COCO.
Table A.8: Unimodal tasks’ effect on image-text alignment.
Method QAP acc Ret @ 5
Proj. + MSE 59.8 73.0
Proj. + CLIP 55.4 68.1
QAP 65.9 -
Local CKA 64.3 76.0
Vision model QAP acc Ret @ 5
ViT 35.3 56.1
DETR 26.5 39.8
DPT 22.7 34.1
Segformer 16.8 33.4
Table A.8: Unimodal tasks’ effect on image-text alignment.

Appendix H Additional Retrieval Results

While the performance on the image retrieval task was reported in Table 2 of the main manuscript, here in Table A.9, we show the NoCaps and Coco caption retrieval results in the reverse setting. In this configuration, the retrieving objective shifts to finding the correct caption from a pool of N𝑁Nitalic_N captions when given a single image. The matching objective remains consistent, but, instead of shuffling the captions, the images themselves are shuffled. While the matching accuracies express minimal changes in this setting, the retrieval accuracies display notable discrepancies.

A plausible explanation for the reduced retrieval scores associated with the relative representation method is the heightened semantic variability inherent in the image domain compared to the caption domain. A considerable number of images share very similar captions, leading to a compressed semantic space for the captions. Consequently, caption embeddings become more closer to one another, making the retrieval a lot harder.

Table A.9: Reverse Caption Retrieval Results for COCO and NoCaps. In this setting, the retrieval objective is, given one image, to retrieve the correct caption from the overall set of N𝑁Nitalic_N captions. The matching objective remains quite similar but instead of shuffling the captions, this time, the images are shuffled.
Method Vision Model NoCaps [2] COCO [27]
Matching accuracy Top-5 retrieval Matching accuracy Top-5 retrieval
Cosine Similarity* CLIP [40] 99.5 99.6 97.1 98.5
Linear regression CLIP-V [40] 63.6 70.1 72.6 83.9
ConvNeXt [47] 22.8 38.9 43.8 65.7
DINOv2 [37] 46.8 59.9 56.2 75.9
Relative CLIP-V [40] 61.3 3.0 61.6 2.9
representations [34] ConvNeXt [47] 25.5 2.7 38.6 12.9
DINOv2 [37] 45.9 38.1 47.7 43.7
Ours: QAP CLIP-V [40] 67.3 - 72.8 -
ConvNeXt [47] 45.9 - 65.1 -
DINOv2 [37] 58.5 - 65.9 -
Ours: Local CKA CLIP-V [40] 65.1 65.9 71.9 80.5
ConvNeXt [47] 44.8 33.0 63.8 74.3
DINOv2 [37] 55.7 64.2 64.3 76.0

Appendix I Additional Cross-Lingual Matching Results

For completeness, we report the results in Table A.10 for the reverse setting of the cross-lingual image caption matching/retrieval task mentioned in the main paper. Given N𝑁Nitalic_N captions in say, German, and N𝑁Nitalic_N shuffled images the objective is to match each German caption with the correct image. In retrieval, the goal is to select the most fitting image from the retrieval set given a German caption. We notice that the matching accuracies remain the same as the direction doesn’t affect the matching. However, in the case of reverse retrieval, we notice that CLIP’s retrieval@5 drops by over 4.5% on average when compared to our local CKA based retrieval of 2.1%.

In Table A.11 we report the results for when we use language-specific BERT Sentence encoders for the cross-lingual caption matching/ retrieval task for 5 languages. For all these cases, the vision encoder is kept fixed as OpenAI’s CLIP-VIT-L-14 trained on English image, caption pairs. We notice that the semantic alignment with the vision encoder in terms of CKA as well as matching/retrieval performance drops with language-specific encoders when compared to using a multi-lingual model like multilingual-mpnet-base-v2. We believe this could be due to the multi-lingual model being trained on a lot more data in comparison to the language-specific ones thus resulting in more meaningful embedding spaces.

Table A.10: Cross-Lingual image matching and retrieval performance comparison. Here we use multilingual captions to retrieve images from the COCO validation set. Using QAP and local CKA-based methods we are able to do cross-lingual image matching/retrieval using CLIP’s ViT-L vision encoder and a multi-lingual sentence transformer paraphrase-multilingual-mpnet-base-v2. While CLIP performs well on the Latin languages, it degrades on non-Latin languages. In comparison, our QAP and Local-CKA-based methods perform comparably in Latin languages while outperforming non-Latin languages, highlighting the efficacy of our training-free transfer approach.
Language Kernel CKA Matching Accuracy Retrieval @ 5
CLIP Ours CLIP Relative[34] Linear Ours (QAP) CLIP Ours (Local)
Latin de 0.472 0.627 43.5 35.0 19.3 39.7 54.9 57.2
en 0.567 0.646 80.9 52.5 25.6 51.3 90.4 66.7
es 0.471 0.634 50.4 37.8 19.7 40.9 63.9 57.9
fr 0.477 0.624 50.8 37.5 18.8 40.3 65.9 56.9
it 0.472 0.638 41.9 37.2 19.7 38.7 52.9 57.0
Non-Latin jp 0.337 0.598 12.9 28.3 15.2 30.2 17.8 48.6
ko 0.154 0.620 0.9 30.4 15.3 31.3 2.2 48.4
pl 0.261 0.642 8.1 36.6 21.0 40.0 15.7 55.9
ru 0.077 0.632 1.7 31.8 16.3 34.8 3.5 53.9
tr 0.301 0.624 7.8 35.8 18.7 38.9 14.6 53.1
zh 0.133 0.641 2.4 36.5 19.2 39.9 4.8 53.7
Avg. 27.4 36.3 18.9 38.7 35.1 55.4
Table A.11: Language-specific encoders for cross-lingual caption matching/retrieval for 5 languages. Language-specific encoders have less semantic similarity with the vision encoder in terms of CKA as well as poorer matching/accuracy performances when compared to multi-lingual models like multilingual-mpnet-base-v2 which is reported in Table 4.

Language Language model CKA Linear Relative QAP Retrieval@5 es hiiamsid\sentence_similarity_spanish_es 0.568 15.9 25.1 28.6 50.0 fr dangvantuan\sentence-camembert-large 0.569 22.5 31.5 35.0 53.1 it nickprock\sentence-bert-base-italian-uncased 0.543 16.0 22.0 26.4 47.8 jp colorfulscoop\sbert-base-ja 0.457 9.2 12.1 14.5 33.7 tr emrecan\bert-base-turkish-cased-mean-nli-stsb-tr 0.564 23.1 34.7 38.3 54.3

Original Image Caption Top-3 Retrieved Images
[Uncaptioned image]

Two desktop computers sitting on top of a desk.

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

A mother and baby elephant walking in green grass in front of a bond.

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

a man is riding a surfboard at the beach

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

The Big Ben clock tower towering over the city of London.

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

A computer mouse is beside a notebook computer.

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Table A.12: Local Kernel CKA Retrieval Mispredictions. In accordance with the experimental protocol detailed in the main paper, we selected 320 base samples and conducted local Kernel CKA retrieval using an additional 500 query samples. Presented above are five example prediction retrievals for instances where the original image failed to secure a position within the top-5 retrievals. We observe that although the original image was not in the retrieved top-5, the retrieved images (top-3 shown here) closely resemble the corresponding caption, thereby highlighting the efficacy of our approach.
Table A.13: CKA for combinations of different vision and text encoders. V, V_tr, V_tr_size, V_mod_size stand for Vision model name, Vision train set, Vision train set size, and Vision model size respectively. T_mod_size stands for text model size. OpenAI’s CLIP text encoder shows highest CKA with facebook dinoV2base closely followed by All-Roberta-large-v1. We make use of All-Roberta-large-v1 as the language encoder for all donwstream tasks and analysis in main text because All-Roberta-large-v1 has been trained using only text data and can be considered a purely textual encoder.

V T CKA V_tr V_tr_p V_tr_size V_mod_size T_mod_size facebook_dinov2-base openai_clip-vit-large-patch14 0.719 LVD-142M DinoV2 142 86 123 facebook_dinov2-base All-Roberta-large-v1 0.706 LVD-142M DinoV2 142 86 355 timm_vit_base_patch16_224.augreg_in21k openai_clip-vit-large-patch14 0.698 ImageNet-21k Supervised 14.1 86 123 facebook_dinov2-large sentence-t5-xxl 0.684 LVD-142M DinoV2 142 307 4870 openai_clip-vit-large-patch14-336 All-Roberta-large-v1 0.677 CLIP-400M Lang. Supervised 400 307 355 facebook_dinov2-large sentence-t5-large 0.668 LVD-142M DinoV2 142 307 335 facebook_dinov2-small sentence-t5-xl 0.661 LVD-142M DinoV2 142 22 1240 facebook_dinov2-small all-mpnet-base-v2 0.655 LVD-142M DinoV2 142 22 109 facebook_dinov2-small all-MiniLM-L6-v1 0.644 LVD-142M DinoV2 142 22 22 facebook_convnext-base-224-22k gtr-t5-xxl 0.626 ImageNet-21k Supervised 14.1 89 4870 timm_vit_small_patch16_224.augreg_in1k gtr-t5-xl 0.602 ImageNet-1k Supervised 1.2 22 1240 timm_convnext_base.fb_in22k all-MiniLM-L6-v2 0.590 ImageNet-21k Supervised 14.1 89 22 timm_convnext_tiny.fb_in1k gtr-t5-xl 0.540 ImageNet-1k Supervised 1.2 29 1240 timm_convnext_base.fb_in1k msmarco-bert-base-dot-v5 0.512 ImageNet-1k Supervised 1.2 89 109 facebook_dino-vitb8 msmarco-distilbert-dot-v5 0.445 ImageNet-1k DinoV1 1.2 86 66 facebook_dino-vits8 all-mpnet-base-v2 0.423 ImageNet-1k DinoV1 1.2 22 109 facebook_dino-vits8 paraphrase-TinyBERT-L6-v2 0.398 ImageNet-1k DinoV1 1.2 22 66

Appendix J Qualitative results

In Table A.12, we present instances of retrieval mispredictions where the original image fails to rank within the top five closest images to the given caption, as determined by local Kernel CKA method. Building upon the experimental methodology outlined in the main paper, we selected 320 base samples and conducted local Kernel CKA retrieval using an additional 500 query samples. We used All-Roberta-large-v1 for text embeddings and DINOv2 ViT-L/14 for image embeddings. The results distinctly illustrate that despite the failure to retrieve the exact original image, the alternative images identified in the top five still exhibit a considerable degree of semantic similarity to the provided caption. This underscores the robustness of the local Kernel CKA retrieval approach, revealing its capability to identify images that, while not the precise match, maintain semantic coherence with the specified caption.