11institutetext: School of Computer and Electronic Information, Nan**g Normal University, Nan**g, 210023, CHN

Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval

Hanwen Su 0009-0003-4335-5290    Ge Song Corresponding author 0000-0002-2159-8203    Kai Huang 0009-0000-1735-1788    Jiyan Wang 0009-0000-4962-4496    Ming Yang 0000-0001-8936-4270
Abstract

In this paper, we study the problem of zero-shot sketch-based image retrieval (ZS-SBIR). The prior methods tackle the problem in a two-modality setting with only category labels or even no textual information involved. However, the growing prevalence of Large-scale pre-trained Language Models (LLMs), which have demonstrated great knowledge learned from web-scale data, can provide us with an opportunity to conclude collective textual information. Our key innovation lies in the usage of text data as auxiliary information for images, thus leveraging the inherent zero-shot generalization ability that language offers. To this end, we propose an approach called Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The network consists of three components: (i) a Description Generation Module that generates textual descriptions for each training category by prompting an LLM with several interrogative sentences, (ii) a Feature Extraction Module that includes two ViTs for sketch and image data, a transformer for extracting tokens of sentences of each training category, finally (iii) a Cross-modal Alignment Module that exchanges the token features of both text-sketch and text-image using cross-attention mechanism, and align the tokens locally and globally. Extensive experiments on three benchmark datasets show our superior performances over the state-of-the-art ZS-SBIR methods.

Keywords:
Zero-shot Learning LLMs Sketch-based Image Retrieval.

1 Introduction

Refer to caption
Refer to caption
Figure 1: (i) During training, the sketch, image, and the corresponding text are fed into the model to learn the alignment for the correspondence of a specific region. (ii) For inference, the transferred knowledge is utilized to do the ZS-SBIR.

Sketch-based image retrieval (SBIR) is a practical problem that uses a hand-drawn sketch as the query to retrieve the image of interest in the gallery. There are many application scenarios for this problem, such as e-commerce: we can retrieve a pair of shoes we want to purchase by drawing a sketch. The conventional SBIR requires the training and testing data to come from the distributions of the same categories, which limits its real-world applications. Moreover, the performances of traditional SBIR models are poor on the data of unseen classes. In addition, annotating large quantities of images is labor-intensive and time-consuming. Driven by practical constraints, late research on SBIR had shifted to zero-shot setup, i.e., zero-shot sketch image retrieval (ZS-SBIR) [1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] because of the prevailing data-scarcity problem [10, 13, 15]. The basic assumption behind the standard ZSL (zero-shot learning) methods is that test data only contain unknown classes[14, 25]. ZS-SBIR is more challenging due to the cross-modal and zero-shot paradigm’s inherent modal and semantic gap[31, 34]. So far, many methods have made progress on ZS-SBIR. While [4, 5, 7] tried methods with no class labels involved, [1, 2, 6, 8, 9, 10, 12] only used class labels embeddings during training. However, recent works have shown that text documents from internet sources can provide great auxiliary information for ZSL [16]. In this case, we try to make use of the text information to boost the performance of ZS-SBIR models. Therefore, our motivation can be summarized as: (i) Because images inherently include complex information, such as coloration and background, simply utilizing the alignment of sketch and image features can bring limited benefit to model’s generalization ability. (ii) As shown in Fig.1, since the detailed visual features of a sketch or image can be summarized in texts, we can align the regions between a sketch and an image with the help of the same phrases. In other words, textual information can be treated as the summary of both sketches and images. (iii) Besides, textual information itself contains very rich generalization knowledge, making it possible for us to use it as the bridge between sketch and image.

Thus, our main approach treating language as an auxiliary representation for ZS-SBIR, trying to boost the existing models¡¯ zero-shot generalization capability. Then, the problem then lies in how the textual information should be leveraged. However, how to get detailed and conclusive texts? Thanks to Large-scale pre-trained Language Models (LLMs), such as GPT-3 [17], which has shown brilliant world knowledge on a wide range of topics, we can collect texts in a k-prompt way. The prompt can be, for example, ”A caption describing a photo of a {category name}.” [24] or in other formats, which we will discuss in Section 4.3. We then collect and pick some reasonable outputs as our textual information for a certain category. Although generating and collecting the text information for each category requires human labor to some extent, it’s more labor-saving than annotating the training images and sketches. The model then learns to align the regions between a sketch and an image. Thus learned knowledge can be generalized to the unseen categories to boost the performance of ZS-SBIR methods.

In this work, we introduce a new framework to handle the ZS-SBIR problem, utilizing the auxiliary textual information. The framework consists of three components: (i) a Description Generation Module that generates texts describing the distinguishing features for each training category by prompting an LLM with several interrogative sentences, and (ii) a Feature Extraction Module that includes two ViTs for both sketch and image data, a Transformer for extracting tokens of sentences of each training category, specifically, (iii) a Cross-modal Alignment Module that exchanges the modal-specific semantic information of both text-sketch and text-image using cross-modal attention, and align the tokens locally and globally. Extensive experiments on three benchmark datasets of ZS-SBIR verify the superiority of our method. We summarize the main contributions of this work as follows:

  • We introduce textual information into the framework of ZS-SBIR model. More specifically, we leverage texts to capture and align the sketches and images local features with more zero-shot generalization ability.

  • We design several prompts, and collect more thorough descriptions for training categories, equipped with a global triplet loss of three modalities and a local matching loss, to acquire knowledge for ZS-SBIR.

  • We conduct experiments on 3 benchmark datasets of ZS-SBIR, achieving a state-of-the-art result.

2 Related Work

2.1 Zero-shot Sketch-Based Image Retrieval

ZS-SBIR aims at generalizing the knowledge learned from the seen training class to unseen testing categories using a query sketch to retrieve the images of the same category. The problem is introduced by [13], which aims to reduce the domain gap between sketches and photos by an image-to-image translation approximation. The subsequent work such as [1, 2, 6, 8, 9, 10, 12] used the CNNs as backbones to extract features and then utilize projections to create a joint space. Despite the help of joint semantic space,  [6] introduced the generative adversarial network for alignment. While, [4, 7] used ViT[18] as its backbone, achieving better results compared to CNN-based methods. Moreover, the teacher-student paradigm was employed by[1, 2] to distill the knowledge. In addition, a test-time training paradigm[5] on the sketches in the test set was introduced to adapt to the test set distribution. [7] addressed the importance and the generalization capability of aligning the corresponding local patches for ZS-SBIR. However, the above-mentioned methods got sub-optimal results, because they didn’t make use of the potential zero-shot generalization ability that textual information has.

2.2 Cross Attention Based on Transformer

The Transformer architecture was first introduced for machine learning by [19]. After that, [18] utilized the Vision Transformer (ViT) directly to the sequence of image patches, achieving great results in the image recognition task. Later methods[20, 22] adopted the ViT as their backbones and made use of the cross attention’s reasoning capability. While [22] finds the pixel-level correspondence for images in the few-shot image classification task, [20] later offers a dual-branch ViT that can extract tokens at multiple scales and exchange class tokens between two branches, enhancing the class token for classification. Besides, for cross-modal retrieval task, [3] directly shares a unified multi-head attention classification network for the two modalities, and therefore make better use of the region structural information shared by the cross-modal instance itself. More recently, [4] introduced a shared fusion ViT that can offer an extra fusion token for interaction with other tokens through the self-attention mechanism, resulting in a more minimized domain gap. Utilizing the cross attention for tokens after self-attention, [7] introduced a patch-aligning method that can learn the patch-level correspondence between a query sketch and an image candidate. However, calculating the cross-attention on sketch-image may cause tokens with weaker semantic discriminability, resulting miss match during training.

2.2.1 Language-assisted Vision Models

Actually,the main challenge of vision-language retrieval is the semantic divergence of heterogeneous data[21]. Moreover, linguistic information normally contains knowledge that is complementary to image modality. Large Language Models (LLMs), such as GPT-3[17], are trained on web-scale datasets. Furthermore, trained LLMs can show impressive capabilities toward few-shot or even zero-shot inference on multiple tasks. Recent papers [16, 23, 24] have shown that using the textual information generated by LLMs with appropriate prompts can help the model learn complementary knowledge for vision tasks [23, 24], even for more fine-grained correspondence[16]. In this paper, we leverage textual information to align the local patches of sketch and image.

Refer to caption
Figure 2: Overview. (i) Description Generation Module collects the textual knowledge describing the visual clues for each specific category, which is generated by LLM with some prompts. (ii) Feature Extraction Module takes the sketch, image, and text data and feeds them into the encoders respectively for extracting token-level features by self-attention. (iii) Cross-attention Alignment Module utilizes the cross-attention mechanism to exchange information between sketch-text and image-text tokens. The correspondences of tokens will be measured locally (matching loss) and globally (triplet loss).

3 Proposed Method

The overall scheme of our proposed framework is shown in Fig.2. Each key module is detailed in the following.

3.1 Description Generation Module

Since we want to utilize the zero-shot generalization capability of text information, we need to obtain category-specific text descriptions describing the visual features for training categories. As shown in Fig. 2, for N training categories, we leverage LLM to produce textual descriptions. For N training categories, we can prompt the language model with the inputs: ¡°What are useful visual features for distinguishing a {category name} in a photo?” [33](Notably, other appropriate prompts can also be used, which we will discuss in Section 4.3) After prompting, we gather and select numerous sentences describing the distinguishing features of a specific training category. We donate the generated text prompts for N categories as PNsubscript𝑃𝑁\mathit{P_{N}}italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, formulated as:

PN=LLM(prompts).subscript𝑃𝑁LLMprompts\mathit{P_{N}}=\mathrm{LLM}(\mathrm{prompts}).italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = roman_LLM ( roman_prompts ) . (1)

The textual descriptions PNsubscript𝑃𝑁\mathit{P_{N}}italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT then serve as the input of the text encoder.

3.2 Feature Extraction Module

For a given query sketch Sh×w×c𝑆superscripthwcS\in\mathbb{R}^{\textit{h}\times\textit{w}\times\textit{c}}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT h × w × c end_POSTSUPERSCRIPT and a gallery image Ih×w×c𝐼superscripthwcI\in\mathbb{R}^{\textit{h}\times\textit{w}\times\textit{c}}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT h × w × c end_POSTSUPERSCRIPT, we use ViT[18] to extract features. More specifically, the ViT partition the input image into non-overlap** patches, then use a projection head to map them into vanilla tokens Svh×w×csubscript𝑆𝑣superscripthwcS_{v}\in\mathbb{R}^{\textit{h}\times\textit{w}\times\textit{c}}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT h × w × c end_POSTSUPERSCRIPT and image feature Ivh×w×csubscript𝐼𝑣superscripthwcI_{v}\in\mathbb{R}^{\textit{h}\times\textit{w}\times\textit{c}}italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT h × w × c end_POSTSUPERSCRIPT. In addition, the hand-drawn sketch usually contains only the object contour that indicates the primary structure. So, during the feature extraction phase, we adopt the learnable tokenization method to enlarge the receptive field when constructing visual tokens through hierarchical convolution, thereby better preserving structural cues from nearby regions[7]. It is a network containing four convolution layers and non-linear activation functions. In this way, we can obtain new tokens Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Insubscript𝐼𝑛I_{n}italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to adjust the vanilla visual tokens using a residual connection. The final output embedding is formulated as:

X=Xv+Xn,𝑋subscript𝑋𝑣subscript𝑋𝑛\mathit{X}=\mathit{X_{v}}+\mathit{X_{n}},italic_X = italic_X start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (2)

where X𝑋\mathit{X}italic_X stands for Sketch(S𝑆\mathit{S}italic_S) or Image(I𝐼\mathit{I}italic_I).

Then, the output embedding serves as the input of ViT. However, we don’t use the MLP head offered by the vanilla ViT[18]. We use the feature before the final MLP, which is semantically richer. We call this feature a global token [X[𝐺𝑙𝑜]]delimited-[]subscript𝑋delimited-[]𝐺𝑙𝑜\mathit{[X_{[Glo]}]}[ italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT ]. As a consequence, the additional learnable global token [X[𝐺𝑙𝑜]]ddelimited-[]subscript𝑋delimited-[]𝐺𝑙𝑜superscript𝑑\mathit{[X_{[Glo]}]}\in\mathbb{R}^{d}[ italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is inserted into the multi-head self-attention and MLP blocks to learn the global representation of an image or a sketch. For both sketch and image, the global tokens after multi-head attention can be attained by:

z0=[X[𝐺𝑙𝑜];X1;;Xn],subscript𝑧0subscript𝑋delimited-[]𝐺𝑙𝑜superscript𝑋1superscript𝑋𝑛\mathit{z_{0}}=[\mathit{X_{[Glo]}};\mathit{X^{1}};...;\mathit{X^{n}}],italic_z start_POSTSUBSCRIPT italic_0 end_POSTSUBSCRIPT = [ italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT ; italic_X start_POSTSUPERSCRIPT italic_1 end_POSTSUPERSCRIPT ; … ; italic_X start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ] , (3)
z=MSA(LN(z1))+z1,l=1L,formulae-sequencesubscript𝑧𝑀𝑆𝐴𝐿𝑁subscript𝑧1subscript𝑧1𝑙1𝐿\mathit{z_{\ell}}=MSA(LN(z_{\ell-1}))+\mathit{z_{\ell-1}},\mathit{l}=1...L,italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_M italic_S italic_A ( italic_L italic_N ( italic_z start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT ) ) + italic_z start_POSTSUBSCRIPT roman_ℓ - italic_1 end_POSTSUBSCRIPT , italic_l = 1 … italic_L , (4)
z=MLP(LN(z))+z,l=1L,formulae-sequencesubscript𝑧𝑀𝐿𝑃𝐿𝑁subscript𝑧subscript𝑧𝑙1𝐿\mathit{z_{\ell}}=MLP(LN(z_{\ell}))+\mathit{z_{\ell}},\mathit{l}=1...L,italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_L italic_N ( italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ) + italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_l = 1 … italic_L , (5)

where X{S,I}𝑋𝑆𝐼\mathit{X}\in\{\mathit{S},\mathit{I}\}italic_X ∈ { italic_S , italic_I }, LN represents the layer norm. L is the number of layers. And the residual connection is implemented. MSA is the multi-head self-attention, which is formulated as:

S𝐴𝑡𝑡𝑒𝑛(Qx,Kx,Vx)=softmax(QxKxTd)Vx,𝑆𝐴𝑡𝑡𝑒𝑛superscript𝑄𝑥superscript𝐾𝑥superscript𝑉𝑥𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscript𝑄𝑥superscriptsuperscript𝐾𝑥𝑇𝑑superscript𝑉𝑥\mathit{S-Atten}(Q^{x},K^{x},V^{x})=softmax(\frac{Q^{x}{K^{x}}^{T}}{\sqrt{d}})% V^{x},italic_S - italic_Atten ( italic_Q start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_K start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , (6)

where the Q,K,V𝑄𝐾𝑉\mathit{Q},\mathit{K},\mathit{V}italic_Q , italic_K , italic_V are query, key, value obtained by map** the same token with three different linear projection heads [Wq,Wk,Wv]subscript𝑊𝑞subscript𝑊𝑘subscript𝑊𝑣[\mathit{{W_{q}}},\mathit{{W_{k}}},\mathit{{W_{v}}}][ italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ].

For a given text description generated by LLM, we use the Transformer model to extract its token features. The textual feature T𝑇\mathit{T}italic_T can be attained by:

T=ft(Et(PN)),𝑇subscript𝑓𝑡subscript𝐸𝑡subscript𝑃𝑁\mathit{T}=\mathit{f_{t}}(\mathit{E_{t}}(P_{N})),italic_T = italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ) , (7)

where the Etsubscript𝐸𝑡\mathit{E_{t}}italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT stands for the text encoder, which is the CLIP [26] text encoder corresponding to the ViT-B/16. ftsubscript𝑓𝑡\mathit{f_{t}}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the linear layer for projecting the textual tokens to that of the same dimension of image and sketch features.

3.3 Cross-modal Alignment Module

Given the features of three different modalities, we aim to align the tokens using cross-attention mechanism. More specifically, we use text tokens to better align the sketch and image. Cross-modal attention is calculated in sketch-text and image-text. The words describing the same region between an image and a sketch should have larger digits. We update the sketch and image tokens by querying the textual tokens. The tokens after cross-modal attention can be formulated as:

C𝐴𝑡𝑡𝑒𝑛x,t(Qt,Kx,Vx)=softmax(QtKxTd)Vx.𝐶subscript𝐴𝑡𝑡𝑒𝑛𝑥𝑡subscript𝑄𝑡subscript𝐾𝑥subscript𝑉𝑥𝑠𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑄𝑡superscriptsubscript𝐾𝑥𝑇𝑑subscript𝑉𝑥\mathit{C-Atten_{x,t}}(\mathit{Q_{t},K_{x},V_{x}})=softmax(\frac{Q_{t}K_{x}^{T% }}{\sqrt{d}})V_{x}.italic_C - italic_Atten start_POSTSUBSCRIPT italic_x , italic_t end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT . (8)

Next, we align the tokens after cross-modal attention. We denote the sketch and image tokens after cross-modal attention as Xstsubscript𝑋𝑠𝑡\mathit{X_{s-t}}italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT and Xitsubscript𝑋𝑖𝑡\mathit{X_{i-t}}italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT. We implement the triplet loss for global alignment. Given a triplet <X[𝐺𝑙𝑜](Xst)subscript𝑋delimited-[]𝐺𝑙𝑜subscript𝑋𝑠𝑡\mathit{{X_{[Glo]}(X_{s-t})}}italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT ), X[𝐺𝑙𝑜]+(Xit)superscriptsubscript𝑋delimited-[]𝐺𝑙𝑜subscript𝑋𝑖𝑡\mathit{{X_{[Glo]}^{+}(X_{i-t})}}italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ), X[𝐺𝑙𝑜](Xit)superscriptsubscript𝑋delimited-[]𝐺𝑙𝑜subscript𝑋𝑖𝑡\mathit{{X_{[Glo]}^{-}(X_{i-t})}}italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) >, where X[𝐺𝑙𝑜](Xst)subscript𝑋delimited-[]𝐺𝑙𝑜subscript𝑋𝑠𝑡\mathit{{X_{[Glo]}(X_{s-t})}}italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT ) is the anchor sketch feature, X[𝐺𝑙𝑜]+(Xit)superscriptsubscript𝑋delimited-[]𝐺𝑙𝑜subscript𝑋𝑖𝑡\mathit{{X_{[Glo]}^{+}(X_{i-t})}}italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) is the image feature with the same class label, X[𝐺𝑙𝑜](Xit)superscriptsubscript𝑋delimited-[]𝐺𝑙𝑜subscript𝑋𝑖𝑡\mathit{{X_{[Glo]}^{-}(X_{i-t})}}italic_X start_POSTSUBSCRIPT [ italic_Glo ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) is the image feature with a different class label. The triplet loss pulls the positive pair away from the negative one. Therefore, the triplet loss is formulated as:

L𝑡𝑟𝑖=1Ti=1Tmax{X[Glo](Xst)X[Glo]+(Xit)\displaystyle\mathit{L_{tri}}=\frac{1}{T}\sum_{i=1}^{T}max\{\left\|X_{[Glo]}(X% _{s-t})-X_{[Glo]}^{+}(X_{i-t})\right\|-italic_L start_POSTSUBSCRIPT italic_tri end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_m italic_a italic_x { ∥ italic_X start_POSTSUBSCRIPT [ italic_G italic_l italic_o ] end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT ) - italic_X start_POSTSUBSCRIPT [ italic_G italic_l italic_o ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) ∥ - (9)
X[Glo](Xst)X[Glo](Xit)+m,0},\displaystyle\left\|X_{[Glo]}(X_{s-t})-X_{[Glo]}^{-}(X_{i-t})\right\|+m,0\},∥ italic_X start_POSTSUBSCRIPT [ italic_G italic_l italic_o ] end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT ) - italic_X start_POSTSUBSCRIPT [ italic_G italic_l italic_o ] end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) ∥ + italic_m , 0 } ,

where \|\cdot\|∥ ⋅ ∥ is the L-2 distance, m𝑚\mathit{m}italic_m is the margin parameter, and T donates the total number of triplets.

In addition to aligning the global alignment, features of local patches also need to be handled. We exploit the relation network[7] to explicitly show the similarity between each pair of visual tokens after cross-modal attention. We first exploit the cosine kernel function, calculating the cosine similarity between all pairs of tokens, resulting in a kernel matrix. Next, a relation network is utilized to explicitly measure the alignment scores. The relation network ρ𝜌\rhoitalic_ρ is a stack of two FC-ReLU-Dropout layers, and the relation score r𝑟\mathit{r}italic_r is a digit in the range of (0,1). The matching score is defined as mean square error(MSE). The relation score r𝑟\mathit{r}italic_r and the matching loss L𝑟𝑛subscript𝐿𝑟𝑛\mathit{L_{rn}}italic_L start_POSTSUBSCRIPT italic_rn end_POSTSUBSCRIPT are formulated as:

r(Xst,Xit)=sigmoid(ρ(XstXitXstXit)),𝑟subscript𝑋𝑠𝑡subscript𝑋𝑖𝑡𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝜌subscript𝑋𝑠𝑡subscript𝑋𝑖𝑡normsubscript𝑋𝑠𝑡normsubscript𝑋𝑖𝑡\mathit{r}(X_{s-t},X_{i-t})=sigmoid(\rho(\frac{X_{s-t}\cdot X_{i-t}}{\left\|X_% {s-t}\right\|\left\|X_{i-t}\right\|})),italic_r ( italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ) = italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( italic_ρ ( divide start_ARG italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT ⋅ italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_X start_POSTSUBSCRIPT italic_s - italic_t end_POSTSUBSCRIPT ∥ ∥ italic_X start_POSTSUBSCRIPT italic_i - italic_t end_POSTSUBSCRIPT ∥ end_ARG ) ) , (10)
L𝑟𝑛=i=1Nj=1M(ri,jpre(yi,yj))2,subscript𝐿𝑟𝑛superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑀superscriptsubscript𝑟𝑖𝑗𝑝𝑟𝑒subscript𝑦𝑖subscript𝑦𝑗2\mathit{L_{rn}}=\sum_{i=1}^{N}\sum_{j=1}^{M}(\mathit{r_{i,j}}-pre(y_{i},y_{j})% )^{2},italic_L start_POSTSUBSCRIPT italic_rn end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_p italic_r italic_e ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (11)
pre=𝟏(yi==yj)pre=\mathbf{1}(y_{i}==y_{j})italic_p italic_r italic_e = bold_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (12)

where 𝑝𝑟𝑒𝑝𝑟𝑒\mathit{pre}italic_pre refers to the score between the predicted score and the ground truth. r=1𝑟1r=1italic_r = 1 when matched. y is the class label. N and M are the totals of query sketches and candidate images.

The overall loss L𝐿\mathit{L}italic_L is the summary of L𝑡𝑟𝑖subscript𝐿𝑡𝑟𝑖\mathit{L_{tri}}italic_L start_POSTSUBSCRIPT italic_tri end_POSTSUBSCRIPT and L𝑟𝑛subscript𝐿𝑟𝑛\mathit{L_{rn}}italic_L start_POSTSUBSCRIPT italic_rn end_POSTSUBSCRIPT:

L=λtriL𝑡𝑟𝑖+λrnL𝑟𝑛,𝐿subscript𝜆𝑡𝑟𝑖subscript𝐿𝑡𝑟𝑖subscript𝜆𝑟𝑛subscript𝐿𝑟𝑛\mathit{L}=\lambda_{tri}\mathit{L_{tri}}+\lambda_{rn}\mathit{L_{rn}},italic_L = italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_tri end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r italic_n end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_rn end_POSTSUBSCRIPT , (13)

where λtrisubscript𝜆𝑡𝑟𝑖\lambda_{tri}italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT and λrnsubscript𝜆𝑟𝑛\lambda_{rn}italic_λ start_POSTSUBSCRIPT italic_r italic_n end_POSTSUBSCRIPT are the hyper-parameters.

3.4 Inference

Notably, during inference, we don’t use the textual information, only sketch and image features are required. The self attention is implemented on both the sketch and image to produce the uni-modal features. Then, the cross attention is implemented on the uni-modal features to calculate the global and local tokens. Finally, the relation network is used to produce the relation score, which presents the similarity between a query sketch and an image.

4 Experiment

4.1 Implementation

Datasets. There are three datasets commonly used for ZS-SBIR. Sketchy[29] is composed of 75,471 sketches and 12,500 natural images from 125 classes. Sketchy-Ext[15] is an extended version of the original Sketchy dataset, which contains 125 categories. Moreover, an additional 60,502 photos are included, creating a larger photo gallery. Sketchy-25 refers to a partition of 100 training classes and 25 testing classes. Sketchy-21 [13] refers to the version of 104/21 train/test classes, which selects classes that do not overlap with ImageNet[30] categories as unseen classes. TU-Berlin[32] contains a total of 250 categories, each category having 80 sketches. It is extended by the collection of 204,489 images provided by [15]. For the extended TU-Berlin dataset, following the [6] we use 220 classes for training and test the remaining 30 classes. QuickDraw[11] is the largest SBIR dataset with a total of 110 categories. It contains 330,000 sketches drawn by amateurs and 204,000 photos. The split of 80 classes for training and 30 for testing.

Table 1: Comparison results. ¡°-”: not reported, ¡°†”: approximate result. The best and second-best scores are colored in red and blue.
Methods RDsuperscript𝑅𝐷\mathit{R^{D}}italic_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT TU-Berlin Sketchy-25 Sketchy-21 QuickDraw
mAP@all Prec@100 mAP@all Prec@100 mAP@200 Prec@200 mAP@all Prec@200
ZSIH[27] 64 0.220 0.291 0.254 0.340 - - - -
CC-DC[28] 256 0.247 0.392 0.311 0.468 - - - -
DOODLE[11] 256 0.109 - 0.369 - - - 0.075 0.068
SEM-PCYC[9] 64 0.297 0.426 0.349 0.463 - - - -
SAKE[6] 512 0.475 0.599 0.547 0.692 0.497 0.598 0.130 0.179
SketchGCN[8] 300 0.323 0.505 0.382 0.538 - - - -
StyleGuide[10] 200 0.254 0.355 0.376 0.484 0.358 0.400 - -
PDFD[12] 512 0.483 0.600 0.661 0.781 - - - -
ViT-Vis[18] 512 0.360 0.503 0.410 0.569 0.403 0.512 0.101 0.113
ViT-Rec[18] 512 0.438 0.578 0.483 0.637 0.416 0.522 0.115 0.127
DSN[1] 512 0.484 0.591 0.583 0.704 - - - -
SBTKNet[2] 512 0.480 0.608 0.553 0.698 0.502 0.596 - -
Sketch3T[5] 512 0.507 - 0.575 - - - - -
TVT[4] 384 0.484 0.662 0.648 0.796 0.531 0.618 0.149 0.293
ZSE-SBIR-RN[7] 512 0.540 0.647 0.698 0.789 0.525 0.620 0.145 0.216
Ours 512 0.560 0.665 0.730 0.809 0.528 0.624 †0.139 †0.209

Competitors. We compare our model with ZSIH[27], CC-DG[28], DOODLE[11], SEM-PCYC[9], SAKE[6], SketchGCN[8], StyleGuide[10], PDFD[12], ViT-Ret/ViT-Vis[18], DSN[1], SBTKNet[2], Sketch3T[5], TVT[4] and ZSE-SBIR[7]. ViT-Ret: replacing the class token in ViT with a retrieval token. ViT-Vis: using the visual tokens. ZSE-SBIR-RN refers to using the relation network for retrieval.

Evaluation protocol. Following the standard evaluation protocol, we test our model with mean average precision (mAP) and precision on top 100/200 (Prec@100/200).

Implementation details. We implement our model with PyTorch toolkit. A sketch or image is scaled to 224*224. For network architecture, the self-attention consists of 12 layers, ViT-B/16 is pre-trained on ImageNet-1K. The cross-attention is designed as one layer with 12 heads. For the optimizer, we choose Adam with weight decay 1e-2 and learning rate 5e-6.

4.2 Results

From Table 1, we can see that our proposed method achieves all top-3 results over all competitors. More specifically, our method achieves the best result on the Sketchy-25 and TU-Berlin datasets. The reason why our model didn’t get a top-1 ranking on Sketchy-21 may be the unseen classes don’t overlap with ImageNet categories that ViT-B/16 is trained on. Due to the large data offered by QuickDraw, we test our method by randomly sampling 40000 sketches to retrieve images. The TVT[4] got the best result on QuickDraw, maybe because it has (i) a Fusion ViT for distilling tokens of different modalities, (ii) a classification loss, which is capable of learning more zero-shot generalization knowledge on a large-scale noisy data. (iii) the quality of sketches in QuickDraw dataset is not very good, because our method focus more on details, the noise of sketches may cause problem to our method. However, our result, where the † is set, is still comparable.

Table 2: Ablation study results on manifesting importance of the usage of textual information.
Methods Sketchy TU-Berlin
mAP@all Prec@100 mAP@all Prec@100
w/o T-S 0.686 0.788 0.536 0.631
w/o T-I 0.689 0.790 0.556 0.649
w/o text 0.698 0.789 0.540 0.647
Ours 0.730 0.809 0.560 0.665

4.3 Further Study

Ablation study. We aim to test the effectiveness of using textual information in various ways. It’s worth noticing that the descriptions yielded by the LLM are only used on training phase. w/o T-S: The cross attention between the text tokens and the sketch tokens are removed, leaving the rest of the network untouched. w/o T-I: The cross attention between the text tokens and the image tokens are removed, leaving the rest of the network untouched. w/o text: The textual information is completely removed, directly calculating the cross attention between sketch and image tokens. We can see from Table 2: (i) Without the cross-attention of both text-sketch and image-text, the performance drops significantly, indicating the importance of using text information. (ii) The performance of using only sketch-text cross-attention is better than using only image-text cross-attention, which is maybe because the background of images is more complex compared to sketches.

Test of text prompt variation. We test the effect of utilizing different text prompts on GPT-3. Text prompt1 is not generated by the LLM. It’s the fixed prompt ”a photo of a {category name}.” For the remaining ones, we select and collect useful outputs as our model’s input. The prompts are inspired by previous works and modified by very little changes. (We only tried those three versions of prompt, other appropriate prompts may also work.) Text prompts 1 to 4 are:

  • a photo of a {category name}.

  • A caption describing a photo of a {category name}. [24]

  • What does a {category name} look like? [24]

  • What are useful visual features for distinguishing a {category name} in a photo? [33]

As shown in Table. 3, the result shows that using text prompt4 got the best output. The simple fixes version of texts brings little benefit to the performance, indicating the importance of text generated from LLM, which describes the attributions of categories in detail. Because by prompting the LLM with text prompt4 can offer more detailed information for recognizing an image, making it possible for us to gather descriptions summarizing the shared attributes of both sketches and images. It demonstrates that the more specific prompt can lead to better description generated by the LLM, thus boosting model’s performance.

Table 3: Results on Sketchy-25 and Sketchy-21 datasets using texts generated by different text prompts on GPT-3.
Prompt Sketchy-25 Sketchy-21
mAP@all Prec@100 mAP@200 Prec@200
Text prompt1 0.695 0.796 0.522 0.621
Text prompt2 0.720 0.801 0.520 0.618
Text prompt3 0.727 0.807 0.525 0.618
Text prompt4 0.730 0.809 0.528 0.624

Qualitative analysis. The top 10 retrieved candidates of sketches queries are shown in the Fig. 3. We can observe that, compared with ZSE-SBIR-RN, our model retrieves better. ZSE-SBIR-RN retrieves false positive images that share similar overall object pose and shape but may ignore some key features. For example, the shape of a suitcase is similar to that of a lighter, both include a rectangular body, but the sketch query of a suitcase that includes a handle. Our model successfully retrieves the true positive images of a suitcase with a handle, corresponding to the text description of a suitcase ”equipped with handles”. It shows the effectiveness of our model utilizing the textual information, not just the similar shape shared by the query sketches and image candidates.

Refer to caption
Figure 3: Exemplar comparison retrieval results for the given query sketches and the top 10 retrieved images. Red box denotes false positive, Green box denotes true positive.
Refer to caption
Figure 4: The mAP@all and Prec@100 scores on TU-Berlin with different values of λtrisubscript𝜆𝑡𝑟𝑖\lambda_{tri}italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT and λrnsubscript𝜆𝑟𝑛\lambda_{rn}italic_λ start_POSTSUBSCRIPT italic_r italic_n end_POSTSUBSCRIPT.

Analysis on parameter sensitivity. As shown in Fig. 4, we analyze the effect of λtrisubscript𝜆𝑡𝑟𝑖\lambda_{tri}italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT and λrnsubscript𝜆𝑟𝑛\lambda_{rn}italic_λ start_POSTSUBSCRIPT italic_r italic_n end_POSTSUBSCRIPT on TU-Berlin dataset. We can observe that when λrnsubscript𝜆𝑟𝑛\lambda_{rn}italic_λ start_POSTSUBSCRIPT italic_r italic_n end_POSTSUBSCRIPT is set as 8, we take λtrisubscript𝜆𝑡𝑟𝑖\lambda_{tri}italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT’s value of 0.1, 0.5, 1, and 2, the result of both mAP@all and Prec@100 increase from 0.1 to 0.5 and decrease from 0.5 to 2. So, we choose to set λtrisubscript𝜆𝑡𝑟𝑖\lambda_{tri}italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT as 0.5. Similarly, when λtrisubscript𝜆𝑡𝑟𝑖\lambda_{tri}italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT is set as 1, we can observe the value of λrnsubscript𝜆𝑟𝑛\lambda_{rn}italic_λ start_POSTSUBSCRIPT italic_r italic_n end_POSTSUBSCRIPT should be set as 8.

5 Conclusion

In this work, we introduce an Cross-Modal Attention Alignment Network with Auxiliary Text Description for zero-shot sketch-based image retrieval. The usage of texts in the ZS-SBIR problem demonstrates the rich zero-shot generalization of linguistic data. We gather and select conclusive sentences for each training category by prompting an LLM appropriately. Then, extract feature tokens by self-attention mechanism, exchange the tokens with cross-attention mechanism, and finally align them both locally and globally. Extensive experiments conducted on three benchmark datasets to demonstrate the superiority of our approach. Moreover, we also serve the LLM with different prompts to collect category-specific descriptions for our experiments, comparing the effect of using different prompts.

6 Acknowledgements

This work is partially supported by the National Science Foundation of China (62106108, 62276138, 62076135, and 61876087) and the Natural Science Foundation of Jiangsu Province (BK20210559).

References

  • [1] Wang, Zhipeng, et al. ”Domain-smoothing network for zero-shot sketch-based image retrieval.” arXiv preprint arXiv:2106.11841 (2021).
  • [2] Tursun, Osman, et al. ”An efficient framework for zero-shot sketch-based image retrieval.” Pattern Recognition 126 (2022): 108528.
  • [3] Yang, Yang, et al. ”Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective.” In IJCAI, 2021 (pp. 3300-3306).
  • [4] Tian, Jialin, et al. ”TVT: three-way vision transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 2. 2022.
  • [5] Sain, Aneeshan, et al. ”Sketch3t: Test-time training for zero-shot sbir.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
  • [6] Liu, Qing, et al. ”Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
  • [7] Lin, Fengyin, et al. ”Zero-shot everything sketch-based image retrieval, and in explainable style.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
  • [8] Gupta, Sumrit, et al. ”Zero-shot sketch based image retrieval using graph transformer.” 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.
  • [9] Dutta, Anjan, and Zeynep Akata. ”Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
  • [10] Dutta, Titir, Anurag Singh, and Soma Biswas. ”Styleguide: Zero-shot sketch-based image retrieval using style-guided image generation.” IEEE Transactions on Multimedia 23 (2020): 2833-2842.
  • [11] Dey, Sounak, et al. ”Doodle to search: Practical zero-shot sketch-based image retrieval.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
  • [12] Deng, Cheng, et al. ”Progressive cross-modal semantic network for zero-shot sketch-based image retrieval.” IEEE Transactions on Image Processing 29 (2020): 8892-8902.
  • [13] Yelamarthi, Sasi Kiran, et al. ”A zero-shot framework for sketch based image retrieval.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
  • [14] Yang, Yang, et al. ”Learning adaptive embedding considering incremental class.” IEEE Transactions on Knowledge and Data Engineering 35.3 (2021): 2736-2749.
  • [15] Liu, Li, et al. ”Deep sketch hashing: Fast free-hand sketch-based image retrieval.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  • [16] Naeem, Muhammad Ferjad, et al. ”I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
  • [17] Brown, Tom, et al. ”Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
  • [18] Dosovitskiy, Alexey, et al. ”An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
  • [19] Vaswani, Ashish, et al. ”Attention is all you need.” Advances in neural information processing systems 30 (2017).
  • [20] Chen, Chun-Fu Richard, Quanfu Fan, and Rameswar Panda. ”Crossvit: Cross-attention multi-scale vision transformer for image classification.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
  • [21] Yang, Yang, et al. ”CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval.” arXiv e-prints (2023): arXiv-2304.
  • [22] Doersch, Carl, Ankush Gupta, and Andrew Zisserman. ”Crosstransformers: spatially-aware few-shot transfer.” Advances in Neural Information Processing Systems 33 (2020): 21981-21993.
  • [23] Mao, Chengzhi, et al. ”Doubly right object recognition: A why prompt for visual rationales.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
  • [24] Zhang, Renrui, et al. ”Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
  • [25] Yang, Yang, et al. ”Not all out-of-distribution data are harmful to open-set active learning.” Advances in Neural Information Processing Systems 36 (2024).
  • [26] Radford, Alec, et al. ”Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.
  • [27] Shen, Yuming, et al. ”Zero-shot sketch-image hashing.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
  • [28] Pang, Kaiyue, et al. ”Generalising fine-grained sketch-based image retrieval.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
  • [29] Sangkloy, Patsorn, et al. ”The sketchy database: learning to retrieve badly drawn bunnies.” ACM Transactions on Graphics (TOG) 35.4 (2016): 1-12.
  • [30] Deng, Jia, et al. ”Imagenet: A large-scale hierarchical image database.” 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.
  • [31] Yang, Yang, et al. ”Semi-supervised multi-modal clustering and classification with incomplete modalities.” IEEE Transactions on Knowledge and Data Engineering 33.2 (2021): 682-695.
  • [32] Eitz, Mathias, et al. ”An evaluation of descriptors for large-scale image retrieval from sketched feature lines.” Computers and Graphics 34.5 (2010): 482-498.
  • [33] Menon, Sachit, and Carl Vondrick. ”Visual classification via description from large language models.” arXiv preprint arXiv:2210.07183 (2022).
  • [34] Yang, Yang, et al. ”Semi-supervised multi-modal multi-instance multi-label deep network with optimal transport.” IEEE Transactions on Knowledge and Data Engineering 33.2 (2021): 696-709.