\addauthor

Peng [email protected],2 \addauthorYujian [email protected] \addauthorHui [email protected] \addauthorZailong [email protected] \addauthorXubo [email protected] \addauthorYiyang [email protected],2 \addauthorGuquan [email protected],2 \addinstitution Bei**g Normal University-Hong Kong Baptist University United International College.
Zhu Hai, China \addinstitution Hong Kong Baptist University.
Hong Kong, China \addinstitution University of Wollongong.
Wollonggong, Austrilia \addinstitution University of Surrey.
Guilford, United Kingdom

Dynamic Identity-Guided Attention Network for Visible-Infrared Person Re-identification

Abstract

Visible-infrared person re-identification (VI-ReID) aims to match people with the same identity between visible and infrared modalities. VI-ReID is a challenging task due to the large differences in individual appearance under different modalities. Existing methods generally try to bridge the cross-modal differences at image or feature level, which lacks exploring the discriminative embeddings. Effectively minimizing these cross-modal discrepancies relies on obtaining representations that are guided by identity and consistent across modalities, while also filtering out representations that are irrelevant to identity. To address these challenges, we introduce a dynamic identity-guided attention network (DIAN) to mine identity-guided and modality-consistent embeddings, facilitating effective bridging the gap between different modalities. Specifically, in DIAN, to pursue a semantically richer representation, we first use orthogonal projection to fuse the features from two connected coarse and fine layers. Furthermore, we first use dynamic convolution kernels to mine identity-guided and modality-consistent representations. More notably, a cross embedding balancing loss is introduced to effectively bridge cross-modal discrepancies by above embeddings. Experimental results on SYSU-MM01 and RegDB datasets show that DIAN achieves state-of-the-art performance. Specifically, for indoor search on SYSU-MM01, our method achieves 86.28% rank-1 accuracy and 87.41% mAP, respectively. Our code will be available soon.

1 Introduction

Person re-identification (ReID), as an important field of computer vision, focuses on personal recognition across cameras [Ye et al.(2021b)Ye, Shen, Lin, Xiang, Shao, and Hoi]. VI-ReID is a subfield of ReID that specializes in personal matching based on images captured by visible and infrared cameras. It faces challenges due to the huge cross-modal discrepancies. Current methods mainly try to handle VI-ReID tasks at the image level and feature level.

For image-level methods, researchers [Zheng et al.(2017)Zheng, Zhang, Sun, Chandraker, Yang, and Tian, Wang et al.(2019b)Wang, Wang, Zheng, Chuang, and Satoh, Hao et al.(2019)Hao, Wang, Li, and Gao] aim to reduce cross-modal differences by finding modality-invariant embeddings. Methods in [Li et al.(2020)Li, Wei, Hong, and Gong, Wang et al.(2020)Wang, Zhang, Yang, Cheng, Chang, Liang, and Hou, Wei et al.(2018)Wei, Zhang, Gao, and Tian, Zhong et al.(2021)Zhong, Lu, Huang, Ye, Jia, and Lin, Ye et al.(2020)Ye, Shen, J. Crandall, Shao, and Luo, Gao et al.(2021)Gao, Liang, **, Gu, Liu, Li, and Lang, Liu et al.(2022)Liu, Wang, Huang, Zhang, and Han] try to generate intermediate images between visible and infrared data, which allows for better alignment and integration at the image level through intermediate modalities. Works in [Sun et al.(2018)Sun, Zheng, Yang, Tian, and Wang, Wei et al.(2021)Wei, Yang, Wang, and Gao] emphasize detailed features in heterogeneous human images by dividing the image into several parts and calculating their relations. Overall, image-level approaches offer simplicity and the ability to capture holistic information, thereby facilitating understanding of the overall scene and context. However, they lack the capability to extract fine-grained features. Differently, methods in feature-level can mine more fine-grained representations. The methods in [Jiang et al.(2022)Jiang, Zhang, Liu, Qian, Zhang, and Wu, Chai et al.(2023)Chai, Ling, Luo, Lin, Jiang, and Li, Feng et al.(2023)Feng, Ji, Wu, Gao, Gao, Liu, Liu, **g, and Luo] enhance joint embedding patterns by focusing on modality-related embeddings from the feature level. Similarly, researchers [Zhang et al.(2022)Zhang, Kang, Zhao, and Shen, Chen et al.(2018)Chen, Collins, Zhu, Papandreou, Zoph, Schroff, Adam, and Shlens, Gao et al.(2019)Gao, Cheng, Zhao, Zhang, Yang, and Torr, Chen et al.(2021a)Chen, Fan, and Panda, Dai et al.(2018)Dai, Ji, Wang, Wu, and Huang, Wang et al.(2019a)Wang, Zhang, Cheng, Liu, Yang, and Hou, Zhao et al.(2021)Zhao, Liu, Chu, Lu, and Yu, Liang et al.(2021)Liang, **, Gao, Liu, Feng, Wang, and Li, Lu et al.(2023)Lu, Zou, and Zhang] seek channel-level, spatial-level or multi-scale implicit connections between different modalities through various attention mechanisms. These indicate that fine-grained features are also crucial. Despite the methods operating at either the image or feature level, they frequently neglect the significance of identity-guided and modality-consistent embeddings. This oversight leads to a failure in effectively bridging the gap between different modalities. To address this challenge, we introduce a novel network named Dynamic Identity-Guided Attention Network (DIAN). By prioritizing identity-guided and ensuring modality-consistent in embeddings, DIAN offers a promising solution to this longstanding issue. The overall architecture is shown in Fig. 1(a). Inspired by [Yang et al.(2021)Yang, He, Fan, Shi, Xue, Li, Ding, and Huang], we first introduce an orthogonal fusion module (OFM) to reduce feature redundancy between connected layers and fuse them effectively. OFM can generate rich semantic features through orthogonal projection. Secondly, in view of the superior ability of dynamic convolution kernels in processing high-response information [Shen et al.(2023)Shen, Zhao, and Zhang], we propose an identity-guided embedding decoupling kernel (IEDK), which can decouple feature maps at different scales and effectively mine the identity-guided and modality-consistent embeddings. Thirdly, we introduce a parallel progressive enhancement module (PPEM). This module enhances embeddings through parallel spatial and channel attention blocks, transitioning from serial to parallel mode, thus maximizing the utilization of training data while avoiding the data scarcity issue caused by the original serial design of the attention [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin]. Finally, to effectively reduce cross-modal discrepancies, a cross embedding balance loss (CEBL) is designed to effectively minimize the cross-modal discrepancies.

To the best of our knowledge, no other method uses a similar approach to solve the VI-ReID task. In summary, our main contributions are as follows.

1. A novel network dynamic identity-guided attention network (DIAN) is proposed for the VI-ReID task. In DIAN, the orthogonal fusion module (OFM) is able to obtain information rich in semantic representation through a novel orthogonal feature fusion method. The novel identity-guided embedding decoupling kernel (IEDK) is able to obtain identity-guided and modality-consistent embeddings at various scales. The proposed parallel progressive enhancement module (PPEM) can further enhance the above embeddings.

2. The cross embedding balance loss (CEBL) is introduced to reduce cross-modal discrepancies and enhance cross-modal consistency by constraining the distribution of decoupled and enhanced embeddings.

3. Experimental results show that DIAN achieves remarkable performances on VI-ReID task. Specifically, for indoor search on SYSU-MM01 dataset, we achieve the Rank-1 of 86.28% and mAP of 87.41%, respectively. They outperform existing SOTA methods.

2 Method

The overall DIAN architecture is shown in Fig. 1(a), with ResNet50 as backbone. OFMs fuse features of the two connected layers, producing rich semantic representation for decoupling in next step. IEDK decouples features and filters out identity-unrelated embeddings, thereby capturing identity-guided and modality-consistent embeddings at diverse scales. Then, PPEM enhances above embeddings to get the more discriminative representations. The outputs are used in CEBLsubscript𝐶𝐸𝐵𝐿\mathcal{L}_{CEBL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E italic_B italic_L end_POSTSUBSCRIPT to bridge cross-modal discrepancies effectively. Initially, visible and infrared images are fed to the ResNet block in stage 0 and get the visible and infrared feature maps 𝐕0subscript𝐕0\mathbf{V}_{0}bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐈0subscript𝐈0\mathbf{I}_{0}bold_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We concatenate them as 𝐀0=𝐂𝐨𝐧𝐜𝐚𝐭(𝐕𝟎,𝐈𝟎)subscript𝐀0𝐂𝐨𝐧𝐜𝐚𝐭subscript𝐕0subscript𝐈0\mathbf{A}_{0}=\mathbf{Concat(\mathbf{V}_{0},\mathbf{I}_{0})}bold_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_Concat ( bold_V start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) for joint training.

Refer to caption
Figure 1: (a) The network architecture of DIAN and its components. (b) Three orthogonal fusion modules (OFMs). (c) Identity-guided embedding decoupling kernel (IEDK). (d) Parallel progressive enhancement module (PPEM). (e) The legend of pictures.

2.1 Orthogonal Fusion Module

Other methods typically only consider feature fusion, which can lead to redundant information between adjacent layers, hindering subsequent modules from exploring features effectively. The novel two-stage orthogonal fusion module (OFM) is introduced to remove feature redundancy and obtain the refined feature maps for the next step. As shown in Fig. 1(b), let 𝐀r(i1)subscript𝐀𝑟𝑖1\mathbf{A}_{r(i-1)}bold_A start_POSTSUBSCRIPT italic_r ( italic_i - 1 ) end_POSTSUBSCRIPT (i=1,2,3.Specifically,𝐀0=𝐀r0formulae-sequence𝑖123𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑙𝑙𝑦subscript𝐀0subscript𝐀𝑟0i=1,2,3.Specifically,\mathbf{A}_{0}=\mathbf{A}_{r0}italic_i = 1 , 2 , 3 . italic_S italic_p italic_e italic_c italic_i italic_f italic_i italic_c italic_a italic_l italic_l italic_y , bold_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT italic_r 0 end_POSTSUBSCRIPT) from the preceding layer be the coarse feature 𝐀cCc×Hc×Wcsubscript𝐀𝑐superscriptsubscript𝐶𝑐subscript𝐻𝑐subscript𝑊𝑐\mathbf{A}_{c}\in\mathbb{R}^{{C}_{c}\times{H}_{c}\times{W}_{c}}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐀isubscript𝐀𝑖\mathbf{A}_{i}bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT output from the stage i𝑖iitalic_i be the fine features 𝐀fCf×Hf×Wfsubscript𝐀𝑓superscriptsubscript𝐶𝑓subscript𝐻𝑓subscript𝑊𝑓\mathbf{A}_{f}\in\mathbb{R}^{{C}_{f}\times{H}_{f}\times{W}_{f}}bold_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In order to fuse them while eliminating redundancy, we first expand the coarse features to 𝐀cCc×Hf×Wfsubscript𝐀𝑐superscriptsubscript𝐶𝑐subscript𝐻𝑓subscript𝑊𝑓\mathbf{A}_{c}\in\mathbb{R}^{{C}_{c}\times{H}_{f}\times{W}_{f}}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with an up-sampling block, so that 𝐀csubscript𝐀𝑐\mathbf{A}_{c}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT has the same height and width as 𝐀fsubscript𝐀𝑓\mathbf{A}_{f}bold_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. To comprehensively integrate information, we thoroughly explore the multi-scale information within the coarse features. 𝐀csubscript𝐀𝑐\mathbf{A}_{c}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT passes through the multi-scale block and concatenate,

𝐀c=𝐂𝐨𝐧𝐜𝐚𝐭(𝐂𝐨𝐧𝐯dilation=k3×3(𝐀c)),k=3,5,7,9,𝐀c4Cc×Hf×Wf.formulae-sequencesubscriptsuperscript𝐀𝑐𝐂𝐨𝐧𝐜𝐚𝐭subscriptsuperscript𝐂𝐨𝐧𝐯33𝑑𝑖𝑙𝑎𝑡𝑖𝑜𝑛𝑘subscript𝐀𝑐formulae-sequence𝑘3579subscriptsuperscript𝐀𝑐superscript4subscript𝐶𝑐subscript𝐻𝑓subscript𝑊𝑓\displaystyle\mathbf{A}^{\prime}_{c}=\mathbf{Concat}(\mathbf{Conv}^{3\times 3}% _{dilation=k}(\mathbf{A}_{c})),k=3,5,7,9,\quad\mathbf{A}^{\prime}_{c}\in% \mathbb{R}^{4{C}_{c}\times{H}_{f}\times{W}_{f}}.bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = bold_Concat ( bold_Conv start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_i italic_l italic_a italic_t italic_i italic_o italic_n = italic_k end_POSTSUBSCRIPT ( bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) , italic_k = 3 , 5 , 7 , 9 , bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . (1)

An attention module formed by softplus layer is to highlight the weight of diverse scales,

𝐀c′′=𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧softplus(𝐀c),𝐀c′′4Cc×Hf×Wf,formulae-sequencesubscriptsuperscript𝐀′′𝑐subscript𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧𝑠𝑜𝑓𝑡𝑝𝑙𝑢𝑠subscriptsuperscript𝐀𝑐subscriptsuperscript𝐀′′𝑐superscript4subscript𝐶𝑐subscript𝐻𝑓subscript𝑊𝑓\displaystyle\mathbf{A}^{\prime\prime}_{c}=\mathbf{Attention}_{softplus}(% \mathbf{A}^{\prime}_{c}),\quad\mathbf{A}^{\prime\prime}_{c}\in\mathbb{R}^{4{C}% _{c}\times{H}_{f}\times{W}_{f}},bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = bold_Attention start_POSTSUBSCRIPT italic_s italic_o italic_f italic_t italic_p italic_l italic_u italic_s end_POSTSUBSCRIPT ( bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (2)

where 𝐀c′′subscriptsuperscript𝐀′′𝑐\mathbf{A}^{\prime\prime}_{c}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the obtained coarse feature maps by formulating each scale’s weight. Meanwhile, for fine features 𝐀fCf×Hf×Wfsubscript𝐀𝑓superscriptsubscript𝐶𝑓subscript𝐻𝑓subscript𝑊𝑓\mathbf{A}_{f}\in\mathbb{R}^{{C}_{f}\times{H}_{f}\times{W}_{f}}bold_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we use a feature expansion module which contains an average pooling layer for resizing it to 𝐀fCf×1×1subscriptsuperscript𝐀𝑓superscriptsubscript𝐶𝑓11\mathbf{A}^{\prime}_{f}\in\mathbb{R}^{{C}_{f}\times 1\times 1}bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT × 1 × 1 end_POSTSUPERSCRIPT, and a linear layer further converting it to 𝐀f′′4Cc×1×1subscriptsuperscript𝐀′′𝑓superscript4subscript𝐶𝑐11\mathbf{A}^{\prime\prime}_{f}\in\mathbb{R}^{4{C}_{c}\times 1\times 1}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × 1 × 1 end_POSTSUPERSCRIPT to expand the features in a more manageable space with 𝐀c′′subscriptsuperscript𝐀′′𝑐\mathbf{A}^{\prime\prime}_{c}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. To obtain refined features, we use orthogonal projection to reduce redundancy between 𝐀c′′subscriptsuperscript𝐀′′𝑐\mathbf{A}^{\prime\prime}_{c}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐀f′′subscriptsuperscript𝐀′′𝑓\mathbf{A}^{\prime\prime}_{f}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, by

𝐀projsubscript𝐀𝑝𝑟𝑜𝑗\displaystyle\mathbf{A}_{proj}bold_A start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT =𝐀c′′𝐀f′′|𝐀f′′|2𝐀f′′,absentsubscriptsuperscript𝐀′′𝑐subscriptsuperscript𝐀′′𝑓superscriptsubscriptsuperscript𝐀′′𝑓2subscriptsuperscript𝐀′′𝑓\displaystyle=\frac{\mathbf{A}^{\prime\prime}_{c}\cdot\mathbf{A}^{\prime\prime% }_{f}}{\left|\mathbf{A}^{\prime\prime}_{f}\right|^{2}}\mathbf{A}^{\prime\prime% }_{f},= divide start_ARG bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG | bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , (3)
𝐀c′′𝐀f′′=Σi=14Cf𝐀c,i′′𝐀f,i′′,|𝐀f′′|2=Σi=1C(𝐀f,i′′)2,formulae-sequencesubscriptsuperscript𝐀′′𝑐subscriptsuperscript𝐀′′𝑓superscriptsubscriptΣ𝑖14subscript𝐶𝑓subscriptsuperscript𝐀′′𝑐𝑖subscriptsuperscript𝐀′′𝑓𝑖superscriptsubscriptsuperscript𝐀′′𝑓2superscriptsubscriptΣ𝑖1𝐶superscriptsubscriptsuperscript𝐀′′𝑓𝑖2\displaystyle\mathbf{A}^{\prime\prime}_{c}\cdot\mathbf{A}^{\prime\prime}_{f}=% \Sigma_{i=1}^{4C_{f}}\mathbf{A}^{\prime\prime}_{c,i}\mathbf{A}^{\prime\prime}_% {f,i},\quad\left|\mathbf{A}^{\prime\prime}_{f}\right|^{2}=\Sigma_{i=1}^{C}% \left(\mathbf{A}^{\prime\prime}_{f,i}\right)^{2},bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f , italic_i end_POSTSUBSCRIPT , | bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4)

where 𝐀c′′𝐀f′′subscriptsuperscript𝐀′′𝑐subscriptsuperscript𝐀′′𝑓\mathbf{A}^{\prime\prime}_{c}\cdot\mathbf{A}^{\prime\prime}_{f}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is dot product and |𝐀f′′|2superscriptsubscriptsuperscript𝐀′′𝑓2\left|\mathbf{A}^{\prime\prime}_{f}\right|^{2}| bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the 𝐋2subscript𝐋2\mathbf{L}_{2}bold_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of 𝐀f′′subscriptsuperscript𝐀′′𝑓\mathbf{A}^{\prime\prime}_{f}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. 𝐀projsubscript𝐀𝑝𝑟𝑜𝑗\mathbf{A}_{proj}bold_A start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT is the projection of coarse features onto fine features, which indicates the redundant information contained in both coarse and fine features. As shown in below of Fig. 1(b), the irredundant information 𝐀dsubscript𝐀𝑑\mathbf{A}_{d}bold_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is obtained by the difference between 𝐀csubscript𝐀𝑐\mathbf{A}_{c}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐀projsubscript𝐀𝑝𝑟𝑜𝑗\mathbf{A}_{proj}bold_A start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT,

𝐀dsubscript𝐀𝑑\displaystyle\mathbf{A}_{d}bold_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT =𝐀c𝐀proj.absentsubscript𝐀𝑐subscript𝐀𝑝𝑟𝑜𝑗\displaystyle=\mathbf{A}_{c}-\mathbf{A}_{proj}.= bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - bold_A start_POSTSUBSCRIPT italic_p italic_r italic_o italic_j end_POSTSUBSCRIPT . (5)

After that, the output 𝐀risubscript𝐀𝑟𝑖\mathbf{A}_{ri}bold_A start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT of the i𝑖iitalic_i-th OFM can be obtained by

𝐀fuse=𝐅𝐜(𝐆𝐚𝐩(𝐂𝐨𝐧𝐜𝐚𝐭(𝐀f′′,𝐀d))),𝐀ri=(𝐀fuse𝐀f)+𝐀f,formulae-sequencesubscript𝐀𝑓𝑢𝑠𝑒𝐅𝐜𝐆𝐚𝐩𝐂𝐨𝐧𝐜𝐚𝐭subscriptsuperscript𝐀′′𝑓subscript𝐀𝑑subscript𝐀𝑟𝑖tensor-productsubscript𝐀𝑓𝑢𝑠𝑒subscript𝐀𝑓subscript𝐀𝑓\displaystyle\mathbf{A}_{fuse}=\mathbf{Fc}(\mathbf{Gap}(\mathbf{Concat}(% \mathbf{A}^{\prime\prime}_{f},\mathbf{A}_{d}))),\quad\mathbf{A}_{ri}=(\mathbf{% A}_{fuse}\otimes\mathbf{A}_{f})+\mathbf{A}_{f},bold_A start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT = bold_Fc ( bold_Gap ( bold_Concat ( bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) ) , bold_A start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT = ( bold_A start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT ⊗ bold_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) + bold_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , (6)

where 𝐆𝐚𝐩𝐆𝐚𝐩\mathbf{Gap}bold_Gap is global average pooling, 𝐅𝐜𝐅𝐜\mathbf{Fc}bold_Fc is the linear layer and tensor-product\otimes means element-wise multiplication. For example, 𝐀r1subscript𝐀𝑟1\mathbf{A}_{r1}bold_A start_POSTSUBSCRIPT italic_r 1 end_POSTSUBSCRIPT is the final output from the first OFM, which contains rich semantic representations of two layers. For the 1st OFM, 𝐀r0=𝐀0subscript𝐀𝑟0subscript𝐀0\mathbf{A}_{r0}=\mathbf{A}_{0}bold_A start_POSTSUBSCRIPT italic_r 0 end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. 𝐀risubscript𝐀𝑟𝑖\mathbf{A}_{ri}bold_A start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT will be then treated as the coarse input branch for the i+1th𝑖1𝑡i+1thitalic_i + 1 italic_t italic_h OFM. Later, 𝐀risubscript𝐀𝑟𝑖\mathbf{A}_{ri}bold_A start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT will pass through the ResNet block in stage i+1𝑖1i+1italic_i + 1 to get the fine input features 𝐀i+1subscript𝐀𝑖1\mathbf{A}_{i+1}bold_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT for the i+1th𝑖1𝑡i+1thitalic_i + 1 italic_t italic_h OFM. And 𝐀r3subscript𝐀𝑟3\mathbf{A}_{r3}bold_A start_POSTSUBSCRIPT italic_r 3 end_POSTSUBSCRIPT is the final output from the last (third) OFM.

2.2 Identity-Guided Embedding Decoupling Kernel

The identity-guided embedding decoupling kernel (IEDK) is introduced to preserve identity-guided and modality-consistent embeddings. As shown in Fig. 1(c), IEDK uses the output of the last (third) OFM, 𝐀r3subscript𝐀𝑟3\mathbf{A}_{r3}bold_A start_POSTSUBSCRIPT italic_r 3 end_POSTSUBSCRIPT as input, effectively locating cross-modal identity-guided and consistency embeddings at different scales. In detail, the input 𝐀r3subscript𝐀𝑟3\mathbf{A}_{r3}bold_A start_POSTSUBSCRIPT italic_r 3 end_POSTSUBSCRIPT is processed in four branches for decoupling, each with a specific purpose. One of the branches is marked as the original branch 𝐀osubscript𝐀o\mathbf{A}_{\rm{o}}bold_A start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT, retaining the original information without any changes. The other three branches 𝐀mksubscript𝐀𝑚𝑘\mathbf{A}_{mk}bold_A start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT (k=1,3,5𝑘135k=1,3,5italic_k = 1 , 3 , 5) use the Unfold function with different dilation scales to extract features at diverse scales. Unlike the convolution operation, the Unfold function has no parameters, thus ensuring that the inherent meaning and structure of the input are preserved during processing. The four branches can thus be expressed as

𝐀o=𝐀r3,𝐀mk=𝐔𝐧𝐟𝐨𝐥𝐝dilation=k(𝐀r3),k=1,3,5.formulae-sequencesubscript𝐀osubscript𝐀𝑟3formulae-sequencesubscript𝐀𝑚𝑘subscript𝐔𝐧𝐟𝐨𝐥𝐝𝑑𝑖𝑙𝑎𝑡𝑖𝑜𝑛𝑘subscript𝐀𝑟3𝑘135\displaystyle\mathbf{A}_{\rm{o}}=\mathbf{A}_{r3},\quad\mathbf{A}_{mk}=\mathbf{% Unfold}_{dilation=k}(\mathbf{A}_{r3}),\ k=1,3,5.bold_A start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT italic_r 3 end_POSTSUBSCRIPT , bold_A start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT = bold_Unfold start_POSTSUBSCRIPT italic_d italic_i italic_l italic_a italic_t italic_i italic_o italic_n = italic_k end_POSTSUBSCRIPT ( bold_A start_POSTSUBSCRIPT italic_r 3 end_POSTSUBSCRIPT ) , italic_k = 1 , 3 , 5 . (7)

Next, because the dynamic convolution kernel demonstrates superior performance in protecting high-response areas of the image [Shen et al.(2023)Shen, Zhao, and Zhang], we introduce a novel dynamic convolution kernel with deformable convolution [Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei], aiming to preserve identity-guided and modality-consistent embeddings . As shown in Fig. 1(c), three branches conduct element-wise operations using dynamic convolution kernel to extract features at different scales. Our dynamic convolution kernel excels at attention aggregation from both spatial and channel perspectives, leading to a more robust representation and more discriminative identity-guided embeddings by effectively capturing non-linear interactions within the data. To get the dynamic convolution kernel, the input 𝐀r3subscript𝐀𝑟3\mathbf{A}_{r3}bold_A start_POSTSUBSCRIPT italic_r 3 end_POSTSUBSCRIPT undergoes initial processing through a deformable convolution layer which can preserve high-response information, yielding

𝐀=𝐃𝐞𝐟𝐨𝐫𝐦𝐚𝐛𝐥𝐞_𝐂𝐨𝐧𝐯deformable_groups=8(𝐀r3).superscript𝐀𝐃𝐞𝐟𝐨𝐫𝐦𝐚𝐛𝐥𝐞_subscript𝐂𝐨𝐧𝐯𝑑𝑒𝑓𝑜𝑟𝑚𝑎𝑏𝑙𝑒_𝑔𝑟𝑜𝑢𝑝𝑠8subscript𝐀𝑟3\displaystyle\mathbf{A}^{*}=\mathbf{Deformable\_Conv}_{deformable\_groups=8}(% \mathbf{A}_{r3}).bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_Deformable _ bold_Conv start_POSTSUBSCRIPT italic_d italic_e italic_f italic_o italic_r italic_m italic_a italic_b italic_l italic_e _ italic_g italic_r italic_o italic_u italic_p italic_s = 8 end_POSTSUBSCRIPT ( bold_A start_POSTSUBSCRIPT italic_r 3 end_POSTSUBSCRIPT ) . (8)

Because the identity features are retained at spatial domain and modality features are retained at channel domain, 𝐀superscript𝐀\mathbf{A}^{*}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT will be processed by spatial and channel refinement to get discriminative identity-guided and modality-consistent embeddings. For spatial refinement,

𝐐sp=𝐂𝐨𝐧𝐯sp1×1(𝐀),𝐕sp=𝐆𝐚𝐩sp(𝐂𝐨𝐧𝐯sp1×1(𝐀)),formulae-sequencesubscript𝐐𝑠𝑝subscriptsuperscript𝐂𝐨𝐧𝐯11𝑠𝑝superscript𝐀subscript𝐕𝑠𝑝subscript𝐆𝐚𝐩𝑠𝑝subscriptsuperscript𝐂𝐨𝐧𝐯11𝑠𝑝superscript𝐀\displaystyle\mathbf{Q}_{sp}=\mathbf{Conv}^{1\times 1}_{sp}(\mathbf{A}^{*}),% \quad\mathbf{V}_{sp}=\mathbf{Gap}_{sp}(\mathbf{Conv}^{1\times 1}_{sp}(\mathbf{% A}^{*})),bold_Q start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT = bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ( bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_V start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT = bold_Gap start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ( bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ( bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , (9)
𝐙sp=𝐒𝐢𝐠𝐦𝐨𝐢𝐝(𝐕sp𝐒𝐨𝐟𝐭𝐦𝐚𝐱(Qsp)),𝐀sp=𝐙sp𝐀+𝐀,formulae-sequencesubscript𝐙𝑠𝑝𝐒𝐢𝐠𝐦𝐨𝐢𝐝tensor-productsubscript𝐕𝑠𝑝𝐒𝐨𝐟𝐭𝐦𝐚𝐱subscript𝑄𝑠𝑝subscriptsuperscript𝐀𝑠𝑝tensor-productsubscript𝐙𝑠𝑝superscript𝐀superscript𝐀\displaystyle\mathbf{Z}_{sp}=\mathbf{Sigmoid}(\mathbf{V}_{sp}\otimes\mathbf{% Softmax}({Q}_{sp})),\quad\mathbf{A}^{*}_{sp}=\mathbf{Z}_{sp}\otimes\mathbf{A}^% {*}+\mathbf{A}^{*},bold_Z start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT = bold_Sigmoid ( bold_V start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ⊗ bold_Softmax ( italic_Q start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ) ) , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ⊗ bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ,

where 𝐂𝐨𝐧𝐯sp1×1subscriptsuperscript𝐂𝐨𝐧𝐯11𝑠𝑝\mathbf{Conv}^{1\times 1}_{sp}bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT is 1×1111\times 11 × 1 convolution layer, 𝐆𝐚𝐩spsubscript𝐆𝐚𝐩𝑠𝑝\mathbf{Gap}_{sp}bold_Gap start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT is global average pooling, 𝐒𝐢𝐠𝐦𝐨𝐢𝐝𝐒𝐢𝐠𝐦𝐨𝐢𝐝\mathbf{Sigmoid}bold_Sigmoid is sigmoid activation, tensor-product\otimes is the element-wise multiplication. For channel refinement,

𝐐ch=𝐂𝐨𝐧𝐯ch1×1(𝐀),𝐕ch=𝐂𝐨𝐧𝐯ch1×1(𝐀),formulae-sequencesubscript𝐐𝑐subscriptsuperscript𝐂𝐨𝐧𝐯11𝑐superscript𝐀subscript𝐕𝑐subscriptsuperscript𝐂𝐨𝐧𝐯11𝑐superscript𝐀\displaystyle\mathbf{Q}_{ch}=\mathbf{Conv}^{1\times 1}_{ch}(\mathbf{A}^{*}),% \quad\mathbf{V}_{ch}=\mathbf{Conv}^{1\times 1}_{ch}(\mathbf{A}^{*}),bold_Q start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT = bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ( bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_V start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT = bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ( bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , (10)
𝐙ch=𝐋𝐚𝐲𝐞𝐫𝐍(𝐒𝐨𝐟𝐭𝐦𝐚𝐱(Qch)𝐕ch),𝐀ch=𝐙ch𝐀+𝐀,formulae-sequencesubscript𝐙𝑐𝐋𝐚𝐲𝐞𝐫𝐍tensor-product𝐒𝐨𝐟𝐭𝐦𝐚𝐱subscript𝑄𝑐subscript𝐕𝑐subscriptsuperscript𝐀𝑐tensor-productsubscript𝐙𝑐superscript𝐀superscript𝐀\displaystyle\mathbf{Z}_{ch}=\mathbf{LayerN}(\mathbf{Softmax}({Q}_{ch})\otimes% \mathbf{V}_{ch}),\quad\mathbf{A}^{*}_{ch}=\mathbf{Z}_{ch}\otimes\mathbf{A}^{*}% +\mathbf{A}^{*},bold_Z start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT = bold_LayerN ( bold_Softmax ( italic_Q start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ) ⊗ bold_V start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ) , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT = bold_Z start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ⊗ bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ,

where, 𝐂𝐨𝐧𝐯ch1×1subscriptsuperscript𝐂𝐨𝐧𝐯11𝑐\mathbf{Conv}^{1\times 1}_{ch}bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT is 1×1111\times 11 × 1 convolution layer, 𝐋𝐚𝐲𝐞𝐫𝐍𝐋𝐚𝐲𝐞𝐫𝐍\mathbf{LayerN}bold_LayerN is layer normalization, tensor-product\otimes is the element-wise multiplication. 𝐀spsubscriptsuperscript𝐀𝑠𝑝\mathbf{A}^{*}_{sp}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT and 𝐀chsubscriptsuperscript𝐀𝑐\mathbf{A}^{*}_{ch}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT are the spatial refined features and channel refined features, respectively. Then the dynamic convolution kernel 𝐖superscript𝐖\mathbf{W}^{*}bold_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be obtained by a fuse module,

𝐖=𝐂𝐨𝐧𝐯1×1(𝐀ch+𝐀sp).superscript𝐖superscript𝐂𝐨𝐧𝐯11subscriptsuperscript𝐀𝑐subscriptsuperscript𝐀𝑠𝑝\displaystyle\mathbf{W}^{*}=\mathbf{Conv}^{1\times 1}(\mathbf{A}^{*}_{ch}+% \mathbf{A}^{*}_{sp}).bold_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT ( bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT + bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_p end_POSTSUBSCRIPT ) . (11)

We can then mine the identity-guided and modality-consistent embeddings at different scales by element-wise multiplication of 𝐖superscript𝐖\mathbf{W}^{*}bold_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the feature maps obtained from the three branches. The original branch will preserve the original information,

𝐀o=𝐀o,𝐀mk=𝐀mk𝐖,(k=1,3,5).formulae-sequencesubscriptsuperscript𝐀osubscript𝐀osubscriptsuperscript𝐀𝑚𝑘tensor-productsubscript𝐀𝑚𝑘superscript𝐖𝑘135\displaystyle\mathbf{A}^{\prime}_{\rm{o}}=\mathbf{A}_{\rm{o}},\quad\mathbf{A}^% {\prime}_{mk}=\mathbf{A}_{mk}\otimes\mathbf{W^{*}},(k=1,3,5).bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT ⊗ bold_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ( italic_k = 1 , 3 , 5 ) . (12)

After being processed by a smooth (𝐂𝐨𝐧𝐯3×3superscript𝐂𝐨𝐧𝐯33\mathbf{Conv}^{3\times 3}bold_Conv start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT) module, the final outputs of IEDK are

𝐀o′′,𝐀m1′′,𝐀m3′′,𝐀m5′′=𝐂𝐨𝐧𝐯3×3(𝐀o,𝐀m1,𝐀m3,𝐀m5),subscriptsuperscript𝐀′′osubscriptsuperscript𝐀′′𝑚1subscriptsuperscript𝐀′′𝑚3subscriptsuperscript𝐀′′𝑚5superscript𝐂𝐨𝐧𝐯33subscriptsuperscript𝐀osubscriptsuperscript𝐀𝑚1subscriptsuperscript𝐀𝑚3subscriptsuperscript𝐀𝑚5\displaystyle\mathbf{A}^{\prime\prime}_{\rm{o}},\ \mathbf{A}^{\prime\prime}_{m% 1},\ \mathbf{A}^{\prime\prime}_{m3},\ \mathbf{A}^{\prime\prime}_{m5}=\mathbf{% Conv}^{3\times 3}(\mathbf{A}^{\prime}_{\rm{o}},\ \mathbf{A}^{\prime}_{m1},\ % \mathbf{A}^{\prime}_{m3},\ \mathbf{A}^{\prime}_{m5}),bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 5 end_POSTSUBSCRIPT = bold_Conv start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT ( bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 5 end_POSTSUBSCRIPT ) , (13)

which are the processed origin, margin1, margin3 and margin5 embeddings, respectively. By considering both channel and spatial perspectives, IEDK can effectively mine more purified feature maps at different scales. This makes our network focus more on identity-guided and modality-consistent embeddings. The visualization for the features output from IEDK can be found in supplementary, which demonstrates the effectiveness of IEDK.

2.3 Parallel Progressive Enhancement Module

Then parallel progressive enhancement module (PPEM) is proposed to enhance embeddings by parallel instead of serial boosting mode, which effectively improves the representation ability of identity-guided and modality-consistent embeddings. In detail, because PPEM enhances the four branches in the same way, we take the margin3 embeddings 𝐀m3′′subscriptsuperscript𝐀′′𝑚3\mathbf{A}^{\prime\prime}_{m3}bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT as an example for illustration. We design a shared module for further purification, including a 𝐂𝐨𝐧𝐯dilation=33×3subscriptsuperscript𝐂𝐨𝐧𝐯33𝑑𝑖𝑙𝑎𝑡𝑖𝑜𝑛3\mathbf{Conv}^{3\times 3}_{dilation=3}bold_Conv start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_i italic_l italic_a italic_t italic_i italic_o italic_n = 3 end_POSTSUBSCRIPT layer, a 𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔\mathbf{LeakyReLU}bold_LeakyReLU layer, and a 𝐂𝐨𝐧𝐯dilation=33×3subscriptsuperscript𝐂𝐨𝐧𝐯33𝑑𝑖𝑙𝑎𝑡𝑖𝑜𝑛3\mathbf{Conv}^{3\times 3}_{dilation=3}bold_Conv start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_i italic_l italic_a italic_t italic_i italic_o italic_n = 3 end_POSTSUBSCRIPT layer. Thus 𝐀ssubscript𝐀𝑠\mathbf{A}_{s}bold_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is,

𝐀s=𝐂𝐨𝐧𝐯dilation=33×3(𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔(𝐂𝐨𝐧𝐯dilation=33×3(𝐀m3′′))),C×H×W,\displaystyle\mathbf{A}_{s}\!\!=\!\!\mathbf{Conv}^{3\times 3}_{dilation=3}(% \mathbf{LeakyReLU}(\mathbf{Conv}^{3\times 3}_{dilation=3}(\mathbf{A}^{\prime% \prime}_{m3}))),\in\mathbb{R}^{C\times H\times W},bold_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = bold_Conv start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_i italic_l italic_a italic_t italic_i italic_o italic_n = 3 end_POSTSUBSCRIPT ( bold_LeakyReLU ( bold_Conv start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d italic_i italic_l italic_a italic_t italic_i italic_o italic_n = 3 end_POSTSUBSCRIPT ( bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT ) ) ) , ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT , (14)

because the identity-related features lie more at spatial domain and modality-related features lie more at channel domain, 𝐀ssubscript𝐀𝑠\mathbf{A}_{s}bold_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT will be fed to a parallel spatial-channel enhancement module. Then, we obtain query vector 𝐐se1×C//2subscript𝐐𝑠𝑒superscript1𝐶absent2\mathbf{Q}_{se}\in\mathbb{R}^{1\times C//2}bold_Q start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C / / 2 end_POSTSUPERSCRIPT and value vector 𝐕seC//2×HWsubscript𝐕𝑠𝑒superscript𝐶absent2𝐻𝑊\mathbf{V}_{se}\in\mathbb{R}^{C//2\times HW}bold_V start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C / / 2 × italic_H italic_W end_POSTSUPERSCRIPT by 𝐂𝐨𝐧𝐯1×1subscript𝐂𝐨𝐧𝐯11\mathbf{Conv}_{1\times 1}bold_Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT.

Subsequently, the spatial attention weight matrix 𝐖se=𝐕se𝐒𝐨𝐟𝐭𝐦𝐚𝐱(𝐀𝐯𝐠𝐏𝐨𝐨𝐥(𝐐se))1×H×Wsubscript𝐖𝑠𝑒tensor-productsubscript𝐕𝑠𝑒𝐒𝐨𝐟𝐭𝐦𝐚𝐱𝐀𝐯𝐠𝐏𝐨𝐨𝐥subscript𝐐𝑠𝑒superscript1𝐻𝑊\mathbf{W}_{se}=\mathbf{V}_{se}\otimes\mathbf{Softmax}(\mathbf{AvgPool}(% \mathbf{Q}_{se}))\in\mathbb{R}^{1\times H\times W}bold_W start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ⊗ bold_Softmax ( bold_AvgPool ( bold_Q start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_H × italic_W end_POSTSUPERSCRIPT is obtained by applying Softmax along spatial perspective, where 𝐀𝐯𝐠𝐏𝐨𝐨𝐥𝐀𝐯𝐠𝐏𝐨𝐨𝐥\mathbf{AvgPool}bold_AvgPool is average pooling. Finally, we get the spatial-enhanced embeddings 𝐀se=𝐀sh𝐖seC×H×Wsubscript𝐀𝑠𝑒tensor-productsubscript𝐀𝑠subscript𝐖𝑠𝑒superscript𝐶𝐻𝑊\mathbf{A}_{se}=\mathbf{A}_{sh}\otimes\mathbf{W}_{se}\in\mathbb{R}^{C\times H% \times W}bold_A start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT italic_s italic_h end_POSTSUBSCRIPT ⊗ bold_W start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT. Similarly, for channel enhancement, we first obtain query vector 𝐐ceHW×1subscript𝐐𝑐𝑒superscript𝐻𝑊1\mathbf{Q}_{ce}\in\mathbb{R}^{HW\times 1}bold_Q start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × 1 end_POSTSUPERSCRIPT and value vector 𝐕ceC//2×HWsubscript𝐕𝑐𝑒superscript𝐶absent2𝐻𝑊\mathbf{V}_{ce}\in\mathbb{R}^{C//2\times HW}bold_V start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C / / 2 × italic_H italic_W end_POSTSUPERSCRIPT by 𝐂𝐨𝐧𝐯1×1subscript𝐂𝐨𝐧𝐯11\mathbf{Conv}_{1\times 1}bold_Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT, and then, the channel attention weight matrix 𝐖ce=𝐕ce𝐒𝐨𝐟𝐭𝐦𝐚𝐱(𝐐ce)C//2×1×1subscript𝐖𝑐𝑒tensor-productsubscript𝐕𝑐𝑒𝐒𝐨𝐟𝐭𝐦𝐚𝐱subscript𝐐𝑐𝑒superscript𝐶absent211\mathbf{W}_{ce}=\mathbf{V}_{ce}\otimes\mathbf{Softmax}(\mathbf{Q}_{ce})\in% \mathbb{R}^{C//2\times 1\times 1}bold_W start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT = bold_V start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ⊗ bold_Softmax ( bold_Q start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C / / 2 × 1 × 1 end_POSTSUPERSCRIPT is obtained by applying Softmax along channel perspective. Finally, we get the channel-enhanced embeddings 𝐀ce=𝐀sh𝐖ceC×H×Wsubscript𝐀𝑐𝑒tensor-productsubscript𝐀𝑠subscript𝐖𝑐𝑒superscript𝐶𝐻𝑊\mathbf{A}_{ce}=\mathbf{A}_{sh}\otimes\mathbf{W}_{ce}\in\mathbb{R}^{C\times H% \times W}bold_A start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT = bold_A start_POSTSUBSCRIPT italic_s italic_h end_POSTSUBSCRIPT ⊗ bold_W start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT. From the above, we could get enhanced embeddings with dual reinforcement in both channel and spatial domains. Then we fuse them with 𝐀ssubscript𝐀𝑠\mathbf{A}_{s}bold_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT through a two-stage module for better representations. The first Transform stage includes a 𝐂𝐨𝐧𝐯1×1superscript𝐂𝐨𝐧𝐯11\mathbf{Conv}^{1\times 1}bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT, a 𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔\mathbf{LeakyReLU}bold_LeakyReLU, a 𝐂𝐨𝐧𝐯1×1superscript𝐂𝐨𝐧𝐯11\mathbf{Conv}^{1\times 1}bold_Conv start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT layer and the second stage includes a 𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔\mathbf{LeakyReLU}bold_LeakyReLU layer. Then the final output for 𝐀m3subscriptsuperscript𝐀𝑚3\mathbf{A}^{*}_{m3}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT is

𝐀t=𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦(𝐀ce+𝐀se)+𝐀s,𝐀m3=𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔(𝐀t)+𝐀m3′′.formulae-sequencesubscript𝐀𝑡𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦subscript𝐀𝑐𝑒subscript𝐀𝑠𝑒subscript𝐀𝑠subscriptsuperscript𝐀𝑚3𝐋𝐞𝐚𝐤𝐲𝐑𝐞𝐋𝐔subscript𝐀𝑡subscriptsuperscript𝐀′′𝑚3\displaystyle\mathbf{A}_{t}=\mathbf{Transform}(\mathbf{A}_{ce}+\mathbf{A}_{se}% )+\mathbf{A}_{s},\quad\mathbf{A}^{*}_{m3}=\mathbf{LeakyReLU}(\mathbf{A}_{t})+% \mathbf{A}^{\prime\prime}_{m3}.bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Transform ( bold_A start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + bold_A start_POSTSUBSCRIPT italic_s italic_e end_POSTSUBSCRIPT ) + bold_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT = bold_LeakyReLU ( bold_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_A start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT . (15)

So, we could obtain enhanced information without lacking original semantic features. Similarly, the other embeddings will be enhanced and termed as 𝐀o,𝐀m1subscriptsuperscript𝐀osubscriptsuperscript𝐀𝑚1\mathbf{A}^{*}_{\rm{o}},\mathbf{A}^{*}_{m1}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT and 𝐀m5subscriptsuperscript𝐀𝑚5\mathbf{A}^{*}_{m5}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 5 end_POSTSUBSCRIPT.

2.4 Cross Embedding Balance Loss

To take full advantage of identity-guided and modality-consistent embeddings, we propose the cross-embedding balance loss (CEBL) CEBLsubscript𝐶𝐸𝐵𝐿\mathcal{L}_{CEBL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E italic_B italic_L end_POSTSUBSCRIPT to most effectively eliminate cross-modal discrepancies. CEBLsubscript𝐶𝐸𝐵𝐿\mathcal{L}_{CEBL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E italic_B italic_L end_POSTSUBSCRIPT consists of a cross triplet loss ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT which is shown in the right hand of Fig. 1(a) and a balance contrastive loss bcsubscript𝑏𝑐\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT. Inspired by [Zhang et al.(2023)Zhang, Yan, Li, and Wang, Zhang and Wang(2023)], ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT is proposed to constrain the correlation between diverse embeddings (𝐀o,𝐀m1,𝐀m3subscriptsuperscript𝐀osubscriptsuperscript𝐀𝑚1subscriptsuperscript𝐀𝑚3\mathbf{A}^{*}_{\rm{o}},\mathbf{A}^{*}_{m1},\mathbf{A}^{*}_{m3}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT, 𝐀m5subscriptsuperscript𝐀𝑚5\mathbf{A}^{*}_{m5}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 5 end_POSTSUBSCRIPT.) to eliminate the cross-modal discrepancies at different scales. In this section, to delve deeper into the relationship between visible and infrared modes from the identity-guided and modality-consistent perspective, we split them into visible embeddings 𝐕o,𝐕m1,𝐕m3,𝐕m5subscriptsuperscript𝐕osubscriptsuperscript𝐕𝑚1subscriptsuperscript𝐕𝑚3subscriptsuperscript𝐕𝑚5\mathbf{V}^{*}_{\rm{o}},\mathbf{V}^{*}_{m1},\mathbf{V}^{*}_{m3},\mathbf{V}^{*}% _{m5}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 5 end_POSTSUBSCRIPT and infrared embeddings 𝐈o,𝐈m1,𝐈m3,𝐈m5subscriptsuperscript𝐈osubscriptsuperscript𝐈𝑚1subscriptsuperscript𝐈𝑚3subscriptsuperscript𝐈𝑚5\mathbf{I}^{*}_{\rm{o}},\mathbf{I}^{*}_{m1},\mathbf{I}^{*}_{m3},\mathbf{I}^{*}% _{m5}bold_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT , bold_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT , bold_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 5 end_POSTSUBSCRIPT from 𝐀o,𝐀m1,𝐀m3,𝐀m5subscriptsuperscript𝐀osubscriptsuperscript𝐀𝑚1subscriptsuperscript𝐀𝑚3subscriptsuperscript𝐀𝑚5\mathbf{A}^{*}_{\rm{o}},\mathbf{A}^{*}_{m1},\mathbf{A}^{*}_{m3},\mathbf{A}^{*}% _{m5}bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 3 end_POSTSUBSCRIPT , bold_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m 5 end_POSTSUBSCRIPT, respectively.

For the identity i𝑖iitalic_i, let 𝐂Vxisubscriptsuperscript𝐂𝑖𝑉𝑥\mathbf{C}^{i}_{Vx}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V italic_x end_POSTSUBSCRIPT and 𝐂Ixisubscriptsuperscript𝐂𝑖𝐼𝑥\mathbf{C}^{i}_{Ix}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_x end_POSTSUBSCRIPT be the cluster centers of 𝐕xsubscriptsuperscript𝐕𝑥\mathbf{V}^{*}_{x}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝐈xsubscriptsuperscript𝐈𝑥\mathbf{I}^{*}_{x}bold_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (x=o,m1,m3,m5𝑥om1m3m5x=\rm{o},m1,m3,m5italic_x = roman_o , m1 , m3 , m5), respectively. Our goal is to close the distance between 𝐂Vm3isubscriptsuperscript𝐂𝑖𝑉𝑚3\mathbf{C}^{i}_{Vm3}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V italic_m 3 end_POSTSUBSCRIPT and 𝐂Iyisubscriptsuperscript𝐂𝑖𝐼𝑦\mathbf{C}^{i}_{Iy}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_y end_POSTSUBSCRIPT (y𝑦yitalic_y is from o,m1,m5om1m5\rm{o},m1,m5roman_o , m1 , m5). By aligning embeddings in this manner, we can bridge cross-modal discrepancies more effectively and constrain them at diverse scales. Meanwhile, to reduce intra-modal discrepancies, we increase the distance between 𝐂Vm3isubscriptsuperscript𝐂𝑖𝑉𝑚3\mathbf{C}^{i}_{Vm3}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V italic_m 3 end_POSTSUBSCRIPT and 𝐂Vyjsubscriptsuperscript𝐂𝑗𝑉𝑦\mathbf{C}^{j}_{Vy}bold_C start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V italic_y end_POSTSUBSCRIPT with the different identity i𝑖iitalic_i, j𝑗jitalic_j. The following loss terms V3ysubscript𝑉3𝑦\mathcal{L}_{V3y}caligraphic_L start_POSTSUBSCRIPT italic_V 3 italic_y end_POSTSUBSCRIPT are inherited from the triplet loss,

V3y=i,j=1ijC[α+𝑫(𝐂Vm3i,𝐂Iyi)𝑫(𝐂Vm3i,𝐂Vyj)]+.subscript𝑉3𝑦superscriptsubscript𝑖𝑗1𝑖𝑗𝐶subscriptdelimited-[]𝛼𝑫superscriptsubscript𝐂𝑉𝑚3𝑖superscriptsubscript𝐂𝐼𝑦𝑖𝑫superscriptsubscript𝐂𝑉𝑚3𝑖superscriptsubscript𝐂𝑉𝑦𝑗\displaystyle\mathcal{L}_{V3y}\!=\!\!\sum_{i,j=1\ i\neq j}^{C}\left[\alpha+% \boldsymbol{D}\left(\mathbf{C}_{Vm3}^{i},\ \mathbf{C}_{Iy}^{i}\right)-% \boldsymbol{D}\left(\mathbf{C}_{Vm3}^{i},\mathbf{C}_{Vy}^{j}\right)\right]_{+}.caligraphic_L start_POSTSUBSCRIPT italic_V 3 italic_y end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT [ italic_α + bold_italic_D ( bold_C start_POSTSUBSCRIPT italic_V italic_m 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_C start_POSTSUBSCRIPT italic_I italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - bold_italic_D ( bold_C start_POSTSUBSCRIPT italic_V italic_m 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_C start_POSTSUBSCRIPT italic_V italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT . (16)

where 𝑫𝑫\boldsymbol{D}bold_italic_D denotes the euclidean distance between two clusters and α𝛼\alphaitalic_α is a constant. As shown in Tab. 3 in ablation studies, we want other branches to have similar performances as the margin3 branch because the margin3 branch has the most discriminative performance when testing each branch solely. That is the reason we design the ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT above. Similarly, we close the distance between 𝐂Im3isubscriptsuperscript𝐂𝑖𝐼𝑚3\mathbf{C}^{i}_{Im3}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_m 3 end_POSTSUBSCRIPT and 𝐂Vyisubscriptsuperscript𝐂𝑖𝑉𝑦\mathbf{C}^{i}_{Vy}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V italic_y end_POSTSUBSCRIPT (y𝑦yitalic_y is from o,m1,m5om1m5\rm{o},m1,m5roman_o , m1 , m5). Meanwhile, to reduce the intra-modal discrepancies, we increase the distance between 𝐂Im3isubscriptsuperscript𝐂𝑖𝐼𝑚3\mathbf{C}^{i}_{Im3}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_m 3 end_POSTSUBSCRIPT and 𝐂Iyjsubscriptsuperscript𝐂𝑗𝐼𝑦\mathbf{C}^{j}_{Iy}bold_C start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_y end_POSTSUBSCRIPT by

I3y=i,j=1ijC[α+𝑫(𝐂Im3i,𝐂Vyi)𝑫(𝐂Im3i,𝐂Iyj)]+.subscript𝐼3𝑦superscriptsubscript𝑖𝑗1𝑖𝑗𝐶subscriptdelimited-[]𝛼𝑫superscriptsubscript𝐂𝐼𝑚3𝑖superscriptsubscript𝐂𝑉𝑦𝑖𝑫superscriptsubscript𝐂𝐼𝑚3𝑖superscriptsubscript𝐂𝐼𝑦𝑗\displaystyle\mathcal{L}_{I3y}\!=\!\!\sum_{i,j=1\ i\neq j}^{C}\left[\alpha+% \boldsymbol{D}\left(\mathbf{C}_{Im3}^{i},\ \mathbf{C}_{Vy}^{i}\right)-% \boldsymbol{D}\left(\mathbf{C}_{Im3}^{i},\mathbf{C}_{Iy}^{j}\right)\right]_{+}.caligraphic_L start_POSTSUBSCRIPT italic_I 3 italic_y end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT [ italic_α + bold_italic_D ( bold_C start_POSTSUBSCRIPT italic_I italic_m 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_C start_POSTSUBSCRIPT italic_V italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - bold_italic_D ( bold_C start_POSTSUBSCRIPT italic_I italic_m 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_C start_POSTSUBSCRIPT italic_I italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT . (17)

Therefore, we can get the cross triplet loss ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT, by

ctri=Σy(V3y+I3y),y=o,m1,m5.formulae-sequencesubscript𝑐𝑡𝑟𝑖subscriptΣ𝑦subscript𝑉3𝑦subscript𝐼3𝑦𝑦om1m5\displaystyle\mathcal{L}_{ctri}\!\!=\Sigma_{y}(\!\mathcal{L}_{V3y}\!+\!% \mathcal{L}_{I3y}),\quad y=\rm{o},m1,m5.caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_V 3 italic_y end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_I 3 italic_y end_POSTSUBSCRIPT ) , italic_y = roman_o , m1 , m5 . (18)

Although we strive to pull the distance between positive embeddings and simultaneously widen the gap between negative embeddings, an issue arises where the distances between hard-negative samples become unbalanced. This leads to some hard-negative samples will be considered as positive samples. However, ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT may contribute to scenario where unbalanced distances among hard-negative samples coexist. To solve this, we propose the balance contrastive loss bcsubscript𝑏𝑐\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT to maximize the distance among all the negative samples, a temperature-scaled softmax function [Wu et al.(2018)Wu, Xiong, Yu, and Lin, Atito et al.(2021)Atito, Awais, and Kittler]. To get bcsubscript𝑏𝑐\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT, we first concatenate cross-modal embeddings.

𝐂mki=𝐂𝐨𝐧𝐜𝐚𝐭(𝐂Vmki,𝐂Imki),(k=1,3,5),𝐂oi=𝐂𝐨𝐧𝐜𝐚𝐭(𝐂Voi,𝐂Ioi).formulae-sequencesubscriptsuperscript𝐂𝑖𝑚𝑘𝐂𝐨𝐧𝐜𝐚𝐭subscriptsuperscript𝐂𝑖𝑉𝑚𝑘subscriptsuperscript𝐂𝑖𝐼𝑚𝑘𝑘135subscriptsuperscript𝐂𝑖o𝐂𝐨𝐧𝐜𝐚𝐭subscriptsuperscript𝐂𝑖𝑉osubscriptsuperscript𝐂𝑖𝐼o\displaystyle\mathbf{C}^{i}_{mk}=\mathbf{Concat}(\mathbf{C}^{i}_{Vmk},\mathbf{% C}^{i}_{Imk}),(k=1,3,5),\quad\mathbf{C}^{i}_{\rm{o}}=\mathbf{Concat}(\mathbf{C% }^{i}_{V\rm{o}},\mathbf{C}^{i}_{I\rm{o}}).bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_k end_POSTSUBSCRIPT = bold_Concat ( bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V italic_m italic_k end_POSTSUBSCRIPT , bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_m italic_k end_POSTSUBSCRIPT ) , ( italic_k = 1 , 3 , 5 ) , bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_o end_POSTSUBSCRIPT = bold_Concat ( bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_V roman_o end_POSTSUBSCRIPT , bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I roman_o end_POSTSUBSCRIPT ) . (19)

The balance cross contrastive loss bcxysubscript𝑏𝑐𝑥𝑦\mathcal{L}_{bcxy}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c italic_x italic_y end_POSTSUBSCRIPT between 𝐂xisubscriptsuperscript𝐂𝑖𝑥\mathbf{C}^{i}_{x}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and 𝐂yisubscriptsuperscript𝐂𝑖𝑦\mathbf{C}^{i}_{y}bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is

bcxy=esim((𝐂xi),(𝐂yi))/τk=1,ki2Nesim((𝐂xi),(𝐂yk))/τ,subscript𝑏𝑐𝑥𝑦superscriptesimsubscriptsuperscript𝐂𝑖𝑥subscriptsuperscript𝐂𝑖𝑦𝜏superscriptsubscriptformulae-sequence𝑘1𝑘𝑖2𝑁superscriptesimsubscriptsuperscript𝐂𝑖𝑥subscriptsuperscript𝐂𝑘𝑦𝜏\mathcal{L}_{bcxy}=\frac{\mathrm{e}^{\operatorname{sim}\left(\left(\mathbf{C}^% {i}_{x}\right),\left(\mathbf{C}^{i}_{y}\right)\right)/\tau}}{\sum_{k=1,k\neq i% }^{2N}\mathrm{e}^{\operatorname{sim}\left(\left(\mathbf{C}^{i}_{x}\right),% \left(\mathbf{C}^{k}_{y}\right)\right)/\tau}},caligraphic_L start_POSTSUBSCRIPT italic_b italic_c italic_x italic_y end_POSTSUBSCRIPT = divide start_ARG roman_e start_POSTSUPERSCRIPT roman_sim ( ( bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) , ( bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_N end_POSTSUPERSCRIPT roman_e start_POSTSUPERSCRIPT roman_sim ( ( bold_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) , ( bold_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ) / italic_τ end_POSTSUPERSCRIPT end_ARG , (20)

and we get the balance contrastive loss bcsubscript𝑏𝑐\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT by following,

bc=ΣyΣxbcxyx,y=o,m1,m3,m5,wherexy.formulae-sequencesubscript𝑏𝑐subscriptΣ𝑦subscriptΣ𝑥subscript𝑏𝑐𝑥𝑦𝑥formulae-sequence𝑦om1m3m5wherexy\displaystyle\mathcal{L}_{bc}=\Sigma_{y}\Sigma_{x}\mathcal{L}_{bcxy}\quad x,y=% \rm{o},m1,m3,m5,where\quad x\neq y.caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b italic_c italic_x italic_y end_POSTSUBSCRIPT italic_x , italic_y = roman_o , m1 , m3 , m5 , roman_where roman_x ≠ roman_y . (21)

Therefore, our overall loss includes the proposed loss CEBL=ctri+bcsubscript𝐶𝐸𝐵𝐿subscript𝑐𝑡𝑟𝑖subscript𝑏𝑐\mathcal{L}_{CEBL}=\mathcal{L}_{ctri}+\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E italic_B italic_L end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT, the triplet loss trisubscript𝑡𝑟𝑖\mathcal{L}_{tri}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT [Hermans et al.(2017)Hermans, Beyer, and Leibe], the cross-entropy loss cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT [Ye et al.(2021b)Ye, Shen, Lin, Xiang, Shao, and Hoi], and can be expressed as

=ce+tri+λ(CEBL).subscript𝑐𝑒subscript𝑡𝑟𝑖𝜆subscript𝐶𝐸𝐵𝐿\displaystyle\mathcal{L}=\mathcal{L}_{ce}+\mathcal{L}_{tri}+\lambda(\mathcal{L% }_{CEBL}).caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT + italic_λ ( caligraphic_L start_POSTSUBSCRIPT italic_C italic_E italic_B italic_L end_POSTSUBSCRIPT ) . (22)

3 Experimental results

3.1 Experiment settings

We test our method on two datasets. SYSU-MM01 [Wu et al.(2017)Wu, Zheng, Yu, Gong, and Lai] is the most challenging dataset for VI-ReID. It includes 29,033 visible and 15,712 infrared images captured by 4 visible and 2 infrared cameras in indoor/outdoor settings. The training set has 22,258 visible and 11,909 infrared images from 395 identities. For testing, there are images from 96 individuals, split into a query set (infrared) and a gallery set (visible). Testing is under two modes, i.e., all-search using all images and indoor search using only indoor images. Another dataset RegDB [Nguyen et al.(2017)Nguyen, Hong, Kim, and Park] comprises 10 visible and infrared images per person, totaling 2,060 images in both sets for training and testing. During testing, both visible-to-infrared and infrared-to-visible modes are utilized. All 2,060 visible/infrared images are employed as query and gallery sets.

Our experiments are all done on a NVIDIA A100 GPU. All the input images are resized to 3×288×14432881443\times 288\times 1443 × 288 × 144 with channel augmentation [Ye et al.(2021a)Ye, Ruan, Du, and Shou]. DIAN is adopted during the training and inference phases. In each mini-batch, we randomly select 4 visible and 4 infrared images, with a batch size of 6. The SGD optimizer is adopted for training. The detail about learning rate is introduced in supplementary. Since two different datasets are used in the experiments, the model is tailored to meet the unique requirements of each dataset. Given that attention mechanisms usually benefit from large amounts of data [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin], for SYSU-MM01, DIAN follows exactly the design in section 2. For RegDB, it is a simpler, smaller dataset. We made DIAN simpler for better performance on the RegDB dataset by removing OFM modules, eliminating origin branch, and reducing the number of branches from four to three.

Finally, for the overall loss \mathcal{L}caligraphic_L in Eq. (22), according to the experimental results of different λ𝜆\lambdaitalic_λ values in Fig. 2(a) and Fig. 2(b), we assign the value of λ𝜆\lambdaitalic_λ as 0.4 for both SYSU-MM01 and RegDB datasets. Outputs from all branches are added for testing. We use mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) to evaluate our work. mAP measures the average retrieval performance across all categories, while CMC assesses the percentage of correct retrievals among the top-k results.

Refer to caption
Refer to caption
Figure 2: Different λ𝜆\lambdaitalic_λ values on SYSU-MM01 (a) and on RegDB (b).

3.2 Comparison With Existing Methods and Ablation Studies

We compare our method with existing SOTA methods without extra data to show the superiority of our method. The experiment results on SYSU-MM01 and RegDB datasets are reported in Tab. 1. More quantitative and qualitative analyses are shown in supplementary.

Table 1: Re-identification rates on SYSU-MM01 and RegDB dataset. The bold means the first-ranked indicator, the underline means the second-randed indicator.
SYSU-MM01 RegDB
Model All-Search Indoor-Search Visible to Infrared Infrared to Visible
Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP
cmGAN [Dai et al.(2018)Dai, Ji, Wang, Wu, and Huang] 26.97 27.80 31.63 42.19 - - - -
AlgnGAN [Wang et al.(2019a)Wang, Zhang, Cheng, Liu, Yang, and Hou] 42.40 40.70 45.90 54.30 57.90 53.60 - -
MSR [Feng et al.(2019)Feng, Lai, and Xie] 37.50 38.11 39.64 50.88 48.43 48.67 - -
JSIA [Wang et al.(2020)Wang, Zhang, Yang, Cheng, Chang, Liang, and Hou] 38.10 36.90 43.80 52.90 48.53 49.30 48.12 48.94
SDL [Kansal et al.(2020)Kansal, Subramanyam, Wang, and Satoh] 28.12 29.01 32.56 39.56 26.47 23.58 25.74 22.89
X-Modality [Li et al.(2020)Li, Wei, Hong, and Gong] 49.92 50.73 - - 62.21 60.18 - -
DDAG [Ye et al.(2020)Ye, Shen, J. Crandall, Shao, and Luo] 54.75 53.02 61.02 67.98 69.34 63.46 68.06 61.80
NFS [Chen et al.(2021b)Chen, Wan, Li, **g, and Sun] 56.91 55.45 62.79 69.79 80.54 72.10 77.95 69.79
MSO [Gao et al.(2021)Gao, Liang, **, Gu, Liu, Li, and Lang] 58.70 56.42 63.09 70.31 73.60 66.90 74.60 67.50
GECNet [Zhong et al.(2021)Zhong, Lu, Huang, Ye, Jia, and Lin] 53.37 51.83 60.60 62.89 82.33 78.45 78.93 75.58
CIMA [Zhao et al.(2021)Zhao, Liu, Chu, Lu, and Yu] 57.20 59.30 66.60 74.70 78.80 69.40 77.90 69.40
TSME [Liu et al.(2022)Liu, Wang, Huang, Zhang, and Han] 64.23 61.21 64.80 71.53 87.35 76.94 86.41 75.70
CMTR [Liang et al.(2021)Liang, **, Gao, Liu, Feng, Wang, and Li] 62.58 61.33 67.02 75.40 80.62 74.42 81.06 73.75
TCOM [Si et al.(2023)Si, He, Li, and Gao] 63.92 60.71 68.35 73.08 87.04 80.40 83.20 76.73
SFANet [Liu et al.(2023)Liu, Ma, Xia, and Li] 65.74 60.83 71.60 80.05 76.31 68.00 70.15 63.77
PMT [Lu et al.(2023)Lu, Zou, and Zhang] 67.53 64.98 71.66 76.52 84.83 76.55 84.16 75.13
SIDA [Gong et al.(2023)Gong, Zhao, Lam, Gao, and Shen] 68.36 64.19 73.28 77.49 81.73 75.07 79.71 72.60
MFCS [Yang et al.(2024)Yang, Dong, Li, Wei, Wang, and Gao] 70.59 67.49 75.98 80.24 85.34 76.39 83.88 75.16
DIAN(Ours) 75.20 71.15 86.28 87.41 88.06 82.57 86.07 80.02
Table 2: DIAN Ablation Study on SYSU-MM01 (%).
Settings SYSU-MM01
OFM IEDK PPEM ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT bcsubscript𝑏𝑐\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT Rank1 Rank10 Rank20 mAP
66.05 93.08 97.37 61.89
71.29 95.41 98.08 66.91
70.81 95.19 97.82 66.92
70.16 94.61 98.00 65.00
70.16 95.90 98.82 66.77
73.65 96.77 99.13 68.95
72.05 95.19 98.03 67.14
74.89 97.81 99.00 70.05
73.66 97.82 99.41 69.10
75.20 97.84 99.53 71.15
Table 3: Four branches performance on SYSU-MM01 Dataset without CEBLsubscript𝐶𝐸𝐵𝐿\mathcal{L}_{CEBL}caligraphic_L start_POSTSUBSCRIPT italic_C italic_E italic_B italic_L end_POSTSUBSCRIPT at Rank-1 and mAP rates.
Four branches performances on SYSU-MM01
Branch Name Rank-1 mAP
Origin 72.21 66.74
Margin1 71.89 67.25
Margin3 72.52 67.37
Margin5 72.23 67.13
Add All 73.65 68.95

Specifically, on the SYSU-MM01 dataset in indoor search mode, our method achieves the Rank-1 accuracy of 86.28% and the mAP of 87.41%, respectively. As shown in Tab. 1, the text in bold indicates the first-ranked indicator, and the one underlined is the second-ranked indicator. In all-search mode, our method produces the Rank-1 accuracy of 75.20% and the mAP of 71.15%, respectively. Our model performs very well in the indoor search mode of SYSU-MM01 because identity-guided and modality-consistent features are more evident in indoor scenes compared to the entire scenes. As shown in Tab. 1, our methods can handle different query modes robustly. We also evaluate our model on RegDB with two query modes. For the visible to infrared mode, our method achieves the Rank-1 accuracy of 88.06% and the mAP of 82.57%.

Ablation on each components. We performed ablation studies to assess the effectiveness of each component in DIAN. In Tab. 3, the third row gives the baseline AGW performance [Ye et al.(2021a)Ye, Ruan, Du, and Shou] with channel augmentation trained with the loss terms cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT and trisubscript𝑡𝑟𝑖\mathcal{L}_{tri}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_i end_POSTSUBSCRIPT. \checkmark indicates the result of adding the corresponding module. We show each component has effectiveness on the network performances. DIAN without using the ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT and bcsubscript𝑏𝑐\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT can improve the performance of the baseline model, which indicates the importance of exploring the identity-guided and modality-consistent embeddings. Moreover, from experimental results shown in last three rows, DIAN with ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT and bcsubscript𝑏𝑐\mathcal{L}_{bc}caligraphic_L start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT can effectively bridge the cross-modal gap towards identity-guided and modality-consistent features.

Test on four branches. As shown in Tab. 3, we have tested the performances on four branches separately, and the margin3 branch achieves the best performance. So it may extract the most discriminative identity-guided embeddings. That is why we design the ctrisubscript𝑐𝑡𝑟𝑖\mathcal{L}_{ctri}caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r italic_i end_POSTSUBSCRIPT.

Quantitative analysis and visualization We also exhibited the quantitative analysis and visualization of DIAN to prove our network. Please see supplementary for details.

4 Conclusion

This paper solves the problem of VI-ReID. We design a dynamic identity-guided attention network (DIAN) to mine identity-guided and modality-consistent embeddings. In DIAN, three orthogonal fusion modules (OFM) are introduced to fuse features for decoupling, an identity-guided embedding decoupling kernel (IEDK) to mine discriminative identity-guided and modality-consistent features at different scales, a parallel progressive enhancement module (PPEM) to progressively enhance above features in parallel. Finally, a cross-embedding balance loss (CEBL) is introduced to effectively bridge the gap between different modalities by identity-guided and modality-consistent embeddings. Experimental results demonstrate that DIAN achieves superior performance.

References

  • [Atito et al.(2021)Atito, Awais, and Kittler] Sara Atito, Muhammad Awais, and Josef Kittler. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021.
  • [Chai et al.(2023)Chai, Ling, Luo, Lin, Jiang, and Li] Zehua Chai, Yongguo Ling, Zhiming Luo, Dazhen Lin, Min Jiang, and Shaozi Li. Dual-stream transformer with distribution alignment for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 33(11):6764–6776, 2023.
  • [Chen et al.(2021a)Chen, Fan, and Panda] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 357–366, Oct. 2021a.
  • [Chen et al.(2018)Chen, Collins, Zhu, Papandreou, Zoph, Schroff, Adam, and Shlens] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale architectures for dense image prediction. Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS)., 31, 2018.
  • [Chen et al.(2021b)Chen, Wan, Li, **g, and Sun] Yehansen Chen, Lin Wan, Zhihang Li, Qianyan **g, and Zongyuan Sun. Neural feature search for rgb-infrared person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 587–597, Jun. 2021b.
  • [Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 764–773, Oct. 2017.
  • [Dai et al.(2018)Dai, Ji, Wang, Wu, and Huang] **yang Dai, Rongrong Ji, Haibin Wang, Qiong Wu, and Yuyu Huang. Cross-modality person re-identification with generative adversarial training. In Int. Joint Conf. Artif. Intell., volume 1, page 6, 2018.
  • [Feng et al.(2023)Feng, Ji, Wu, Gao, Gao, Liu, Liu, **g, and Luo] Yujian Feng, Yimu Ji, Fei Wu, Guangwei Gao, Yang Gao, Tianliang Liu, Shangdong Liu, Xiao-Yuan **g, and Jiebo Luo. Occluded visible-infrared person re-identification. IEEE Trans. Multimedia., 25:1401–1413, 2023.
  • [Feng et al.(2019)Feng, Lai, and Xie] Zhanxiang Feng, Jianhuang Lai, and Xiaohua Xie. Learning modality-specific representations for visible-infrared person re-identification. IEEE Trans. Image Process., 29:579–590, 2019.
  • [Gao et al.(2019)Gao, Cheng, Zhao, Zhang, Yang, and Torr] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell., 43(2):652–662, 2019.
  • [Gao et al.(2021)Gao, Liang, **, Gu, Liu, Li, and Lang] Yajun Gao, Tengfei Liang, Yi **, Xiaoyan Gu, Wu Liu, Yidong Li, and Congyan Lang. Mso: Multi-feature space joint optimization network for rgb-infrared person re-identification. In Proc. of the 29th ACM Int. Conf. Multimedia., pages 5257–5265, 2021.
  • [Gong et al.(2023)Gong, Zhao, Lam, Gao, and Shen] Jiahao Gong, Sanyuan Zhao, Kin-Man Lam, Xin Gao, and Jianbing Shen. Spectrum-irrelevant fine-grained representation for visible–infrared person re-identification. Comput. Vis. Image Underst., 232:103703, 2023.
  • [Hao et al.(2019)Hao, Wang, Li, and Gao] Yi Hao, Nannan Wang, Jie Li, and Xinbo Gao. Hsme: Hypersphere manifold embedding for visible thermal person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 33, pages 8385–8392, 2019.
  • [Hermans et al.(2017)Hermans, Beyer, and Leibe] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [Jiang et al.(2022)Jiang, Zhang, Liu, Qian, Zhang, and Wu] Kongzhu Jiang, Tianzhu Zhang, Xiang Liu, Bingqiao Qian, Yongdong Zhang, and Feng Wu. Cross-modality transformer for visible-infrared person re-identification. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 480–496. Springer, Oct. 2022.
  • [Kansal et al.(2020)Kansal, Subramanyam, Wang, and Satoh] Kajal Kansal, A Venkata Subramanyam, Zheng Wang, and Shin’ichi Satoh. Sdl: Spectrum-disentangled representation learning for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 30(10):3422–3432, 2020.
  • [Li et al.(2020)Li, Wei, Hong, and Gong] Diangang Li, Xing Wei, Xiaopeng Hong, and Yihong Gong. Infrared-visible cross-modal person re-identification with an x modality. In Proc. AAAI Conf. Artif. Intell., volume 34, pages 4610–4617, 2020.
  • [Liang et al.(2021)Liang, **, Gao, Liu, Feng, Wang, and Li] Tengfei Liang, Yi **, Yajun Gao, Wu Liu, Songhe Feng, Tao Wang, and Yidong Li. Cmtr: Cross-modality transformer for visible-infrared person re-identification. arXiv preprint arXiv:2110.08994, 2021.
  • [Liu et al.(2023)Liu, Ma, Xia, and Li] Haojie Liu, Shun Ma, Daoxun Xia, and Shaozi Li. Sfanet: A spectrum-aware feature augmentation network for visible-infrared person reidentification. IEEE Trans. Neural Netw. Learn. Syst., 34(4):1958–1971, 2023.
  • [Liu et al.(2022)Liu, Wang, Huang, Zhang, and Han] Jianan Liu, Jialiang Wang, Nianchang Huang, Qiang Zhang, and Jungong Han. Revisiting modality-specific feature compensation for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 32(10):7226–7240, 2022.
  • [Lu et al.(2023)Lu, Zou, and Zhang] Hu Lu, Xuezhang Zou, and **** Zhang. Learning progressive modality-shared transformers for effective visible-infrared person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 37, pages 1835–1843, 2023.
  • [Nguyen et al.(2017)Nguyen, Hong, Kim, and Park] Dat Tien Nguyen, Hyung Gil Hong, Ki Wan Kim, and Kang Ryoung Park. Person recognition system based on a combination of body images from visible light and thermal cameras. Sens., 17(3):605, 2017.
  • [Shen et al.(2023)Shen, Zhao, and Zhang] Hao Shen, Zhong-Qiu Zhao, and Wandi Zhang. Adaptive dynamic filtering network for image denoising. In Proc. AAAI Conf. Artif. Intell., volume 37, pages 2227–2235, 2023.
  • [Si et al.(2023)Si, He, Li, and Gao] Tongzhen Si, Fazhi He, Penglei Li, and Xiaoxin Gao. Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification. Neural Comput., 523:170–181, 2023.
  • [Sun et al.(2018)Sun, Zheng, Yang, Tian, and Wang] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Sheng** Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 480–496. Springer, Sept. 2018.
  • [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS)., 30, 2017.
  • [Wang et al.(2020)Wang, Zhang, Yang, Cheng, Chang, Liang, and Hou] Guan-An Wang, Tianzhu Zhang, Yang Yang, Jian Cheng, Jianlong Chang, Xu Liang, and Zeng-Guang Hou. Cross-modality paired-images generation for rgb-infrared person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 34, pages 12144–12151, 2020.
  • [Wang et al.(2019a)Wang, Zhang, Cheng, Liu, Yang, and Hou] Guan’an Wang, Tianzhu Zhang, Jian Cheng, Si Liu, Yang Yang, and Zengguang Hou. Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 3623–3632, Oct. 2019a.
  • [Wang et al.(2019b)Wang, Wang, Zheng, Chuang, and Satoh] Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Yung-Yu Chuang, and Shin’ichi Satoh. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 618–626, Jun. 2019b.
  • [Wei et al.(2018)Wei, Zhang, Gao, and Tian] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 79–88, Jun. 2018.
  • [Wei et al.(2021)Wei, Yang, Wang, and Gao] Ziyu Wei, Xi Yang, Nannan Wang, and Xinbo Gao. Flexible body partition-based adversarial learning for visible infrared person re-identification. IEEE Trans. Neural Netw. Learn. Syst., 33(9):4676–4687, 2021.
  • [Wu et al.(2017)Wu, Zheng, Yu, Gong, and Lai] Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, and Jianhuang Lai. Rgb-infrared cross-modality person re-identification. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 5380–5389, Oct. 2017.
  • [Wu et al.(2018)Wu, Xiong, Yu, and Lin] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 3733–3742, Jun. 2018.
  • [Yang et al.(2021)Yang, He, Fan, Shi, Xue, Li, Ding, and Huang] Min Yang, Dongliang He, Miao Fan, Baorong Shi, Xuetong Xue, Fu Li, Errui Ding, and Jizhou Huang. Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 11772–11781, Oct. 2021.
  • [Yang et al.(2024)Yang, Dong, Li, Wei, Wang, and Gao] Xi Yang, Wenjiao Dong, Meijie Li, Ziyu Wei, Nannan Wang, and Xinbo Gao. Cooperative separation of modality shared-specific features for visible-infrared person re-identification. IEEE Trans. Multimedia, pages 1–11, 2024. 10.1109/TMM.2024.3377139.
  • [Ye et al.(2020)Ye, Shen, J. Crandall, Shao, and Luo] Mang Ye, Jianbing Shen, David J. Crandall, Ling Shao, and Jiebo Luo. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 229–247. Springer, Aug. 2020.
  • [Ye et al.(2021a)Ye, Ruan, Du, and Shou] Mang Ye, Weijian Ruan, Bo Du, and Mike Zheng Shou. Channel augmented joint learning for visible-infrared recognition. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 13567–13576, Oct. 2021a.
  • [Ye et al.(2021b)Ye, Shen, Lin, Xiang, Shao, and Hoi] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell., 44(6):2872–2893, Aug. 2021b.
  • [Zhang et al.(2022)Zhang, Kang, Zhao, and Shen] Yiyuan Zhang, Yuhao Kang, Sanyuan Zhao, and Jianbing Shen. Dual-semantic consistency learning for visible-infrared person re-identification. IEEE Trans. Inf. Foren. Sec., 18:1554–1565, 2022.
  • [Zhang and Wang(2023)] Yukang Zhang and Hanzi Wang. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 2153–2162, Jun. 2023.
  • [Zhang et al.(2023)Zhang, Yan, Li, and Wang] Yukang Zhang, Yan Yan, Jie Li, and Hanzi Wang. Mrcn: A novel modality restitution and compensation network for visible-infrared person re-identification. arXiv preprint arXiv:2303.14626, 2023.
  • [Zhao et al.(2021)Zhao, Liu, Chu, Lu, and Yu] Zhiwei Zhao, Bin Liu, Qi Chu, Yan Lu, and Nenghai Yu. Joint color-irrelevant consistency learning and identity-aware modality adaptation for visible-infrared cross modality person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 35, pages 3520–3528, 2021.
  • [Zheng et al.(2017)Zheng, Zhang, Sun, Chandraker, Yang, and Tian] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. Person re-identification in the wild. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 1367–1376, Jul. 2017.
  • [Zhong et al.(2021)Zhong, Lu, Huang, Ye, Jia, and Lin] Xian Zhong, Tianyou Lu, Wenxin Huang, Mang Ye, Xuemei Jia, and Chia-Wen Lin. Grayscale enhancement colorization network for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 32(3):1418–1430, 2021.