\addauthor

Peng [email protected],2 \addauthorYujian [email protected] \addauthorHui [email protected] \addauthorZailong [email protected] \addauthorXubo [email protected] \addauthorYiyang [email protected],2 \addauthorGuquan [email protected],2 \addinstitution Bei**g Normal University-Hong Kong Baptist University United International College.
Zhu Hai, China \addinstitution Hong Kong Baptist University.
Hong Kong, China \addinstitution University of Wollongong.
Wollonggong, Austrilia \addinstitution University of Surrey.
Guilford, United Kingdom

Dynamic Identity-Guided Attention Network for Visible-Infrared Person Re-identification

Abstract

Visible-infrared person re-identification (VI-ReID) aims to match people with the same identity between visible and infrared modalities. VI-ReID is a challenging task due to the large differences in individual appearance under different modalities. Existing methods generally try to bridge the cross-modal differences at image or feature level, which lacks exploring the discriminative embeddings. Effectively minimizing these cross-modal discrepancies relies on obtaining representations that are guided by identity and consistent across modalities, while also filtering out representations that are irrelevant to identity. To address these challenges, we introduce a dynamic identity-guided attention network (DIAN) to mine identity-guided and modality-consistent embeddings, facilitating effective bridging the gap between different modalities. Specifically, in DIAN, to pursue a semantically richer representation, we first use orthogonal projection to fuse the features from two connected coarse and fine layers. Furthermore, we first use dynamic convolution kernels to mine identity-guided and modality-consistent representations. More notably, a cross embedding balancing loss is introduced to effectively bridge cross-modal discrepancies by above embeddings. Experimental results on SYSU-MM01 and RegDB datasets show that DIAN achieves state-of-the-art performance. Specifically, for indoor search on SYSU-MM01, our method achieves 86.28% rank-1 accuracy and 87.41% mAP, respectively. Our code will be available soon.

1 Introduction

Person re-identification (ReID), as an important field of computer vision, focuses on personal recognition across cameras [Ye et al.(2021b)Ye, Shen, Lin, Xiang, Shao, and Hoi]. VI-ReID is a subfield of ReID that specializes in personal matching based on images captured by visible and infrared cameras. It faces challenges due to the huge cross-modal discrepancies. Current methods mainly try to handle VI-ReID tasks at the image level and feature level.

For image-level methods, researchers [Zheng et al.(2017)Zheng, Zhang, Sun, Chandraker, Yang, and Tian, Wang et al.(2019b)Wang, Wang, Zheng, Chuang, and Satoh, Hao et al.(2019)Hao, Wang, Li, and Gao] aim to reduce cross-modal differences by finding modality-invariant embeddings. Methods in [Li et al.(2020)Li, Wei, Hong, and Gong, Wang et al.(2020)Wang, Zhang, Yang, Cheng, Chang, Liang, and Hou, Wei et al.(2018)Wei, Zhang, Gao, and Tian, Zhong et al.(2021)Zhong, Lu, Huang, Ye, Jia, and Lin, Ye et al.(2020)Ye, Shen, J. Crandall, Shao, and Luo, Gao et al.(2021)Gao, Liang, **, Gu, Liu, Li, and Lang, Liu et al.(2022)Liu, Wang, Huang, Zhang, and Han] try to generate intermediate images between visible and infrared data, which allows for better alignment and integration at the image level through intermediate modalities. Works in [Sun et al.(2018)Sun, Zheng, Yang, Tian, and Wang, Wei et al.(2021)Wei, Yang, Wang, and Gao] emphasize detailed features in heterogeneous human images by dividing the image into several parts and calculating their relations. Overall, image-level approaches offer simplicity and the ability to capture holistic information, thereby facilitating understanding of the overall scene and context. However, they lack the capability to extract fine-grained features. Differently, methods in feature-level can mine more fine-grained representations. The methods in [Jiang et al.(2022)Jiang, Zhang, Liu, Qian, Zhang, and Wu, Chai et al.(2023)Chai, Ling, Luo, Lin, Jiang, and Li, Feng et al.(2023)Feng, Ji, Wu, Gao, Gao, Liu, Liu, **g, and Luo] enhance joint embedding patterns by focusing on modality-related embeddings from the feature level. Similarly, researchers [Zhang et al.(2022)Zhang, Kang, Zhao, and Shen, Chen et al.(2018)Chen, Collins, Zhu, Papandreou, Zoph, Schroff, Adam, and Shlens, Gao et al.(2019)Gao, Cheng, Zhao, Zhang, Yang, and Torr, Chen et al.(2021a)Chen, Fan, and Panda, Dai et al.(2018)Dai, Ji, Wang, Wu, and Huang, Wang et al.(2019a)Wang, Zhang, Cheng, Liu, Yang, and Hou, Zhao et al.(2021)Zhao, Liu, Chu, Lu, and Yu, Liang et al.(2021)Liang, **, Gao, Liu, Feng, Wang, and Li, Lu et al.(2023)Lu, Zou, and Zhang] seek channel-level, spatial-level or multi-scale implicit connections between different modalities through various attention mechanisms. These indicate that fine-grained features are also crucial. Despite the methods operating at either the image or feature level, they frequently neglect the significance of identity-guided and modality-consistent embeddings. This oversight leads to a failure in effectively bridging the gap between different modalities. To address this challenge, we introduce a novel network named Dynamic Identity-Guided Attention Network (DIAN). By prioritizing identity-guided and ensuring modality-consistent in embeddings, DIAN offers a promising solution to this longstanding issue. The overall architecture is shown in Fig. 1(a). Inspired by [Yang et al.(2021)Yang, He, Fan, Shi, Xue, Li, Ding, and Huang], we first introduce an orthogonal fusion module (OFM) to reduce feature redundancy between connected layers and fuse them effectively. OFM can generate rich semantic features through orthogonal projection. Secondly, in view of the superior ability of dynamic convolution kernels in processing high-response information [Shen et al.(2023)Shen, Zhao, and Zhang], we propose an identity-guided embedding decoupling kernel (IEDK), which can decouple feature maps at different scales and effectively mine the identity-guided and modality-consistent embeddings. Thirdly, we introduce a parallel progressive enhancement module (PPEM). This module enhances embeddings through parallel spatial and channel attention blocks, transitioning from serial to parallel mode, thus maximizing the utilization of training data while avoiding the data scarcity issue caused by the original serial design of the attention [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin]. Finally, to effectively reduce cross-modal discrepancies, a cross embedding balance loss (CEBL) is designed to effectively minimize the cross-modal discrepancies.

To the best of our knowledge, no other method uses a similar approach to solve the VI-ReID task. In summary, our main contributions are as follows.

1. A novel network dynamic identity-guided attention network (DIAN) is proposed for the VI-ReID task. In DIAN, the orthogonal fusion module (OFM) is able to obtain information rich in semantic representation through a novel orthogonal feature fusion method. The novel identity-guided embedding decoupling kernel (IEDK) is able to obtain identity-guided and modality-consistent embeddings at various scales. The proposed parallel progressive enhancement module (PPEM) can further enhance the above embeddings.

2. The cross embedding balance loss (CEBL) is introduced to reduce cross-modal discrepancies and enhance cross-modal consistency by constraining the distribution of decoupled and enhanced embeddings.

3. Experimental results show that DIAN achieves remarkable performances on VI-ReID task. Specifically, for indoor search on SYSU-MM01 dataset, we achieve the Rank-1 of 86.28% and mAP of 87.41%, respectively. They outperform existing SOTA methods.

2 Method

The overall DIAN architecture is shown in Fig. 1(a), with ResNet50 as backbone. OFMs fuse features of the two connected layers, producing rich semantic representation for decoupling in next step. IEDK decouples features and filters out identity-unrelated embeddings, thereby capturing identity-guided and modality-consistent embeddings at diverse scales. Then, PPEM enhances above embeddings to get the more discriminative representations. The outputs are used in $\mathcal{L}_{CEBL}$ to bridge cross-modal discrepancies effectively. Initially, visible and infrared images are fed to the ResNet block in stage 0 and get the visible and infrared feature maps $\mathbf{V}_{0}$ and $\mathbf{I}_{0}$ . We concatenate them as $\mathbf{A}_{0}=\mathbf{Concat(\mathbf{V}_{0},\mathbf{I}_{0})}$ for joint training.

Refer to caption — Figure 1: (a) The network architecture of DIAN and its components. (b) Three orthogonal fusion modules (OFMs). (c) Identity-guided embedding decoupling kernel (IEDK). (d) Parallel progressive enhancement module (PPEM). (e) The legend of pictures.

2.1 Orthogonal Fusion Module

Other methods typically only consider feature fusion, which can lead to redundant information between adjacent layers, hindering subsequent modules from exploring features effectively. The novel two-stage orthogonal fusion module (OFM) is introduced to remove feature redundancy and obtain the refined feature maps for the next step. As shown in Fig. 1(b), let $\mathbf{A}_{r(i-1)}$ ( $i=1,2,3.Specifically,\mathbf{A}_{0}=\mathbf{A}_{r0}$ ) from the preceding layer be the coarse feature $\mathbf{A}_{c}\in\mathbb{R}^{{C}_{c}\times{H}_{c}\times{W}_{c}}$ and $\mathbf{A}_{i}$ output from the stage $i$ be the fine features $\mathbf{A}_{f}\in\mathbb{R}^{{C}_{f}\times{H}_{f}\times{W}_{f}}$ . In order to fuse them while eliminating redundancy, we first expand the coarse features to $\mathbf{A}_{c}\in\mathbb{R}^{{C}_{c}\times{H}_{f}\times{W}_{f}}$ with an up-sampling block, so that $\mathbf{A}_{c}$ has the same height and width as $\mathbf{A}_{f}$ . To comprehensively integrate information, we thoroughly explore the multi-scale information within the coarse features. $\mathbf{A}_{c}$ passes through the multi-scale block and concatenate,

\displaystyle\mathbf{A}^{\prime}_{c}=\mathbf{Concat}(\mathbf{Conv}^{3\times 3}% _{dilation=k}(\mathbf{A}_{c})),k=3,5,7,9,\quad\mathbf{A}^{\prime}_{c}\in% \mathbb{R}^{4{C}_{c}\times{H}_{f}\times{W}_{f}}.

(1)

An attention module formed by softplus layer is to highlight the weight of diverse scales,

\displaystyle\mathbf{A}^{\prime\prime}_{c}=\mathbf{Attention}_{softplus}(% \mathbf{A}^{\prime}_{c}),\quad\mathbf{A}^{\prime\prime}_{c}\in\mathbb{R}^{4{C}% _{c}\times{H}_{f}\times{W}_{f}},

(2)

where $\mathbf{A}^{\prime\prime}_{c}$ is the obtained coarse feature maps by formulating each scale’s weight. Meanwhile, for fine features $\mathbf{A}_{f}\in\mathbb{R}^{{C}_{f}\times{H}_{f}\times{W}_{f}}$ , we use a feature expansion module which contains an average pooling layer for resizing it to $\mathbf{A}^{\prime}_{f}\in\mathbb{R}^{{C}_{f}\times 1\times 1}$ , and a linear layer further converting it to $\mathbf{A}^{\prime\prime}_{f}\in\mathbb{R}^{4{C}_{c}\times 1\times 1}$ to expand the features in a more manageable space with $\mathbf{A}^{\prime\prime}_{c}$ . To obtain refined features, we use orthogonal projection to reduce redundancy between $\mathbf{A}^{\prime\prime}_{c}$ and $\mathbf{A}^{\prime\prime}_{f}$ , by

\displaystyle\mathbf{A}_{proj}

\displaystyle=\frac{\mathbf{A}^{\prime\prime}_{c}\cdot\mathbf{A}^{\prime\prime% }_{f}}{\left|\mathbf{A}^{\prime\prime}_{f}\right|^{2}}\mathbf{A}^{\prime\prime% }_{f},

(3)

\displaystyle\mathbf{A}^{\prime\prime}_{c}\cdot\mathbf{A}^{\prime\prime}_{f}=% \Sigma_{i=1}^{4C_{f}}\mathbf{A}^{\prime\prime}_{c,i}\mathbf{A}^{\prime\prime}_% {f,i},\quad\left|\mathbf{A}^{\prime\prime}_{f}\right|^{2}=\Sigma_{i=1}^{C}% \left(\mathbf{A}^{\prime\prime}_{f,i}\right)^{2},

(4)

where $\mathbf{A}^{\prime\prime}_{c}\cdot\mathbf{A}^{\prime\prime}_{f}$ is dot product and $\left|\mathbf{A}^{\prime\prime}_{f}\right|^{2}$ is the $\mathbf{L}_{2}$ norm of $\mathbf{A}^{\prime\prime}_{f}$ . $\mathbf{A}_{proj}$ is the projection of coarse features onto fine features, which indicates the redundant information contained in both coarse and fine features. As shown in below of Fig. 1(b), the irredundant information $\mathbf{A}_{d}$ is obtained by the difference between $\mathbf{A}_{c}$ and $\mathbf{A}_{proj}$ ,

\displaystyle\mathbf{A}_{d}

\displaystyle=\mathbf{A}_{c}-\mathbf{A}_{proj}.

(5)

After that, the output $\mathbf{A}_{ri}$ of the $i$ -th OFM can be obtained by

\displaystyle\mathbf{A}_{fuse}=\mathbf{Fc}(\mathbf{Gap}(\mathbf{Concat}(% \mathbf{A}^{\prime\prime}_{f},\mathbf{A}_{d}))),\quad\mathbf{A}_{ri}=(\mathbf{% A}_{fuse}\otimes\mathbf{A}_{f})+\mathbf{A}_{f},

(6)

where $\mathbf{Gap}$ is global average pooling, $\mathbf{Fc}$ is the linear layer and $\otimes$ means element-wise multiplication. For example, $\mathbf{A}_{r1}$ is the final output from the first OFM, which contains rich semantic representations of two layers. For the 1st OFM, $\mathbf{A}_{r0}=\mathbf{A}_{0}$ . $\mathbf{A}_{ri}$ will be then treated as the coarse input branch for the $i+1th$ OFM. Later, $\mathbf{A}_{ri}$ will pass through the ResNet block in stage $i+1$ to get the fine input features $\mathbf{A}_{i+1}$ for the $i+1th$ OFM. And $\mathbf{A}_{r3}$ is the final output from the last (third) OFM.

2.2 Identity-Guided Embedding Decoupling Kernel

The identity-guided embedding decoupling kernel (IEDK) is introduced to preserve identity-guided and modality-consistent embeddings. As shown in Fig. 1(c), IEDK uses the output of the last (third) OFM, $\mathbf{A}_{r3}$ as input, effectively locating cross-modal identity-guided and consistency embeddings at different scales. In detail, the input $\mathbf{A}_{r3}$ is processed in four branches for decoupling, each with a specific purpose. One of the branches is marked as the original branch $\mathbf{A}_{\rm{o}}$ , retaining the original information without any changes. The other three branches $\mathbf{A}_{mk}$ ( $k=1,3,5$ ) use the Unfold function with different dilation scales to extract features at diverse scales. Unlike the convolution operation, the Unfold function has no parameters, thus ensuring that the inherent meaning and structure of the input are preserved during processing. The four branches can thus be expressed as

\displaystyle\mathbf{A}_{\rm{o}}=\mathbf{A}_{r3},\quad\mathbf{A}_{mk}=\mathbf{% Unfold}_{dilation=k}(\mathbf{A}_{r3}),\ k=1,3,5.

(7)

Next, because the dynamic convolution kernel demonstrates superior performance in protecting high-response areas of the image [Shen et al.(2023)Shen, Zhao, and Zhang], we introduce a novel dynamic convolution kernel with deformable convolution [Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei], aiming to preserve identity-guided and modality-consistent embeddings . As shown in Fig. 1(c), three branches conduct element-wise operations using dynamic convolution kernel to extract features at different scales. Our dynamic convolution kernel excels at attention aggregation from both spatial and channel perspectives, leading to a more robust representation and more discriminative identity-guided embeddings by effectively capturing non-linear interactions within the data. To get the dynamic convolution kernel, the input $\mathbf{A}_{r3}$ undergoes initial processing through a deformable convolution layer which can preserve high-response information, yielding

\displaystyle\mathbf{A}^{*}=\mathbf{Deformable\_Conv}_{deformable\_groups=8}(% \mathbf{A}_{r3}).

(8)

Because the identity features are retained at spatial domain and modality features are retained at channel domain, $\mathbf{A}^{*}$ will be processed by spatial and channel refinement to get discriminative identity-guided and modality-consistent embeddings. For spatial refinement,

		$\displaystyle\mathbf{Q}_{sp}=\mathbf{Conv}^{1\times 1}_{sp}(\mathbf{A}^{}),% \quad\mathbf{V}_{sp}=\mathbf{Gap}_{sp}(\mathbf{Conv}^{1\times 1}_{sp}(\mathbf{% A}^{})),$		(9)
		$\displaystyle\mathbf{Z}_{sp}=\mathbf{Sigmoid}(\mathbf{V}_{sp}\otimes\mathbf{% Softmax}({Q}_{sp})),\quad\mathbf{A}^{}_{sp}=\mathbf{Z}_{sp}\otimes\mathbf{A}^% {}+\mathbf{A}^{*},$		(9)

where $\mathbf{Conv}^{1\times 1}_{sp}$ is $1\times 1$ convolution layer, $\mathbf{Gap}_{sp}$ is global average pooling, $\mathbf{Sigmoid}$ is sigmoid activation, $\otimes$ is the element-wise multiplication. For channel refinement,

		$\displaystyle\mathbf{Q}_{ch}=\mathbf{Conv}^{1\times 1}_{ch}(\mathbf{A}^{}),% \quad\mathbf{V}_{ch}=\mathbf{Conv}^{1\times 1}_{ch}(\mathbf{A}^{}),$		(10)
		$\displaystyle\mathbf{Z}_{ch}=\mathbf{LayerN}(\mathbf{Softmax}({Q}_{ch})\otimes% \mathbf{V}_{ch}),\quad\mathbf{A}^{}_{ch}=\mathbf{Z}_{ch}\otimes\mathbf{A}^{}% +\mathbf{A}^{*},$		(10)

where, $\mathbf{Conv}^{1\times 1}_{ch}$ is $1\times 1$ convolution layer, $\mathbf{LayerN}$ is layer normalization, $\otimes$ is the element-wise multiplication. $\mathbf{A}^{*}_{sp}$ and $\mathbf{A}^{*}_{ch}$ are the spatial refined features and channel refined features, respectively. Then the dynamic convolution kernel $\mathbf{W}^{*}$ can be obtained by a fuse module,

\displaystyle\mathbf{W}^{*}=\mathbf{Conv}^{1\times 1}(\mathbf{A}^{*}_{ch}+% \mathbf{A}^{*}_{sp}).

(11)

We can then mine the identity-guided and modality-consistent embeddings at different scales by element-wise multiplication of $\mathbf{W}^{*}$ with the feature maps obtained from the three branches. The original branch will preserve the original information,

\displaystyle\mathbf{A}^{\prime}_{\rm{o}}=\mathbf{A}_{\rm{o}},\quad\mathbf{A}^% {\prime}_{mk}=\mathbf{A}_{mk}\otimes\mathbf{W^{*}},(k=1,3,5).

(12)

After being processed by a smooth ( $\mathbf{Conv}^{3\times 3}$ ) module, the final outputs of IEDK are

\displaystyle\mathbf{A}^{\prime\prime}_{\rm{o}},\ \mathbf{A}^{\prime\prime}_{m% 1},\ \mathbf{A}^{\prime\prime}_{m3},\ \mathbf{A}^{\prime\prime}_{m5}=\mathbf{% Conv}^{3\times 3}(\mathbf{A}^{\prime}_{\rm{o}},\ \mathbf{A}^{\prime}_{m1},\ % \mathbf{A}^{\prime}_{m3},\ \mathbf{A}^{\prime}_{m5}),

(13)

which are the processed origin, margin1, margin3 and margin5 embeddings, respectively. By considering both channel and spatial perspectives, IEDK can effectively mine more purified feature maps at different scales. This makes our network focus more on identity-guided and modality-consistent embeddings. The visualization for the features output from IEDK can be found in supplementary, which demonstrates the effectiveness of IEDK.

2.3 Parallel Progressive Enhancement Module

Then parallel progressive enhancement module (PPEM) is proposed to enhance embeddings by parallel instead of serial boosting mode, which effectively improves the representation ability of identity-guided and modality-consistent embeddings. In detail, because PPEM enhances the four branches in the same way, we take the margin3 embeddings $\mathbf{A}^{\prime\prime}_{m3}$ as an example for illustration. We design a shared module for further purification, including a $\mathbf{Conv}^{3\times 3}_{dilation=3}$ layer, a $\mathbf{LeakyReLU}$ layer, and a $\mathbf{Conv}^{3\times 3}_{dilation=3}$ layer. Thus $\mathbf{A}_{s}$ is,

\displaystyle\mathbf{A}_{s}\!\!=\!\!\mathbf{Conv}^{3\times 3}_{dilation=3}(% \mathbf{LeakyReLU}(\mathbf{Conv}^{3\times 3}_{dilation=3}(\mathbf{A}^{\prime% \prime}_{m3}))),\in\mathbb{R}^{C\times H\times W},

(14)

because the identity-related features lie more at spatial domain and modality-related features lie more at channel domain, $\mathbf{A}_{s}$ will be fed to a parallel spatial-channel enhancement module. Then, we obtain query vector $\mathbf{Q}_{se}\in\mathbb{R}^{1\times C//2}$ and value vector $\mathbf{V}_{se}\in\mathbb{R}^{C//2\times HW}$ by $\mathbf{Conv}_{1\times 1}$ .

Subsequently, the spatial attention weight matrix $\mathbf{W}_{se}=\mathbf{V}_{se}\otimes\mathbf{Softmax}(\mathbf{AvgPool}(% \mathbf{Q}_{se}))\in\mathbb{R}^{1\times H\times W}$ is obtained by applying Softmax along spatial perspective, where $\mathbf{AvgPool}$ is average pooling. Finally, we get the spatial-enhanced embeddings $\mathbf{A}_{se}=\mathbf{A}_{sh}\otimes\mathbf{W}_{se}\in\mathbb{R}^{C\times H% \times W}$ . Similarly, for channel enhancement, we first obtain query vector $\mathbf{Q}_{ce}\in\mathbb{R}^{HW\times 1}$ and value vector $\mathbf{V}_{ce}\in\mathbb{R}^{C//2\times HW}$ by $\mathbf{Conv}_{1\times 1}$ , and then, the channel attention weight matrix $\mathbf{W}_{ce}=\mathbf{V}_{ce}\otimes\mathbf{Softmax}(\mathbf{Q}_{ce})\in% \mathbb{R}^{C//2\times 1\times 1}$ is obtained by applying Softmax along channel perspective. Finally, we get the channel-enhanced embeddings $\mathbf{A}_{ce}=\mathbf{A}_{sh}\otimes\mathbf{W}_{ce}\in\mathbb{R}^{C\times H% \times W}$ . From the above, we could get enhanced embeddings with dual reinforcement in both channel and spatial domains. Then we fuse them with $\mathbf{A}_{s}$ through a two-stage module for better representations. The first Transform stage includes a $\mathbf{Conv}^{1\times 1}$ , a $\mathbf{LeakyReLU}$ , a $\mathbf{Conv}^{1\times 1}$ layer and the second stage includes a $\mathbf{LeakyReLU}$ layer. Then the final output for $\mathbf{A}^{*}_{m3}$ is

\displaystyle\mathbf{A}_{t}=\mathbf{Transform}(\mathbf{A}_{ce}+\mathbf{A}_{se}% )+\mathbf{A}_{s},\quad\mathbf{A}^{*}_{m3}=\mathbf{LeakyReLU}(\mathbf{A}_{t})+% \mathbf{A}^{\prime\prime}_{m3}.

(15)

So, we could obtain enhanced information without lacking original semantic features. Similarly, the other embeddings will be enhanced and termed as $\mathbf{A}^{*}_{\rm{o}},\mathbf{A}^{*}_{m1}$ and $\mathbf{A}^{*}_{m5}$ .

2.4 Cross Embedding Balance Loss

To take full advantage of identity-guided and modality-consistent embeddings, we propose the cross-embedding balance loss (CEBL) $\mathcal{L}_{CEBL}$ to most effectively eliminate cross-modal discrepancies. $\mathcal{L}_{CEBL}$ consists of a cross triplet loss $\mathcal{L}_{ctri}$ which is shown in the right hand of Fig. 1(a) and a balance contrastive loss $\mathcal{L}_{bc}$ . Inspired by [Zhang et al.(2023)Zhang, Yan, Li, and Wang, Zhang and Wang(2023)], $\mathcal{L}_{ctri}$ is proposed to constrain the correlation between diverse embeddings ( $\mathbf{A}^{*}_{\rm{o}},\mathbf{A}^{*}_{m1},\mathbf{A}^{*}_{m3}$ , $\mathbf{A}^{*}_{m5}$ .) to eliminate the cross-modal discrepancies at different scales. In this section, to delve deeper into the relationship between visible and infrared modes from the identity-guided and modality-consistent perspective, we split them into visible embeddings $\mathbf{V}^{*}_{\rm{o}},\mathbf{V}^{*}_{m1},\mathbf{V}^{*}_{m3},\mathbf{V}^{*}% _{m5}$ and infrared embeddings $\mathbf{I}^{*}_{\rm{o}},\mathbf{I}^{*}_{m1},\mathbf{I}^{*}_{m3},\mathbf{I}^{*}% _{m5}$ from $\mathbf{A}^{*}_{\rm{o}},\mathbf{A}^{*}_{m1},\mathbf{A}^{*}_{m3},\mathbf{A}^{*}% _{m5}$ , respectively.

For the identity $i$ , let $\mathbf{C}^{i}_{Vx}$ and $\mathbf{C}^{i}_{Ix}$ be the cluster centers of $\mathbf{V}^{*}_{x}$ and $\mathbf{I}^{*}_{x}$ ( $x=\rm{o},m1,m3,m5$ ), respectively. Our goal is to close the distance between $\mathbf{C}^{i}_{Vm3}$ and $\mathbf{C}^{i}_{Iy}$ ( $y$ is from $\rm{o},m1,m5$ ). By aligning embeddings in this manner, we can bridge cross-modal discrepancies more effectively and constrain them at diverse scales. Meanwhile, to reduce intra-modal discrepancies, we increase the distance between $\mathbf{C}^{i}_{Vm3}$ and $\mathbf{C}^{j}_{Vy}$ with the different identity $i$ , $j$ . The following loss terms $\mathcal{L}_{V3y}$ are inherited from the triplet loss,

\displaystyle\mathcal{L}_{V3y}\!=\!\!\sum_{i,j=1\ i\neq j}^{C}\left[\alpha+% \boldsymbol{D}\left(\mathbf{C}_{Vm3}^{i},\ \mathbf{C}_{Iy}^{i}\right)-% \boldsymbol{D}\left(\mathbf{C}_{Vm3}^{i},\mathbf{C}_{Vy}^{j}\right)\right]_{+}.

(16)

where $\boldsymbol{D}$ denotes the euclidean distance between two clusters and $\alpha$ is a constant. As shown in Tab. 3 in ablation studies, we want other branches to have similar performances as the margin3 branch because the margin3 branch has the most discriminative performance when testing each branch solely. That is the reason we design the $\mathcal{L}_{ctri}$ above. Similarly, we close the distance between $\mathbf{C}^{i}_{Im3}$ and $\mathbf{C}^{i}_{Vy}$ ( $y$ is from $\rm{o},m1,m5$ ). Meanwhile, to reduce the intra-modal discrepancies, we increase the distance between $\mathbf{C}^{i}_{Im3}$ and $\mathbf{C}^{j}_{Iy}$ by

\displaystyle\mathcal{L}_{I3y}\!=\!\!\sum_{i,j=1\ i\neq j}^{C}\left[\alpha+% \boldsymbol{D}\left(\mathbf{C}_{Im3}^{i},\ \mathbf{C}_{Vy}^{i}\right)-% \boldsymbol{D}\left(\mathbf{C}_{Im3}^{i},\mathbf{C}_{Iy}^{j}\right)\right]_{+}.

(17)

Therefore, we can get the cross triplet loss $\mathcal{L}_{ctri}$ , by

\displaystyle\mathcal{L}_{ctri}\!\!=\Sigma_{y}(\!\mathcal{L}_{V3y}\!+\!% \mathcal{L}_{I3y}),\quad y=\rm{o},m1,m5.

(18)

Although we strive to pull the distance between positive embeddings and simultaneously widen the gap between negative embeddings, an issue arises where the distances between hard-negative samples become unbalanced. This leads to some hard-negative samples will be considered as positive samples. However, $\mathcal{L}_{ctri}$ may contribute to scenario where unbalanced distances among hard-negative samples coexist. To solve this, we propose the balance contrastive loss $\mathcal{L}_{bc}$ to maximize the distance among all the negative samples, a temperature-scaled softmax function [Wu et al.(2018)Wu, Xiong, Yu, and Lin, Atito et al.(2021)Atito, Awais, and Kittler]. To get $\mathcal{L}_{bc}$ , we first concatenate cross-modal embeddings.

\displaystyle\mathbf{C}^{i}_{mk}=\mathbf{Concat}(\mathbf{C}^{i}_{Vmk},\mathbf{% C}^{i}_{Imk}),(k=1,3,5),\quad\mathbf{C}^{i}_{\rm{o}}=\mathbf{Concat}(\mathbf{C% }^{i}_{V\rm{o}},\mathbf{C}^{i}_{I\rm{o}}).

(19)

The balance cross contrastive loss $\mathcal{L}_{bcxy}$ between $\mathbf{C}^{i}_{x}$ and $\mathbf{C}^{i}_{y}$ is

\mathcal{L}_{bcxy}=\frac{\mathrm{e}^{\operatorname{sim}\left(\left(\mathbf{C}^% {i}_{x}\right),\left(\mathbf{C}^{i}_{y}\right)\right)/\tau}}{\sum_{k=1,k\neq i% }^{2N}\mathrm{e}^{\operatorname{sim}\left(\left(\mathbf{C}^{i}_{x}\right),% \left(\mathbf{C}^{k}_{y}\right)\right)/\tau}},

(20)

and we get the balance contrastive loss $\mathcal{L}_{bc}$ by following,

\displaystyle\mathcal{L}_{bc}=\Sigma_{y}\Sigma_{x}\mathcal{L}_{bcxy}\quad x,y=% \rm{o},m1,m3,m5,where\quad x\neq y.

(21)

Therefore, our overall loss includes the proposed loss $\mathcal{L}_{CEBL}=\mathcal{L}_{ctri}+\mathcal{L}_{bc}$ , the triplet loss $\mathcal{L}_{tri}$ [Hermans et al.(2017)Hermans, Beyer, and Leibe], the cross-entropy loss $\mathcal{L}_{ce}$ [Ye et al.(2021b)Ye, Shen, Lin, Xiang, Shao, and Hoi], and can be expressed as

\displaystyle\mathcal{L}=\mathcal{L}_{ce}+\mathcal{L}_{tri}+\lambda(\mathcal{L% }_{CEBL}).

(22)

3 Experimental results

3.1 Experiment settings

We test our method on two datasets. SYSU-MM01 [Wu et al.(2017)Wu, Zheng, Yu, Gong, and Lai] is the most challenging dataset for VI-ReID. It includes 29,033 visible and 15,712 infrared images captured by 4 visible and 2 infrared cameras in indoor/outdoor settings. The training set has 22,258 visible and 11,909 infrared images from 395 identities. For testing, there are images from 96 individuals, split into a query set (infrared) and a gallery set (visible). Testing is under two modes, i.e., all-search using all images and indoor search using only indoor images. Another dataset RegDB [Nguyen et al.(2017)Nguyen, Hong, Kim, and Park] comprises 10 visible and infrared images per person, totaling 2,060 images in both sets for training and testing. During testing, both visible-to-infrared and infrared-to-visible modes are utilized. All 2,060 visible/infrared images are employed as query and gallery sets.

Our experiments are all done on a NVIDIA A100 GPU. All the input images are resized to $3\times 288\times 144$ with channel augmentation [Ye et al.(2021a)Ye, Ruan, Du, and Shou]. DIAN is adopted during the training and inference phases. In each mini-batch, we randomly select 4 visible and 4 infrared images, with a batch size of 6. The SGD optimizer is adopted for training. The detail about learning rate is introduced in supplementary. Since two different datasets are used in the experiments, the model is tailored to meet the unique requirements of each dataset. Given that attention mechanisms usually benefit from large amounts of data [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin], for SYSU-MM01, DIAN follows exactly the design in section 2. For RegDB, it is a simpler, smaller dataset. We made DIAN simpler for better performance on the RegDB dataset by removing OFM modules, eliminating origin branch, and reducing the number of branches from four to three.

Finally, for the overall loss $\mathcal{L}$ in Eq. (22), according to the experimental results of different $\lambda$ values in Fig. 2(a) and Fig. 2(b), we assign the value of $\lambda$ as 0.4 for both SYSU-MM01 and RegDB datasets. Outputs from all branches are added for testing. We use mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) to evaluate our work. mAP measures the average retrieval performance across all categories, while CMC assesses the percentage of correct retrievals among the top-k results.

3.2 Comparison With Existing Methods and Ablation Studies

We compare our method with existing SOTA methods without extra data to show the superiority of our method. The experiment results on SYSU-MM01 and RegDB datasets are reported in Tab. 1. More quantitative and qualitative analyses are shown in supplementary.

Table 1: Re-identification rates on SYSU-MM01 and RegDB dataset. The bold means the first-ranked indicator, the underline means the second-randed indicator.

	SYSU-MM01				RegDB
Model	All-Search		Indoor-Search		Visible to Infrared		Infrared to Visible
Model	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
cmGAN [Dai et al.(2018)Dai, Ji, Wang, Wu, and Huang]	26.97	27.80	31.63	42.19	-	-	-	-
AlgnGAN [Wang et al.(2019a)Wang, Zhang, Cheng, Liu, Yang, and Hou]	42.40	40.70	45.90	54.30	57.90	53.60	-	-
MSR [Feng et al.(2019)Feng, Lai, and Xie]	37.50	38.11	39.64	50.88	48.43	48.67	-	-
JSIA [Wang et al.(2020)Wang, Zhang, Yang, Cheng, Chang, Liang, and Hou]	38.10	36.90	43.80	52.90	48.53	49.30	48.12	48.94
SDL [Kansal et al.(2020)Kansal, Subramanyam, Wang, and Satoh]	28.12	29.01	32.56	39.56	26.47	23.58	25.74	22.89
X-Modality [Li et al.(2020)Li, Wei, Hong, and Gong]	49.92	50.73	-	-	62.21	60.18	-	-
DDAG [Ye et al.(2020)Ye, Shen, J. Crandall, Shao, and Luo]	54.75	53.02	61.02	67.98	69.34	63.46	68.06	61.80
NFS [Chen et al.(2021b)Chen, Wan, Li, **g, and Sun]	56.91	55.45	62.79	69.79	80.54	72.10	77.95	69.79
MSO [Gao et al.(2021)Gao, Liang, **, Gu, Liu, Li, and Lang]	58.70	56.42	63.09	70.31	73.60	66.90	74.60	67.50
GECNet [Zhong et al.(2021)Zhong, Lu, Huang, Ye, Jia, and Lin]	53.37	51.83	60.60	62.89	82.33	78.45	78.93	75.58
CIMA [Zhao et al.(2021)Zhao, Liu, Chu, Lu, and Yu]	57.20	59.30	66.60	74.70	78.80	69.40	77.90	69.40
TSME [Liu et al.(2022)Liu, Wang, Huang, Zhang, and Han]	64.23	61.21	64.80	71.53	87.35	76.94	86.41	75.70
CMTR [Liang et al.(2021)Liang, **, Gao, Liu, Feng, Wang, and Li]	62.58	61.33	67.02	75.40	80.62	74.42	81.06	73.75
TCOM [Si et al.(2023)Si, He, Li, and Gao]	63.92	60.71	68.35	73.08	87.04	80.40	83.20	76.73
SFANet [Liu et al.(2023)Liu, Ma, Xia, and Li]	65.74	60.83	71.60	80.05	76.31	68.00	70.15	63.77
PMT [Lu et al.(2023)Lu, Zou, and Zhang]	67.53	64.98	71.66	76.52	84.83	76.55	84.16	75.13
SIDA [Gong et al.(2023)Gong, Zhao, Lam, Gao, and Shen]	68.36	64.19	73.28	77.49	81.73	75.07	79.71	72.60
MFCS [Yang et al.(2024)Yang, Dong, Li, Wei, Wang, and Gao]	70.59	67.49	75.98	80.24	85.34	76.39	83.88	75.16
DIAN(Ours)	75.20	71.15	86.28	87.41	88.06	82.57	86.07	80.02

Table 2: DIAN Ablation Study on SYSU-MM01 (%).

OFM	IEDK	PPEM	$\mathcal{L}_{ctri}$	$\mathcal{L}_{bc}$	Rank1	Rank10	Rank20	mAP
Settings				SYSU-MM01
					66.05	93.08	97.37	61.89
✓					71.29	95.41	98.08	66.91
	✓				70.81	95.19	97.82	66.92
		✓			70.16	94.61	98.00	65.00
✓	✓				70.16	95.90	98.82	66.77
✓	✓	✓			73.65	96.77	99.13	68.95
	✓	✓	✓		72.05	95.19	98.03	67.14
✓	✓	✓	✓		74.89	97.81	99.00	70.05
✓	✓	✓		✓	73.66	97.82	99.41	69.10
✓	✓	✓	✓	✓	75.20	97.84	99.53	71.15

Table 3: Four branches performance on SYSU-MM01 Dataset without

\mathcal{L}_{CEBL}

at Rank-1 and mAP rates.

Four branches performances on SYSU-MM01
Branch Name	Rank-1	mAP
Origin	72.21	66.74
Margin1	71.89	67.25
Margin3	72.52	67.37
Margin5	72.23	67.13
Add All	73.65	68.95

Specifically, on the SYSU-MM01 dataset in indoor search mode, our method achieves the Rank-1 accuracy of 86.28% and the mAP of 87.41%, respectively. As shown in Tab. 1, the text in bold indicates the first-ranked indicator, and the one underlined is the second-ranked indicator. In all-search mode, our method produces the Rank-1 accuracy of 75.20% and the mAP of 71.15%, respectively. Our model performs very well in the indoor search mode of SYSU-MM01 because identity-guided and modality-consistent features are more evident in indoor scenes compared to the entire scenes. As shown in Tab. 1, our methods can handle different query modes robustly. We also evaluate our model on RegDB with two query modes. For the visible to infrared mode, our method achieves the Rank-1 accuracy of 88.06% and the mAP of 82.57%.

Ablation on each components. We performed ablation studies to assess the effectiveness of each component in DIAN. In Tab. 3, the third row gives the baseline AGW performance [Ye et al.(2021a)Ye, Ruan, Du, and Shou] with channel augmentation trained with the loss terms $\mathcal{L}_{ce}$ and $\mathcal{L}_{tri}$ . $\checkmark$ indicates the result of adding the corresponding module. We show each component has effectiveness on the network performances. DIAN without using the $\mathcal{L}_{ctri}$ and $\mathcal{L}_{bc}$ can improve the performance of the baseline model, which indicates the importance of exploring the identity-guided and modality-consistent embeddings. Moreover, from experimental results shown in last three rows, DIAN with $\mathcal{L}_{ctri}$ and $\mathcal{L}_{bc}$ can effectively bridge the cross-modal gap towards identity-guided and modality-consistent features.

Test on four branches. As shown in Tab. 3, we have tested the performances on four branches separately, and the margin3 branch achieves the best performance. So it may extract the most discriminative identity-guided embeddings. That is why we design the $\mathcal{L}_{ctri}$ .

Quantitative analysis and visualization We also exhibited the quantitative analysis and visualization of DIAN to prove our network. Please see supplementary for details.

4 Conclusion

This paper solves the problem of VI-ReID. We design a dynamic identity-guided attention network (DIAN) to mine identity-guided and modality-consistent embeddings. In DIAN, three orthogonal fusion modules (OFM) are introduced to fuse features for decoupling, an identity-guided embedding decoupling kernel (IEDK) to mine discriminative identity-guided and modality-consistent features at different scales, a parallel progressive enhancement module (PPEM) to progressively enhance above features in parallel. Finally, a cross-embedding balance loss (CEBL) is introduced to effectively bridge the gap between different modalities by identity-guided and modality-consistent embeddings. Experimental results demonstrate that DIAN achieves superior performance.

References

[Atito et al.(2021)Atito, Awais, and Kittler] Sara Atito, Muhammad Awais, and Josef Kittler. Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021.
[Chai et al.(2023)Chai, Ling, Luo, Lin, Jiang, and Li] Zehua Chai, Yongguo Ling, Zhiming Luo, Dazhen Lin, Min Jiang, and Shaozi Li. Dual-stream transformer with distribution alignment for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 33(11):6764–6776, 2023.
[Chen et al.(2021a)Chen, Fan, and Panda] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 357–366, Oct. 2021a.
[Chen et al.(2018)Chen, Collins, Zhu, Papandreou, Zoph, Schroff, Adam, and Shlens] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale architectures for dense image prediction. Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS)., 31, 2018.
[Chen et al.(2021b)Chen, Wan, Li, **g, and Sun] Yehansen Chen, Lin Wan, Zhihang Li, Qianyan **g, and Zongyuan Sun. Neural feature search for rgb-infrared person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 587–597, Jun. 2021b.
[Dai et al.(2017)Dai, Qi, Xiong, Li, Zhang, Hu, and Wei] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 764–773, Oct. 2017.
[Dai et al.(2018)Dai, Ji, Wang, Wu, and Huang] **yang Dai, Rongrong Ji, Haibin Wang, Qiong Wu, and Yuyu Huang. Cross-modality person re-identification with generative adversarial training. In Int. Joint Conf. Artif. Intell., volume 1, page 6, 2018.
[Feng et al.(2023)Feng, Ji, Wu, Gao, Gao, Liu, Liu, **g, and Luo] Yujian Feng, Yimu Ji, Fei Wu, Guangwei Gao, Yang Gao, Tianliang Liu, Shangdong Liu, Xiao-Yuan **g, and Jiebo Luo. Occluded visible-infrared person re-identification. IEEE Trans. Multimedia., 25:1401–1413, 2023.
[Feng et al.(2019)Feng, Lai, and Xie] Zhanxiang Feng, Jianhuang Lai, and Xiaohua Xie. Learning modality-specific representations for visible-infrared person re-identification. IEEE Trans. Image Process., 29:579–590, 2019.
[Gao et al.(2019)Gao, Cheng, Zhao, Zhang, Yang, and Torr] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell., 43(2):652–662, 2019.
[Gao et al.(2021)Gao, Liang, **, Gu, Liu, Li, and Lang] Yajun Gao, Tengfei Liang, Yi **, Xiaoyan Gu, Wu Liu, Yidong Li, and Congyan Lang. Mso: Multi-feature space joint optimization network for rgb-infrared person re-identification. In Proc. of the 29th ACM Int. Conf. Multimedia., pages 5257–5265, 2021.
[Gong et al.(2023)Gong, Zhao, Lam, Gao, and Shen] Jiahao Gong, Sanyuan Zhao, Kin-Man Lam, Xin Gao, and Jianbing Shen. Spectrum-irrelevant fine-grained representation for visible–infrared person re-identification. Comput. Vis. Image Underst., 232:103703, 2023.
[Hao et al.(2019)Hao, Wang, Li, and Gao] Yi Hao, Nannan Wang, Jie Li, and Xinbo Gao. Hsme: Hypersphere manifold embedding for visible thermal person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 33, pages 8385–8392, 2019.
[Hermans et al.(2017)Hermans, Beyer, and Leibe] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[Jiang et al.(2022)Jiang, Zhang, Liu, Qian, Zhang, and Wu] Kongzhu Jiang, Tianzhu Zhang, Xiang Liu, Bingqiao Qian, Yongdong Zhang, and Feng Wu. Cross-modality transformer for visible-infrared person re-identification. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 480–496. Springer, Oct. 2022.
[Kansal et al.(2020)Kansal, Subramanyam, Wang, and Satoh] Kajal Kansal, A Venkata Subramanyam, Zheng Wang, and Shin’ichi Satoh. Sdl: Spectrum-disentangled representation learning for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 30(10):3422–3432, 2020.
[Li et al.(2020)Li, Wei, Hong, and Gong] Diangang Li, Xing Wei, Xiaopeng Hong, and Yihong Gong. Infrared-visible cross-modal person re-identification with an x modality. In Proc. AAAI Conf. Artif. Intell., volume 34, pages 4610–4617, 2020.
[Liang et al.(2021)Liang, **, Gao, Liu, Feng, Wang, and Li] Tengfei Liang, Yi **, Yajun Gao, Wu Liu, Songhe Feng, Tao Wang, and Yidong Li. Cmtr: Cross-modality transformer for visible-infrared person re-identification. arXiv preprint arXiv:2110.08994, 2021.
[Liu et al.(2023)Liu, Ma, Xia, and Li] Haojie Liu, Shun Ma, Daoxun Xia, and Shaozi Li. Sfanet: A spectrum-aware feature augmentation network for visible-infrared person reidentification. IEEE Trans. Neural Netw. Learn. Syst., 34(4):1958–1971, 2023.
[Liu et al.(2022)Liu, Wang, Huang, Zhang, and Han] Jianan Liu, Jialiang Wang, Nianchang Huang, Qiang Zhang, and Jungong Han. Revisiting modality-specific feature compensation for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 32(10):7226–7240, 2022.
[Lu et al.(2023)Lu, Zou, and Zhang] Hu Lu, Xuezhang Zou, and **** Zhang. Learning progressive modality-shared transformers for effective visible-infrared person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 37, pages 1835–1843, 2023.
[Nguyen et al.(2017)Nguyen, Hong, Kim, and Park] Dat Tien Nguyen, Hyung Gil Hong, Ki Wan Kim, and Kang Ryoung Park. Person recognition system based on a combination of body images from visible light and thermal cameras. Sens., 17(3):605, 2017.
[Shen et al.(2023)Shen, Zhao, and Zhang] Hao Shen, Zhong-Qiu Zhao, and Wandi Zhang. Adaptive dynamic filtering network for image denoising. In Proc. AAAI Conf. Artif. Intell., volume 37, pages 2227–2235, 2023.
[Si et al.(2023)Si, He, Li, and Gao] Tongzhen Si, Fazhi He, Penglei Li, and Xiaoxin Gao. Tri-modality consistency optimization with heterogeneous augmented images for visible-infrared person re-identification. Neural Comput., 523:170–181, 2023.
[Sun et al.(2018)Sun, Zheng, Yang, Tian, and Wang] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Sheng** Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 480–496. Springer, Sept. 2018.
[Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Proc. Int. Conf. Neural Inf. Process. Syst. (NeurIPS)., 30, 2017.
[Wang et al.(2020)Wang, Zhang, Yang, Cheng, Chang, Liang, and Hou] Guan-An Wang, Tianzhu Zhang, Yang Yang, Jian Cheng, Jianlong Chang, Xu Liang, and Zeng-Guang Hou. Cross-modality paired-images generation for rgb-infrared person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 34, pages 12144–12151, 2020.
[Wang et al.(2019a)Wang, Zhang, Cheng, Liu, Yang, and Hou] Guan’an Wang, Tianzhu Zhang, Jian Cheng, Si Liu, Yang Yang, and Zengguang Hou. Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 3623–3632, Oct. 2019a.
[Wang et al.(2019b)Wang, Wang, Zheng, Chuang, and Satoh] Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Yung-Yu Chuang, and Shin’ichi Satoh. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 618–626, Jun. 2019b.
[Wei et al.(2018)Wei, Zhang, Gao, and Tian] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 79–88, Jun. 2018.
[Wei et al.(2021)Wei, Yang, Wang, and Gao] Ziyu Wei, Xi Yang, Nannan Wang, and Xinbo Gao. Flexible body partition-based adversarial learning for visible infrared person re-identification. IEEE Trans. Neural Netw. Learn. Syst., 33(9):4676–4687, 2021.
[Wu et al.(2017)Wu, Zheng, Yu, Gong, and Lai] Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, and Jianhuang Lai. Rgb-infrared cross-modality person re-identification. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 5380–5389, Oct. 2017.
[Wu et al.(2018)Wu, Xiong, Yu, and Lin] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 3733–3742, Jun. 2018.
[Yang et al.(2021)Yang, He, Fan, Shi, Xue, Li, Ding, and Huang] Min Yang, Dongliang He, Miao Fan, Baorong Shi, Xuetong Xue, Fu Li, Errui Ding, and Jizhou Huang. Dolg: Single-stage image retrieval with deep orthogonal fusion of local and global features. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 11772–11781, Oct. 2021.
[Yang et al.(2024)Yang, Dong, Li, Wei, Wang, and Gao] Xi Yang, Wenjiao Dong, Meijie Li, Ziyu Wei, Nannan Wang, and Xinbo Gao. Cooperative separation of modality shared-specific features for visible-infrared person re-identification. IEEE Trans. Multimedia, pages 1–11, 2024. 10.1109/TMM.2024.3377139.
[Ye et al.(2020)Ye, Shen, J. Crandall, Shao, and Luo] Mang Ye, Jianbing Shen, David J. Crandall, Ling Shao, and Jiebo Luo. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 229–247. Springer, Aug. 2020.
[Ye et al.(2021a)Ye, Ruan, Du, and Shou] Mang Ye, Weijian Ruan, Bo Du, and Mike Zheng Shou. Channel augmented joint learning for visible-infrared recognition. In Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), pages 13567–13576, Oct. 2021a.
[Ye et al.(2021b)Ye, Shen, Lin, Xiang, Shao, and Hoi] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell., 44(6):2872–2893, Aug. 2021b.
[Zhang et al.(2022)Zhang, Kang, Zhao, and Shen] Yiyuan Zhang, Yuhao Kang, Sanyuan Zhao, and Jianbing Shen. Dual-semantic consistency learning for visible-infrared person re-identification. IEEE Trans. Inf. Foren. Sec., 18:1554–1565, 2022.
[Zhang and Wang(2023)] Yukang Zhang and Hanzi Wang. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 2153–2162, Jun. 2023.
[Zhang et al.(2023)Zhang, Yan, Li, and Wang] Yukang Zhang, Yan Yan, Jie Li, and Hanzi Wang. Mrcn: A novel modality restitution and compensation network for visible-infrared person re-identification. arXiv preprint arXiv:2303.14626, 2023.
[Zhao et al.(2021)Zhao, Liu, Chu, Lu, and Yu] Zhiwei Zhao, Bin Liu, Qi Chu, Yan Lu, and Nenghai Yu. Joint color-irrelevant consistency learning and identity-aware modality adaptation for visible-infrared cross modality person re-identification. In Proc. AAAI Conf. Artif. Intell., volume 35, pages 3520–3528, 2021.
[Zheng et al.(2017)Zheng, Zhang, Sun, Chandraker, Yang, and Tian] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. Person re-identification in the wild. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 1367–1376, Jul. 2017.
[Zhong et al.(2021)Zhong, Lu, Huang, Ye, Jia, and Lin] Xian Zhong, Tianyou Lu, Wenxin Huang, Mang Ye, Xuemei Jia, and Chia-Wen Lin. Grayscale enhancement colorization network for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol., 32(3):1418–1430, 2021.