HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2401.06825v1 [cs.CV] 12 Jan 2024

Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Jiangming Shi1, Xiangbo Yin2, Yeyun Chen1, Yachao Zhang3, Zhizhong Zhang4, 5,
Yuan Xie4, 5*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Yanyun Qu1,2
1 Institute of Artificial Intelligence, 2 School of Informatics, Xiamen University
3 Tsinghua Shenzhen International Graduate School, Tsinghua University
4 East China Normal University, 5 Chongqing Institute of East China Normal University
[email protected], {zzzhang, yxie}@cs.ecnu.edu.cn, [email protected]
Corresponding author.
Abstract

Unsupervised visible-infrared person re-identification (USL-VI-ReID) is a promising yet challenging retrieval task. The key challenges in USL-VI-ReID are to effectively generate pseudo-labels and establish pseudo-label correspondences across modalities without relying on any prior annotations. Recently, clustered pseudo-label methods have gained more attention in USL-VI-ReID. However, previous methods fell short of fully exploiting the individual nuances, as they simply utilized a single memory that represented an identity to establish cross-modality correspondences, resulting in ambiguous cross-modality correspondences. To address the problem, we propose a Multi-Memory Matching (MMM) framework for USL-VI-ReID. We first design a Cross-Modality Clustering (CMC) module to generate the pseudo-labels through clustering together both two modality samples. To associate cross-modality clustered pseudo-labels, we design a Multi-Memory Learning and Matching (MMLM) module, ensuring that optimization explicitly focuses on the nuances of individual perspectives and establishes reliable cross-modality correspondences. Finally, we design a Soft Cluster-level Alignment (SCA) module to narrow the modality gap while mitigating the effect of noise pseudo-labels through a soft many-to-many alignment strategy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the reliability of the established cross-modality correspondences and the effectiveness of our MMM. The source codes will be released.

1 Introduction

Person re-identification (ReID) aims to match the same person across different cameras, with applications in various fields of video surveillance [49, 12, 13, 14], such as intelligent security and criminal investigation. However, in low-light conditions, the images captured by visible cameras are far from satisfactory, which renders methods [24, 14, 33, 43, 55] that primarily focus on matching visible images less effective. Fortunately, smart surveillance cameras that can switch from visible to infrared modes in poor lighting environments have become widespread, driving the development of visible-infrared person re-identification (VI-ReID) for the 24-hour surveillance system.

Refer to caption
Figure 1: Comparision with different methods on ARI. The ARI indicates the Adjusted Rand Index, which is a similarity measure between two clusterings. The visualization is presented in Fig. 5.

VI-ReID aims at retrieving infrared images of the same person when provided with a visible person image, and vice versa [53, 48]. Many VI-ReID methods [51, 53, 52, 28, 41, 56, 19] have shown promising progress. However, these methods are based on well-annotated cross-modality data, which is time-consuming and labor-intensive, thereby limiting the practical application of supervised VI-ReID methods in real-world scenarios.

To free the toilsome label process and speed the automation of VI-ReID, several unsupervised VI-ReID (USL-VI-ReID) methods [44, 6, 16, 42, 47, 45] have been proposed, which try to establish cross-modality correspondences by clustering pseudo-labels and have achieved fairly good performance. However, the reliability of pseudo-labels and cross-modality correspondences remains invalid in USL-VI-ReID. We argue they are critical to the credibility of USL-VI-ReID. To measure the reliability, we introduce the Adjusted Rand Index (ARI) [18] metric, which is a widely recognized metric for clustering evaluation. The larger ARI value, the better it reflects the degree of overlap between the clustered results and the ground-truth labels. More detailed explanations are presented in supplementary materials. In Fig. 1, RGB and IR categories denote the ARI values of visible and infrared pseudo-labels, which can measure the quality of visible and infrared pseudo-labels. The ALL category represents the ARI values of overall pseudo-labels, composed of visible and infrared pseudo-labels, and serves as a metric for evaluating the reliability of cross-modality correspondences. Interestingly, we observe a peculiar phenomenon: cross-modality correspondences of previous methods are not reliable, though they achieve good performance as shown in Fig. 1 and Tab. 1. Therefore, we raise a question: Why do noisy cross-modality correspondences still achieve good performance? This answer is that persons with different identities may share some similar features. As a result, some similar features become more closely related due to noisy correspondences. But the closer proximity of similar features may result in greater similarity among different persons, rendering the retrieval of specific persons from a large gallery more challenging.

To reduce the ambiguous cross-modality correspondences in USL-VI-ReID, we develop a novel Multi-Memory Matching (MMM) framework to exploit the individual nuances. Multi-memory can store a wider array of distinct characteristics for a single identity. For example, Memory 1 can retain front-facing attributes, Memory 2 can capture rear-facing attributes, and so on. In short, multi-memory supports a more diverse representation, which is beneficial for the establishment of cross-modality correspondences. Specifically, we propose a Cross-Modality Clustering (CMC) module to generate pseudo-labels. Unlike previous methods, we not only cluster intra-modality samples but also cluster inter-modality samples to learn modality-invariant features. We note that the existing methods typically rely on a single memory to represent individual characteristics and establish cross-modality correspondences. However, a single memory may not capture all individual nuances, including perspective, attire, and other factors, which naturally leads to poor cross-modality correspondences. Therefore, we design a Multi-Memory Learning and Matching (MMLM) module to obtain reliable cross-modality correspondences. The modules mentioned above do not directly reduce the discrepancy between the two modalities, we propose the Soft Cluster-level Alignment (SCA) module to narrow the modality gap through two soft cluster-level intra- and inter-modality alignment losses. Our MMM can achieve fairly good quality of pseudo-labels and cross-modality correspondences compared with several USL-VI-ReID methods, as shown in Fig. 1.

The main contributions are summarized as follows:

  • We introduce the ARI metric to evaluate the quality of pseudo-labels and cross-modality correspondences, and we observe a curious phenomenon: cross-modality correspondences of previous methods are not reliable, though they achieve good performance.

  • We design a novel Multi-Memory Matching (MMM) framework for unsupervised VI-ReID, which exploits the individual nuances to effectively establish reliable cross-modality correspondences.

  • We introduce three effective modules: Cross-Modality Clustering (CMC), Multi-Memory Learning and Matching (MMLM), and Soft Cluster-level Alignment (SCA). These modules facilitate the generation of pseudo-labels, establish reliable cross-modality correspondences, and narrow the discrepancy between two modalities while mitigating the influence of noisy pseudo-labels.

2 Related Work

2.1 Supervised Visible-Infrared Person ReID

Visible-infrared person ReID is a challenging cross-modality image retrieval problem. Many works have been proposed to alleviate the large cross-modality gap for VI-ReID, which can be broadly categorized into two classes: image-level alignment and feature-level alignment. The image-level alignment methods [8, 35, 34] try to generate cross-modal images to excavate modality-invariant information. Moreover, several methods [20, 58, 39] introduce an auxiliary modality to assist the cross-modality retrieval task. The feature-level alignment methods [52, 25, 56, 48, 51, 32, 41] mainly map cross-modal features into a shared feature space to reduce cross-modal differences. For example, SGIEL [11] separates shape-related features from shape-erasure features through orthogonal decomposition to improve the diversity and identification of the learned representations for VI-ReID. However, the above methods heavily rely on large-scale cross-modality data annotation, which is quite expensive and time-consuming.

Refer to caption
Figure 2: The pipeline of our framework. Different colors indicate different persons, \bigcirc and \bigtriangleup indicate visible and infrared features, respectively. It contains the Cross-Modality Clustering module (baseline, described in Sec. 3.2) and two key novel components: Multi-Memory Learning and Matching (MMLM, described in Sec. 3.3) and Soft Cluster-level Alignment (SCA, described in Sec. 3.4). The framework is proposed to effectively establish reliable cross-modality correspondences.

2.2 Unsupervised Single-Modality Person ReID

Existing unsupervised single-modality person ReID (USL-ReID) methods can be roughly categorized into domain translation-based methods and clustering-based methods. The domain translation-based methods [12, 13, 14, 31, 54, 22, 37] try to transfer the knowledge from the labeled source domain to the unlabeled target domain for USL-ReID. Compared with the former, the clustering-based methods [24, 23, 33, 43, 55, 7, 3] are more challenging, which are trained directly on the unlabeled target domain. The common idea of clustering-based methods is using clustering algorithms [10] to generate pseudo-labels to train a ReID model. Pseudo-labels inevitably contain noise, so it is challenging to assign the correct label to each unlabeled image. Recently, Cluster-Contrast [9] performs contrastive learning at the cluster level, thus achieving the consistency of the feature dictionary. Although the above methods perform well on USL-ReID, they are not suitable for solving the USL-VI-ReID due to the large cross-modality gap.

2.3 Unsupervised Visible-Infrared Person ReID

The challenge of unsupervised VI-ReID (USL-VI-ReID) is establishing reliable cross-modality correspondence. H2H [21] and OTLA [38] use a well-annotated labeled source domain for pre-training to solve the USL-VI-ReID. Inspired by Cluster-Contrast [9] for USL-ReID, some clustering-based methods [44, 6, 16, 42, 47] are proposed for USL-VI-ReID, they try to establish cross-modality correspondence by clustering pseudo-labels. Recently, it has been shown that the Large-scale Vision-Language Pre-training model, e.g., CLIP [29], naturally excels in producing textual descriptions for images. To this end, CCLNet [5] leverages the text information from CLIP to improve the USL-VI-ReID task. However, none of the above methods evaluate the reliability of cross-modality correspondence, indeed, their cross-modality correspondence is not reliable. Our method aims to investigate how to establish more reliable cross-modality correspondence for USL-VI-ReID.

3 Methodology

The framework of our MMM is illustrated in Fig. 2. We begin by employing the Cross-Modality Clustering (CMC) module to generate pseudo-labels. Building upon CMC, we propose a novel Multi-Memory Learning and Matching (MMLM) module to effectively establish cross-modality correspondences. Finally, we propose the Soft Cluster-level Alignment (SCA) module to narrow the gap between two modalities while mitigating the impact of noisy pseudo-labels through two soft cluster-level intra- and inter-modality alignment losses.

3.1 Notation Definition

Suppose we have a USL-VI-ReID dataset denoted as D={V,R}𝐷𝑉𝑅D=\{V,R\}italic_D = { italic_V , italic_R }. Here, V={vi}i=1N𝑉subscriptsuperscriptsubscript𝑣𝑖𝑁𝑖1V=\{v_{i}\}^{N}_{i=1}italic_V = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT represents the visible images with N𝑁Nitalic_N samples, and R={ri}i=1M𝑅subscriptsuperscriptsubscript𝑟𝑖𝑀𝑖1R=\{r_{i}\}^{M}_{i=1}italic_R = { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT denotes the infrared images with M𝑀Mitalic_M samples. We initialize their pseudo-labels as Ytsuperscript𝑌𝑡Y^{t}italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where t{v,r}𝑡𝑣𝑟t\in\{v,r\}italic_t ∈ { italic_v , italic_r }. Let Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Mpsubscript𝑀𝑝M_{p}italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT represent the number of visible and infrared samples with ID p𝑝pitalic_p, where p{1,2,,Pt}𝑝12superscript𝑃𝑡p\in\{1,2,...,P^{t}\}italic_p ∈ { 1 , 2 , … , italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } and Ptsuperscript𝑃𝑡P^{t}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the total number of person identities for modality t𝑡titalic_t. The respective feature sets of these images are denoted as Fv={f1v,f2v,,fNv}superscript𝐹𝑣subscriptsuperscript𝑓𝑣1subscriptsuperscript𝑓𝑣2subscriptsuperscript𝑓𝑣𝑁F^{v}=\{f^{v}_{1},f^{v}_{2},\ldots,f^{v}_{N}\}italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = { italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } for visible samples and Fr={f1r,f2r,,fMr}superscript𝐹𝑟subscriptsuperscript𝑓𝑟1subscriptsuperscript𝑓𝑟2subscriptsuperscript𝑓𝑟𝑀F^{r}=\{f^{r}_{1},f^{r}_{2},\ldots,f^{r}_{M}\}italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = { italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } for infrared samples, respectively. Our goal is to develop a cross-modality person ReID model without utilizing any labels.

3.2 Cross-Modality Clustering

Most USL-VI-ReID methods typically use clustering algorithms to generate pseudo-labels. Following this paradigm, we employ the DBSCAN algorithm [10] to generate pseudo-labels for all images, as described:

Yt=DBSCAN(Ft).superscript𝑌𝑡𝐷𝐵𝑆𝐶𝐴𝑁superscript𝐹𝑡{{Y}^{t}}=DBSCAN(F^{t}).italic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_D italic_B italic_S italic_C italic_A italic_N ( italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (1)

Unlike previous methods, we not only cluster intra-modality samples (t=vort=r)𝑡𝑣𝑜𝑟𝑡𝑟(t=v~{}or~{}t=r)( italic_t = italic_v italic_o italic_r italic_t = italic_r ) but also cluster inter-modality samples (t=[v,r])𝑡𝑣𝑟(t=[v,r])( italic_t = [ italic_v , italic_r ] ) to indirectly build cross-modality correspondence.

At the beginning of every training iteration, we calculate and retain the memory for each cluster as follows:

𝑪Vp=1Npi=1Npf(Vip),subscript𝑪superscript𝑉𝑝1subscript𝑁𝑝superscriptsubscript𝑖1subscript𝑁𝑝𝑓superscriptsubscript𝑉𝑖𝑝\bm{C}_{V^{p}}=\frac{1}{N_{p}}\sum_{i=1}^{N_{p}}{f}(V_{i}^{p}),bold_italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) , (2)
𝑪Rp=1Mpi=1Mpf(Rip),subscript𝑪superscript𝑅𝑝1subscript𝑀𝑝superscriptsubscript𝑖1subscript𝑀𝑝𝑓superscriptsubscript𝑅𝑖𝑝\bm{C}_{R^{p}}=\frac{1}{M_{p}}\sum_{i=1}^{M_{p}}{f}(R_{i}^{p}),bold_italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) , (3)
𝑪VRp=1Api=1Apf(VRip),subscript𝑪𝑉superscript𝑅𝑝1subscript𝐴𝑝superscriptsubscript𝑖1subscript𝐴𝑝𝑓𝑉superscriptsubscript𝑅𝑖𝑝\bm{C}_{{VR}^{p}}=\frac{1}{A_{p}}\sum_{i=1}^{A_{p}}{f}(VR_{i}^{p}),bold_italic_C start_POSTSUBSCRIPT italic_V italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f ( italic_V italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) , (4)

where f()𝑓f(\cdot)italic_f ( ⋅ ) is a function designated for extracting features from images across diverse modalities. We use superscripts to denote specified identity, Vpsuperscript𝑉𝑝V^{p}italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and Rpsuperscript𝑅𝑝R^{p}italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT denote the visible and infrared modality of the same identity sample sets with ID p𝑝pitalic_p, respectively. VRp𝑉superscript𝑅𝑝{VR}^{p}italic_V italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT represents the combined set of both modalities with Apsubscript𝐴𝑝A_{p}italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT samples of the same ID p𝑝pitalic_p. The index p𝑝pitalic_p ranges from 1 to Ptsuperscript𝑃𝑡P^{t}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Then, we optimize the feature extractor using ClusterNCE [9] loss, computed as:

LV=logexp(CV+Fv/τ)p=1Pvexp(CVpFv/τ),subscript𝐿𝑉superscriptsubscript𝐶𝑉superscript𝐹𝑣𝜏superscriptsubscript𝑝1superscript𝑃𝑣subscript𝐶superscript𝑉𝑝superscript𝐹𝑣𝜏L_{V}=-\log\frac{\exp\left({C_{V}^{+}}\cdot F^{v}/\tau\right)}{\sum_{p=1}^{P^{% v}}\exp\left({C_{V^{p}}}\cdot F^{v}/\tau\right)},italic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( italic_C start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⋅ italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT / italic_τ ) end_ARG , (5)
LR=logexp(CR+Fr/τ)p=1Prexp(CRpFr/τ),subscript𝐿𝑅superscriptsubscript𝐶𝑅superscript𝐹𝑟𝜏superscriptsubscript𝑝1superscript𝑃𝑟subscript𝐶superscript𝑅𝑝superscript𝐹𝑟𝜏L_{R}=-\log\frac{\exp\left({C_{R}^{+}}\cdot F^{r}/\tau\right)}{\sum_{p=1}^{P^{% r}}\exp\left({C_{R^{p}}}\cdot F^{r}/\tau\right)},italic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( italic_C start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⋅ italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT / italic_τ ) end_ARG , (6)
LVR=logexp(CVR+[Fv,Fr]/τ)p=1Pv,rexp(CVRp[Fv,Fr]/τ),subscript𝐿𝑉𝑅superscriptsubscript𝐶𝑉𝑅superscript𝐹𝑣superscript𝐹𝑟𝜏superscriptsubscript𝑝1superscript𝑃𝑣𝑟subscript𝐶𝑉superscript𝑅𝑝superscript𝐹𝑣superscript𝐹𝑟𝜏L_{VR}=-\log\frac{\exp\left({C_{{VR}}^{+}}\cdot[F^{v},F^{r}]/\tau\right)}{\sum% _{p=1}^{P^{v,r}}\exp\left({C_{{VR}^{p}}}\cdot[F^{v},F^{r}]/\tau\right)},italic_L start_POSTSUBSCRIPT italic_V italic_R end_POSTSUBSCRIPT = - roman_log divide start_ARG roman_exp ( italic_C start_POSTSUBSCRIPT italic_V italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ⋅ [ italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_v , italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_C start_POSTSUBSCRIPT italic_V italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ [ italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] / italic_τ ) end_ARG , (7)

where C+superscript𝐶C^{+}italic_C start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the positive memory representation and the τ𝜏\tauitalic_τ is a temperature hyper-parameter.

The CMC loss is defined as:

LCMC=LV+LR+LVR.subscript𝐿𝐶𝑀𝐶subscript𝐿𝑉subscript𝐿𝑅subscript𝐿𝑉𝑅L_{CMC}=L_{{V}}+L_{{R}}+L_{{VR}}.italic_L start_POSTSUBSCRIPT italic_C italic_M italic_C end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_V italic_R end_POSTSUBSCRIPT . (8)

3.3 Multi-Memory Learning and Matching

The CMC optimizes the feature extractor using a single memory, but a single memory may not fully capture individual nuances, such as perspective and attire. Moreover, the CMC does not directly establish relations between the two modalities, thereby limiting its effectiveness in cases with significant modality discrepancies. To more effectively capture intra-identity variations and bridge the gap between the visible and infrared modalities, we propose the Multi-Memory Learning and Matching (MMLM) module, which mines a holistic representation and establishes reliable cross-modality correspondences. Specifically, we further subdivide single memory into multi-memory for a single identity, which can be formulated as K-means [26]:

FCVip=argmin𝑛{f(vj)CVp22,vjVp},subscript𝐹subscript𝐶subscriptsuperscript𝑉𝑝𝑖𝑛superscriptsubscriptnorm𝑓subscript𝑣𝑗subscript𝐶superscript𝑉𝑝22for-allsubscript𝑣𝑗superscript𝑉𝑝F_{C_{V^{p}_{i}}}=\underset{{n}}{\arg\min}\{\left\|{f}(v_{j})-C_{V^{p}}\right% \|_{2}^{2},\forall v_{j}\in V^{p}\},italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = underitalic_n start_ARG roman_arg roman_min end_ARG { ∥ italic_f ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT } , (9)
FCRip=argmin𝑛{f(rj)CRp22,rjRp},subscript𝐹subscript𝐶subscriptsuperscript𝑅𝑝𝑖𝑛superscriptsubscriptnorm𝑓subscript𝑟𝑗subscript𝐶superscript𝑅𝑝22for-allsubscript𝑟𝑗superscript𝑅𝑝F_{C_{R^{p}_{i}}}=\underset{{n}}{\arg\min}\{\left\|{f}(r_{j})-C_{R^{p}}\right% \|_{2}^{2},\forall r_{j}\in R^{p}\},italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = underitalic_n start_ARG roman_arg roman_min end_ARG { ∥ italic_f ( italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT } , (10)

here, FCVipsubscript𝐹subscript𝐶subscriptsuperscript𝑉𝑝𝑖F_{C_{V^{p}_{i}}}italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT and FCRipsubscript𝐹subscript𝐶subscriptsuperscript𝑅𝑝𝑖F_{C_{R^{p}_{i}}}italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT represent the i𝑖iitalic_i-th visible and infrared feature sets of ID p𝑝pitalic_p, respectively. The index i𝑖iitalic_i ranges from 1 to n𝑛nitalic_n, where n𝑛nitalic_n is the number of memories for a single identity.

KCVip=1|FCVip|fvFCVipfvsubscript𝐾subscript𝐶subscriptsuperscript𝑉𝑝𝑖1subscript𝐹subscript𝐶subscriptsuperscript𝑉𝑝𝑖subscriptsuperscript𝑓𝑣subscript𝐹subscript𝐶subscriptsuperscript𝑉𝑝𝑖superscript𝑓𝑣K_{C_{V^{p}_{i}}}=\frac{1}{|F_{C_{V^{p}_{i}}}|}\sum_{f^{v}\in F_{C_{V^{p}_{i}}% }}f^{v}italic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT (11)
KCRip=1|FCRip|frFCRipfr,subscript𝐾subscript𝐶subscriptsuperscript𝑅𝑝𝑖1subscript𝐹subscript𝐶subscriptsuperscript𝑅𝑝𝑖subscriptsuperscript𝑓𝑟subscript𝐹subscript𝐶subscriptsuperscript𝑅𝑝𝑖superscript𝑓𝑟K_{C_{R^{p}_{i}}}=\frac{1}{|F_{C_{R^{p}_{i}}}|}\sum_{f^{r}\in F_{C_{R^{p}_{i}}% }}f^{r},italic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ italic_F start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , (12)

where KCVpsubscript𝐾subscript𝐶superscript𝑉𝑝K_{C_{V^{p}}}italic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT and KCRpsubscript𝐾subscript𝐶superscript𝑅𝑝K_{C_{R^{p}}}italic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT represent the visible and infrared multi-memory of ID p𝑝pitalic_p, respectively.

By employing the multi-memory learning strategy, we achieve more diverse memories for a single identity. However, these memories still exhibit a strong implicit correlation with the modality, which negatively impacts the establishment of cross-modality correspondences. Inspired by the PGM [42], we transform the cross-modality multi-memory matching problem into a weighted bipartite graph matching. The goal is to match each visible cluster with the corresponding identity infrared cluster while minimizing the cost, which is formulated as follows:

min𝑄CTQ s.t. p[Pv],p[Pr]:Qpp{0,1},p[Pv]:p[Pr]Qpp1,p[Pr]:p[Pv]Qpp=1,𝑄superscript𝐶𝑇𝑄:formulae-sequence s.t. for-all𝑝delimited-[]superscript𝑃𝑣for-allsuperscript𝑝delimited-[]superscript𝑃𝑟superscriptsubscript𝑄𝑝superscript𝑝01:for-all𝑝delimited-[]superscript𝑃𝑣superscript𝑝delimited-[]superscript𝑃𝑟superscriptsubscript𝑄𝑝superscript𝑝1:for-allsuperscript𝑝delimited-[]superscript𝑃𝑟𝑝delimited-[]superscript𝑃𝑣superscriptsubscript𝑄𝑝superscript𝑝1\begin{array}[]{c}\underset{{Q}}{\min}C^{T}{Q}\\ \text{ s.t. }\forall p\in[P^{v}],\forall p^{\prime}\in[P^{r}]:Q_{p}^{p^{\prime% }}\in\{0,1\},\\ \forall p\in[P^{v}]:\underset{{p^{\prime}\in[P^{r}]}}{\sum}Q_{p}^{p^{\prime}}% \leq 1,\\ \forall p^{\prime}\in[P^{r}]:\underset{{p\in[P^{v}}]}{\sum}Q_{p}^{p^{\prime}}=% 1,\end{array}start_ARRAY start_ROW start_CELL underitalic_Q start_ARG roman_min end_ARG italic_C start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Q end_CELL end_ROW start_ROW start_CELL s.t. ∀ italic_p ∈ [ italic_P start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ] , ∀ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] : italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ { 0 , 1 } , end_CELL end_ROW start_ROW start_CELL ∀ italic_p ∈ [ italic_P start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ] : start_UNDERACCENT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] end_UNDERACCENT start_ARG ∑ end_ARG italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ≤ 1 , end_CELL end_ROW start_ROW start_CELL ∀ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] : start_UNDERACCENT italic_p ∈ [ italic_P start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ] end_UNDERACCENT start_ARG ∑ end_ARG italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = 1 , end_CELL end_ROW end_ARRAY (13)

where Q={Qpp}Pv×Pr×1𝑄superscriptsubscript𝑄𝑝superscript𝑝superscriptsuperscript𝑃𝑣superscript𝑃𝑟1{Q}=\left\{Q_{p}^{p^{\prime}}\right\}\in\mathbb{R}^{P^{v}\times P^{r}\times 1}italic_Q = { italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT × italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT × 1 end_POSTSUPERSCRIPT indicates whether KVpsubscript𝐾superscript𝑉𝑝K_{V^{p}}italic_K start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and KRpsubscript𝐾superscript𝑅superscript𝑝K_{R^{p^{\prime}}}italic_K start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT belong to the same person (Qpp=1)superscriptsubscript𝑄𝑝superscript𝑝1\left(Q_{p}^{p^{\prime}}=1\right)( italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = 1 ) or not (Qpp=0)superscriptsubscript𝑄𝑝superscript𝑝0\left(Q_{p}^{p^{\prime}}=0\right)( italic_Q start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = 0 ). C𝐶Citalic_C and [Pt]delimited-[]superscript𝑃𝑡[P^{t}][ italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] denote cost matrix and {1,,Pt}1superscript𝑃𝑡\{1,\dots,P^{t}\}{ 1 , … , italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }, respectively. We design a simple yet effective cost expression for cross-modality multi-memory matching as follows:

C(KCVp,KCVp)=i=1nminj{1,,n}KVip,KRjp2,C(K_{C_{V^{p}}},K_{C_{V^{p^{\prime}}}})=\sum_{i=1}^{n}\min_{j\in\{1,\cdots,n\}% }\|K_{V^{p}_{i}},K_{R^{p^{\prime}}_{j}}\|_{2},italic_C ( italic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_min start_POSTSUBSCRIPT italic_j ∈ { 1 , ⋯ , italic_n } end_POSTSUBSCRIPT ∥ italic_K start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (14)

Finally, we transfer the infrared pseudo-labels to the visible pseudo-labels, which can be written as:

Yv:=QYr.assignsuperscript𝑌𝑣𝑄superscript𝑌𝑟{Y^{v}}:=QY^{r}.italic_Y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT := italic_Q italic_Y start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT . (15)

3.4 Soft Cluster-level Alignment

Pseudo-labels inherently contain noise, a problem that is not exempt even in human annotations [48], leading to a reduction in performance. The method [2] illustrated that deep neural networks initially learn from simple samples before accommodating noisy labels. Building on this insight, we assess the confidence associated with each label. To do so, we employ a two-component Gaussian Mixture Model (GMM) to model the loss distribution:

LIDv=logp(YvC(Fv)),superscriptsubscript𝐿𝐼𝐷𝑣𝑝conditionalsuperscript𝑌𝑣𝐶superscript𝐹𝑣{L}_{ID}^{v}={-\log p\left({{Y}^{v}}\mid C\left(F^{v}\right)\right)},italic_L start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = - roman_log italic_p ( italic_Y start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∣ italic_C ( italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) ) , (16)
p(LIDvθ)=k=12πkϕ(LIDvk),𝑝conditionalsubscriptsuperscript𝐿𝑣𝐼𝐷𝜃superscriptsubscript𝑘12subscript𝜋𝑘italic-ϕconditionalsuperscriptsubscript𝐿𝐼𝐷𝑣𝑘p({L}^{v}_{ID}\mid\theta)=\sum_{k=1}^{2}\pi_{k}\phi({L}_{ID}^{v}\mid k),italic_p ( italic_L start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT ∣ italic_θ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ϕ ( italic_L start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∣ italic_k ) , (17)

where C()𝐶C(\cdot)italic_C ( ⋅ ) acts as an identity classifier. πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the mixture coefficient, while ϕ(LIDvk)italic-ϕconditionalsubscriptsuperscript𝐿𝑣𝐼𝐷𝑘\phi({L}^{v}_{ID}\mid k)italic_ϕ ( italic_L start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT ∣ italic_k ) denotes the probability density of the k𝑘kitalic_k-th component.

Subsequently, the confidence is determined by computing its posterior probability, detailed as:

Wv=p(kLIDv),superscript𝑊𝑣𝑝conditional𝑘subscriptsuperscript𝐿𝑣𝐼𝐷W^{v}=p\left(k\mid{L}^{v}_{ID}\right),italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = italic_p ( italic_k ∣ italic_L start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT ) , (18)

where k𝑘kitalic_k refers to the Gaussian component with a smaller mean, while p(kLIDv)𝑝conditional𝑘subscriptsuperscript𝐿𝑣𝐼𝐷p\left(k\mid{L}^{v}_{ID}\right)italic_p ( italic_k ∣ italic_L start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT ) indicates the responsiveness of LIDvsubscriptsuperscript𝐿𝑣𝐼𝐷{L}^{v}_{ID}italic_L start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_D end_POSTSUBSCRIPT to the k𝑘kitalic_k-th component. In the same way, we can obtain the confidence Wrsuperscript𝑊𝑟W^{r}italic_W start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and Wvrsuperscript𝑊𝑣𝑟W^{vr}italic_W start_POSTSUPERSCRIPT italic_v italic_r end_POSTSUPERSCRIPT.

To penalize the noise during optimization, the memories in Eq. (2), (3), (4) are updated by:

CVp:=1Npi=1Npf(Vip)WVip,assignsubscript𝐶superscript𝑉𝑝1subscript𝑁𝑝superscriptsubscript𝑖1subscript𝑁𝑝𝑓superscriptsubscript𝑉𝑖𝑝subscript𝑊subscriptsuperscript𝑉𝑝𝑖{C}_{V^{p}}:=\frac{1}{N_{p}}\sum_{i=1}^{N_{p}}{f}(V_{i}^{p})W_{V^{p}_{i}},italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f ( italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) italic_W start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (19)
CRp:=1Mpi=1Mpf(Rip)WRip,assignsubscript𝐶superscript𝑅𝑝1subscript𝑀𝑝superscriptsubscript𝑖1subscript𝑀𝑝𝑓superscriptsubscript𝑅𝑖𝑝subscript𝑊subscriptsuperscript𝑅𝑝𝑖{C}_{R^{p}}:=\frac{1}{M_{p}}\sum_{i=1}^{M_{p}}{f}(R_{i}^{p})W_{R^{p}_{i}},italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) italic_W start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (20)
CVRp:=1Api=1Apf(VRip)WVRip,assignsubscript𝐶𝑉superscript𝑅𝑝1subscript𝐴𝑝superscriptsubscript𝑖1subscript𝐴𝑝𝑓𝑉superscriptsubscript𝑅𝑖𝑝subscript𝑊𝑉subscriptsuperscript𝑅𝑝𝑖{C}_{{VR}^{p}}:=\frac{1}{A_{p}}\sum_{i=1}^{A_{p}}{f}(VR_{i}^{p})W_{VR^{p}_{i}},italic_C start_POSTSUBSCRIPT italic_V italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f ( italic_V italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) italic_W start_POSTSUBSCRIPT italic_V italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (21)

where WVipsubscript𝑊subscriptsuperscript𝑉𝑝𝑖W_{V^{p}_{i}}italic_W start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, WRipsubscript𝑊subscriptsuperscript𝑅𝑝𝑖W_{R^{p}_{i}}italic_W start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and WVRipsubscript𝑊𝑉subscriptsuperscript𝑅𝑝𝑖W_{VR^{p}_{i}}italic_W start_POSTSUBSCRIPT italic_V italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the confidences of samples Vipsubscriptsuperscript𝑉𝑝𝑖V^{p}_{i}italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Ripsubscriptsuperscript𝑅𝑝𝑖R^{p}_{i}italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and VRip𝑉subscriptsuperscript𝑅𝑝𝑖VR^{p}_{i}italic_V italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively.

To reduce the intra-modality discrepancy, we employ the distilled CVpsubscript𝐶superscript𝑉𝑝{C}_{V^{p}}italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and CRpsubscript𝐶superscript𝑅𝑝{C}_{R^{p}}italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to align every sample of ID p𝑝pitalic_p to its corresponding memory in each modality. The cluster-level intra-modality alignment loss LIntrasubscript𝐿𝐼𝑛𝑡𝑟𝑎{L}_{Intra}italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT is proposed as:

LIntra=LIntraV+LIntraR=p=1PvfvFpvfvCVp22+p=1PrfrFprfrCRp22,subscript𝐿𝐼𝑛𝑡𝑟𝑎superscriptsubscript𝐿𝐼𝑛𝑡𝑟𝑎𝑉superscriptsubscript𝐿𝐼𝑛𝑡𝑟𝑎𝑅superscriptsubscript𝑝1superscript𝑃𝑣subscriptsuperscript𝑓𝑣superscriptsubscript𝐹𝑝𝑣superscriptsubscriptdelimited-∥∥superscript𝑓𝑣subscript𝐶superscript𝑉𝑝22superscriptsubscript𝑝1superscript𝑃𝑟subscriptsuperscript𝑓𝑟superscriptsubscript𝐹𝑝𝑟superscriptsubscriptdelimited-∥∥superscript𝑓𝑟subscript𝐶superscript𝑅𝑝22\begin{split}{L}_{Intra}&={L}_{Intra}^{V}+{L}_{Intra}^{R}\\ &=\sum_{p=1}^{P^{v}}\sum_{f^{v}\in F_{p}^{v}}\left\|f^{v}-{C}_{V^{p}}\right\|_% {2}^{2}\\ &+\sum_{p=1}^{P^{r}}\sum_{f^{r}\in F_{p}^{r}}\left\|f^{r}-{C}_{R^{p}}\right\|_% {2}^{2},\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT end_CELL start_CELL = italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ∈ italic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT - italic_C start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ italic_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT - italic_C start_POSTSUBSCRIPT italic_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (22)

where Fpvsubscriptsuperscript𝐹𝑣𝑝{F}^{v}_{p}italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, Fprsubscriptsuperscript𝐹𝑟𝑝{F}^{r}_{p}italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denote visible feature and infrared feature sets of ID p𝑝pitalic_p, respectively.

Since VI-ReID is a many-to-many matching problem, we propose cluster-level inter-modality alignment loss, which forces the feature distribution of the samples from the visible modality to be similar to the feature distribution of the samples from the infrared modality and vice versa by:

LInter=LInterV+LInterR=1Pp=1P(12D(Fpv,sg(Fpr))+12D(Fpr,sg(Fpv))),subscript𝐿𝐼𝑛𝑡𝑒𝑟superscriptsubscript𝐿𝐼𝑛𝑡𝑒𝑟𝑉superscriptsubscript𝐿𝐼𝑛𝑡𝑒𝑟𝑅1𝑃superscriptsubscript𝑝1𝑃12𝐷subscriptsuperscript𝐹𝑣𝑝𝑠𝑔subscriptsuperscript𝐹𝑟𝑝12𝐷subscriptsuperscript𝐹𝑟𝑝𝑠𝑔subscriptsuperscript𝐹𝑣𝑝\begin{split}{L}_{Inter}&={L}_{Inter}^{V}+{L}_{Inter}^{R}\\ &=\frac{1}{P}\sum_{p=1}^{P}(\frac{1}{2}D(F^{v}_{p},sg(F^{r}_{p}))\\ &\quad\quad\quad+\frac{1}{2}D(F^{r}_{p},sg(F^{v}_{p}))),\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT end_CELL start_CELL = italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_D ( italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_s italic_g ( italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_D ( italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_s italic_g ( italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) ) , end_CELL end_ROW (23)

where sg()𝑠𝑔sg(\cdot)italic_s italic_g ( ⋅ ) represents the stop-gradient operation, and D(i,j)𝐷𝑖𝑗D(i,j)italic_D ( italic_i , italic_j ) represents the distance between distributions i𝑖iitalic_i and j𝑗jitalic_j. P𝑃Pitalic_P is min(Pv,Pr)superscript𝑃𝑣superscript𝑃𝑟\min(P^{v},P^{r})roman_min ( italic_P start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ). In this paper, we employ the squared Maximum Mean Discrepancy (MMD22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT) [15] to quantify the discrepancy between distributions. MMD22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT is a commonly used non-parametric metric in domain adaptation and has been observed to outperform other metrics, such as KL divergence in empirical studies, MMD22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT is constructed as:

MMD2(Fpr,Fpv)=1|Fpr|2firFprfjrFprz(fir,fjr)+1|Fpv|2fivFpvfivFpvz(fiv,fjv)2|Fpr||Fpv|firFprfjvFprz(fir,fjv),superscriptMMD2subscriptsuperscript𝐹𝑟𝑝subscriptsuperscript𝐹𝑣𝑝1superscriptsubscriptsuperscript𝐹𝑟𝑝2subscriptsubscriptsuperscript𝑓𝑟𝑖subscriptsuperscript𝐹𝑟𝑝subscriptsubscriptsuperscript𝑓𝑟𝑗subscriptsuperscript𝐹𝑟𝑝𝑧subscriptsuperscript𝑓𝑟𝑖subscriptsuperscript𝑓𝑟𝑗1superscriptsubscriptsuperscript𝐹𝑣𝑝2subscriptsubscriptsuperscript𝑓𝑣𝑖subscriptsuperscript𝐹𝑣𝑝subscriptsubscriptsuperscript𝑓𝑣𝑖subscriptsuperscript𝐹𝑣𝑝𝑧subscriptsuperscript𝑓𝑣𝑖subscriptsuperscript𝑓𝑣𝑗2subscriptsuperscript𝐹𝑟𝑝subscriptsuperscript𝐹𝑣𝑝subscriptsubscriptsuperscript𝑓𝑟𝑖subscriptsuperscript𝐹𝑟𝑝subscriptsubscriptsuperscript𝑓𝑣𝑗subscriptsuperscript𝐹𝑟𝑝𝑧subscriptsuperscript𝑓𝑟𝑖subscriptsuperscript𝑓𝑣𝑗\begin{split}\text{MMD}^{2}({F}^{r}_{p},{F}^{v}_{p})&=\frac{1}{|{F}^{r}_{p}|^{% 2}}\sum_{{f}^{r}_{i}\in{F}^{r}_{p}}\sum_{{f}^{r}_{j}\in{F}^{r}_{p}}z({f}^{r}_{% i},{f}^{r}_{j})\\ &+\frac{1}{|{F}^{v}_{p}|^{2}}\sum_{{f}^{v}_{i}\in{F}^{v}_{p}}\sum_{{f}^{v}_{i}% \in{F}^{v}_{p}}z({f}^{v}_{i},{f}^{v}_{j})\\ &-\frac{2}{|{F}^{r}_{p}||{F}^{v}_{p}|}\sum_{{f}^{r}_{i}\in{F}^{r}_{p}}\sum_{{f% }^{v}_{j}\in{F}^{r}_{p}}z({f}^{r}_{i},{f}^{v}_{j}),\end{split}start_ROW start_CELL MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG | italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_z ( italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG 1 end_ARG start_ARG | italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_z ( italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG 2 end_ARG start_ARG | italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | | italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_F start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_z ( italic_f start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , end_CELL end_ROW (24)

where z(s,s)=exp(𝒔𝒔222σ2)𝑧𝑠superscript𝑠superscriptsubscriptnorm𝒔superscript𝒔222superscript𝜎2z(s,s^{\prime})=\exp(\frac{-\left\|\bm{s}-\bm{s}^{\prime}\right\|_{2}^{2}}{2% \sigma^{2}})italic_z ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_exp ( divide start_ARG - ∥ bold_italic_s - bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) is a Gaussian kernel.

The SCA loss is defined as:

LSCA=λIntraLIntra+λInterLInter,subscript𝐿𝑆𝐶𝐴subscript𝜆𝐼𝑛𝑡𝑟𝑎subscript𝐿𝐼𝑛𝑡𝑟𝑎subscript𝜆𝐼𝑛𝑡𝑒𝑟subscript𝐿𝐼𝑛𝑡𝑒𝑟L_{SCA}=\lambda_{Intra}L_{{Intra}}+\lambda_{Inter}L_{{Inter}},italic_L start_POSTSUBSCRIPT italic_S italic_C italic_A end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT , (25)

where λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT are the balancing weights.

Overall Loss. The total loss for training the model is defined by the following equation:

Loverall=LCMC+LSCA.subscript𝐿𝑜𝑣𝑒𝑟𝑎𝑙𝑙subscript𝐿𝐶𝑀𝐶subscript𝐿𝑆𝐶𝐴L_{overall}=L_{CMC}+L_{SCA}.italic_L start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r italic_a italic_l italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_C italic_M italic_C end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_S italic_C italic_A end_POSTSUBSCRIPT . (26)
Table 1: Comparisons with state-of-the-art methods on SYSU-MM01 and RegDB, i.e., supervised visible-infrared person ReID (SVI-ReID), semi-supervised visible-infrared person ReID (SSVI-ReID) and unsupervised visible-infrared person ReID (USL-VI-ReID). All methods are measured by Rank-1 (%) and mAP (%). GUR* denotes the results without camera information.
Settings SYSU-MM01 RegDB
All Search Indoor Search Visible2Thermal Thermal2Visible
Type Method Venue Rank-1 mAP Rank-1 mAP Rank-1 mAP Rank-1 mAP
SVI-ReID JSIA-ReID [36] AAAI’20 38.1 36.9 43.8 52.9 48.5 49.3 48.1 48.9
DDAG [51] ECCV’20 54.8 53.0 61.0 68.0 69.4 63.5 68.1 61.8
AGW [53] TRAMI’21 47.5 47.7 54.2 63.0 70.1 66.4 70.5 65.9
NFS [4] CVPR’21 56.9 55.5 62.8 69.8 80.5 72.1 78.0 69.8
LbA [28] ICCV’21 55.4 54.1 58.5 66.3 74.2 67.6 72.4 65.5
CAJ [52] ICCV’21 69.9 66.9 76.3 80.4 85.0 79.1 84.8 77.8
MPANet [41] CVPR’21 70.6 68.2 76.7 81.0 83.7 80.9 82.8 80.7
DART [48] CVPR’22 68.7 66.3 72.5 78.2 83.6 75.7 82.0 73.8
FMCNet [56] CVPR’22 66.3 62.5 68.2 74.1 89.1 84.4 88.4 83.9
MAUM [25] CVPR’22 71.7 68.8 77.0 81.9 87.9 85.1 87.0 84.3
MID [17] AAAI’22 60.3 59.4 64.9 70.1 87.5 84.9 84.3 81.4
LUPI [1] ECCV’22 71.1 67.6 82.4 82.7 88.0 82.7 86.8 81.3
DEEN [57] CVPR’23 74.7 71.8 80.3 83.3 91.1 85.1 89.5 83.4
SGIEL [11] CVPR’23 77.1 72.3 82.1 83.0 92.2 86.6 91.1 85.2
PartMix [19] CVPR’23 77.8 74.6 81.5 84.4 85.7 82.3 84.9 82.5
SSVI-ReID MAUM-50 [25] CVPR’22 28.8 36.1 - - - - - -
MAUM-100 [25] CVPR’22 38.5 39.2 - - - - - -
OTLA [38] ECCV’22 48.2 43.9 47.4 56.8 49.9 41.8 49.6 42.8
TAA [46] TIP’23 48.8 42.3 50.1 56.0 62.2 56.0 63.8 56.5
DPIS [30] ICCV’23 58.4 55.6 63.0 70.0 62.3 53.2 61.5 52.7
USL-VI-ReID OTLA [38] ECCV’22 29.9 27.1 29.8 38.8 32.9 29.7 32.1 28.6
ADCA [44] MM’22 45.5 42.7 50.6 59.1 67.2 64.1 68.5 63.8
NGLR [6] arXiv’23 50.4 47.4 53.5 61.7 85.6 76.7 82.9 75.0
MBCCM [16] arXiv’23 53.1 48.2 55.2 62.0 83.8 77.9 82.8 76.7
CCLNet [5] MM’23 54.0 50.2 56.7 65.1 69.9 65.5 70.2 66.7
GUR* [47] ICCV’23 61.0 57.0 64.2 69.5 73.9 70.2 75.0 69.9
PGM [42] CVPR’23 57.3 51.8 56.2 62.7 69.5 65.4 69.9 65.2
MMM(Ours) - 61.6 57.9 64.4 70.4 89.7 80.5 85.8 77.0

4 Experiments

In this section, we conduct comprehensive experiments to verify the effectiveness of our MMM. First, we compare our MMM with several state-of-the-art methods under three settings, i.e., supervised visible-infrared person ReID (SVI-ReID), semi-supervised visible-infrared person ReID (SSVI-ReID) and unsupervised visible-infrared person ReID (USL-VI-ReID). After that, we perform ablation studies to evaluate the effectiveness of each module in our MMM. Finally, we perform a discussion and analysis of the hyper-parameters and visualization. If not specified, we conduct analysis experiments on SYSU-MM01 in the single-shot & all-search mode.

Table 2: Ablation studies on SYSU-MM01 in all search mode and indoor search mode. “Baseline” means the model trained only with the CMC module. Rank-R accuracy(%) and mAP(%) are reported.
Method All Search Indoor Search
Order Baseline MMLM Intra Inter Rank-1 Rank-5 Rank-10 Rank-20 mAP Rank-1 Rank-5 Rank-10 Rank-20 mAP
1 51.74 78.67 87.87 94.76 49.81 56.34 84.66 92.77 96.98 64.46
2 55.15 81.65 90.53 96.46 52.21 58.76 85.21 93.06 97.16 65.47
3 58.48 83.69 91.79 97.15 55.05 62.19 86.95 93.60 97.64 68.09
4 57.26 82.34 90.84 96.93 53.81 60.26 85.77 93.16 97.36 66.66
5 61.56 85.66 93.33 98.03 57.92 64.37 88.80 95.01 98.20 70.40

4.1 Experimental Setting

Dataset. We evaluate our MMM on two benchmarks, i.e., SYSU-MM01 [40] and RegDB [27]. More detailed explanations are presented in supplementary materials.

Evaluation Protocols. Cumulative Matching Characteristics [50] and mean Average Precision (mAP) are adopted as the evaluation metrics on two datasets to evaluate the performance of our MMM quantitatively. For fair comparisons, we report the results of all-search mode and indoor-search mode with the official code on SYSU-MM01. Following [52], We also report the results on RegDB by randomly splitting the training and testing set 10 times in visible-to-thermal and thermal-to-visible modes.

4.2 Implementation Details

We adopt ResNet50, which is initialized with the ImageNet pre-trained weights, as the shared backbone to extract 2048d features. Our MMM is implemented in PyTorch. During training, random horizontal flip** and random crop are used for data augmentation [52]. The total number of training epochs is 80. The additional detailed settings are presented in supplementary materials.

4.3 Results and Analysis

To clearly demonstrate the effectiveness of our MMM, we compare our MMM with several state-of-the-art methods under three settings, i.e., SVI-ReID, SSVI-ReID, and USL-VI-ReID. The quantitative results on SYSU-MM01 and RegDB are shown in Tab. 1.

Comparison with SVI-ReID Methods. Surprisingly, our MMM performs better than several supervised methods on SYSU-MM01, including JSIA-ReID [36], DDAG [51], AGW [53], NFS [4], and LbA [28]. The results show the effectiveness of our MMM. However, we have to acknowledge that there is still a certain gap between our MMM and many SVI-ReID methods due to the absence of cross-modality data annotations.

Refer to caption
Figure 3: The effect of hyper-parameter n𝑛nitalic_n, λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT with different values on SYSU-MM01.
Refer to caption
Figure 4: The intra-identity and inter-identity distances on SYSU-MM01, where δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the gap between the intra-identity distance mean and the inter-identity distance mean.
Refer to caption
Figure 5: The Visualization of the pseudo-labels of the same identity with different modalities.

Comparison with SSVI-ReID Methods.  We compared our MMM with five state-of-the-art SSVI-ReID methods. Notably, our MMM achieved superior performance without the utilization of any annotations, surpassing all SSVI-ReID methods that rely on limited annotations.

Comparison with USL-VI-ReID Methods. Compared with seven state-of-the-art USVI-ReID methods, our MMM consistently performs better than most existing USL-VI-ReID methods by a significant margin, which demonstrates the effectiveness of our MMM. As shown in Tab. 1, our MMM achieves 61.6% in Rank-1 and 57.9% in mAP on SYSU-MM01. Moreover, the results are surprising on RegDB, our MMM improves the Rank-1 and mAP accuracy by a large margin of 15.8% and 10.3% compared to GUR under visible to thermal mode. These results even surpass most SVI-ReID methods.

The above results clearly show that our MMM is effective, which highlights the significant potential of our MMM in addressing USL-VI-ReID challenges.

4.4 Ablation Study

To further analyze the effectiveness of the Multi-Memory Learning and Matching (MMLM), the Soft Cluster-level Alignment (SCA), we conduct ablation studies on SYSU-MM01 under both all-search and indoor-search modes. The results are reported in Tab. 2.

Baseline. Order 1 denotes that the model is trained only with the CMC module. Although it achieves a promising performance on SYSU-MM01, it does not directly establish relations between the two modalities.

Effective of MMLM. The effectiveness of the MMLM module is revealed by comparing Order 1 and Order 2. The MMLM improves 3.41% in Rank-1 and 2.40% in mAP on SYSU-MM01. The results, combined with Fig. 1, demonstrate that the MMLM can help align visible and infrared pseudo-labels to establish cross-modality correspondences.

Effective of Intra in SCA. As shown in Order 3 of Tab. 2, the performance is improved to 58.48% in Rank-1 and 55.05% in mAP when adding the cluster-level intra-modality loss (Intra) in SCA, which shows the effectiveness of Intra in reducing the discrepancy of intra-modality.

Effective of Inter in SCA. The cluster-level inter-modality alignment loss (Inter) is proposed to reduce the discrepancy of inter-modality, our MMM can reach 57.26% in Rank-1 and 53.81% in mAP when adding it. Moreover, when combining Inter with Intra, our MMM achieves the best performance with 61.56% in Rank-1 and 57.92% in mAP, which surpasses the baseline by a large margin of 9.82% in Rank-1 and 8.11% in mAP.

4.5 Analysis of Hyper-parameters

In Fig. 3, we analyze the key hyper-parameters of our MMM on SYSU-MM01, i.e., the number of memories n𝑛nitalic_n, λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT. In Fig. 3 (a), we vary the number of memories from 1 to 5 while kee** the λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT fixed, which shows our MMM achieves the best performance with 61.56% in Rank-1 and 57.92% in mAP when n=4𝑛4n=4italic_n = 4. Moreover, to balance the contribution between the cluster-level intra- and inter-modality alignment loss in SCA, we study the effect of λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT by fixing one and adjusting the other. To be specific, we maintain the λInter=0.05subscript𝜆𝐼𝑛𝑡𝑒𝑟0.05\lambda_{Inter}=0.05italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT = 0.05 and tune the value of λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT in [0.1,0.5,1.0,1.5,2.0]0.10.51.01.52.0[0.1,0.5,1.0,1.5,2.0][ 0.1 , 0.5 , 1.0 , 1.5 , 2.0 ] (Fig. 3 (b)), while fix the λIntra=0.5subscript𝜆𝐼𝑛𝑡𝑟𝑎0.5\lambda_{Intra}=0.5italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = 0.5 and explore the λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT on different values which vary in [0.01,0.025,0.05,0.075,0.1]0.010.0250.050.0750.1[0.01,0.025,0.05,0.075,0.1][ 0.01 , 0.025 , 0.05 , 0.075 , 0.1 ] (Fig. 3 (c)). We can observe that our MMM achieves high accuracy under different combinations with λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT, which shows the performance of our MMM is not sensitive to λIntrasubscript𝜆𝐼𝑛𝑡𝑟𝑎\lambda_{Intra}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT and λIntersubscript𝜆𝐼𝑛𝑡𝑒𝑟\lambda_{Inter}italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT, and the best performance is achieved with λIntra=0.5subscript𝜆𝐼𝑛𝑡𝑟𝑎0.5\lambda_{Intra}=0.5italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = 0.5 and λInter=0.05subscript𝜆𝐼𝑛𝑡𝑒𝑟0.05\lambda_{Inter}=0.05italic_λ start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r end_POSTSUBSCRIPT = 0.05.

4.6 Analysis of Visualization

To further illustrate the effectiveness of MMM, we visualize the intra-identity and inter-identity distances on SYSU-MM01 in Fig. 4. As shown in Fig. 4 (a)-(d), with the addition of the proposed module, the means of intra-identity distances gradually decrease while the means of inter-identity distances gradually increase, which makes the intra-identity and inter-identity features distributions are pushed away (δ1<δ2<δ3<δ4subscript𝛿1subscript𝛿2subscript𝛿3subscript𝛿4\delta_{1}<\delta_{2}<\delta_{3}<\delta_{4}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT < italic_δ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT). The results show that our MMM can effectively reduce the cross-modality distances between the same identity samples and push the distance between different identity samples far away.

Moreover, we also visualize the pseudo-labels of the same identity with different modalities, where we randomly choose 3 person identities, where each identity consists of 4 visible images and infrared images. As shown in Fig. 5, persons of the same identity in different modalities have the same pseudo-label in our MMM (right) compared with GUR (left), which shows that our MMM can establish more reliable cross-modality correspondences.

5 Conclusion

In this paper, we introduce a metric to measure cross-modality correspondences and clustering pseudo-labels, i.e., Adjusted Rand Index, and investigate how to establish reliable cross-modality correspondences for USL-VI-ReID. To this end, we propose a Multi-Memory Matching (MMM) framework. Firstly, we design a Cross-Modality Clustering (CMC) module to generate pseudo-labels. Instead of previous methods, we employ multi-memory in the Multi-Memory Learning and Matching (MMLM) module to capture individual nuances and establish reliable cross-modality correspondences. Additionally, we present a Soft Cluster-level Alignment (SCA) module to reduce the cross-modality gap while mitigating the effect of noise pseudo-labels. Comprehensive experimental results show that our MMM can establish reliable cross-modality correspondences and outperforms existing USL-VI-ReID methods on SYSU-MM01 and RegDB.

References

  • Alehdaghi et al. [2022] Mahdi Alehdaghi, Arthur Josi, Rafael M. O. Cruz, and Eric Granger. Visible-infrared person re-identification using privileged intermediate information. In ECCV, pages 720–737, 2022.
  • Arpit et al. [2017] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In ICML, pages 233–242, 2017.
  • Chen et al. [2021a] Hao Chen, Benoit Lagadec, and François Brémond. ICE: inter-instance contrastive encoding for unsupervised person re-identification. In ICCV, pages 14940–14949, 2021a.
  • Chen et al. [2021b] Yehansen Chen, Lin Wan, Zhihang Li, Qianyan **g, and Zongyuan Sun. Neural feature search for rgb-infrared person re-identification. In CVPR, pages 587–597, 2021b.
  • Chen et al. [2023] Zhong Chen, Zhizhong Zhang, Xin Tan, Yanyun Qu, and Yuan Xie. Unveiling the power of clip in unsupervised visible-infrared person re-identification. In ACM MM, pages 3667–3675, 2023.
  • Cheng et al. [2023] De Cheng, Xiaojian Huang, Nannan Wang, Lingfeng He, Zhihui Li, and Xinbo Gao. Unsupervised visible-infrared person reid by collaborative learning with neighbor-guided label refinement. ArXiv:2305.12711, 2023.
  • Cho et al. [2022] Yoonki Cho, Woo Jae Kim, Seunghoon Hong, and Sung-Eui Yoon. Part-based pseudo label refinement for unsupervised person re-identification. In CVPR, pages 7298–7308, 2022.
  • Choi et al. [2020] Seokeon Choi, Sumin Lee, Youngeun Kim, Taekyung Kim, and Changick Kim. Hi-cmd: Hierarchical cross-modality disentanglement for visible-infrared person re-identification. In CVPR, pages 10254–10263, 2020.
  • Dai et al. [2022] Zuozhuo Dai, Guangyuan Wang, Weihao Yuan, Siyu Zhu, and ** Tan. Cluster contrast for unsupervised person re-identification. In ACCV, pages 319–337, 2022.
  • Ester et al. [1996] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226–231, 1996.
  • Feng et al. [2023] Jiawei Feng, Ancong Wu, and Wei-Shi Zheng. Shape-erased feature learning for visible-infrared person re-identification. In CVPR, pages 22752–22761, 2023.
  • Fu et al. [2019] Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and Thomas S. Huang. Self-similarity grou**: A simple unsupervised cross domain adaptation approach for person re-identification. In ICCV, pages 6111–6120, 2019.
  • Ge et al. [2020a] Yixiao Ge, Dapeng Chen, and Hongsheng Li. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. In ICLR, 2020a.
  • Ge et al. [2020b] Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, and Hongsheng Li. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In NeurIPS, 2020b.
  • Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • He et al. [2023] Lingfeng He, Nannan Wang, Shizhou Zhang, Zhen Wang, Xinbo Gao, et al. Efficient bilateral cross-modality cluster matching for unsupervised visible-infrared person reid. ArXiv:2305.12673, 2023.
  • Huang et al. [2022] Zhipeng Huang, Jiawei Liu, Liang Li, Kecheng Zheng, and Zheng-Jun Zha. Modality-adaptive mixup and invariant decomposition for rgb-infrared person re-identification. In AAAI, pages 1034–1042, 2022.
  • Hubert and Arabie [1985] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193–218, 1985.
  • Kim et al. [2023] Minsu Kim, Seungryong Kim, Jungin Park, Seongheon Park, and Kwanghoon Sohn. Partmix: Regularization strategy to learn part discovery for visible-infrared person re-identification. In CVPR, pages 18621–18632, 2023.
  • Li et al. [2020] Diangang Li, Xing Wei, Xiaopeng Hong, and Yihong Gong. Infrared-visible cross-modal person re-identification with an X modality. In AAAI, pages 4610–4617, 2020.
  • Liang et al. [2021] Wenqi Liang, Guangcong Wang, Jianhuang Lai, and Xiaohua Xie. Homogeneous-to-heterogeneous: Unsupervised learning for rgb-infrared person re-identification. IEEE Trans. Image Process., 30:6392–6407, 2021.
  • Lin et al. [2018] Shan Lin, Haoliang Li, Chang-Tsun Li, and Alex C. Kot. Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. In BMVC, page 9, 2018.
  • Lin et al. [2019] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering approach to unsupervised person re-identification. In AAAI, pages 8738–8745, 2019.
  • Lin et al. [2020] Yutian Lin, Lingxi Xie, Yu Wu, Chenggang Yan, and Qi Tian. Unsupervised person re-identification via softened similarity learning. In CVPR, pages 3387–3396, 2020.
  • Liu et al. [2022] Jialun Liu, Yifan Sun, Feng Zhu, Hongbin Pei, Yi Yang, and Wenhui Li. Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In CVPR, pages 19344–19353, 2022.
  • MacQueen et al. [1967] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281–297, 1967.
  • Nguyen et al. [2017] Dat Tien Nguyen, Hyung Gil Hong, Ki-Wan Kim, and Kang Ryoung Park. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 17(3):605, 2017.
  • Park et al. [2021] Hyunjong Park, Sanghoon Lee, Junghyup Lee, and Bumsub Ham. Learning by aligning: Visible-infrared person re-identification using cross-modal correspondences. In ICCV, pages 12026–12035, 2021.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  • Shi et al. [2023] Jiangming Shi, Yachao Zhang, ** Fan, Zhongchao Shi, and Yanyun Qu. Dual pseudo-labels interactive self-training for semi-supervised visible-infrared person re-identification. In ICCV, pages 11218–11228, 2023.
  • Song et al. [2020] Liangchen Song, Cheng Wang, Lefei Zhang, Bo Du, Qian Zhang, Chang Huang, and Xinggang Wang. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognit., 102:107173, 2020.
  • Sun et al. [2022] Hanzhe Sun, Jun Liu, Zhizhong Zhang, Chengjie Wang, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Not all pixels are matched: Dense contrastive learning for cross-modality person re-identification. In ACM MM, pages 5333–5341, 2022.
  • Wang and Zhang [2020] Dongkai Wang and Shiliang Zhang. Unsupervised person re-identification via multi-label classification. In CVPR, pages 10978–10987, 2020.
  • Wang et al. [2019] Guan’an Wang, Tianzhu Zhang, Jian Cheng, Si Liu, Yang Yang, and Zengguang Hou. Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In ICCV, pages 3622–3631, 2019.
  • Wang et al. [2020a] Guan’an Wang, Yang Yang, Tianzhu Zhang, Jian Cheng, Zengguang Hou, Prayag Tiwari, and Hari Mohan Pandey. Cross-modality paired-images generation and augmentation for rgb-infrared person re-identification. Neural Networks, 128:294–304, 2020a.
  • Wang et al. [2020b] Guan’an Wang, Yang Yang, Tianzhu Zhang, Jian Cheng, Zengguang Hou, Prayag Tiwari, and Hari Mohan Pandey. Cross-modality paired-images generation and augmentation for rgb-infrared person re-identification. Neural Networks, 128:294–304, 2020b.
  • Wang et al. [2018] **gya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR, pages 2275–2284, 2018.
  • Wang et al. [2022] Jiangming Wang, Zhizhong Zhang, Mingang Chen, Yi Zhang, Cong Wang, Bin Sheng, Yanyun Qu, and Yuan Xie. Optimal transport for label-efficient visible-infrared person re-identification. In ECCV, pages 93–109, 2022.
  • Wei et al. [2021] Ziyu Wei, Xi Yang, Nannan Wang, and Xinbo Gao. Syncretic modality collaborative learning for visible infrared person re-identification. In ICCV, pages 225–234, 2021.
  • Wu et al. [2017] Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, and Jianhuang Lai. Rgb-infrared cross-modality person re-identification. In ICCV, pages 5390–5399, 2017.
  • Wu et al. [2021] Qiong Wu, **yang Dai, Jie Chen, Chia-Wen Lin, Yongjian Wu, Feiyue Huang, Bineng Zhong, and Rongrong Ji. Discover cross-modality nuances for visible-infrared person re-identification. In CVPR, pages 4330–4339, 2021.
  • Wu and Ye [2023] Zesen Wu and Mang Ye. Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning. In CVPR, pages 9548–9558, 2023.
  • Xuan and Zhang [2021] Shiyu Xuan and Shiliang Zhang. Intra-inter camera similarity for unsupervised person re-identification. In CVPR, pages 11926–11935, 2021.
  • Yang et al. [2022a] Bin Yang, Mang Ye, Jun Chen, and Zesen Wu. Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re-identification. In ACM MM, pages 2843–2851, 2022a.
  • Yang et al. [2023a] Bin Yang, Jun Chen, Cuiqun Chen, and Mang Ye. Dual consistency-constrained learning for unsupervised visible-infrared person re-identification. IEEE Transactions on Information Forensics and Security, 2023a.
  • Yang et al. [2023b] Bin Yang, Jun Chen, Xianzheng Ma, and Mang Ye. Translation, association and augmentation: Learning cross-modality re-identification from single-modality annotation. IEEE Transactions on Image Processing, 32:5099–5113, 2023b.
  • Yang et al. [2023c] Bin Yang, Jun Chen, and Mang Ye. Towards grand unified representation learning for unsupervised visible-infrared person re-identification. In ICCV, pages 11069–11079, 2023c.
  • Yang et al. [2022b] Mouxing Yang, Zhenyu Huang, Peng Hu, Taihao Li, Jiancheng Lv, and Xi Peng. Learning with twin noisy labels for visible-infrared person re-identification. In CVPR, pages 14288–14297, 2022b.
  • Ye et al. [2018a] Mang Ye, Zheng Wang, Xiangyuan Lan, and Pong C. Yuen. Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI, pages 1092–1099, 2018a.
  • Ye et al. [2018b] Mang Ye, Zheng Wang, Xiangyuan Lan, and Pong C. Yuen. Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI, pages 1092–1099, 2018b.
  • Ye et al. [2020] Mang Ye, Jianbing Shen, David J. Crandall, Ling Shao, and Jiebo Luo. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In ECCV, pages 229–247, 2020.
  • Ye et al. [2021] Mang Ye, Weijian Ruan, Bo Du, and Mike Zheng Shou. Channel augmented joint learning for visible-infrared recognition. In ICCV, pages 13547–13556, 2021.
  • Ye et al. [2022] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell., pages 2872–2893, 2022.
  • Zhai et al. [2020] Yunpeng Zhai, Qixiang Ye, Shijian Lu, Mengxi Jia, Rongrong Ji, and Yonghong Tian. Multiple expert brainstorming for domain adaptive person re-identification. In ECCV, pages 594–611, 2020.
  • Zhang et al. [2023] Guoqing Zhang, Hongwei Zhang, Weisi Lin, Arun Kumar Chandran, and Xuan **g. Camera contrast learning for unsupervised person re-identification. IEEE Trans. Circuits Syst. Video Technol., 33(8):4096–4107, 2023.
  • Zhang et al. [2022] Qiang Zhang, Changzhou Lai, Jianan Liu, Nianchang Huang, and Jungong Han. Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In CVPR, pages 7339–7348, 2022.
  • Zhang and Wang [2023] Yukang Zhang and Hanzi Wang. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In CVPR, pages 2153–2162, 2023.
  • Zhang et al. [2021] Yukang Zhang, Yan Yan, Yang Lu, and Hanzi Wang. Towards a unified middle modality learning for visible-infrared person re-identification. In ACM MM, pages 788–796, 2021.
\thetitle

Supplementary Material

VI Overview

In this document, we first introduce the ARI as a metric for evaluating the reliability of cross-modality correspondences and pseudo-labels. Secondly, we present more detailed explanations of datasets. Finally, we supplement the implementation details.

VII ARI as a Metric for Evaluating Unsupervised Cross-Modal Re-Identification

To evaluate the reliability of cross-modality correspondences and pseudo-labels in unsupervised cross-modality re-identification, this work introduces the Adjusted Rand Index (ARI) metric, ARI is a measure of the similarity between two data clusterings, the ARI is calculated using the formula:

ARI=ij(nij2)[i(ai2)j(bj2)]/(N2)12[i(ai2)+j(bj2)][i(ai2)j(bj2)]/(N2)ARIsubscript𝑖𝑗binomialsubscript𝑛𝑖𝑗2delimited-[]subscript𝑖binomialsubscript𝑎𝑖2subscript𝑗binomialsubscript𝑏𝑗2binomial𝑁212delimited-[]subscript𝑖binomialsubscript𝑎𝑖2subscript𝑗binomialsubscript𝑏𝑗2delimited-[]subscript𝑖binomialsubscript𝑎𝑖2subscript𝑗binomialsubscript𝑏𝑗2binomial𝑁2\text{ARI}=\frac{\sum_{ij}\binom{n_{ij}}{2}-\left[\sum_{i}\binom{a_{i}}{2}\sum% _{j}\binom{b_{j}}{2}\right]/\binom{N}{2}}{\frac{1}{2}\left[\sum_{i}\binom{a_{i% }}{2}+\sum_{j}\binom{b_{j}}{2}\right]-\left[\sum_{i}\binom{a_{i}}{2}\sum_{j}% \binom{b_{j}}{2}\right]/\binom{N}{2}}ARI = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( FRACOP start_ARG italic_n start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) - [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( FRACOP start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( FRACOP start_ARG italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) ] / ( FRACOP start_ARG italic_N end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( FRACOP start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( FRACOP start_ARG italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) ] - [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( FRACOP start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( FRACOP start_ARG italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) ] / ( FRACOP start_ARG italic_N end_ARG start_ARG 2 end_ARG ) end_ARG (XXVII)

where nijsubscript𝑛𝑖𝑗n_{ij}italic_n start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the number of samples in the common cluster between pseudo-labels and ground-truth labels, aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of samples in the i𝑖iitalic_i-th cluster of pseudo-labels, bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of samples in the j𝑗jitalic_j-th cluster of ground-truth labels, and N𝑁Nitalic_N is the total number of samples in the dataset. The binomial coefficient (nk)binomial𝑛𝑘\binom{n}{k}( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ), representing the number of ways to choose (k)𝑘(k)( italic_k ) elements from n𝑛nitalic_n distinct elements, is calculated as:

(nk)=n!k!(nk)!binomial𝑛𝑘𝑛𝑘𝑛𝑘\binom{n}{k}=\frac{n!}{k!(n-k)!}( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) = divide start_ARG italic_n ! end_ARG start_ARG italic_k ! ( italic_n - italic_k ) ! end_ARG (XXVIII)

where n!𝑛n!italic_n ! denotes n𝑛nitalic_n factorial, the product of all positive integers up to n𝑛nitalic_n. The ARI value ranges from -1 to 1, with 1 indicating perfect agreement, 0 indicating no better agreement than chance, and negative values indicating less agreement than chance.

To demonstrate the computation of the Adjusted Rand Index (ARI) in the context of unsupervised cross-modal person re-identification, given two elements T (ground-truth labels) and P (pseudo-labels), the ground-truth labels are not used during the training process.

Table III: Confusion matrix between pseudo-labels and ground-truth labels.
CLuster ‘abc’ CLuster ‘de’ CLuster ‘fgh’ sum
CLuster ‘ab’ 2 0 0 2
CLuster ‘cde’ 1 2 0 3
CLuster ‘fgh’ 0 0 3 3
sum 3 2 3 8

Clusterings Defined:

  • Clustering T: {‘abc’, ‘de’, ‘fgh’}

  • Clustering P: {‘ab’, ‘cde’, ‘fgh’}

where a,b,d,f are visible samples and c,e,g,h are infrared samples.

Step 1: Calculating Pairwise Combinations Within Each Clustering

Based on Tab. III, the number of pairwise combinations within each cluster in both clusterings T and P is calculated as follows:

  • In Clustering T:

    • Cluster ‘abc’: (32)=3binomial323\binom{3}{2}=3( FRACOP start_ARG 3 end_ARG start_ARG 2 end_ARG ) = 3 pairs

    • Cluster ‘de’: (22)=1binomial221\binom{2}{2}=1( FRACOP start_ARG 2 end_ARG start_ARG 2 end_ARG ) = 1 pair

    • Cluster ‘fgh’: (32)=3binomial323\binom{3}{2}=3( FRACOP start_ARG 3 end_ARG start_ARG 2 end_ARG ) = 3 pairs

  • In Clustering P:

    • Cluster ‘ab’: (22)=1binomial221\binom{2}{2}=1( FRACOP start_ARG 2 end_ARG start_ARG 2 end_ARG ) = 1 pair

    • Cluster ‘cde’: (32)=3binomial323\binom{3}{2}=3( FRACOP start_ARG 3 end_ARG start_ARG 2 end_ARG ) = 3 pairs

    • Cluster ‘fgh’: (32)=3binomial323\binom{3}{2}=3( FRACOP start_ARG 3 end_ARG start_ARG 2 end_ARG ) = 3 pairs

Step 2: Identifying Shared Pairs Between Clusterings

The number of shared pairs between Clustering T and Clustering P is identified:

  • Shared pairs: ‘ab’, ‘de’, ‘fg’, ‘fh’, ‘gh’

  • Total shared pairs: 5 pairs

Step 3: Computing the ARI

The ARI is computed using the following formula:

  • ij(nij2)=5subscript𝑖𝑗binomialsubscript𝑛𝑖𝑗25\sum_{ij}\binom{n_{ij}}{2}=5∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( FRACOP start_ARG italic_n start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) = 5

  • i(ai2)=3+1+3=7subscript𝑖binomialsubscript𝑎𝑖23137\sum_{i}\binom{a_{i}}{2}=3+1+3=7∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( FRACOP start_ARG italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) = 3 + 1 + 3 = 7

  • j(bj2)=1+3+3=7subscript𝑗binomialsubscript𝑏𝑗21337\sum_{j}\binom{b_{j}}{2}=1+3+3=7∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( FRACOP start_ARG italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) = 1 + 3 + 3 = 7

  • (N2)=(82)=28binomial𝑁2binomial8228\binom{N}{2}=\binom{8}{2}=28( FRACOP start_ARG italic_N end_ARG start_ARG 2 end_ARG ) = ( FRACOP start_ARG 8 end_ARG start_ARG 2 end_ARG ) = 28

Substituting these values into the ARI formula:

ARI=57×7287+727×728=51.7571.75=3.255.250.619ARI57728772772851.7571.753.255.250.619\text{ARI}=\frac{5-\frac{7\times 7}{28}}{\frac{7+7}{2}-\frac{7\times 7}{28}}=% \frac{5-1.75}{7-1.75}=\frac{3.25}{5.25}\approx 0.619ARI = divide start_ARG 5 - divide start_ARG 7 × 7 end_ARG start_ARG 28 end_ARG end_ARG start_ARG divide start_ARG 7 + 7 end_ARG start_ARG 2 end_ARG - divide start_ARG 7 × 7 end_ARG start_ARG 28 end_ARG end_ARG = divide start_ARG 5 - 1.75 end_ARG start_ARG 7 - 1.75 end_ARG = divide start_ARG 3.25 end_ARG start_ARG 5.25 end_ARG ≈ 0.619

The ARI value of approximately 0.619 indicates a moderate level of agreement between Clustering T and Clustering P, suggesting some consistency in the clustering results, yet not perfectly aligned. This example demonstrates the utility of ARI in providing an objective assessment of the reliability of cross-modality correspondences.

VIII Dataset

We evaluate our MMM on two benchmarks, i.e., SYSU-MM01 and RegDB. SYSU-MM01 is a large-scale visible-infrared person ReID dataset, which is collected from four visible cameras and two infrared cameras in both indoor and outdoor scenes. This dataset totally contains 287,628 visible images and 15,792 infrared images with 491 identities. Among them, 22,258 visible images and 11,909 infrared images with 395 identities are used for the training set. In addition, 3,803 infrared images are used for the query set and 301 visible images are randomly selected to make up the gallery set. RegDB is a relatively small dataset, which contains 412 identities with 4,120 visible images and 4,120 infrared images. The dataset is divided at random, with half for training and the other half for testing.

IX Implementation Details

Our proposed framework is implemented in PyTorch. At each training step, we randomly sample 8 IDs, of which 4 visible and 4 infrared images are chosen to formulate a batch, and training images are resized to 288×144288144288\times 144288 × 144. The total number of training epochs is 80. SGD optimizer is adopted to train the model with the momentum setting to 0.9 and weight decay setting to 5e45𝑒45e-45 italic_e - 4. The Intra module is added from the 1stsuperscript1𝑠𝑡1^{st}1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT epoch and the Inter module is added from the 15thsuperscript15𝑡15^{th}15 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT epoch. The loss temperature τ𝜏\tauitalic_τ is set to 0.05.