Multi-Memory Matching for Unsupervised Visible-Infrared Person Re-Identification

Jiangming Shi¹, Xiangbo Yin², Yeyun Chen¹, Yachao Zhang³, Zhizhong Zhang^{4, 5},
Yuan Xie^{4, 5 ${}^{*}$}, Yanyun Qu^1,2
¹ Institute of Artificial Intelligence, ² School of Informatics, Xiamen University
³ Tsinghua Shenzhen International Graduate School, Tsinghua University
⁴ East China Normal University, ⁵ Chongqing Institute of East China Normal University
[email protected], {zzzhang, yxie}@cs.ecnu.edu.cn, [email protected] Corresponding author.

Abstract

Unsupervised visible-infrared person re-identification (USL-VI-ReID) is a promising yet challenging retrieval task. The key challenges in USL-VI-ReID are to effectively generate pseudo-labels and establish pseudo-label correspondences across modalities without relying on any prior annotations. Recently, clustered pseudo-label methods have gained more attention in USL-VI-ReID. However, previous methods fell short of fully exploiting the individual nuances, as they simply utilized a single memory that represented an identity to establish cross-modality correspondences, resulting in ambiguous cross-modality correspondences. To address the problem, we propose a Multi-Memory Matching (MMM) framework for USL-VI-ReID. We first design a Cross-Modality Clustering (CMC) module to generate the pseudo-labels through clustering together both two modality samples. To associate cross-modality clustered pseudo-labels, we design a Multi-Memory Learning and Matching (MMLM) module, ensuring that optimization explicitly focuses on the nuances of individual perspectives and establishes reliable cross-modality correspondences. Finally, we design a Soft Cluster-level Alignment (SCA) module to narrow the modality gap while mitigating the effect of noise pseudo-labels through a soft many-to-many alignment strategy. Extensive experiments on the public SYSU-MM01 and RegDB datasets demonstrate the reliability of the established cross-modality correspondences and the effectiveness of our MMM. The source codes will be released.

1 Introduction

Person re-identification (ReID) aims to match the same person across different cameras, with applications in various fields of video surveillance [49, 12, 13, 14], such as intelligent security and criminal investigation. However, in low-light conditions, the images captured by visible cameras are far from satisfactory, which renders methods [24, 14, 33, 43, 55] that primarily focus on matching visible images less effective. Fortunately, smart surveillance cameras that can switch from visible to infrared modes in poor lighting environments have become widespread, driving the development of visible-infrared person re-identification (VI-ReID) for the 24-hour surveillance system.

Refer to caption — Figure 1: Comparision with different methods on ARI. The ARI indicates the Adjusted Rand Index, which is a similarity measure between two clusterings. The visualization is presented in Fig. 5.

VI-ReID aims at retrieving infrared images of the same person when provided with a visible person image, and vice versa [53, 48]. Many VI-ReID methods [51, 53, 52, 28, 41, 56, 19] have shown promising progress. However, these methods are based on well-annotated cross-modality data, which is time-consuming and labor-intensive, thereby limiting the practical application of supervised VI-ReID methods in real-world scenarios.

To free the toilsome label process and speed the automation of VI-ReID, several unsupervised VI-ReID (USL-VI-ReID) methods [44, 6, 16, 42, 47, 45] have been proposed, which try to establish cross-modality correspondences by clustering pseudo-labels and have achieved fairly good performance. However, the reliability of pseudo-labels and cross-modality correspondences remains invalid in USL-VI-ReID. We argue they are critical to the credibility of USL-VI-ReID. To measure the reliability, we introduce the Adjusted Rand Index (ARI) [18] metric, which is a widely recognized metric for clustering evaluation. The larger ARI value, the better it reflects the degree of overlap between the clustered results and the ground-truth labels. More detailed explanations are presented in supplementary materials. In Fig. 1, RGB and IR categories denote the ARI values of visible and infrared pseudo-labels, which can measure the quality of visible and infrared pseudo-labels. The ALL category represents the ARI values of overall pseudo-labels, composed of visible and infrared pseudo-labels, and serves as a metric for evaluating the reliability of cross-modality correspondences. Interestingly, we observe a peculiar phenomenon: cross-modality correspondences of previous methods are not reliable, though they achieve good performance as shown in Fig. 1 and Tab. 1. Therefore, we raise a question: Why do noisy cross-modality correspondences still achieve good performance? This answer is that persons with different identities may share some similar features. As a result, some similar features become more closely related due to noisy correspondences. But the closer proximity of similar features may result in greater similarity among different persons, rendering the retrieval of specific persons from a large gallery more challenging.

To reduce the ambiguous cross-modality correspondences in USL-VI-ReID, we develop a novel Multi-Memory Matching (MMM) framework to exploit the individual nuances. Multi-memory can store a wider array of distinct characteristics for a single identity. For example, Memory 1 can retain front-facing attributes, Memory 2 can capture rear-facing attributes, and so on. In short, multi-memory supports a more diverse representation, which is beneficial for the establishment of cross-modality correspondences. Specifically, we propose a Cross-Modality Clustering (CMC) module to generate pseudo-labels. Unlike previous methods, we not only cluster intra-modality samples but also cluster inter-modality samples to learn modality-invariant features. We note that the existing methods typically rely on a single memory to represent individual characteristics and establish cross-modality correspondences. However, a single memory may not capture all individual nuances, including perspective, attire, and other factors, which naturally leads to poor cross-modality correspondences. Therefore, we design a Multi-Memory Learning and Matching (MMLM) module to obtain reliable cross-modality correspondences. The modules mentioned above do not directly reduce the discrepancy between the two modalities, we propose the Soft Cluster-level Alignment (SCA) module to narrow the modality gap through two soft cluster-level intra- and inter-modality alignment losses. Our MMM can achieve fairly good quality of pseudo-labels and cross-modality correspondences compared with several USL-VI-ReID methods, as shown in Fig. 1.

The main contributions are summarized as follows:

•

We introduce the ARI metric to evaluate the quality of pseudo-labels and cross-modality correspondences, and we observe a curious phenomenon: cross-modality correspondences of previous methods are not reliable, though they achieve good performance.
•

We design a novel Multi-Memory Matching (MMM) framework for unsupervised VI-ReID, which exploits the individual nuances to effectively establish reliable cross-modality correspondences.
•

We introduce three effective modules: Cross-Modality Clustering (CMC), Multi-Memory Learning and Matching (MMLM), and Soft Cluster-level Alignment (SCA). These modules facilitate the generation of pseudo-labels, establish reliable cross-modality correspondences, and narrow the discrepancy between two modalities while mitigating the influence of noisy pseudo-labels.

2 Related Work

2.1 Supervised Visible-Infrared Person ReID

Visible-infrared person ReID is a challenging cross-modality image retrieval problem. Many works have been proposed to alleviate the large cross-modality gap for VI-ReID, which can be broadly categorized into two classes: image-level alignment and feature-level alignment. The image-level alignment methods [8, 35, 34] try to generate cross-modal images to excavate modality-invariant information. Moreover, several methods [20, 58, 39] introduce an auxiliary modality to assist the cross-modality retrieval task. The feature-level alignment methods [52, 25, 56, 48, 51, 32, 41] mainly map cross-modal features into a shared feature space to reduce cross-modal differences. For example, SGIEL [11] separates shape-related features from shape-erasure features through orthogonal decomposition to improve the diversity and identification of the learned representations for VI-ReID. However, the above methods heavily rely on large-scale cross-modality data annotation, which is quite expensive and time-consuming.

2.2 Unsupervised Single-Modality Person ReID

Existing unsupervised single-modality person ReID (USL-ReID) methods can be roughly categorized into domain translation-based methods and clustering-based methods. The domain translation-based methods [12, 13, 14, 31, 54, 22, 37] try to transfer the knowledge from the labeled source domain to the unlabeled target domain for USL-ReID. Compared with the former, the clustering-based methods [24, 23, 33, 43, 55, 7, 3] are more challenging, which are trained directly on the unlabeled target domain. The common idea of clustering-based methods is using clustering algorithms [10] to generate pseudo-labels to train a ReID model. Pseudo-labels inevitably contain noise, so it is challenging to assign the correct label to each unlabeled image. Recently, Cluster-Contrast [9] performs contrastive learning at the cluster level, thus achieving the consistency of the feature dictionary. Although the above methods perform well on USL-ReID, they are not suitable for solving the USL-VI-ReID due to the large cross-modality gap.

2.3 Unsupervised Visible-Infrared Person ReID

The challenge of unsupervised VI-ReID (USL-VI-ReID) is establishing reliable cross-modality correspondence. H2H [21] and OTLA [38] use a well-annotated labeled source domain for pre-training to solve the USL-VI-ReID. Inspired by Cluster-Contrast [9] for USL-ReID, some clustering-based methods [44, 6, 16, 42, 47] are proposed for USL-VI-ReID, they try to establish cross-modality correspondence by clustering pseudo-labels. Recently, it has been shown that the Large-scale Vision-Language Pre-training model, e.g., CLIP [29], naturally excels in producing textual descriptions for images. To this end, CCLNet [5] leverages the text information from CLIP to improve the USL-VI-ReID task. However, none of the above methods evaluate the reliability of cross-modality correspondence, indeed, their cross-modality correspondence is not reliable. Our method aims to investigate how to establish more reliable cross-modality correspondence for USL-VI-ReID.

3 Methodology

The framework of our MMM is illustrated in Fig. 2. We begin by employing the Cross-Modality Clustering (CMC) module to generate pseudo-labels. Building upon CMC, we propose a novel Multi-Memory Learning and Matching (MMLM) module to effectively establish cross-modality correspondences. Finally, we propose the Soft Cluster-level Alignment (SCA) module to narrow the gap between two modalities while mitigating the impact of noisy pseudo-labels through two soft cluster-level intra- and inter-modality alignment losses.

3.1 Notation Definition

Suppose we have a USL-VI-ReID dataset denoted as $D=\{V,R\}$ . Here, $V=\{v_{i}\}^{N}_{i=1}$ represents the visible images with $N$ samples, and $R=\{r_{i}\}^{M}_{i=1}$ denotes the infrared images with $M$ samples. We initialize their pseudo-labels as $Y^{t}$ , where $t\in\{v,r\}$ . Let $N_{p}$ and $M_{p}$ represent the number of visible and infrared samples with ID $p$ , where $p\in\{1,2,...,P^{t}\}$ and $P^{t}$ is the total number of person identities for modality $t$ . The respective feature sets of these images are denoted as $F^{v}=\{f^{v}_{1},f^{v}_{2},\ldots,f^{v}_{N}\}$ for visible samples and $F^{r}=\{f^{r}_{1},f^{r}_{2},\ldots,f^{r}_{M}\}$ for infrared samples, respectively. Our goal is to develop a cross-modality person ReID model without utilizing any labels.

3.2 Cross-Modality Clustering

Most USL-VI-ReID methods typically use clustering algorithms to generate pseudo-labels. Following this paradigm, we employ the DBSCAN algorithm [10] to generate pseudo-labels for all images, as described:

{{Y}^{t}}=DBSCAN(F^{t}).

(1)

Unlike previous methods, we not only cluster intra-modality samples $(t=v~{}or~{}t=r)$ but also cluster inter-modality samples $(t=[v,r])$ to indirectly build cross-modality correspondence.

At the beginning of every training iteration, we calculate and retain the memory for each cluster as follows:

\bm{C}_{V^{p}}=\frac{1}{N_{p}}\sum_{i=1}^{N_{p}}{f}(V_{i}^{p}),

(2)

\bm{C}_{R^{p}}=\frac{1}{M_{p}}\sum_{i=1}^{M_{p}}{f}(R_{i}^{p}),

(3)

\bm{C}_{{VR}^{p}}=\frac{1}{A_{p}}\sum_{i=1}^{A_{p}}{f}(VR_{i}^{p}),

(4)

where $f(\cdot)$ is a function designated for extracting features from images across diverse modalities. We use superscripts to denote specified identity, $V^{p}$ and $R^{p}$ denote the visible and infrared modality of the same identity sample sets with ID $p$ , respectively. ${VR}^{p}$ represents the combined set of both modalities with $A_{p}$ samples of the same ID $p$ . The index $p$ ranges from 1 to $P^{t}$ .

Then, we optimize the feature extractor using ClusterNCE [9] loss, computed as:

L_{V}=-\log\frac{\exp\left({C_{V}^{+}}\cdot F^{v}/\tau\right)}{\sum_{p=1}^{P^{% v}}\exp\left({C_{V^{p}}}\cdot F^{v}/\tau\right)},

(5)

L_{R}=-\log\frac{\exp\left({C_{R}^{+}}\cdot F^{r}/\tau\right)}{\sum_{p=1}^{P^{% r}}\exp\left({C_{R^{p}}}\cdot F^{r}/\tau\right)},

(6)

L_{VR}=-\log\frac{\exp\left({C_{{VR}}^{+}}\cdot[F^{v},F^{r}]/\tau\right)}{\sum% _{p=1}^{P^{v,r}}\exp\left({C_{{VR}^{p}}}\cdot[F^{v},F^{r}]/\tau\right)},

(7)

where $C^{+}$ is the positive memory representation and the $\tau$ is a temperature hyper-parameter.

The CMC loss is defined as:

L_{CMC}=L_{{V}}+L_{{R}}+L_{{VR}}.

(8)

3.3 Multi-Memory Learning and Matching

The CMC optimizes the feature extractor using a single memory, but a single memory may not fully capture individual nuances, such as perspective and attire. Moreover, the CMC does not directly establish relations between the two modalities, thereby limiting its effectiveness in cases with significant modality discrepancies. To more effectively capture intra-identity variations and bridge the gap between the visible and infrared modalities, we propose the Multi-Memory Learning and Matching (MMLM) module, which mines a holistic representation and establishes reliable cross-modality correspondences. Specifically, we further subdivide single memory into multi-memory for a single identity, which can be formulated as K-means [26]:

F_{C_{V^{p}_{i}}}=\underset{{n}}{\arg\min}\{\left\|{f}(v_{j})-C_{V^{p}}\right% \|_{2}^{2},\forall v_{j}\in V^{p}\},

(9)

F_{C_{R^{p}_{i}}}=\underset{{n}}{\arg\min}\{\left\|{f}(r_{j})-C_{R^{p}}\right% \|_{2}^{2},\forall r_{j}\in R^{p}\},

(10)

here, $F_{C_{V^{p}_{i}}}$ and $F_{C_{R^{p}_{i}}}$ represent the $i$ -th visible and infrared feature sets of ID $p$ , respectively. The index $i$ ranges from 1 to $n$ , where $n$ is the number of memories for a single identity.

K_{C_{V^{p}_{i}}}=\frac{1}{|F_{C_{V^{p}_{i}}}|}\sum_{f^{v}\in F_{C_{V^{p}_{i}}% }}f^{v}

(11)

K_{C_{R^{p}_{i}}}=\frac{1}{|F_{C_{R^{p}_{i}}}|}\sum_{f^{r}\in F_{C_{R^{p}_{i}}% }}f^{r},

(12)

where $K_{C_{V^{p}}}$ and $K_{C_{R^{p}}}$ represent the visible and infrared multi-memory of ID $p$ , respectively.

By employing the multi-memory learning strategy, we achieve more diverse memories for a single identity. However, these memories still exhibit a strong implicit correlation with the modality, which negatively impacts the establishment of cross-modality correspondences. Inspired by the PGM [42], we transform the cross-modality multi-memory matching problem into a weighted bipartite graph matching. The goal is to match each visible cluster with the corresponding identity infrared cluster while minimizing the cost, which is formulated as follows:

\begin{array}[]{c}\underset{{Q}}{\min}C^{T}{Q}\\ \text{ s.t. }\forall p\in[P^{v}],\forall p^{\prime}\in[P^{r}]:Q_{p}^{p^{\prime% }}\in\{0,1\},\\ \forall p\in[P^{v}]:\underset{{p^{\prime}\in[P^{r}]}}{\sum}Q_{p}^{p^{\prime}}% \leq 1,\\ \forall p^{\prime}\in[P^{r}]:\underset{{p\in[P^{v}}]}{\sum}Q_{p}^{p^{\prime}}=% 1,\end{array}

(13)

where ${Q}=\left\{Q_{p}^{p^{\prime}}\right\}\in\mathbb{R}^{P^{v}\times P^{r}\times 1}$ indicates whether $K_{V^{p}}$ and $K_{R^{p^{\prime}}}$ belong to the same person $\left(Q_{p}^{p^{\prime}}=1\right)$ or not $\left(Q_{p}^{p^{\prime}}=0\right)$ . $C$ and $[P^{t}]$ denote cost matrix and $\{1,\dots,P^{t}\}$ , respectively. We design a simple yet effective cost expression for cross-modality multi-memory matching as follows:

C(K_{C_{V^{p}}},K_{C_{V^{p^{\prime}}}})=\sum_{i=1}^{n}\min_{j\in\{1,\cdots,n\}% }\|K_{V^{p}_{i}},K_{R^{p^{\prime}}_{j}}\|_{2},

(14)

Finally, we transfer the infrared pseudo-labels to the visible pseudo-labels, which can be written as:

{Y^{v}}:=QY^{r}.

(15)

3.4 Soft Cluster-level Alignment

Pseudo-labels inherently contain noise, a problem that is not exempt even in human annotations [48], leading to a reduction in performance. The method [2] illustrated that deep neural networks initially learn from simple samples before accommodating noisy labels. Building on this insight, we assess the confidence associated with each label. To do so, we employ a two-component Gaussian Mixture Model (GMM) to model the loss distribution:

{L}_{ID}^{v}={-\log p\left({{Y}^{v}}\mid C\left(F^{v}\right)\right)},

(16)

p({L}^{v}_{ID}\mid\theta)=\sum_{k=1}^{2}\pi_{k}\phi({L}_{ID}^{v}\mid k),

(17)

where $C(\cdot)$ acts as an identity classifier. $\pi_{k}$ represents the mixture coefficient, while $\phi({L}^{v}_{ID}\mid k)$ denotes the probability density of the $k$ -th component.

Subsequently, the confidence is determined by computing its posterior probability, detailed as:

W^{v}=p\left(k\mid{L}^{v}_{ID}\right),

(18)

where $k$ refers to the Gaussian component with a smaller mean, while $p\left(k\mid{L}^{v}_{ID}\right)$ indicates the responsiveness of ${L}^{v}_{ID}$ to the $k$ -th component. In the same way, we can obtain the confidence $W^{r}$ and $W^{vr}$ .

To penalize the noise during optimization, the memories in Eq. (2), (3), (4) are updated by:

{C}_{V^{p}}:=\frac{1}{N_{p}}\sum_{i=1}^{N_{p}}{f}(V_{i}^{p})W_{V^{p}_{i}},

(19)

{C}_{R^{p}}:=\frac{1}{M_{p}}\sum_{i=1}^{M_{p}}{f}(R_{i}^{p})W_{R^{p}_{i}},

(20)

{C}_{{VR}^{p}}:=\frac{1}{A_{p}}\sum_{i=1}^{A_{p}}{f}(VR_{i}^{p})W_{VR^{p}_{i}},

(21)

where $W_{V^{p}_{i}}$ , $W_{R^{p}_{i}}$ , and $W_{VR^{p}_{i}}$ denote the confidences of samples $V^{p}_{i}$ , $R^{p}_{i}$ , and $VR^{p}_{i}$ , respectively.

To reduce the intra-modality discrepancy, we employ the distilled ${C}_{V^{p}}$ and ${C}_{R^{p}}$ to align every sample of ID $p$ to its corresponding memory in each modality. The cluster-level intra-modality alignment loss ${L}_{Intra}$ is proposed as:

\begin{split}{L}_{Intra}&={L}_{Intra}^{V}+{L}_{Intra}^{R}\\ &=\sum_{p=1}^{P^{v}}\sum_{f^{v}\in F_{p}^{v}}\left\|f^{v}-{C}_{V^{p}}\right\|_% {2}^{2}\\ &+\sum_{p=1}^{P^{r}}\sum_{f^{r}\in F_{p}^{r}}\left\|f^{r}-{C}_{R^{p}}\right\|_% {2}^{2},\end{split}

(22)

where ${F}^{v}_{p}$ , ${F}^{r}_{p}$ denote visible feature and infrared feature sets of ID $p$ , respectively.

Since VI-ReID is a many-to-many matching problem, we propose cluster-level inter-modality alignment loss, which forces the feature distribution of the samples from the visible modality to be similar to the feature distribution of the samples from the infrared modality and vice versa by:

\begin{split}{L}_{Inter}&={L}_{Inter}^{V}+{L}_{Inter}^{R}\\ &=\frac{1}{P}\sum_{p=1}^{P}(\frac{1}{2}D(F^{v}_{p},sg(F^{r}_{p}))\\ &\quad\quad\quad+\frac{1}{2}D(F^{r}_{p},sg(F^{v}_{p}))),\end{split}

(23)

where $sg(\cdot)$ represents the stop-gradient operation, and $D(i,j)$ represents the distance between distributions $i$ and $j$ . $P$ is $\min(P^{v},P^{r})$ . In this paper, we employ the squared Maximum Mean Discrepancy (MMD ${}^{2}$ ) [15] to quantify the discrepancy between distributions. MMD ${}^{2}$ is a commonly used non-parametric metric in domain adaptation and has been observed to outperform other metrics, such as KL divergence in empirical studies, MMD ${}^{2}$ is constructed as:

\begin{split}\text{MMD}^{2}({F}^{r}_{p},{F}^{v}_{p})&=\frac{1}{|{F}^{r}_{p}|^{% 2}}\sum_{{f}^{r}_{i}\in{F}^{r}_{p}}\sum_{{f}^{r}_{j}\in{F}^{r}_{p}}z({f}^{r}_{% i},{f}^{r}_{j})\\ &+\frac{1}{|{F}^{v}_{p}|^{2}}\sum_{{f}^{v}_{i}\in{F}^{v}_{p}}\sum_{{f}^{v}_{i}% \in{F}^{v}_{p}}z({f}^{v}_{i},{f}^{v}_{j})\\ &-\frac{2}{|{F}^{r}_{p}||{F}^{v}_{p}|}\sum_{{f}^{r}_{i}\in{F}^{r}_{p}}\sum_{{f% }^{v}_{j}\in{F}^{r}_{p}}z({f}^{r}_{i},{f}^{v}_{j}),\end{split}

(24)

where $z(s,s^{\prime})=\exp(\frac{-\left\|\bm{s}-\bm{s}^{\prime}\right\|_{2}^{2}}{2% \sigma^{2}})$ is a Gaussian kernel.

The SCA loss is defined as:

L_{SCA}=\lambda_{Intra}L_{{Intra}}+\lambda_{Inter}L_{{Inter}},

(25)

where $\lambda_{Intra}$ and $\lambda_{Inter}$ are the balancing weights.

Overall Loss. The total loss for training the model is defined by the following equation:

L_{overall}=L_{CMC}+L_{SCA}.

(26)

Table 1: Comparisons with state-of-the-art methods on SYSU-MM01 and RegDB, i.e., supervised visible-infrared person ReID (SVI-ReID), semi-supervised visible-infrared person ReID (SSVI-ReID) and unsupervised visible-infrared person ReID (USL-VI-ReID). All methods are measured by Rank-1 (%) and mAP (%). GUR* denotes the results without camera information.

Settings			SYSU-MM01				RegDB
Settings			All Search		Indoor Search		Visible2Thermal		Thermal2Visible
Type	Method	Venue	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1	mAP
SVI-ReID	JSIA-ReID [36]	AAAI’20	38.1	36.9	43.8	52.9	48.5	49.3	48.1	48.9
	DDAG [51]	ECCV’20	54.8	53.0	61.0	68.0	69.4	63.5	68.1	61.8
	AGW [53]	TRAMI’21	47.5	47.7	54.2	63.0	70.1	66.4	70.5	65.9
	NFS [4]	CVPR’21	56.9	55.5	62.8	69.8	80.5	72.1	78.0	69.8
	LbA [28]	ICCV’21	55.4	54.1	58.5	66.3	74.2	67.6	72.4	65.5
	CAJ [52]	ICCV’21	69.9	66.9	76.3	80.4	85.0	79.1	84.8	77.8
	MPANet [41]	CVPR’21	70.6	68.2	76.7	81.0	83.7	80.9	82.8	80.7
	DART [48]	CVPR’22	68.7	66.3	72.5	78.2	83.6	75.7	82.0	73.8
	FMCNet [56]	CVPR’22	66.3	62.5	68.2	74.1	89.1	84.4	88.4	83.9
	MAUM [25]	CVPR’22	71.7	68.8	77.0	81.9	87.9	85.1	87.0	84.3
	MID [17]	AAAI’22	60.3	59.4	64.9	70.1	87.5	84.9	84.3	81.4
	LUPI [1]	ECCV’22	71.1	67.6	82.4	82.7	88.0	82.7	86.8	81.3
	DEEN [57]	CVPR’23	74.7	71.8	80.3	83.3	91.1	85.1	89.5	83.4
	SGIEL [11]	CVPR’23	77.1	72.3	82.1	83.0	92.2	86.6	91.1	85.2
	PartMix [19]	CVPR’23	77.8	74.6	81.5	84.4	85.7	82.3	84.9	82.5
SSVI-ReID	MAUM-50 [25]	CVPR’22	28.8	36.1	-	-	-	-	-	-
	MAUM-100 [25]	CVPR’22	38.5	39.2	-	-	-	-	-	-
	OTLA [38]	ECCV’22	48.2	43.9	47.4	56.8	49.9	41.8	49.6	42.8
	TAA [46]	TIP’23	48.8	42.3	50.1	56.0	62.2	56.0	63.8	56.5
	DPIS [30]	ICCV’23	58.4	55.6	63.0	70.0	62.3	53.2	61.5	52.7
USL-VI-ReID	OTLA [38]	ECCV’22	29.9	27.1	29.8	38.8	32.9	29.7	32.1	28.6
	ADCA [44]	MM’22	45.5	42.7	50.6	59.1	67.2	64.1	68.5	63.8
	NGLR [6]	arXiv’23	50.4	47.4	53.5	61.7	85.6	76.7	82.9	75.0
	MBCCM [16]	arXiv’23	53.1	48.2	55.2	62.0	83.8	77.9	82.8	76.7
	CCLNet [5]	MM’23	54.0	50.2	56.7	65.1	69.9	65.5	70.2	66.7
	GUR* [47]	ICCV’23	61.0	57.0	64.2	69.5	73.9	70.2	75.0	69.9
	PGM [42]	CVPR’23	57.3	51.8	56.2	62.7	69.5	65.4	69.9	65.2
	MMM(Ours)	-	61.6	57.9	64.4	70.4	89.7	80.5	85.8	77.0

4 Experiments

In this section, we conduct comprehensive experiments to verify the effectiveness of our MMM. First, we compare our MMM with several state-of-the-art methods under three settings, i.e., supervised visible-infrared person ReID (SVI-ReID), semi-supervised visible-infrared person ReID (SSVI-ReID) and unsupervised visible-infrared person ReID (USL-VI-ReID). After that, we perform ablation studies to evaluate the effectiveness of each module in our MMM. Finally, we perform a discussion and analysis of the hyper-parameters and visualization. If not specified, we conduct analysis experiments on SYSU-MM01 in the single-shot & all-search mode.

Table 2: Ablation studies on SYSU-MM01 in all search mode and indoor search mode. “Baseline” means the model trained only with the CMC module. Rank-R accuracy(%) and mAP(%) are reported.

	Method				All Search					Indoor Search
Order	Baseline	MMLM	Intra	Inter	Rank-1	Rank-5	Rank-10	Rank-20	mAP	Rank-1	Rank-5	Rank-10	Rank-20	mAP
1	✓				51.74	78.67	87.87	94.76	49.81	56.34	84.66	92.77	96.98	64.46
2	✓	✓			55.15	81.65	90.53	96.46	52.21	58.76	85.21	93.06	97.16	65.47
3	✓	✓	✓		58.48	83.69	91.79	97.15	55.05	62.19	86.95	93.60	97.64	68.09
4	✓	✓		✓	57.26	82.34	90.84	96.93	53.81	60.26	85.77	93.16	97.36	66.66
5	✓	✓	✓	✓	61.56	85.66	93.33	98.03	57.92	64.37	88.80	95.01	98.20	70.40

4.1 Experimental Setting

Dataset. We evaluate our MMM on two benchmarks, i.e., SYSU-MM01 [40] and RegDB [27]. More detailed explanations are presented in supplementary materials.

Evaluation Protocols. Cumulative Matching Characteristics [50] and mean Average Precision (mAP) are adopted as the evaluation metrics on two datasets to evaluate the performance of our MMM quantitatively. For fair comparisons, we report the results of all-search mode and indoor-search mode with the official code on SYSU-MM01. Following [52], We also report the results on RegDB by randomly splitting the training and testing set 10 times in visible-to-thermal and thermal-to-visible modes.

4.2 Implementation Details

We adopt ResNet50, which is initialized with the ImageNet pre-trained weights, as the shared backbone to extract 2048d features. Our MMM is implemented in PyTorch. During training, random horizontal flip** and random crop are used for data augmentation [52]. The total number of training epochs is 80. The additional detailed settings are presented in supplementary materials.

4.3 Results and Analysis

To clearly demonstrate the effectiveness of our MMM, we compare our MMM with several state-of-the-art methods under three settings, i.e., SVI-ReID, SSVI-ReID, and USL-VI-ReID. The quantitative results on SYSU-MM01 and RegDB are shown in Tab. 1.

Comparison with SVI-ReID Methods. Surprisingly, our MMM performs better than several supervised methods on SYSU-MM01, including JSIA-ReID [36], DDAG [51], AGW [53], NFS [4], and LbA [28]. The results show the effectiveness of our MMM. However, we have to acknowledge that there is still a certain gap between our MMM and many SVI-ReID methods due to the absence of cross-modality data annotations.

Comparison with SSVI-ReID Methods. We compared our MMM with five state-of-the-art SSVI-ReID methods. Notably, our MMM achieved superior performance without the utilization of any annotations, surpassing all SSVI-ReID methods that rely on limited annotations.

Comparison with USL-VI-ReID Methods. Compared with seven state-of-the-art USVI-ReID methods, our MMM consistently performs better than most existing USL-VI-ReID methods by a significant margin, which demonstrates the effectiveness of our MMM. As shown in Tab. 1, our MMM achieves 61.6% in Rank-1 and 57.9% in mAP on SYSU-MM01. Moreover, the results are surprising on RegDB, our MMM improves the Rank-1 and mAP accuracy by a large margin of 15.8% and 10.3% compared to GUR under visible to thermal mode. These results even surpass most SVI-ReID methods.

The above results clearly show that our MMM is effective, which highlights the significant potential of our MMM in addressing USL-VI-ReID challenges.

4.4 Ablation Study

To further analyze the effectiveness of the Multi-Memory Learning and Matching (MMLM), the Soft Cluster-level Alignment (SCA), we conduct ablation studies on SYSU-MM01 under both all-search and indoor-search modes. The results are reported in Tab. 2.

Baseline. Order 1 denotes that the model is trained only with the CMC module. Although it achieves a promising performance on SYSU-MM01, it does not directly establish relations between the two modalities.

Effective of MMLM. The effectiveness of the MMLM module is revealed by comparing Order 1 and Order 2. The MMLM improves 3.41% in Rank-1 and 2.40% in mAP on SYSU-MM01. The results, combined with Fig. 1, demonstrate that the MMLM can help align visible and infrared pseudo-labels to establish cross-modality correspondences.

Effective of Intra in SCA. As shown in Order 3 of Tab. 2, the performance is improved to 58.48% in Rank-1 and 55.05% in mAP when adding the cluster-level intra-modality loss (Intra) in SCA, which shows the effectiveness of Intra in reducing the discrepancy of intra-modality.

Effective of Inter in SCA. The cluster-level inter-modality alignment loss (Inter) is proposed to reduce the discrepancy of inter-modality, our MMM can reach 57.26% in Rank-1 and 53.81% in mAP when adding it. Moreover, when combining Inter with Intra, our MMM achieves the best performance with 61.56% in Rank-1 and 57.92% in mAP, which surpasses the baseline by a large margin of 9.82% in Rank-1 and 8.11% in mAP.

4.5 Analysis of Hyper-parameters

In Fig. 3, we analyze the key hyper-parameters of our MMM on SYSU-MM01, i.e., the number of memories $n$ , $\lambda_{Intra}$ and $\lambda_{Inter}$ . In Fig. 3 (a), we vary the number of memories from 1 to 5 while kee** the $\lambda_{Intra}$ and $\lambda_{Inter}$ fixed, which shows our MMM achieves the best performance with 61.56% in Rank-1 and 57.92% in mAP when $n=4$ . Moreover, to balance the contribution between the cluster-level intra- and inter-modality alignment loss in SCA, we study the effect of $\lambda_{Intra}$ and $\lambda_{Inter}$ by fixing one and adjusting the other. To be specific, we maintain the $\lambda_{Inter}=0.05$ and tune the value of $\lambda_{Intra}$ in $[0.1,0.5,1.0,1.5,2.0]$ (Fig. 3 (b)), while fix the $\lambda_{Intra}=0.5$ and explore the $\lambda_{Inter}$ on different values which vary in $[0.01,0.025,0.05,0.075,0.1]$ (Fig. 3 (c)). We can observe that our MMM achieves high accuracy under different combinations with $\lambda_{Intra}$ and $\lambda_{Inter}$ , which shows the performance of our MMM is not sensitive to $\lambda_{Intra}$ and $\lambda_{Inter}$ , and the best performance is achieved with $\lambda_{Intra}=0.5$ and $\lambda_{Inter}=0.05$ .

4.6 Analysis of Visualization

To further illustrate the effectiveness of MMM, we visualize the intra-identity and inter-identity distances on SYSU-MM01 in Fig. 4. As shown in Fig. 4 (a)-(d), with the addition of the proposed module, the means of intra-identity distances gradually decrease while the means of inter-identity distances gradually increase, which makes the intra-identity and inter-identity features distributions are pushed away ( $\delta_{1}<\delta_{2}<\delta_{3}<\delta_{4}$ ). The results show that our MMM can effectively reduce the cross-modality distances between the same identity samples and push the distance between different identity samples far away.

Moreover, we also visualize the pseudo-labels of the same identity with different modalities, where we randomly choose 3 person identities, where each identity consists of 4 visible images and infrared images. As shown in Fig. 5, persons of the same identity in different modalities have the same pseudo-label in our MMM (right) compared with GUR (left), which shows that our MMM can establish more reliable cross-modality correspondences.

5 Conclusion

In this paper, we introduce a metric to measure cross-modality correspondences and clustering pseudo-labels, i.e., Adjusted Rand Index, and investigate how to establish reliable cross-modality correspondences for USL-VI-ReID. To this end, we propose a Multi-Memory Matching (MMM) framework. Firstly, we design a Cross-Modality Clustering (CMC) module to generate pseudo-labels. Instead of previous methods, we employ multi-memory in the Multi-Memory Learning and Matching (MMLM) module to capture individual nuances and establish reliable cross-modality correspondences. Additionally, we present a Soft Cluster-level Alignment (SCA) module to reduce the cross-modality gap while mitigating the effect of noise pseudo-labels. Comprehensive experimental results show that our MMM can establish reliable cross-modality correspondences and outperforms existing USL-VI-ReID methods on SYSU-MM01 and RegDB.

References

Alehdaghi et al. [2022] Mahdi Alehdaghi, Arthur Josi, Rafael M. O. Cruz, and Eric Granger. Visible-infrared person re-identification using privileged intermediate information. In ECCV, pages 720–737, 2022.
Arpit et al. [2017] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks. In ICML, pages 233–242, 2017.
Chen et al. [2021a] Hao Chen, Benoit Lagadec, and François Brémond. ICE: inter-instance contrastive encoding for unsupervised person re-identification. In ICCV, pages 14940–14949, 2021a.
Chen et al. [2021b] Yehansen Chen, Lin Wan, Zhihang Li, Qianyan **g, and Zongyuan Sun. Neural feature search for rgb-infrared person re-identification. In CVPR, pages 587–597, 2021b.
Chen et al. [2023] Zhong Chen, Zhizhong Zhang, Xin Tan, Yanyun Qu, and Yuan Xie. Unveiling the power of clip in unsupervised visible-infrared person re-identification. In ACM MM, pages 3667–3675, 2023.
Cheng et al. [2023] De Cheng, Xiaojian Huang, Nannan Wang, Lingfeng He, Zhihui Li, and Xinbo Gao. Unsupervised visible-infrared person reid by collaborative learning with neighbor-guided label refinement. ArXiv:2305.12711, 2023.
Cho et al. [2022] Yoonki Cho, Woo Jae Kim, Seunghoon Hong, and Sung-Eui Yoon. Part-based pseudo label refinement for unsupervised person re-identification. In CVPR, pages 7298–7308, 2022.
Choi et al. [2020] Seokeon Choi, Sumin Lee, Youngeun Kim, Taekyung Kim, and Changick Kim. Hi-cmd: Hierarchical cross-modality disentanglement for visible-infrared person re-identification. In CVPR, pages 10254–10263, 2020.
Dai et al. [2022] Zuozhuo Dai, Guangyuan Wang, Weihao Yuan, Siyu Zhu, and ** Tan. Cluster contrast for unsupervised person re-identification. In ACCV, pages 319–337, 2022.
Ester et al. [1996] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226–231, 1996.
Feng et al. [2023] Jiawei Feng, Ancong Wu, and Wei-Shi Zheng. Shape-erased feature learning for visible-infrared person re-identification. In CVPR, pages 22752–22761, 2023.
Fu et al. [2019] Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and Thomas S. Huang. Self-similarity grou**: A simple unsupervised cross domain adaptation approach for person re-identification. In ICCV, pages 6111–6120, 2019.
Ge et al. [2020a] Yixiao Ge, Dapeng Chen, and Hongsheng Li. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. In ICLR, 2020a.
Ge et al. [2020b] Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, and Hongsheng Li. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. In NeurIPS, 2020b.
Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
He et al. [2023] Lingfeng He, Nannan Wang, Shizhou Zhang, Zhen Wang, Xinbo Gao, et al. Efficient bilateral cross-modality cluster matching for unsupervised visible-infrared person reid. ArXiv:2305.12673, 2023.
Huang et al. [2022] Zhipeng Huang, Jiawei Liu, Liang Li, Kecheng Zheng, and Zheng-Jun Zha. Modality-adaptive mixup and invariant decomposition for rgb-infrared person re-identification. In AAAI, pages 1034–1042, 2022.
Hubert and Arabie [1985] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193–218, 1985.
Kim et al. [2023] Minsu Kim, Seungryong Kim, Jungin Park, Seongheon Park, and Kwanghoon Sohn. Partmix: Regularization strategy to learn part discovery for visible-infrared person re-identification. In CVPR, pages 18621–18632, 2023.
Li et al. [2020] Diangang Li, Xing Wei, Xiaopeng Hong, and Yihong Gong. Infrared-visible cross-modal person re-identification with an X modality. In AAAI, pages 4610–4617, 2020.
Liang et al. [2021] Wenqi Liang, Guangcong Wang, Jianhuang Lai, and Xiaohua Xie. Homogeneous-to-heterogeneous: Unsupervised learning for rgb-infrared person re-identification. IEEE Trans. Image Process., 30:6392–6407, 2021.
Lin et al. [2018] Shan Lin, Haoliang Li, Chang-Tsun Li, and Alex C. Kot. Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. In BMVC, page 9, 2018.
Lin et al. [2019] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering approach to unsupervised person re-identification. In AAAI, pages 8738–8745, 2019.
Lin et al. [2020] Yutian Lin, Lingxi Xie, Yu Wu, Chenggang Yan, and Qi Tian. Unsupervised person re-identification via softened similarity learning. In CVPR, pages 3387–3396, 2020.
Liu et al. [2022] Jialun Liu, Yifan Sun, Feng Zhu, Hongbin Pei, Yi Yang, and Wenhui Li. Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In CVPR, pages 19344–19353, 2022.
MacQueen et al. [1967] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281–297, 1967.
Nguyen et al. [2017] Dat Tien Nguyen, Hyung Gil Hong, Ki-Wan Kim, and Kang Ryoung Park. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 17(3):605, 2017.
Park et al. [2021] Hyunjong Park, Sanghoon Lee, Junghyup Lee, and Bumsub Ham. Learning by aligning: Visible-infrared person re-identification using cross-modal correspondences. In ICCV, pages 12026–12035, 2021.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
Shi et al. [2023] Jiangming Shi, Yachao Zhang, ** Fan, Zhongchao Shi, and Yanyun Qu. Dual pseudo-labels interactive self-training for semi-supervised visible-infrared person re-identification. In ICCV, pages 11218–11228, 2023.
Song et al. [2020] Liangchen Song, Cheng Wang, Lefei Zhang, Bo Du, Qian Zhang, Chang Huang, and Xinggang Wang. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognit., 102:107173, 2020.
Sun et al. [2022] Hanzhe Sun, Jun Liu, Zhizhong Zhang, Chengjie Wang, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Not all pixels are matched: Dense contrastive learning for cross-modality person re-identification. In ACM MM, pages 5333–5341, 2022.
Wang and Zhang [2020] Dongkai Wang and Shiliang Zhang. Unsupervised person re-identification via multi-label classification. In CVPR, pages 10978–10987, 2020.
Wang et al. [2019] Guan’an Wang, Tianzhu Zhang, Jian Cheng, Si Liu, Yang Yang, and Zengguang Hou. Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In ICCV, pages 3622–3631, 2019.
Wang et al. [2020a] Guan’an Wang, Yang Yang, Tianzhu Zhang, Jian Cheng, Zengguang Hou, Prayag Tiwari, and Hari Mohan Pandey. Cross-modality paired-images generation and augmentation for rgb-infrared person re-identification. Neural Networks, 128:294–304, 2020a.
Wang et al. [2020b] Guan’an Wang, Yang Yang, Tianzhu Zhang, Jian Cheng, Zengguang Hou, Prayag Tiwari, and Hari Mohan Pandey. Cross-modality paired-images generation and augmentation for rgb-infrared person re-identification. Neural Networks, 128:294–304, 2020b.
Wang et al. [2018] **gya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR, pages 2275–2284, 2018.
Wang et al. [2022] Jiangming Wang, Zhizhong Zhang, Mingang Chen, Yi Zhang, Cong Wang, Bin Sheng, Yanyun Qu, and Yuan Xie. Optimal transport for label-efficient visible-infrared person re-identification. In ECCV, pages 93–109, 2022.
Wei et al. [2021] Ziyu Wei, Xi Yang, Nannan Wang, and Xinbo Gao. Syncretic modality collaborative learning for visible infrared person re-identification. In ICCV, pages 225–234, 2021.
Wu et al. [2017] Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, and Jianhuang Lai. Rgb-infrared cross-modality person re-identification. In ICCV, pages 5390–5399, 2017.
Wu et al. [2021] Qiong Wu, **yang Dai, Jie Chen, Chia-Wen Lin, Yongjian Wu, Feiyue Huang, Bineng Zhong, and Rongrong Ji. Discover cross-modality nuances for visible-infrared person re-identification. In CVPR, pages 4330–4339, 2021.
Wu and Ye [2023] Zesen Wu and Mang Ye. Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning. In CVPR, pages 9548–9558, 2023.
Xuan and Zhang [2021] Shiyu Xuan and Shiliang Zhang. Intra-inter camera similarity for unsupervised person re-identification. In CVPR, pages 11926–11935, 2021.
Yang et al. [2022a] Bin Yang, Mang Ye, Jun Chen, and Zesen Wu. Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re-identification. In ACM MM, pages 2843–2851, 2022a.
Yang et al. [2023a] Bin Yang, Jun Chen, Cuiqun Chen, and Mang Ye. Dual consistency-constrained learning for unsupervised visible-infrared person re-identification. IEEE Transactions on Information Forensics and Security, 2023a.
Yang et al. [2023b] Bin Yang, Jun Chen, Xianzheng Ma, and Mang Ye. Translation, association and augmentation: Learning cross-modality re-identification from single-modality annotation. IEEE Transactions on Image Processing, 32:5099–5113, 2023b.
Yang et al. [2023c] Bin Yang, Jun Chen, and Mang Ye. Towards grand unified representation learning for unsupervised visible-infrared person re-identification. In ICCV, pages 11069–11079, 2023c.
Yang et al. [2022b] Mouxing Yang, Zhenyu Huang, Peng Hu, Taihao Li, Jiancheng Lv, and Xi Peng. Learning with twin noisy labels for visible-infrared person re-identification. In CVPR, pages 14288–14297, 2022b.
Ye et al. [2018a] Mang Ye, Zheng Wang, Xiangyuan Lan, and Pong C. Yuen. Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI, pages 1092–1099, 2018a.
Ye et al. [2018b] Mang Ye, Zheng Wang, Xiangyuan Lan, and Pong C. Yuen. Visible thermal person re-identification via dual-constrained top-ranking. In IJCAI, pages 1092–1099, 2018b.
Ye et al. [2020] Mang Ye, Jianbing Shen, David J. Crandall, Ling Shao, and Jiebo Luo. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In ECCV, pages 229–247, 2020.
Ye et al. [2021] Mang Ye, Weijian Ruan, Bo Du, and Mike Zheng Shou. Channel augmented joint learning for visible-infrared recognition. In ICCV, pages 13547–13556, 2021.
Ye et al. [2022] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C. H. Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell., pages 2872–2893, 2022.
Zhai et al. [2020] Yunpeng Zhai, Qixiang Ye, Shijian Lu, Mengxi Jia, Rongrong Ji, and Yonghong Tian. Multiple expert brainstorming for domain adaptive person re-identification. In ECCV, pages 594–611, 2020.
Zhang et al. [2023] Guoqing Zhang, Hongwei Zhang, Weisi Lin, Arun Kumar Chandran, and Xuan **g. Camera contrast learning for unsupervised person re-identification. IEEE Trans. Circuits Syst. Video Technol., 33(8):4096–4107, 2023.
Zhang et al. [2022] Qiang Zhang, Changzhou Lai, Jianan Liu, Nianchang Huang, and Jungong Han. Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In CVPR, pages 7339–7348, 2022.
Zhang and Wang [2023] Yukang Zhang and Hanzi Wang. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In CVPR, pages 2153–2162, 2023.
Zhang et al. [2021] Yukang Zhang, Yan Yan, Yang Lu, and Hanzi Wang. Towards a unified middle modality learning for visible-infrared person re-identification. In ACM MM, pages 788–796, 2021.

\thetitle

Supplementary Material

VI Overview

In this document, we first introduce the ARI as a metric for evaluating the reliability of cross-modality correspondences and pseudo-labels. Secondly, we present more detailed explanations of datasets. Finally, we supplement the implementation details.

VII ARI as a Metric for Evaluating Unsupervised Cross-Modal Re-Identification

To evaluate the reliability of cross-modality correspondences and pseudo-labels in unsupervised cross-modality re-identification, this work introduces the Adjusted Rand Index (ARI) metric, ARI is a measure of the similarity between two data clusterings, the ARI is calculated using the formula:

\text{ARI}=\frac{\sum_{ij}\binom{n_{ij}}{2}-\left[\sum_{i}\binom{a_{i}}{2}\sum% _{j}\binom{b_{j}}{2}\right]/\binom{N}{2}}{\frac{1}{2}\left[\sum_{i}\binom{a_{i% }}{2}+\sum_{j}\binom{b_{j}}{2}\right]-\left[\sum_{i}\binom{a_{i}}{2}\sum_{j}% \binom{b_{j}}{2}\right]/\binom{N}{2}}

(XXVII)

where $n_{ij}$ is the number of samples in the common cluster between pseudo-labels and ground-truth labels, $a_{i}$ is the number of samples in the $i$ -th cluster of pseudo-labels, $b_{j}$ is the number of samples in the $j$ -th cluster of ground-truth labels, and $N$ is the total number of samples in the dataset. The binomial coefficient $\binom{n}{k}$ , representing the number of ways to choose $(k)$ elements from $n$ distinct elements, is calculated as:

\binom{n}{k}=\frac{n!}{k!(n-k)!}

(XXVIII)

where $n!$ denotes $n$ factorial, the product of all positive integers up to $n$ . The ARI value ranges from -1 to 1, with 1 indicating perfect agreement, 0 indicating no better agreement than chance, and negative values indicating less agreement than chance.

To demonstrate the computation of the Adjusted Rand Index (ARI) in the context of unsupervised cross-modal person re-identification, given two elements T (ground-truth labels) and P (pseudo-labels), the ground-truth labels are not used during the training process.

Table III: Confusion matrix between pseudo-labels and ground-truth labels.

	CLuster ‘abc’	CLuster ‘de’	CLuster ‘fgh’	sum
CLuster ‘ab’	2	0	0	2
CLuster ‘cde’	1	2	0	3
CLuster ‘fgh’	0	0	3	3
sum	3	2	3	8

Clusterings Defined:

•

Clustering T: {‘abc’, ‘de’, ‘fgh’}
•

Clustering P: {‘ab’, ‘cde’, ‘fgh’}

where a,b,d,f are visible samples and c,e,g,h are infrared samples.

Step 1: Calculating Pairwise Combinations Within Each Clustering

Based on Tab. III, the number of pairwise combinations within each cluster in both clusterings T and P is calculated as follows:

•
In Clustering T:
- –
  
  Cluster ‘abc’: $\binom{3}{2}=3$ pairs
- –
  
  Cluster ‘de’: $\binom{2}{2}=1$ pair
- –
  
  Cluster ‘fgh’: $\binom{3}{2}=3$ pairs
•
In Clustering P:
- –
  
  Cluster ‘ab’: $\binom{2}{2}=1$ pair
- –
  
  Cluster ‘cde’: $\binom{3}{2}=3$ pairs
- –
  
  Cluster ‘fgh’: $\binom{3}{2}=3$ pairs

Step 2: Identifying Shared Pairs Between Clusterings

The number of shared pairs between Clustering T and Clustering P is identified:

•

Shared pairs: ‘ab’, ‘de’, ‘fg’, ‘fh’, ‘gh’
•

Total shared pairs: 5 pairs

Step 3: Computing the ARI

The ARI is computed using the following formula:

•

$\sum_{ij}\binom{n_{ij}}{2}=5$
•

$\sum_{i}\binom{a_{i}}{2}=3+1+3=7$
•

$\sum_{j}\binom{b_{j}}{2}=1+3+3=7$
•

$\binom{N}{2}=\binom{8}{2}=28$

Substituting these values into the ARI formula:

\text{ARI}=\frac{5-\frac{7\times 7}{28}}{\frac{7+7}{2}-\frac{7\times 7}{28}}=% \frac{5-1.75}{7-1.75}=\frac{3.25}{5.25}\approx 0.619

The ARI value of approximately 0.619 indicates a moderate level of agreement between Clustering T and Clustering P, suggesting some consistency in the clustering results, yet not perfectly aligned. This example demonstrates the utility of ARI in providing an objective assessment of the reliability of cross-modality correspondences.

VIII Dataset

We evaluate our MMM on two benchmarks, i.e., SYSU-MM01 and RegDB. SYSU-MM01 is a large-scale visible-infrared person ReID dataset, which is collected from four visible cameras and two infrared cameras in both indoor and outdoor scenes. This dataset totally contains 287,628 visible images and 15,792 infrared images with 491 identities. Among them, 22,258 visible images and 11,909 infrared images with 395 identities are used for the training set. In addition, 3,803 infrared images are used for the query set and 301 visible images are randomly selected to make up the gallery set. RegDB is a relatively small dataset, which contains 412 identities with 4,120 visible images and 4,120 infrared images. The dataset is divided at random, with half for training and the other half for testing.

IX Implementation Details

Our proposed framework is implemented in PyTorch. At each training step, we randomly sample 8 IDs, of which 4 visible and 4 infrared images are chosen to formulate a batch, and training images are resized to $288\times 144$ . The total number of training epochs is 80. SGD optimizer is adopted to train the model with the momentum setting to 0.9 and weight decay setting to $5e-4$ . The Intra module is added from the $1^{st}$ epoch and the Inter module is added from the $15^{th}$ epoch. The loss temperature $\tau$ is set to 0.05.