Local-Aware Global Attention Network for Person Re-Identification Based on Body and Hand Images

Nathanael L. Baisa Nathanael L. Baisa is with the School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK. Email: [email protected].
Abstract

Learning representative, robust and discriminative information from images is essential for effective person re-identification (Re-Id). In this paper, we propose a compound approach for end-to-end discriminative deep feature learning for person Re-Id based on both body and hand images. We carefully design the Local-Aware Global Attention Network (LAGA-Net), a multi-branch deep network architecture consisting of one branch for spatial attention, one branch for channel attention, one branch for global feature representations and another branch for local feature representations. The attention branches focus on the relevant features of the image while suppressing the irrelevant backgrounds. The global and local branches intends to capture global context and fine-grained information, respectively. A set of ablation study shows that each component contributes to the increased performance of the LAGA-Net. Extensive evaluations on four popular body-based person Re-Id benchmarks and two publicly available hand datasets demonstrate that our proposed method consistently outperforms existing state-of-the-art methods.

Index Terms:
Person re-identification, Deep representation learning, Attention mechanisms, Global features, Part-level features.

I Introduction

Person re-identfication (Re-Id), matching a particular person across different times, places, or cameras, has recently received a lot of attention from both industry and academia for different applications such as intelligent video surveillance. It is currently the main component of visual tracking in both single camera [1] and multiple camera [2]. It is also similar to image retrieval in many aspects. Given a query image, person Re-Id ranks gallery images in terms of similarity to the query image. The image can be of a person’s body, hand, face, etc. In this process, each image is represented with a feature embedding. Learning robust and discriminative feature representations is very crucial to overcome the many challenges the person Re-Id is facing. These challenges [3] include pose variations, occlusion, view point changes, lighting changes, background clutter, noisy labels, etc. Various efforts have been made to address these challenges [4], for instance, considering whole body [5, 6, 7], body parts [8, 9, 10] and attention mechanisms [11, 12, 13] for learning robust and discriminative feature representations from person body images in uncontrolled environments for better matching.

Person Re-Id for biometric application has recently also received a lot of attention [14]. Hand images, one of the primary biometric traits [15, 16], deliver discriminative features for biometric person recognition. Hand images not only have less variability when compared to other biometric modalities but also have strong and diverse features which remain relatively stable after adulthood [16, 17, 18]. Because of this, there is a strong potential to investigate hand images captured by digital cameras for person recognition, especially for criminal investigation in uncontrolled environments since they are often the only available information in cases of serious crime such as sexual abuse.

In this work, we propose a compound approach for end-to-end discriminative deep feature representations learning for person Re-Id based on both body and hand images. We carefully design the Local-Aware Global Attention Network (LAGA-Net), a multi-branch deep network consisting of 4 branches to learn deep global, attentive and local (part-level) feature representations which are robust and discriminative enough for dealing with person Re-Id challenges. Each branch has its own merit within the goal of the proposed network architecture for learning robust and discriminative deep feature representations. Specifically, the attention branches, channel attention branch and spatial attention branch, focus on the relevant features of the image while suppressing the irrelevant backgrounds. The spatial attention branch incorporates relative positional encodings into spatial attention module to maintain translation equivariance. The global and local branches intends to capture global context and fine-grained information, respectively. By carefully designing this end-to-end compound method, we have shown that it is possible to effectively learn robust and discriminative feature representations for person Re-Id based on both body and hand images, unlike the previous methods [5, 6, 7, 8, 9, 10, 11, 12, 13]. Our contributions can be summarized as follows.

  1. 1.

    We propose a multi-branch deep network by incorporating both channel and spatial attention modules in branches in addition to global (without attention) and local branches for person Re-Id based on both body and hand images which is efficient computationally and flexible in terms of the backbone architecture.

  2. 2.

    We include relative positional encodings into the spatial attention module, considering height and width independently, to capture the spatial positions of pixels in order to overcome the weakness of the attention mechanisms - equivariant to pixel shuffling, for efficiently re-identifying a person based on body and hand images.

  3. 3.

    We make extensive evaluations on four popular body datasets: Market-1501 [19], DukeMTMC-Re-ID [20], CUHK03 [5], MSMT17 [3] and two publicly available hand datasets: 11k [21], HD [22], and LAGA-Net significantly outperforms existing state-of-the-art methods on these datasets.

The rest of the paper is organized as follows. After the discussion of related work in Section II, the proposed method is described in Section III including the attention modules, the overall architecture of the LAGA-Net and the loss functions. The experimental results are analyzed and compared in Section IV followed by the discussion in Section V and the main conclusion along with suggestion for future work in Section VI.

II Related Work

II-A Body-based Person Re-Id

Many person Re-Id methods have been proposed over the last few years, with more performance gain obtained by methods based on deep learning, using both supervised [8, 9, 10, 11, 12, 13] and unsupervised [23, 24] learning approaches. Some of these works are based on learning global deep feature representations [5, 6, 7]. To overcome the performance limitations of the person Re-Id methods based on global features, researchers shifted their attention to learn deep local (part-level) feature representations by considering person poses [8]. In this case, external pose (skeleton) estimation methods have been used to leverage human part cues. However, this also comes with another disadvantage since the errors in the pose estimation can propagate to the re-identification stage. Mask is also used as external cues to remove the background clutters in pixel level to retrieve body shape information [25]. Using these external methods bring additional computational burden. To overcome this, uniform partitioning of the images without relying on external methods was introduced for body-based person Re-Id in [9, 10] and for hand-based person identification in [16]. Though this approach helped in gaining performance boost over the previous methods, the performance of the methods based on this approach is not sufficient to handle the challenges the person Re-Id methods are facing, probably due to pose misalignment problem.

Recently, self-attention mechanism, an integral component of Transformers [26], has received great attention in deep learning. The self-attention mechanism captures long-term information and interactions amongst all entities (e.g. pixels, channels, sequence elements, etc.) of the input data. It updates each pixel, for instance, by aggregating global information from all pixels in the input image. The attention mechanism has been used in [11, 12, 13] for person Re-Id by considering both channel and spatial attentions which compute correlations between all the channels and all the pixels of the input feature map, respectively. However, these methods used the attention modules across the entire network layers which make it computationally inefficient since the self-attention computation is very expensive if the dimension of the input data is very large. Furthermore, these methods have limited performance when applied to person body and hand images. By default, the attention mechanism does not model relative or absolute position information. To overcome this, a relative position representation of a sequence element with respect to its neighbours has been proposed in [27] for natural language processing, and later on incorporated into a standalone global attention-based deep network for images without using convolutions for modeling pixel interactions [28]. Unlike these methods, we use the attention mechanism along with the convolution operations by applying the attention modules only to low-resolution feature maps in later stages of a deep network in branches along with global (without attention) and local branches for better computational efficiency as well as accuracy for person Re-Id based on not only body images but also hand images.

II-B Hand-based Person Re-Id

Both traditional and deep learning approaches have been combined in [21, 17] to develop person identification using hand images. After training a convolutional neural network (CNN) on digital hand images (RGB), the method in [21] used the network as a feature extractor to obtain CNN-features which have been fed into a set of support vector machine (SVM) classifiers. The work in [17] used rather similar approach with additional data type for fusion, near-infrared (NIR) images. These methods are not an end-to-end. An end-to-end approach considering both horizontal and vertical uniform partitioning has been proposed in [16]. However, all these methods have a limited performance. Unlike these methods, our proposed method is a compound approach considering multi-branch network architecture which is efficient not only on body images but also on hand images for re-identifying individuals.

III Proposed Method

In this section, we introduce the two attention modules followed by the overall architecture of the LAGA-Net and the used loss functions. The goal of the attention modules is to supress irrelevant backgrounds while focusing on discriminative information of person appearances.

III-A Channel Attention Module

Channel attention module (CAM) aims to aggregate channel-wise feature-level information since some channels in higher convolutional layers are semantically related i.e. CAM computes the correlations between all the channels. The structure of CAM is given in Fig. 1a. Given the input feature map 𝐄iC×H×Wsubscript𝐄𝑖superscript𝐶𝐻𝑊\mathbf{E}_{i}\in\mathbb{R}^{C\times H\times W}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT where C, H, W are the number of channels, height and width of the feature map, respectively, we first reshape it to produce the matrices of keys, queries and values, respectively, as 𝐊C×HW𝐊superscript𝐶𝐻𝑊\mathbf{K}\in\mathbb{R}^{C\times HW}bold_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H italic_W end_POSTSUPERSCRIPT, 𝐐C×HW𝐐superscript𝐶𝐻𝑊\mathbf{Q}\in\mathbb{R}^{C\times HW}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H italic_W end_POSTSUPERSCRIPT and 𝐕C×HW𝐕superscript𝐶𝐻𝑊\mathbf{V}\in\mathbb{R}^{C\times HW}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H italic_W end_POSTSUPERSCRIPT. Then, the global channel attention map 𝐀cC×Csubscript𝐀𝑐superscript𝐶𝐶\mathbf{A}_{c}\in\mathbb{R}^{C\times C}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_C end_POSTSUPERSCRIPT is computed using the dot-product of the query with all keys as

𝐀c=ρ(𝐊𝐐T)subscript𝐀𝑐𝜌superscript𝐊𝐐𝑇\mathbf{A}_{c}=\rho(\mathbf{K}\mathbf{Q}^{T})bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_ρ ( bold_KQ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) (1)

where 𝐐Tsuperscript𝐐𝑇\mathbf{Q}^{T}bold_Q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the matrix transpose of 𝐐𝐐\mathbf{Q}bold_Q, and ρ𝜌\rhoitalic_ρ represents the softmax normalization along each row separately. The self-attended output feature map 𝐄oC×H×Wsubscript𝐄𝑜superscript𝐶𝐻𝑊\mathbf{E}_{o}\in\mathbb{R}^{C\times H\times W}bold_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT for the channel attention is given by

𝐄o=γ(𝐀c𝐕)+𝐄isubscript𝐄𝑜𝛾subscript𝐀𝑐𝐕subscript𝐄𝑖\mathbf{E}_{o}=\gamma(\mathbf{A}_{c}\mathbf{V})+\mathbf{E}_{i}bold_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_γ ( bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_V ) + bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (2)

where γ𝛾\gammaitalic_γ is initialized as 0 and gradually learns to assign more weight to adjust the impact of the CAM. 𝐀c𝐕subscript𝐀𝑐𝐕\mathbf{A}_{c}\mathbf{V}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT bold_V means that the matrix of values 𝐕𝐕\mathbf{V}bold_V is weighted by the attention score 𝐀csubscript𝐀𝑐\mathbf{A}_{c}bold_A start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

III-B Spatial Attention Module with Relative Positional Encodings

The goal of spatial attention module is to aggregate the semantically similar pixels in the spatial domain of the input feature map. Though the spatial attention mechanism attends to the entire input feature map based on content (pixel values), it does not take into account the spatial positions of pixels which makes it equivariant to pixel shuffling. To overcome this, we incorporate the relative positional encodings along the rows (height) and columns (width), which is computationally efficient, so that it maintains translation equivariance i.e. translating (shifting) the input pixel also translates the output pixel by the same amount. The structure of the Spatial Attention Module with Relative Positional Encodings (SAM-RPE) is given in Fig. 1b. Given the input feature map 𝐄iC×H×Wsubscript𝐄𝑖superscript𝐶𝐻𝑊\mathbf{E}_{i}\in\mathbb{R}^{C\times H\times W}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, the matrices of keys 𝐊dk×HW𝐊superscriptsubscript𝑑𝑘𝐻𝑊\mathbf{K}\in\mathbb{R}^{d_{k}\times HW}bold_K ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_H italic_W end_POSTSUPERSCRIPT, queries 𝐐dk×HW𝐐superscriptsubscript𝑑𝑘𝐻𝑊\mathbf{Q}\in\mathbb{R}^{d_{k}\times HW}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_H italic_W end_POSTSUPERSCRIPT and values 𝐕C×HW𝐕superscript𝐶𝐻𝑊\mathbf{V}\in\mathbb{R}^{C\times HW}bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H italic_W end_POSTSUPERSCRIPT are obtained by transforming it through defined learnable weight matrices 𝐖Ksubscript𝐖𝐾\mathbf{W}_{K}bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, 𝐖Qsubscript𝐖𝑄\mathbf{W}_{Q}bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝐖Vsubscript𝐖𝑉\mathbf{W}_{V}bold_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, respectively, where dk=C8subscript𝑑𝑘𝐶8d_{k}=\frac{C}{8}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_C end_ARG start_ARG 8 end_ARG is the channels dimension of the keys and queries. The learnable weight matrices 𝐖Ksubscript𝐖𝐾\mathbf{W}_{K}bold_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, 𝐖Qsubscript𝐖𝑄\mathbf{W}_{Q}bold_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝐖Vsubscript𝐖𝑉\mathbf{W}_{V}bold_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are implemented using independent pointwise (1×1111\times 11 × 1) convolution layers with batch normalization and ReLU activation. Thus, the global spatial attention map 𝐀sHW×HWsubscript𝐀𝑠superscript𝐻𝑊𝐻𝑊\mathbf{A}_{s}\in\mathbb{R}^{HW\times HW}bold_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_H italic_W end_POSTSUPERSCRIPT is computed as

𝐀s=ρ(𝐊T𝐐)subscript𝐀𝑠𝜌superscript𝐊𝑇𝐐\mathbf{A}_{s}=\rho(\mathbf{K}^{T}\mathbf{Q})bold_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_ρ ( bold_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Q ) (3)

where 𝐊Tsuperscript𝐊𝑇\mathbf{K}^{T}bold_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the matrix transpose of 𝐊𝐊\mathbf{K}bold_K, and ρ𝜌\rhoitalic_ρ represents the softmax normalization along each row separately.

We consider height (row) and width (column) attentions due to the relative spatial positions for computational efficiency. To compute these relative positional attentions, we first need to represent relative shifts along the height or width of the input feature map. Let a relative position embedding for the height that needs to be learned be 𝐑H(2H1)×dksubscript𝐑𝐻superscript2𝐻1subscript𝑑𝑘\mathbf{R}_{H}\in\mathbb{R}^{(2H-1)\times d_{k}}bold_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 2 italic_H - 1 ) × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where H𝐻Hitalic_H is the height of the input feature map and dk=C8subscript𝑑𝑘𝐶8d_{k}=\frac{C}{8}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_C end_ARG start_ARG 8 end_ARG is the number of channels. A possible vertical shift, from (H1)𝐻1-(H-1)- ( italic_H - 1 ) to H1𝐻1H-1italic_H - 1, corresponds to each row of 𝐑Hsubscript𝐑𝐻\mathbf{R}_{H}bold_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT. The relative shifts in the matrix 𝐑Hsubscript𝐑𝐻\mathbf{R}_{H}bold_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT need to be represented using absolute shifts. To do this, the re-indexing tensor 𝐈HH×W×(2H1)superscript𝐈𝐻superscript𝐻𝑊2𝐻1\mathbf{I}^{H}\in\mathbb{R}^{H\times W\times(2H-1)}bold_I start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × ( 2 italic_H - 1 ) end_POSTSUPERSCRIPT, which is used as a mask, can be defined as

𝐈h,i,rH={1,ifih=r&|ih|H0,otherwisesubscriptsuperscript𝐈𝐻𝑖𝑟cases1if𝑖𝑟𝑖𝐻0otherwise\mathbf{I}^{H}_{h,i,r}=\begin{cases}1,&\text{if}~{}i-h=r~{}~{}\&~{}~{}|i-h|% \leq H\\ 0,&\text{otherwise}\end{cases}bold_I start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h , italic_i , italic_r end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if italic_i - italic_h = italic_r & | italic_i - italic_h | ≤ italic_H end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (4)

where h{0,,H1}0𝐻1h\in\{0,...,H-1\}italic_h ∈ { 0 , … , italic_H - 1 }, i{0,,W1}𝑖0𝑊1i\in\{0,...,W-1\}italic_i ∈ { 0 , … , italic_W - 1 } and r{(H1),,0,,H1}𝑟𝐻10𝐻1r\in\{-(H-1),...,0,...,H-1\}italic_r ∈ { - ( italic_H - 1 ) , … , 0 , … , italic_H - 1 }.

By resha** 𝐈Hsuperscript𝐈𝐻\mathbf{I}^{H}bold_I start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT to 𝐈HHW×(2H1)subscript𝐈𝐻superscript𝐻𝑊2𝐻1\mathbf{I}_{H}\in\mathbb{R}^{HW\times(2H-1)}bold_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × ( 2 italic_H - 1 ) end_POSTSUPERSCRIPT, a position embedding tensor with indices of absolute shifts for the height 𝐏HHW×dksubscript𝐏𝐻superscript𝐻𝑊subscript𝑑𝑘\mathbf{P}_{H}\in\mathbb{R}^{HW\times d_{k}}bold_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is given by

𝐏H=𝐈H𝐑Hsubscript𝐏𝐻subscript𝐈𝐻subscript𝐑𝐻\mathbf{P}_{H}=\mathbf{I}_{H}\mathbf{R}_{H}bold_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = bold_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT (5)

Then, the self-attended feature map 𝐄HC×H×Wsubscript𝐄𝐻superscript𝐶𝐻𝑊\mathbf{E}_{H}\in\mathbb{R}^{C\times H\times W}bold_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT corresponding to the height relative position embedding 𝐑Hsubscript𝐑𝐻\mathbf{R}_{H}bold_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, which is used as keys implicity (𝐏Hsubscript𝐏𝐻\mathbf{P}_{H}bold_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT explicitly), is computed as

𝐄H=𝐕(𝐏H𝐐)subscript𝐄𝐻𝐕subscript𝐏𝐻𝐐\mathbf{E}_{H}=\mathbf{V}(\mathbf{P}_{H}\mathbf{Q})bold_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = bold_V ( bold_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT bold_Q ) (6)

where 𝐏H𝐐subscript𝐏𝐻𝐐\mathbf{P}_{H}\mathbf{Q}bold_P start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT bold_Q corresponds to the height relative positional attention.

The relative position embedding for the width 𝐑W(2W1)×dksubscript𝐑𝑊superscript2𝑊1subscript𝑑𝑘\mathbf{R}_{W}\in\mathbb{R}^{(2W-1)\times d_{k}}bold_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 2 italic_W - 1 ) × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the re-indexing tensor 𝐈WHW×(2W1)subscript𝐈𝑊superscript𝐻𝑊2𝑊1\mathbf{I}_{W}\in\mathbb{R}^{HW\times(2W-1)}bold_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × ( 2 italic_W - 1 ) end_POSTSUPERSCRIPT and its corresponding self-attended feature map 𝐄WC×H×Wsubscript𝐄𝑊superscript𝐶𝐻𝑊\mathbf{E}_{W}\in\mathbb{R}^{C\times H\times W}bold_E start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT can be obtained with similar approach to the above height formulation since they are symmetric.

Thus, the final self-attended output feature map 𝐄oC×H×Wsubscript𝐄𝑜superscript𝐶𝐻𝑊\mathbf{E}_{o}\in\mathbb{R}^{C\times H\times W}bold_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT for the SAM-RPE is given by

𝐄o=γ(𝐕𝐀s+BN(𝐄H)+BN(𝐄W))+𝐄isubscript𝐄𝑜𝛾subscript𝐕𝐀𝑠𝐵𝑁subscript𝐄𝐻𝐵𝑁subscript𝐄𝑊subscript𝐄𝑖\mathbf{E}_{o}=\gamma(\mathbf{V}\mathbf{A}_{s}+BN(\mathbf{E}_{H})+BN(\mathbf{E% }_{W}))+\mathbf{E}_{i}bold_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_γ ( bold_VA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_B italic_N ( bold_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) + italic_B italic_N ( bold_E start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) ) + bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (7)

where γ𝛾\gammaitalic_γ is initialized as 0 and gradually learns to assign more weight to adjust the impact of the SAM-RPE, and BN𝐵𝑁BNitalic_B italic_N is a batch normalization.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Attention modules used in our proposed LAGA-Net: (a) Channel Attention Module (CAM), (b) Spatial Attention Module with Relative Positional Encodings (SAM-RPE).

III-C Network Architecture Overview

The overall architecture of the proposed LAGA-Net is given in Fig. 2. It incorporates two complementary attention modules, channel and spatial. These attention modules are used at higher level of the network, for computational efficiency, in branches along with the global (without attention) branch and the local branch which is obtained by performing uniform horizontal partitioning. As a backbone network, we use ResNet50 [29] pretrained on ImageNet due to its precise architecture with competitive performances in some person Re-Id works [9, 10, 12, 13]. Obviously, any network designed for image classification can be adapted, for example Inception network [30] and DenseNet [31]. We keep the structure of the original ResNet50 before layer 3 (inclusive) remain the same when we modify the backbone network to produce the LAGA-Net. We create 4 independent branches just after the layer 3 of the ResNet50 to incorporate the channel and the spatial (with relative positional encodings) attention modules in branches by kee** one (without attention) global branch and one additional local branch for which we generate 3 horizontal stripes uniformly from the output feature map.

Spatial attention branch: This branch aggregates the semantically similar pixels in the spatial domain of the input feature map and it uses a Global Average Pooling (GAP) layer to summarize the 3D tensor of activations to form a 2048-dimensional column feature vector s.

Channel attention branch: This branch aggregates the correlations between all the channels of the input feature map and it uses the GAP layer to summarize the 3D tensor of activations to form a 2048-dimensional column feature vector c.

Global branch: This branch aims to maintain global context information for discriminative feature learning, and the GAP layer is used to summarize the 3D tensor of activations to form a 2048-dimensional column feature vector g.

Local branch: For this branch, we change from the GAP layer to conventional average pooling (AP) layer to create uniform horizontal partitions (stripes) on the 3D tensor of activations to learn 2048-dimensional part-level features pisubscriptp𝑖\textbf{p}_{i}p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where integer i[1,3]𝑖13i\in[1,3]italic_i ∈ [ 1 , 3 ]; the total number of partitions used is 3.

Each reduction layer for each branch (for each stripe in the case of the local branch) is implemented using a new fully-connected layer (FC), batch normalization (BN), leaky rectified linear unit (LReLU) and dropout with probability of 0.5 to reduce possible over-fitting. The reduction layers are employed to convert 2048-dimensional column feature vectors obtained after the GAP and conventional AP layers to 1024-dimensional feature vectors which in turn are fed into the classification layers. Each classification layer, which is implemented using a FC layer followed by a softmax function, predicts the identity (ID) of each input. In addition, we change the last stride from 2 to 1 in the backbone network i.e. remove the last spatial down-sampling operation, which increases the size of the tensor of each branch for improved performance as observed in [7].

Refer to caption
Figure 2: Structure of LAGA-Net. Four separate 3D tensors (one for spatial attention branch, one for channel attention branch, one for global branch and the other for local branch) are obtained by passing the input image through the stacked convolutional layers from the backbone network. S3 and S4 are the SAM-RPE after layer 3 and layer 4 (L4) of the ResNet50, respectively. Similarly, C3 and C4 are the CAM after layer 3 and layer 4 of the ResNet50, respectively. Three horizontal partitions (stripes) are also performed on L4 to produce the local branch. Given an input image, six separate 2048-D column feature vectors are obtained by passing it through the backbone network with the 4 branches (the local branch has 3 horizontal stripes). Each classifier predicts the identity (ID) of the input image during training. In case of hand-based person Re-Id, hand input images are used.

III-D Loss Functions

The LAGA-Net is optimized during training by minimizing the loss function \mathcal{L}caligraphic_L consisting of the sum of cross-entropy losses over the 6 ID predictions for identification (classification) and the sum of hard mining triplet losses over the 6 ID predictions for metric learning i.e. each classifier predicts the identity of the input image as shown in Fig. 2. The total loss function \mathcal{L}caligraphic_L is given by

=l=16l,xent+βl=16l,tripletsuperscriptsubscript𝑙16subscript𝑙𝑥𝑒𝑛𝑡𝛽superscriptsubscript𝑙16subscript𝑙𝑡𝑟𝑖𝑝𝑙𝑒𝑡\mathcal{L}=\sum_{l=1}^{6}\mathcal{L}_{l,xent}+\beta\sum_{l=1}^{6}\mathcal{L}_% {l,triplet}caligraphic_L = ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l , italic_x italic_e italic_n italic_t end_POSTSUBSCRIPT + italic_β ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_l , italic_t italic_r italic_i italic_p italic_l italic_e italic_t end_POSTSUBSCRIPT (8)

where β𝛽\betaitalic_β is a hyperparameter balancing the two types of losses.

For the learned features fisubscriptf𝑖\textbf{f}_{i}f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the cross-entropy loss (softmax loss) with label smoothing [32] is given as:

l,xent=i=1NqyilogeWyiTfi+byic=1CeWcTfi+bcsubscript𝑙𝑥𝑒𝑛𝑡superscriptsubscript𝑖1𝑁subscript𝑞subscript𝑦𝑖superscript𝑒subscriptsuperscriptW𝑇subscript𝑦𝑖subscriptf𝑖subscript𝑏subscript𝑦𝑖superscriptsubscript𝑐1𝐶superscript𝑒subscriptsuperscriptW𝑇𝑐subscriptf𝑖subscript𝑏𝑐\mathcal{L}_{l,xent}=-\sum_{i=1}^{N}q_{y_{i}}\log\frac{e^{\textbf{W}^{T}_{y_{i% }}\textbf{f}_{i}+b_{y_{i}}}}{\sum_{c=1}^{C}e^{\textbf{W}^{T}_{c}\textbf{f}_{i}% +b_{c}}}caligraphic_L start_POSTSUBSCRIPT italic_l , italic_x italic_e italic_n italic_t end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (9)

where N is the batch-size, C is the number of classes (identities) in the training dataset, and WcsubscriptW𝑐\textbf{W}_{c}W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and bcsubscript𝑏𝑐b_{c}italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are weight vector and bias for class c𝑐citalic_c, respectively. Note that zc=WcTfi+bcsubscript𝑧𝑐subscriptsuperscriptW𝑇𝑐subscriptf𝑖subscript𝑏𝑐z_{c}=\textbf{W}^{T}_{c}\textbf{f}_{i}+b_{c}italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are the logits or unnormalized probabilities. The ground-truth distribution over labels qyisubscript𝑞subscript𝑦𝑖q_{y_{i}}italic_q start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT by including label smoothing can be given as

qyi={1C1Cϵ,if yi=y1Cϵ,otherwisesubscript𝑞subscript𝑦𝑖cases1𝐶1𝐶italic-ϵif subscript𝑦𝑖𝑦1𝐶italic-ϵotherwiseq_{y_{i}}=\begin{cases}1-\frac{C-1}{C}\epsilon,&\text{if }y_{i}=y\\ \frac{1}{C}\epsilon,&\text{otherwise}\end{cases}italic_q start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { start_ROW start_CELL 1 - divide start_ARG italic_C - 1 end_ARG start_ARG italic_C end_ARG italic_ϵ , end_CELL start_CELL if italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_C end_ARG italic_ϵ , end_CELL start_CELL otherwise end_CELL end_ROW (10)

where y𝑦yitalic_y is ground-truth label and ϵitalic-ϵ\epsilonitalic_ϵ is a smoothing value.

Similarly, the batch hard mining triplet loss [33] is given as follows:

l,triplet=i=1Pa=1K[α+maxp=1Kfa(i)fp(i)2hardestpositiveminn=1K,j=1P,jifa(i)fn(j)2hardestnegative]+\begin{array}[]{lll}\mathcal{L}_{l,triplet}&=\sum_{i=1}^{P}\sum_{a=1}^{K}\bigg% {[}\alpha~{}+~{}\overbrace{\max_{p=1\dots K}\|\textbf{f}_{a}^{(i)}-\textbf{f}_% {p}^{(i)}\|_{2}}^{hardest~{}positive}\\ &-\underbrace{\min_{n=1\dots K,j=1\dots P,j\neq i}\|\textbf{f}_{a}^{(i)}-% \textbf{f}_{n}^{(j)}\|_{2}}_{hardest~{}negative}\bigg{]}_{+}\end{array}start_ARRAY start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_l , italic_t italic_r italic_i italic_p italic_l italic_e italic_t end_POSTSUBSCRIPT end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ italic_α + over⏞ start_ARG roman_max start_POSTSUBSCRIPT italic_p = 1 … italic_K end_POSTSUBSCRIPT ∥ f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_h italic_a italic_r italic_d italic_e italic_s italic_t italic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - under⏟ start_ARG roman_min start_POSTSUBSCRIPT italic_n = 1 … italic_K , italic_j = 1 … italic_P , italic_j ≠ italic_i end_POSTSUBSCRIPT ∥ f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_h italic_a italic_r italic_d italic_e italic_s italic_t italic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY (11)

where [A]+=max(A,0)subscriptdelimited-[]𝐴𝐴0[A]_{+}=\max(A,0)[ italic_A ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_max ( italic_A , 0 ), α𝛼\alphaitalic_α is the margin hyperparameter that controls the distance differences of intra and inter classes, and fa(i)superscriptsubscriptf𝑎𝑖\textbf{f}_{a}^{(i)}f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, fp(i)superscriptsubscriptf𝑝𝑖\textbf{f}_{p}^{(i)}f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, fn(i)superscriptsubscriptf𝑛𝑖\textbf{f}_{n}^{(i)}f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT are the features extracted from anchor, positive and negative samples, respectively. The positive and negative samples refer to the persons with same or different identity with the anchor. The candidate triplets are constructed by the furthest positive and closest negative sampled pairs. These are basically the hardest positive and hardest negative pairs in a mini-batch N=PK𝑁𝑃𝐾N=PKitalic_N = italic_P italic_K with P𝑃Pitalic_P selected identities and K𝐾Kitalic_K instances (images) per identity. In our experimental settings, we use β=0.1𝛽0.1\beta=0.1italic_β = 0.1, α=1.2𝛼1.2\alpha=1.2italic_α = 1.2, P=5𝑃5P=5italic_P = 5, K=4𝐾4K=4italic_K = 4 and N=PK=20𝑁𝑃𝐾20N=PK=20italic_N = italic_P italic_K = 20.

For both losses, we use the learned embeddings ś, ć, ǵ, 1subscript1\textbf{\'{p}}_{1}ṕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\textbf{\'{p}}_{2}ṕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 3subscript3\textbf{\'{p}}_{3}ṕ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (in place of f in the Eqns. (9) and (11)) during training (see in Fig. 2). However, during testing, we concatenate all the 2048-D feature vectors of the 6 branches as the final feature embedding, just after the GAP and conventional AP i.e. =[s,c,g,p1,p2,p3]scgsubscriptp1subscriptp2subscriptp3\mathcal{F}=[\textbf{s},\textbf{c},\textbf{g},\textbf{p}_{1},\textbf{p}_{2},% \textbf{p}_{3}]caligraphic_F = [ s , c , g , p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] which becomes 12288-D feature vector, and then compare feature vector of each query image with gallery feature vectors using cosine distance. In fact, we put image I and its horizontally flipped image IsuperscriptI\textbf{I}^{\prime}I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into the model, and get their embeddings \mathcal{F}caligraphic_F and superscript\mathcal{F}^{\prime}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then their mean feature +2superscript2\frac{\mathcal{F}+\mathcal{F}^{\prime}}{2}divide start_ARG caligraphic_F + caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG is used as the embedding of image I during testing which improves the Re-Id performance.

IV Experiments

IV-A Datasets

In this section, we describe both body person Re-Id datasets such as Market-1501 [19], DukeMTMC-Re-ID [20], CUHK03 [5] and MSMT17 [3], and hand datasets such as 11k hands dataset [21] and Hong Kong Polytechnic University Hand Dorsal (HD) dataset [22].

Market-1501 [19] is captured by 6 cameras comprising 32,668 labeled images of 1,501 identities. 751 identities (12,936 images) are used for training whereas the rest are used for testing. The testing data is sub-divided into test probe (query) set and test gallery set. The test probe set has 3,368 images of 750 identities, while 2,793 additional distractors are also included into the test gallery set. The overall statistics of body-based person Re-Id datasets used in this paper is given in Table I.

DukeMTMC-Re-ID [20] is captured by 8 cameras containing 36,411 images of 1,812 identities. Among these identities, 1,404 identities appear in more than 2 cameras while 408 identities (distactors) appear in only one camera. 702 identitis are randomly chosen for training from the 1,404 identities and the rest are used for testing. One query image for each identity per camera is chosen from the testing set for the probe set while all remaining images including the distractors are used for gallery set.

CUHK03 [5] contains 13,164 images of 1,467 persons, and each identity only appears in two disjoint camera views. The new training and testing protocol proposed in [34] is adopted, in which 767 identities are used for training and 700 for testing. Both labeled and detected bounding boxes are given in CUHK03, we perform experiments on the labeled (L) bounding boxes.

MSMT17 [3] is captured by a 15-camera network (12 outdoor, 3 indoor) and is currently the largest publicly available person Re-Id dataset with 126,441 images of 4,101 identities. The training set contains 32,621 images with 1041 persons (identities), while the test set contains 93,820 images with 3060 persons. For the test set, 11,659 images are randomly selected as query, and the other 82161 images are used as gallery i.e. we use the training-testing split of [3]. The video is collected with different weather conditions at three-time slots (morning, noon, afternoon). The annotations include camera IDs, weathers and time slots. Thus, the MSMT17 is significantly more challenging than the other three due to its massive scale, more complex and dynamic scenes.

11k hands111https://sites.google.com/view/11khands dataset [21] has 190 subjects (identities). We use the same partitioning strategy of the dataset as in [16]. As in [16], this dataset is divided into right dorsal, left dorsal, right palmar and left palmar sub-datasets to train a hand-based person Re-Id (recognition) model. After excluding accessories and dividing the dataset as in [16], right dorsal has 143 identities, left dorsal has 146, right palmar has 143 and left palmar has 151 identities. The first half and the second half of each sub-dataset based on identity are used for training and testing, respectively. For instance, for right dorsal, the first 72 identities are used for training and the last 71 identities are used for testing. Similarly, the first 73, 72 and 76 identities are used for training phase for left dorsal, right palmar and left palmar sub-datasets, respectively. The remaining identities of each sub-dataset (73 for left dorsal, 71 for right palmar, 75 for left palmar) are used for testing. From each identity of the test set of each sub-dataset, we randomly choose one image and put in a common gallery for all the 11k sub-datasets. The remaining images of each identity of the test set of each sub-dataset are used as a query set for that sub-dataset. Accordingly, the 11k gallery has 290 images and the query has 971 images for right dorsal, 988 images for left dorsal, 917 images for right palmar and 948 images for left palmar. A randomly chosen image of each identity of the training set of each sub-dataset is used as a validation for monitoring the training process. This procedure is repeated for 10 times and the average performance is reported. The overall statistics of hand-based person Re-Id datasets used in this paper is given in Table II.

HD222http://www4.comp.polyu.edu.hk/~csajaykr/knuckleV2.htm dataset [22] has 502 identities. We use the same partitioning strategy of the dataset as in [16]. The first 251 identities and the second 251 identities of the dataset are used for training and testing, respectively. From each identity of the test set of this dataset, one image is randomly chosen and put in a gallery and the rest are used as a query (probe). Unlike the 11k dataset, the HD dataset has additional images of 213 subjects, which lack clarity or do not have second minor knuckle patterns, and are added to the HD gallery. Accordingly, the gallery for the HD dataset has 1593 images and the query has 1992 images. A randomly chosen image of each identity of the training set of the dataset is used as a validation for monitoring the training process. This procedure is repeated for 10 times and the average performance is reported.

TABLE I: Statistics of body-based person Re-Id datasets used in this paper: Market-1501 [19], DukeMTMC-Re-ID [20], CUHK03 [5] and MSMT17 [3]. Number of identities (ids), number of images and number of cameras are shown for train set, query set and gallery set of each dataset.
Subset Market-1501 DukeMTMC CUHK03 (L) MSMT17
# ids # images # cameras # ids # images # cameras # ids # images # cameras # ids # images # cameras
Train 751 12,936 6 702 16,522 8 767 7,365 2 1,041 30,248 15
Query 750 3,368 6 702 2,228 8 700 1,400 2 3,060 11,659 15
Gallery 751 15,913 6 1,110 17,661 8 700 5,332 2 3,060 82,161 15
TABLE II: Statistics of hand-based person Re-Id datasets used in this paper: 11k [21] and HD [22]. Number of identities (ids) and number of images are shown for train set, query set and gallery set of each dataset. Only one camera is used to capture each of these datasets.
Subset D-r of 11k D-l of 11k P-r of 11k P-l of 11k HD
# ids # images # ids # images # ids # images # ids # images # ids # images
Train 72 962 73 808 72 977 76 1,004 251 2,407
Query 71 971 73 988 71 917 75 948 251 1,992
Gallery 290 290 290 290 290 290 290 290 464 1,593

IV-B Implementation Details

We implemented the LAGA-Net using PyTorch deep learning framework and trained it on NVIDIA GeForce RTX 2080 Ti GPU.

Body-based person Re-Id: The input images of size 384×128384128384\times 128384 × 128 are resized to 9898\frac{9}{8}divide start_ARG 9 end_ARG start_ARG 8 end_ARG times of the size of the input images and then randomly cropped to 384×128384128384\times 128384 × 128, augmented by random horizontal flip, normalization, color jittering and random erasing [35] during training. The test images are resized to 384×128384128384\times 128384 × 128 and augmented only by normalization.

Hand-based person Re-Id: The input images are resized to 356×356356356356\times 356356 × 356 and then randomly cropped to 324×324324324324\times 324324 × 324, augmented by random horizontal flip, normalization and color jittering during training. However, only normalization is utilized during testing with the test images resized to 324×324324324324\times 324324 × 324, without a random crop.

In both cases, a random order of images are used by reshuffling the dataset. We use a combination of cross-entropy and hard mining triplet losses over the 6 ID predictions as in Eq. (8) to train the LAGA-Net. To prevent over-fitting and over-confidence, label smoothing [32] with smoothing value (ϵitalic-ϵ\epsilonitalic_ϵ) of 0.1 is also used with the cross-entropy loss. We train the model for 70 epochs with mini-batch size of 20 and Adam optimizer with the weight decay factor for L2 regularization of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. For the first 10 epochs, we use a warmup strategy [7], increasing a learning rate linearly from 8×1068superscript1068\times 10^{-6}8 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to 8×1048superscript1048\times 10^{-4}8 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and then it is decayed to 4×1044superscript1044\times 10^{-4}4 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT after 40 and 60 epochs, respectively. The learning rate is divided by 10 for the existing layers of the backbone network i.e. ten times bigger learning rate is given to the newly added layers (FC layers and batch normalizations) and the attention modules (embedding functions and batch normalizations), with appropriate weight and bias initializations.

IV-C Evaluation Metrics

We use the standard person Re-ID evaluation metrics, particularly, Cumulative Matching Characteristics (CMC) [19] (rank-1 or top-1 matching accuracy) and mean Average Precision (mAP) [19] to evaluate our proposed person Re-Id method.

IV-D Ablation Analysis

Our proposed method, as described in Section III, incorporates two complementary attention modules, channel and spatial with relative position, at a higher level of the network in branches along with the global (without attention) and local branches. The ablation analysis of these components is given in Table III with evaluations on the Market-1501 [19] body dataset and palmar right (P-r) of the 11k hands dataset [21]. As shown in this table, each component contributes to a performance gain. For body-based person Re-Id, for instance, the global branch (ResNet50 with some modifications) gives rank-1 and mAP of 93.92% and 83.21%, respectively. Incorporating the local branch with three horizontal partitions (stripes) boosts the performance to 94.41% rank-1 and 85.30% mAP. Rank-1 accuracy and mAP of 95.12% and 86.91%, respectively, are obtained by integrating the channel attention module (CAM). Incorporating the spatial attention module with relative positional encodings (SAM-RPE) contributes to the performance gain as well, giving the overall LAGA-Net performance of 96.18% rank-1 and 88.76% mAP on the Market-1501 dataset. Similarly, each component contributes to a performance gain on palmar right (P-r) of the 11k hands dataset as shown in this table.

TABLE III: Ablation analysis on components of LAGA-Net on Market-1501 [19] and palmar right (P-r) of 11k [21]. Global + Local + CAM + SAM-RPE gives LAGA-Net. The results are shown in rank-1 accuracy (%) and mAP (%).
Method Marke1501 P-r (11k)
rank-1 mAP rank-1 mAP
Global 93.92 83.21 95.43 95.95
+ Local 94.41 85.30 96.52 96.97
+ CAM 95.12 86.91 97.14 97.45
+ SAM-RPE 96.18 88.76 98.16 98.54

IV-E Comparison with the State-of-the-art Methods

Body-based person Re-Id: We evaluate our model and report the results using rank-1 matching accuracy and mAP [19] on Market-1501 [19], DukeMTMC-Re-ID [20], CUHK03 [5] and MSMT17 [3] datasets. For fair comparison, we did not use post-processing such as re-ranking [34] or multi-query [19]. We compare our proposed method, LAGA-Net, to many existing state-of-the-art methods and report the quantitative performance comparison in Table IV. As shown in this table, our method outperforms all other methods across all datasets in both rank-1 accuracy and mAP evaluation metrics except on CUHK03 where our method is ranked 2nd in rank-1 accuracy. This indicates that our method is more generalizable than the other methods across all datasets. Specifically, our proposed method outperforms the existing part-based methods such as PCB+RPP [9] and MGN [10] by large margin. For instance, the LAGA-Net outperforms the MGN by 13.73% in rank-1 accuracy and 10.54% mAP on CUHK03 dataset. Similarly, the LAGA-Net outperforms the existing attention-based methods such as MHN [11], ABD-Net [12] and RGA-Net [13]. For instance, the LAGA-Net outperforms the RGA-Net by 2.26% in rank-1 accuracy and 3.47% in mAP on MSMT17 dataset. Our proposed method outperforms not only these supervised body-based person Re-Id methods, but also the recent unsupervised person Re-Id methods such as RLCC [23] and IICS [24] as shown in Table IV.

TABLE IV: Quantitative performance comparison of our method (LAGA-Net) with existing state-of-the-art body-based person Re-Id methods on Market-1501 [19], DukeMTMC-Re-ID [20], CUHK03 [5] and MSMT17 [3] datasets. The results are shown in rank-1 accuracy (%) and mAP (%). Best and second best results are shown in red and blue, respectively. * denotes unsupervised person Re-Id methods.
Method Market-1501 DukeMTMC CUHK03 (L) MSMT17
rank-1 mAP rank-1 mAP rank-1 mAP rank-1 mAP
MGCAM [25] 83.79 74.33 - - 50.14 50.21 - -
DGNet [6] 94.8 86.0 86.6 74.8 - - 77.2 52.3
Interpreter-50 [36] 94.74 87.11 87.84 75.27 - - - -
OSNet [37] 94.8 84.9 88.6 73.5 72.3 67.8 78.7 52.9
MGN [10] 95.7 86.9 88.7 78.4 68.0 67.4 - -
PCB+RPP [9] 93.8 81.6 83.3 69.2 - - 68.2 40.4
PDC [8] 84.14 63.41 - - 88.70 - - -
BagTicks [7] 94.5 85.9 86.4 76.4 - - - -
ABD-Net [12] 95.60 88.28 89.00 78.59 - - 82.30 60.80
RGA-Net [13] 96.10 88.40 - - 81.10 77.40 80.30 57.50
MHN [11] 95.1 85.0 89.1 77.2 77.2 72.4 - -
IANet [38] 94.4 83.1 87.1 73.4 92.4 - 75.5 46.8
RLCC* [23] 90.8 77.7 83.2 69.2 - - 56.5 27.9
IICS* [24] 88.8 72.1 80.0 64.4 - - 56.4 26.9
LAGA-Net (Ours) 96.18 88.76 89.71 78.92 81.73 77.94 82.56 60.97

Hand-based person Re-Id: We compare our proposed method to many existing state-of-the-art hand-based Re-Id (recognition) methods such as GPA-Net [16], MBA-Net [39], RGA-Net [13] and ABD-Net [12]. The GPA-Net was designed for hand-based person identification, however, both RGA-Net and ABD-Net were designed for body-based person re-identification. Therefore, we trained both RGA-Net and ABD-Net on hand datasets using the same experimental settings (loss function, optimizer, hyperparameters, etc.) as our method to make a fair comparasion with our method. The quantitative performance comparison of our method with the other methods is given in Table V. As shown in this table, our method outperforms all other methods across all datasets in both rank-1 accuracy and mAP evaluation metrics. For instance, the LAGA-Net outperforms the MBA-Net by 0.81% in rank-1 accuracy and 0.73% in mAP on HD dataset. The adapted body-based person Re-Id methods, RGA-Net [13] and ABD-Net [12], have limited performance on hand datasets. For instance, our method outperfoms the RGA-Net by 5.50% in rank-1 accuracy and 4.96% in mAP on palmar right (P-r) of 11k dataset.

TABLE V: Quantitative performance comparison of our method (LAGA-Net) with existing state-of-the-art hand-based person Re-Id methods (GPA-Net [16], MBA-Net [39], RGA-Net [13] and ABD-Net [12]) on right dorsal (D-r) of 11k, left dorsal (D-l) of 11k, right palmar (P-r) of 11k, left palmar (P-l) of 11k and HD datasets. The results are shown in rank-1 accuracy (%) and mAP (%). Best and second best results are shown in red and blue, respectively.
Method D-r of 11k D-l of 11k P-r of 11k P-l of 11k HD
rank-1 mAP rank-1 mAP rank-1 mAP rank-1 mAP rank-1 mAP
GPA-Net [16] 94.80 95.72 94.87 95.93 95.83 96.31 95.72 96.20 94.64 95.08
MBA-Net [39] 97.45 97.98 96.71 97.41 98.05 98.42 97.42 97.84 95.12 95.54
RGA-Net [13] 94.77 95.67 95.30 95.98 92.66 93.58 94.95 95.67 95.06 95.39
ABD-Net [12] 95.89 96.76 94.26 95.34 96.21 96.91 95.54 96.01 94.93 95.38
LAGA-Net (Ours) 97.56 98.11 96.82 97.53 98.16 98.54 97.56 97.95 95.93 96.27

IV-F Qualitative Re-ID Results

The qualitative Re-ID results of our proposed method is shown in Figs. 3 and 4. As can be observed on Fig. 3, our proposed method (LAGA-Net) has an improved performance on Market-1501 dataset over the baseline (the global without attention component of the LAGA-Net) in retrieval performance. While the LAGA-Net retrieves all top-5 correct results (top row for each query), the baseline only retrieves few results (bottom row for each query). This indicates that the LAGA-Net learns more robust discriminative feature embeddings which help to find more correct (true positive) results than the baseline even when the persons in the images are under significant appearance variations. We also show a qualitative exemplar image (query) of each (sub-)dataset of the hand datasets in Fig. 4 with ranked results retrieved from a gallery of each (sub-)dataset. There is only one correct image in the galleries of the hand-based person Re-Id datasets unlike in the body-based person Re-Id datasets. Bacause of this, only one correct image is retrieved from the gallery of the 11k and HD hand datasets for each query image whereas multiple correct images are retrieved from the Market-1501 gallery for each query image. Overall, our proposed method is effective on both the body-based and the hand-based person Re-Id datasets as it learns more robust and discriminative deep feature representations.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Some qualitative results of our method on Market-1501 [19] dataset using query vs ranked results retrieved from gallery. Left: query image, Right: a) top-5 results of the LAGA-Net, b) top-5 results of the global (without attention) component of the LAGA-Net (baseline). The green and red bounding boxes denote the correct and the wrong matches, respectively. Feature embeddings from our proposed method (LAGA-Net) give better retrieval performance.
Refer to caption
Figure 4: Some qualitative results of our method on 11k [21] and HD [22] datasets using query vs ranked results retrieved from gallery. From top to bottom row are right dorsal of 11k, left dorsal of 11k, right palmar of 11k, left palmar of 11k and HD datasets. The green and red bounding boxes denote the correct and the wrong matches, respectively.

V Discussion

The proposed multi-branch deep network architecture, LAGA-Net, is a compound approach for end-to-end discriminative deep feature representations learning for person Re-Id based on both body and hand images. We have shown in the experiments section (see Section IV) that extensive evaluation of this method has demonstrated excellent performance not only on body datasets but also on hands datasets, as shown in Table IV and Table V, respectively.

The different branches in the proposed multi-branch deep network capture different but complementary information to boost the performance of the network as shown in ablation analysis section (see Section IV-D) on both body and hands datasets, particularly in Table III. Each branch has its own importance in the proposed network. The attention branches, channel attention branch and spatial attention branch, focus on the relevant features of the image while suppressing the irrelevant backgrounds. To maintain translation equivariance, relative positional encodings is integrated into spatial attention module of the spatial attention branch. The global and local branches intends to capture global context and fine-grained information, respectively. By properly integrating these branches in our proposed network, we have demonstrated that it is possible to effectively learn robust and discriminative feature representations for person Re-Id based on both body and hand images. The main importance of the proposed method is the excellent performance it has shown on both body and hand images. The elegant approach of integrating many different components into our proposed compound approach has demonstrated learning of robust and discriminative feature representations which is helpful in overcoming the many challenges the person Re-Id is facing such as pose variations, occlusion, view point changes, lighting changes, background clutter, noisy labels, etc.

The proposed method has various applications. For instance, the person Re-ID based on body can be used for intelligent video surveillance in less controlled or uncontrolled environments. The hand-based person Re-Id can be used for criminal investigation in uncontrolled environments, for instance, for recognizing or re-identifying perpetrators of serious crime such as sexual abuse in case only hand images of the perpetrators are available, which is very crucial in assisting international police forces.

VI Conclusion

In this work, we introduce a compound approach for end-to-end discriminative deep feature learning, the Local-Aware Global Attention Network (LAGA-Net), for person Re-Id based on both body and hand images. The LAGA-Net is a multi-branch deep network architecture consisting of channel and spatial attention modules in branches in addition to global (without attention) and local branches to learn deep attentive, global and part-level feature embeddings for more robust and discriminative person Re-Id. We also integrate relative positional encodings into the spatial attention module to capture the spatial positions of pixels to overcome the weakness of the attention mechanisms, equivariant to pixel shuffling. The incorporation of these branches allows a deeper study of the features of the person body and hand images for robust re-identification of individuals in less controlled and challenging environments. The LAGA-Net demonstrates the state-of-the-art performance through extensive experiments on four popular body-based person Re-Id benchmarks and two publicly available hand datasets where the ablation analysis shows each component substantially contributes to a performance gain.

References

  • [1] Nathanael L. Baisa, “Occlusion-robust online multi-object visual tracking using a GM-PHD filter with CNN-based re-identification,” Journal of Visual Communication and Image Representation, vol. 80, pp. 103279, 2021.
  • [2] Ergys Ristani and Carlo Tomasi, “Features for multi-target multi-camera tracking and re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, June 18-22, 2018, pp. 6036–6046.
  • [3] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian, “Person transfer GAN to bridge domain gap for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [4] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven C.H. Hoi, “Deep learning for person re-identification: A survey and outlook,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
  • [5] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang, “Deepreid: Deep filter pairing neural network for person re-identification,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 152–159.
  • [6] Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz, “Joint discriminative and generative learning for person re-identification,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2133–2142.
  • [7] Hao Luo, Wei Jiang, Youzhi Gu, Fuxu Liu, Xingyu Liao, Shenqi Lai, and Jianyang Gu, “A strong baseline and batch normalization neck for deep person re-identification,” IEEE Transactions on Multimedia, vol. 22, no. 10, pp. 2597–2609, 2020.
  • [8] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deep convolutional model for person re-identification,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3980–3989.
  • [9] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Sheng** Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [10] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou, “Learning discriminative features with multiple granularities for person re-identification,” in Proceedings of the 26th ACM International Conference on Multimedia, New York, NY, USA, 2018, MM ’18, p. 274–282, Association for Computing Machinery.
  • [11] Binghui Chen, Weihong Deng, and Jiani Hu, “Mixed high-order attention network for person re-identification,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 371–381.
  • [12] Tianlong Chen, Shao** Ding, **gyi Xie, Ye Yuan, Wuyang Chen, Yang Yang, Zhou Ren, and Zhangyang Wang, “ABD-Net: Attentive but diverse person re-identification,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 8350–8360.
  • [13] Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Xin **, and Zhibo Chen, “Relation-aware global attention for person re-identification,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3183–3192.
  • [14] Anil K. Jain, Debayan Deb, and Joshua J. Engelsma, “Biometrics: Trust, but verify,” IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 4, no. 3, pp. 303–323, 2022.
  • [15] A. Dantcheva, P. Elia, and A. Ross, “What else does your biometric data reveal? A survey on soft biometrics,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 3, pp. 441–467, 2016.
  • [16] Nathanael L. Baisa, Bryan Williams, Hossein Rahmani, Plamen Angelov, and Sue Black, “Hand-based person identification using global and part-aware deep feature representation learning,” in 2022 Eleventh International Conference on Image Processing Theory, Tools and Applications (IPTA), 2022, pp. 1–6.
  • [17] Yimin Yuan, Chaoying Tang, Shuhang Xia, Zhou Chen, and Tong Qi, “HandNet: Identification based on hand images using deep learning methods,” in Proceedings of the 2020 4th International Conference on Vision, Image and Signal Processing, New York, NY, USA, 2020, ICVISP 2020, Association for Computing Machinery.
  • [18] Abdelouahab Attia, Zahid Akhtar, and Youssef Chahir, “Feature-level fusion of major and minor dorsal finger knuckle patterns for person authentication,” Signal, Image and Video Processing, Feb. 2021.
  • [19] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116–1124.
  • [20] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in Computer Vision – ECCV 2016 Workshops, Gang Hua and Hervé Jégou, Eds., Cham, 2016, pp. 17–35, Springer International Publishing.
  • [21] Mahmoud Afifi, “11k hands: gender recognition and biometric identification using a large dataset of hand images,” Multimedia Tools and Applications, 2019.
  • [22] A. Kumar and Z. Xu, “Personal identification using minor knuckle patterns from palm dorsal surface,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 10, pp. 2338–2348, 2016.
  • [23] Xiao Zhang, Yixiao Ge, Yu Qiao, and Hongsheng Li, “Refining pseudo labels with clustering consensus over generations for unsupervised object re-identification,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3435–3444.
  • [24] Shiyu Xuan and Shiliang Zhang, “Intra-inter camera similarity for unsupervised person re-identification,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11921–11930.
  • [25] Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang, “Mask-guided contrastive attention model for person re-identification,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 1179–1188.
  • [26] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah, “Transformers in vision: A survey,” 2021.
  • [27] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, “Self-attention with relative position representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, June 2018, pp. 464–468, Association for Computational Linguistics.
  • [28] Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui Jia, and Ching-Hui Chen, “Global self-attention networks for image recognition,” CoRR, vol. abs/2010.03019, 2020.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [30] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. 2017, AAAI’17, p. 4278–4284, AAAI Press.
  • [31] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, jul 2017, pp. 2261–2269, IEEE Computer Society.
  • [32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
  • [33] Alexander Hermans, Lucas Beyer, and Bastian Leibe, “In defense of the triplet loss for person re-identification,” 2017.
  • [34] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li, “Re-ranking person re-identification with k-reciprocal encoding,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3652–3661.
  • [35] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang, “Random erasing data augmentation,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12. 2020, pp. 13001–13008, AAAI Press.
  • [36] Xiaodong Chen, Xinchen Liu, Wu Liu, Xiao-** Zhang, Yongdong Zhang, and Tao Mei, “Explainable person re-identification with attribute-guided metric distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 11813–11822.
  • [37] Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang, “Omni-scale feature learning for person re-identification,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 3701–3711.
  • [38] Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen, “Interaction-and-aggregation network for person re-identification,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9309–9318.
  • [39] Nathanael L. Baisa, Bryan Williams, Hossein Rahmani, Plamen Angelov, and Sue Black, “Multi-branch with attention network for hand-based person recognition,” in 2022 26th International Conference on Pattern Recognition (ICPR), 2022.