GMT: A Robust Global Association Model for Multi-Target Multi-Camera Tracking

Huijie Fan ID , Tinghui Zhao ID , Qiang Wang ID , Baojie Fan ID , Yandong Tang ID Member, IEEE, and LianQing Liu ID Manuscript received xx xx, xxxx; revised xx xx, xxxx; accepted xx xx, xxxx. Date of publication xx xx, xxxx; date of current version xx xx, xxxx. This work is supported by the National Natural Science Foundation of China (62273339, 62073205, U20A20200). The associate editor coordinating the review of this manuscript and approving it for publication was Prof. xxxx (Corresponding author: Qiang Wang).Huijie Fan, Yandong Tang and LianQing Liu are with the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China (e-mail: [email protected]; [email protected]; [email protected]).Tinghui Zhao is with the State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China and also with the University of Chinese Academy of Sciences, Bei**g 100049, China (e-mail: [email protected]).Qiang Wang is with the Key Laboratory of Manufacturing Industrial Integrated, Shenyang University (e-mail: [email protected]).Baojie Fan is with the Automation and AI College, Nan**g University of Posts and Telecommunications, Nan**g 210049, China, and also with the State Key Laboratory of Integrated Services Networks, Xi’an 710071, China (e-mail: [email protected]).
Abstract

In the task of multi-target multi-camera (MTMC) tracking of pedestrians, the data association problem is a key issue and main challenge, especially with complications arising from camera movements, lighting variations, and obstructions. However, most MTMC models adopt two-step approaches, thus heavily depending on the results of the first-step tracking in practical applications. Moreover, the same targets crossing different cameras may exhibit significant appearance variations, which further increases the difficulty of cross-camera matching. To address the aforementioned issues, we propose a global online MTMC tracking model that addresses the dependency on the first tracking stage in two-step methods and enhances cross-camera matching. Specifically, we propose a transformer-based global MTMC association module to explore target associations across different cameras and frames, generating global trajectories directly. Additionally, to integrate the appearance and spatio-temporal features of targets, we propose a feature extraction and fusion module for MTMC tracking. This module enhances feature representation and establishes correlations between the features of targets across multiple cameras. To accommodate high scene diversity and complex lighting condition variations, we have established the VisionTrack dataset, which enables the development of models that are more generalized and robust to various environments. Our model demonstrates significant improvements over comparison methods on the VisionTrack dataset and others.

Index Terms:
MTMC tracking, MTMC dataset, vision transformer.
Refer to caption
Figure 1: Two-Step Approach Versus Global Approach in MTMC. Most two-step approaches employ off-the-shelf models for single-camera tracking, followed by inter-camera tracking models. Our global model merges these two steps, associating all targets across cameras and generating global trajectories at once.

I Introduction

Multi-target multi-camera (MTMC) tracking [1] has vast applications in numerous domains such as video surveillance [2, 3, 4], autonomous driving [5], human-computer interaction [6], anomaly action detection [7], crowd behavior analysis [8, 9], and understanding of traffic scenes [10]. Compared to single-camera tracking, multi-camera tracking uses information from cameras at various locations to enhance system robustness and alleviate occlusion issues by offering multiple viewing angles, ensuring more continuous and accurate tracking [11, 12]. However, the appearance, lighting, and background information of one object captured by different cameras usually vary significantly [13, 14], increasing the difficulty of target association in multi-camera tracking.

Most MTMC tracking models proposed in recent years employ two-step approaches, which involves single-camera tracking followed by inter-camera re-identification (Re-ID) (shown in Figure 1, two-step approach). Firstly, single-camera tracking models, suchs as DeepSORT [15] or FairMOT [16], are used to associate targets across frames within one camera, generating local trajectories. Secondly, Re-ID models [17, 13] are employed to link these local trajectories, creating complete trajectories across multiple cameras. Although this method of target association is widely used, the performance of inter-camera tracking in the second step is greatly influenced by false positives and trajectory fragments produced by the single-camera tracking in the first step [18]. Furthermore, the two-step approach inherently separates the tracking and Re-ID processes, preventing the model from leveraging the full context of the multi-camera setup during both steps. This separation can result in the loss of valuable temporal and spatial information that could be used to improve tracking accuracy.

Global MTMC tracking models integrate both steps into a unified process (shown in Figure 1, global approach), reducing inaccuracies and enhancing the robustness of target association. Existing global MTMC tracking models are mostly based on constructing a graph which connects targets across all cameras [19, 20]. Global MTMC tracking methods are more robust compared to the two-step MTMC approaches, but they have higher computational complexity and lack real-time capabilities. These models also require sophisticated algorithms to handle the optimization of the global graph.

Refer to caption
Figure 2: Our proposed GMT model for MTMC tracking. Our GMT model consists of three main modules. (1) Object Detection Module detects targets frame-by-frame within the video stream. (2) Feature Extraction Module extracts and fuses Re-ID and spatio-temporal features for each target. (3) Global Association Module employs a transformer to produce similarity matrix for target association and engages a memory bank for secondary associating to recover lost trajectories of unmatched targets, ultimately creating global trajectories.

For a more efficient solution to target association in MTMC tracking, we propose a transformer-based global MTMC tracking model. As shown in Figure 2, this model directly generates global trajectories across cameras from initial detections. Our model simplifies the two-step approach and avoids the complex graph optimization process. With the transformer’s global association capabilities, our model processes all target features within a time window, enabling interaction and enhancement of target features across different cameras, and generating a similarity matrix between targets. Global trajectories for each target are constructed by processing the similarity matrix and using the Hungarian algorithm.

Our model also addresses the limitation of existing MTMC tracking models, which often focus solely on appearance features when associating targets across cameras, by introducing a fused feature representation module that integrates Re-ID features with spatio-temporal features. This integration considers both appearance similarities and the continuity of target trajectories within time windows. The inclusion of spatial-temporal information allows for more robust and accurate target association, as it leverages both visual distinctiveness and movement patterns, leading to improved MTMC tracking performance. To recover the lost trajectories, we introduces a memory bank module to preserve the historical features of each trajectory. Our end-to-end trainable model achieved competitive performance in six diverse MTMC tracking datasets.

The main contributions are summarized as follows:

  • We propose a transformer-based global MTMC tracking model that associates features of targets across different cameras and frames, directly generates global trajectories, and possesses the capability to recover lost trajectories.

  • We propose a feature representation module for MTMC tracking, which integrates the appearance and spatio-temporal features of targets to enhance their feature representation, thereby effectively establishing correlations between features of targets in multiple cameras.

  • We have established the VisionTrack dataset, a large-scale MTMC tracking dataset featuring high scene diversity and complex lighting condition variations, enabling the development of models that are more generalized and robust to various environments.

  • On the VisionTrack dataset, we attained a score of 76.076.076.076.0 in CVMA (Cross-View Matching Accuracy) and 81.481.481.481.4 in CVIDF1 (Cross-View IDF1 Score), outperforming the second-best model by 5.95.95.95.9 in CVMA and 13.013.013.013.0 in CVIDF1.

II Related Work

II-1 Object Detection

Object detection is one of the fundamental tasks in computer vision, with the mission of locating and classifying targets of specific categories within images. In recent years, object detection methods can be categorized into anchor-based and anchor-free detection algorithms. Examples of anchor-based algorithms are Faster RCNN [21], SSD [22], and YOLOv2 [23], whereas anchor-free algorithms include YOLO [24], CenterNet [25], and DETR [26]. Our work focuses on feature association between targets, thus we employ CenterNet to detect targets in every frame of the video. Other detection algorithms can also replace CenterNet within our framework.

II-2 Two-Step MTMC Tracking

Two-step MTMC tracking approaches decompose MTMC tracking into two consecutive steps: Single-Camera Tracking (SCT) [27, 28, 29, 30, 31] and Inter-Camera Tracking (ICT) [32, 33, 34, 35, 36, 37, 38]. SCT methods include tracking-by-detection paradigm-based DeepSORT [15], BYTETRACK [39], GTR [40] and joint detection-tracking paradigm-based FairMOT [16]. ICT aims to associate the same targets across different cameras, thereby forming inter-camera trajectories. Inter-camera target Re-ID is a challenging task because the same targets may have different postures, lighting, and occlusions in different cameras. To better perform ICT, some models focus on optimizing and improving target association methods. DyGLIP [41] employs dynamic graphs for link prediction, combining with attention mechanisms to establish accurate data associations among targets. TRACTA [42] formulates the MTMC data association problem as a trajectory-to-target assignment issue, proposing a restricted non-negative matrix factorization algorithm to calculate the assignment matrix. MvMHAT [43] introduces a self-supervised learning framework, establishing pairwise similarity and triplet transitive similarity for learning data association models in MTMC tracking. MIA-NET [44], designed to address the challenges of multi-camera small object tracking, introduces a inter-camera matching model that employs keypoint map**. Additionally, some models focus on optimizing the feature representation of targets, making the feature more discriminative across different cameras. Crossmot [45] introduces both single-view Re-ID embeddings and cross-view Re-ID embeddings, representing target features distinctively in the two tracking steps. Li et al. [46] proposed intra-tracklet and inter-tracklet attention modules, separately learning each target’s motion and appearance features and each trajectory’s feature representation. Cheng et al. [47] presented a graph model that initially connects detected objects across different cameras spatially, and then transforms these connections into a temporal graph for temporal association.

Refer to caption
Figure 3: Examples of the VisionTrack dataset. Our dataset was captured in various weather conditions, times, and scenes. The images from left to right are examples taken during sunny, overcast, dusk, and night-time conditions. The same targets appearing in different cameras are marked with the same colored bounding boxes.
Refer to caption
Figure 4: Left chart: The red line graph represents the average size of each target in different scenes, while the bar chart represents the average number of targets per frame in different scenes. The VisionTrack dataset showcases various target densities in different scenes, ranging from densely packed small targets to sparsely distributed large targets. Right chart: The VisionTrack dataset covers a variety of weather and illumination conditions, including dark and low-light scenes.

II-3 Global MTMC Tracking

Existing global MTMC tracking algorithms are most based on graph networks to establish relationship among targets, directly outputting global trajectories. Chen et al. [19] proposed a global graph network that merges SCT and ICT processes. They also introduced a new similarity measurement scheme that balances different similarities in two steps. Liu et al. [20] proposed a method using generalized maximum clique optimization to construct a global graph. Furthermore, to better calculate similarity, they used LOMO features and Hankel matrices to represent the appearance and motion features of targets, respectively.

III Dataset

III-A Previous datasets

Existing MTMC tracking datasets with overlap** fields of view include: EPFL [48], CAMPUS [49], MvMHAT [43], WILDTRACK [50], and DIVOTrack [45]. Each of these datasets suffers from one or more issues, including limited scene diversity, uniform weather and lighting conditions, and unsatisfactory annotation quality. The scarcity of high-quality datasets limits the training and application of models in MTMC tracking. To address the limitations and enrich the diversity of existing datasets, we constructed a high-quality annotated dataset, named VisionTrack, which features complex lighting conditions and enhanced scene richness (shown in the right chart of Figure 4). A comparison with existing datasets is shown in Table I.

TABLE I: Comparison between MTMC tracking datasets with overlap** fields of view. Compared to existing datasets, VisionTrack offers advantages in terms of the number of scenes, quantity of images, count of targets, complexity of lighting conditions, and the number of UAV views.
Dataset Scenes Views Frames Boxes Moving camera Low-light UAV view
EPFL 5 3-4 97K 625K ×\times× ×\times× ×\times×
CAMPUS 4 4 83K 490K ×\times× ×\times× ×\times×
MvMHAT 1 3-4 31K 208K \checkmark ×\times× ×\times×
WILDTRACK 1 7 3K 40K ×\times× ×\times× ×\times×
DIVOTrack 10 3 54K 560K \checkmark ×\times× one UAV
VisionTrack 15 2 116K 1176K \checkmark \checkmark two UAVs
TABLE II: Notation list of GMT model
K𝐾Kitalic_K The set of cameras in the scenario, K={kc}c=1C𝐾superscriptsubscriptsubscript𝑘𝑐𝑐1𝐶K=\{k_{c}\}_{c=1}^{C}italic_K = { italic_k start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT
V𝑉Vitalic_V The set of videos captured from the cameras, V={vc}c=1C𝑉superscriptsubscriptsubscript𝑣𝑐𝑐1𝐶V=\{v_{c}\}_{c=1}^{C}italic_V = { italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT
Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT The set of all frames in video vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, Mc={mct}t=1Tcsubscript𝑀𝑐superscriptsubscriptsuperscriptsubscript𝑚𝑐𝑡𝑡1subscript𝑇𝑐M_{c}=\{m_{c}^{t}\}_{t=1}^{T_{c}}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
ΓisubscriptΓ𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The trajectory ΓisubscriptΓ𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consisted of a bounding boxes set, Γi={τict}c=1,t=1C,TsubscriptΓ𝑖superscriptsubscriptsuperscriptsubscript𝜏𝑖𝑐𝑡formulae-sequence𝑐1𝑡1𝐶𝑇\Gamma_{i}=\{\tau_{i}^{ct}\}_{c=1,t=1}^{C,T}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_c = 1 , italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C , italic_T end_POSTSUPERSCRIPT
τictsuperscriptsubscript𝜏𝑖𝑐𝑡\tau_{i}^{ct}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT The bounding box corresponding to ΓisubscriptΓ𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT
P𝑃Pitalic_P The set of detected targets, P={pn}n=1N𝑃superscriptsubscriptsubscript𝑝𝑛𝑛1𝑁P=\{p_{n}\}_{n=1}^{N}italic_P = { italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT The target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT consisted of the bounding box position, time,
and camera index, pn=(bn,tn,cn)subscript𝑝𝑛subscript𝑏𝑛subscript𝑡𝑛subscript𝑐𝑛p_{n}=(b_{n},t_{n},c_{n})italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT The fused feature of target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
pnΓisubscript𝑝𝑛subscriptΓ𝑖p_{n}\in\Gamma_{i}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The detected target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT belongs to trajectory ΓisubscriptΓ𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}\leftrightarrow p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT The target pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and target pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT belong to the same trajectory

III-B VisionTrack

III-B1 Data collection

We captured video data using two drones, each equipped with a camera that has a resolution of 1920×1080. All videos were captured at a frame rate of 30 FPS, comprising approximately 116K frames and 1176K detection boxes. Both drones were moving during the recording process. To ensure dataset diversity, we selected different scenes with varying weather and lighting conditions, including sunny, overcast, dusk, night (shown in Figure 3). We also captured scenes with varying levels of crowd density, ranging from sparse large targets to dense small targets(shown in left chart of Figure 4). The dataset is split approximately 1:1, with half designated for training and the other half for testing. The videos from the two cameras were manually time-aligned. We also place a high emphasis on the protection of personal privacy within our dataset. All individuals appearing in the dataset have consented to the use of their images solely for scientific research purposes.

III-B2 Annotation

Our dataset annotation process was twofold. In the first step, annotators identified and assigned unique IDs to the same targets present in the overlap** fields of view from both cameras. In the second step, we employed the semi-automatic object tracking annotation software, Darklable, to annotate frame-by-frame the targets selected in the first step from the synchronized videos captured by the two cameras. This process was repeated until all targets in the videos were assigned unique IDs. Notably, for targets appearing in only one camera’s view, they were annotated solely within that view. We designed annotation guidelines for our MTMC tracking dataset, based on methods from multi-object tracking [51]:

  • Even if a portion of the target is occluded, annotators are required to estimate the complete bounding box.

  • When over half of the target’s body is out of the camera’s view, annotators disregard these targets.

  • For any given target, a unique ID is maintained throughout the entire video sequence captured by both cameras.

IV Method

To achieve robust, online MTMC tracking in overlap** fields of view, we introduce a novel transformer-based Global MTMC Tracking (GMT) model, as illustrated in Figure 2. Our GMT model consists of three main modules: a target detection module, a feature extraction and fusion module, and a global association module. These modules will be discussed in detail in the following three subsections. In subsection A, we present a brief definition of the MTMC tracking, as well as object detection module. Subsection B introduces a new method for target feature extraction and fusion in MTMC tracking. In subsection C, we propose an MTMC association module, comprising a global association transformer and a memory bank, capable of achieving online MTMC tracking and lost trajectory recovery.

IV-A Overview

IV-A1 Problem definition

Given a scenario with C𝐶Citalic_C cameras having overlap** fields of view, the cameras set is denoted as K={kc}c=1C𝐾superscriptsubscriptsubscript𝑘𝑐𝑐1𝐶K=\{k_{c}\}_{c=1}^{C}italic_K = { italic_k start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. The video stream vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is captured from camera kcsubscript𝑘𝑐k_{c}italic_k start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and consists of Tcsubscript𝑇𝑐T_{c}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT consecutive frames, represented as Mc={mct}t=1Tcsubscript𝑀𝑐superscriptsubscriptsuperscriptsubscript𝑚𝑐𝑡𝑡1subscript𝑇𝑐M_{c}=\{m_{c}^{t}\}_{t=1}^{T_{c}}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the frame at time t𝑡titalic_t in the video stream vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The video streams captured from all cameras form the video stream set V={vc}c=1C𝑉superscriptsubscriptsubscript𝑣𝑐𝑐1𝐶V=\{v_{c}\}_{c=1}^{C}italic_V = { italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. Our aim is to perform online detection and tracking of every target captured by the synchronized cameras. The goal is to assign IDs to the corresponding targets in different cameras and produce global trajectories. Assume there are I𝐼Iitalic_I ground truth trajectories in the video stream set V𝑉Vitalic_V, with each trajectory denoted by ΓisubscriptΓ𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where Γ={Γi}i=1IΓsuperscriptsubscriptsubscriptΓ𝑖𝑖1𝐼\Gamma=\{\Gamma_{i}\}_{i=1}^{I}roman_Γ = { roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT. A trajectory ΓisubscriptΓ𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, consisting of bounding boxes for the same target across all frames, is denoted by Γi={τict}c=1,t=1C,TsubscriptΓ𝑖superscriptsubscriptsuperscriptsubscript𝜏𝑖𝑐𝑡formulae-sequence𝑐1𝑡1𝐶𝑇\Gamma_{i}=\{\tau_{i}^{ct}\}_{c=1,t=1}^{C,T}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_c = 1 , italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C , italic_T end_POSTSUPERSCRIPT, where τictsuperscriptsubscript𝜏𝑖𝑐𝑡\tau_{i}^{ct}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT represents the bounding box of trajectory ΓisubscriptΓ𝑖\Gamma_{i}roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

τict={(x1ict,y1ict,x2ict,y2ict)Tif Γi appears in mctotherwisesuperscriptsubscript𝜏𝑖𝑐𝑡casessuperscriptsuperscriptsubscript𝑥1𝑖𝑐𝑡superscriptsubscript𝑦1𝑖𝑐𝑡superscriptsubscript𝑥2𝑖𝑐𝑡superscriptsubscript𝑦2𝑖𝑐𝑡𝑇if subscriptΓ𝑖 appears in superscriptsubscript𝑚𝑐𝑡otherwise\tau_{i}^{ct}=\begin{cases}(x_{1}^{ict},y_{1}^{ict},x_{2}^{ict},y_{2}^{ict})^{% T}&\text{if }\Gamma_{i}\text{ appears }\text{in }m_{c}^{t}\\ \emptyset&\text{otherwise}\end{cases}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT = { start_ROW start_CELL ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_c italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_c italic_t end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_c italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_c italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL if roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT appears in italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ∅ end_CELL start_CELL otherwise end_CELL end_ROW (1)

IV-A2 Object detection

Given an image set M={mct}c=1,t=1C,T𝑀superscriptsubscriptsuperscriptsubscript𝑚𝑐𝑡formulae-sequence𝑐1𝑡1𝐶𝑇M=\{m_{c}^{t}\}_{c=1,t=1}^{C,T}italic_M = { italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_c = 1 , italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C , italic_T end_POSTSUPERSCRIPT captured from C𝐶Citalic_C synchronized cameras over a time period T𝑇Titalic_T, the object detector processes each frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and detects Nctsuperscriptsubscript𝑁𝑐𝑡N_{c}^{t}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT targets. These detected targets in frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are denoted as Pct={pnct}n=1Nctsuperscript𝑃𝑐𝑡superscriptsubscriptsuperscriptsubscript𝑝𝑛𝑐𝑡𝑛1superscriptsubscript𝑁𝑐𝑡P^{ct}=\{p_{n}^{ct}\}_{n=1}^{N_{c}^{t}}italic_P start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT = { italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

The set of detected targets from all frames in image set M𝑀Mitalic_M is denoted as P=c=1Ct=1TPct={pn}n=1N𝑃superscriptsubscript𝑐1𝐶superscriptsubscript𝑡1𝑇superscript𝑃𝑐𝑡superscriptsubscriptsubscript𝑝𝑛𝑛1𝑁P=\bigcup_{c=1}^{C}\bigcup_{t=1}^{T}P^{ct}=\{p_{n}\}_{n=1}^{N}italic_P = ⋃ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT = { italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where N=c=1Ct=1TNct𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑡1𝑇superscriptsubscript𝑁𝑐𝑡N=\sum_{c=1}^{C}\sum_{t=1}^{T}N_{c}^{t}italic_N = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

The corresponding bounding boxes for the detected targets in set P𝑃Pitalic_P are represented as B={bn}n=1N𝐵superscriptsubscriptsubscript𝑏𝑛𝑛1𝑁B=\{b_{n}\}_{n=1}^{N}italic_B = { italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where bnsubscript𝑏𝑛b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the spatial coordinates of the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT detected target in set P𝑃Pitalic_P and is given by bn=(x1n,y1n,x2n,y2n)Tsubscript𝑏𝑛superscriptsuperscriptsubscript𝑥1𝑛superscriptsubscript𝑦1𝑛superscriptsubscript𝑥2𝑛superscriptsubscript𝑦2𝑛𝑇b_{n}=\left(x_{1}^{n},y_{1}^{n},x_{2}^{n},y_{2}^{n}\right)^{T}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Each detected target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is defined by its bounding box, temporal index tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and camera index cnsubscript𝑐𝑛c_{n}italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where pn=(bn,tn,cn)subscript𝑝𝑛subscript𝑏𝑛subscript𝑡𝑛subscript𝑐𝑛p_{n}=\left(b_{n},t_{n},c_{n}\right)italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

Our pipeline’s detection module can be integrated with most existing detectors for end-to-end training. We employ the CenterNet [25] with a DLA34 [52] backbone for object detection. CenterNet is an anchor-free detector based on keypoint detection. Images are processed through the CenterNet to produce heatmaps, where peaks in the heatmaps correspond to object centers. The overall loss function for the network is defined as

det=k+λsizesize+λoffoffsubscriptdetsubscript𝑘subscript𝜆sizesubscriptsizesubscript𝜆offsubscriptoff\mathcal{L}_{\text{det}}=\mathcal{L}_{k}+\lambda_{\text{size}}\mathcal{L}_{% \text{size}}+\lambda_{\text{off}}\mathcal{L}_{\text{off}}caligraphic_L start_POSTSUBSCRIPT det end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT size end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT size end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT off end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT off end_POSTSUBSCRIPT (2)

where the keypoint loss ksubscript𝑘\mathcal{L}_{k}caligraphic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, object size loss sizesubscriptsize\mathcal{L}_{\text{size}}caligraphic_L start_POSTSUBSCRIPT size end_POSTSUBSCRIPT, and center offset loss offsubscriptoff\mathcal{L}_{\text{off}}caligraphic_L start_POSTSUBSCRIPT off end_POSTSUBSCRIPT are consistent with their definitions in the original paper [25]. The loss weights are set as λsize=0.1subscript𝜆size0.1\lambda_{\text{size}}=0.1italic_λ start_POSTSUBSCRIPT size end_POSTSUBSCRIPT = 0.1 and λoff=1.0subscript𝜆off1.0\lambda_{\text{off}}=1.0italic_λ start_POSTSUBSCRIPT off end_POSTSUBSCRIPT = 1.0, which are also consistent with the original paper.

Refer to caption
Figure 5: Global association transformer: During online inference, Q𝑄Qitalic_Q represents the features of targets to be matched in the current frame, and F𝐹Fitalic_F represents the features of all detected targets(including matched targets in historical frames and unmatched targets in the current frame) within the time window. During training, Q=F𝑄𝐹Q=Fitalic_Q = italic_F, where both the encoder and decoder inputs are the features of all detected targets.

IV-B Feature extraction module

Re-ID features are particularly effective in scenarios where targets have distinct appearances, enabling reliable identification across different camera views. However, relying solely on appearance features can be insufficient, especially in crowded scenes or when targets have similar visual characteristics. To address this limitation, our model also integrates spatio-temporal features. These features include the target’s bounding box position, temporal information, and camera index. The bounding box position provides spatial context, indicating where the target is located within the frame. Temporal information captures the motion patterns of the target over time, and the camera index distinguishes between different camera views. We employ RoIAlign [53] to extract the Re-ID features of each detected target in the set P𝑃Pitalic_P. These Re-ID features encapsulate the discriminative visual attributes of each detected target, aiding in distinguishing between different individuals even under challenging conditions such as occlusions or changes in viewpoint. The Re-ID feature for the detected target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is denoted as fnroisuperscriptsubscript𝑓𝑛roif_{n}^{\text{roi}}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roi end_POSTSUPERSCRIPT. The Re-ID features for all detected targets in P𝑃Pitalic_P are concatenated as Froi=concat({fnroi}n=1N)superscript𝐹roiconcatsuperscriptsubscriptsuperscriptsubscript𝑓𝑛roi𝑛1𝑁F^{\text{roi}}=\text{concat}(\{f_{n}^{\text{roi}}\}_{n=1}^{N})italic_F start_POSTSUPERSCRIPT roi end_POSTSUPERSCRIPT = concat ( { italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roi end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ), where N𝑁Nitalic_N is the number of targets. Additionally, we merge the box position, temporal, and camera index features of each detected target to form a comprehensive spatio-temporal feature vector. This feature incorporates spatial location information, temporal dynamics, and camera-specific attributes, thereby enriching the feature representation with contextual information essential for accurate target association across camera views. The box position, temporal, and camera index features of the detected target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are merged to form the spatio-temporal feature vector: fnst=(x1nw,y1nh,x2nw,y2nh,tnT,cnC)superscriptsubscript𝑓𝑛stsuperscriptsubscript𝑥1𝑛𝑤superscriptsubscript𝑦1𝑛superscriptsubscript𝑥2𝑛𝑤superscriptsubscript𝑦2𝑛subscript𝑡𝑛𝑇subscript𝑐𝑛𝐶f_{n}^{\text{st}}=\left(\frac{x_{1}^{n}}{w},\frac{y_{1}^{n}}{h},\frac{x_{2}^{n% }}{w},\frac{y_{2}^{n}}{h},\frac{t_{n}}{T},\frac{c_{n}}{C}\right)italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT = ( divide start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_w end_ARG , divide start_ARG italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_h end_ARG , divide start_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_w end_ARG , divide start_ARG italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG start_ARG italic_h end_ARG , divide start_ARG italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG , divide start_ARG italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_C end_ARG ), where w𝑤witalic_w and hhitalic_h are the width and height of the frame containing pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The spatio-temporal features for all targets in P𝑃Pitalic_P are concatenated as Fst=concat({fnst}n=1N)superscript𝐹stconcatsuperscriptsubscriptsuperscriptsubscript𝑓𝑛st𝑛1𝑁F^{\text{st}}=\text{concat}(\{f_{n}^{\text{st}}\}_{n=1}^{N})italic_F start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT = concat ( { italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ). By passing fnroisuperscriptsubscript𝑓𝑛roif_{n}^{\text{roi}}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roi end_POSTSUPERSCRIPT and fnstsuperscriptsubscript𝑓𝑛stf_{n}^{\text{st}}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT through Re-ID feature encoder Hroisubscript𝐻roiH_{\text{roi}}italic_H start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT and spatio-temporal feature encoder Hstsubscript𝐻stH_{\text{st}}italic_H start_POSTSUBSCRIPT st end_POSTSUBSCRIPT, respectively, the fused feature for target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is constructed as fn=concat(Hroi(fnroi),Hst(fnst))Dsubscript𝑓𝑛concatsubscript𝐻roisuperscriptsubscript𝑓𝑛roisubscript𝐻stsuperscriptsubscript𝑓𝑛stsuperscript𝐷f_{n}=\text{concat}(H_{\text{roi}}(f_{n}^{\text{roi}}),H_{\text{st}}(f_{n}^{% \text{st}}))\in\mathbb{R}^{D}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = concat ( italic_H start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roi end_POSTSUPERSCRIPT ) , italic_H start_POSTSUBSCRIPT st end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, where Hroi(fnroi)Droisubscript𝐻roisuperscriptsubscript𝑓𝑛roisuperscriptsubscript𝐷roiH_{\text{roi}}(f_{n}^{\text{roi}})\in\mathbb{R}^{D_{\text{roi}}}italic_H start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roi end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Hst(fnst)Dstsubscript𝐻stsuperscriptsubscript𝑓𝑛stsuperscriptsubscript𝐷stH_{\text{st}}(f_{n}^{\text{st}})\in\mathbb{R}^{D_{\text{st}}}italic_H start_POSTSUBSCRIPT st end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT st end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Here, D𝐷Ditalic_D is the dimension of the fused feature fnsubscript𝑓𝑛f_{n}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, Droisubscript𝐷roiD_{\text{roi}}italic_D start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT is the dimension of the output Re-ID feature, and Dstsubscript𝐷stD_{\text{st}}italic_D start_POSTSUBSCRIPT st end_POSTSUBSCRIPT is the dimension of the output spatio-temporal feature, with D=Dst+Droi𝐷subscript𝐷stsubscript𝐷roiD=D_{\text{st}}+D_{\text{roi}}italic_D = italic_D start_POSTSUBSCRIPT st end_POSTSUBSCRIPT + italic_D start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT. The fused features of targets in P𝑃Pitalic_P, denoted as F=concat({fn}n=1N)N×D𝐹concatsuperscriptsubscriptsubscript𝑓𝑛𝑛1𝑁superscript𝑁𝐷F=\text{concat}(\{{f_{n}}\}_{n=1}^{N})\in\mathbb{R}^{N\times D}italic_F = concat ( { italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, serve as inputs for the global association module. Overall, the fusion of these diverse features results in a more robust and comprehensive representation of each target. By combining appearance and spatio-temporal information, our model can better handle challenging scenarios where targets may look similar but move differently or when their appearance changes due to varying environmental conditions across cameras.

IV-C Global association module

In Section B, we acquired the fused features F𝐹Fitalic_F of all detected targets within the image set M𝑀Mitalic_M. The global association module (shown in Figure 5) enhances and interrelates target features, producing a similarity matrix between targets to be matched in the current frame and detected targets in the time window (including matched targets in the historical frames and unmatched targets in the current frame). This matrix is then used with the Hungarian algorithm to establish relationships between unmatched targets and matched targets and to produce global trajectories. For targets that fail to match trajectories, they will be associated with historical trajectories stored in the memory bank to determine whether new trajectories should be generated.

IV-C1 Targets association learning

Our global association transformer is constructed with an encoder and a decoder layer. The encoder processes FN×D𝐹superscript𝑁𝐷F\in\mathbb{R}^{N\times D}italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, representing the features of all detected targets within a time window, where N𝑁Nitalic_N is the number of all detected targets. The decoder processes QNq×D𝑄superscriptsubscript𝑁𝑞𝐷Q\in\mathbb{R}^{N_{q}\times D}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, representing the features of unmatched targets, where Nqsubscript𝑁𝑞N_{q}italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the number of unmatched targets. This architecture is designed to explore the associations between unmatched targets and matched targets, thereby facilitating the identification and tracking. The encoder utilizes self-attention layer and feed-forward layer to enhance the features F𝐹Fitalic_F. The enhanced features output from the encoder are denoted as FeN×Dsubscript𝐹𝑒superscript𝑁𝐷F_{e}\in\mathbb{R}^{N\times D}italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT. In the decoder layer, cross-attention operations are conducted between Fesubscript𝐹𝑒F_{e}italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and the features output from the decoder’s self-attention layer. The enhanced features output by the decoder are QdNq×Dsubscript𝑄𝑑superscriptsubscript𝑁𝑞𝐷Q_{d}\in\mathbb{R}^{N_{q}\times D}italic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT. The similarity matrix between Q𝑄Qitalic_Q and F𝐹Fitalic_F is obtained by the matrix multiplication G=QdFeTNq×N𝐺subscript𝑄𝑑superscriptsubscript𝐹𝑒𝑇superscriptsubscript𝑁𝑞𝑁G=Q_{d}{F_{e}}^{T}\in\mathbb{R}^{N_{q}\times N}italic_G = italic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT. Here, Gijsubscript𝐺𝑖𝑗G_{ij}italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is an element of the matrix G𝐺Gitalic_G and represents the score indicating the likelihood that the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT target in Q𝑄Qitalic_Q and the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT target in F𝐹Fitalic_F belong to the same trajectory.

To determine the correlation between each detected target and the ground truth trajectory, we follow an assignment method consistent with object detection. If the ground truth bounding box τkcntnsuperscriptsubscript𝜏𝑘subscript𝑐𝑛subscript𝑡𝑛\tau_{k}^{c_{n}t_{n}}italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from trajectory ΓksubscriptΓ𝑘\Gamma_{k}roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in frame mcntnsuperscriptsubscript𝑚subscript𝑐𝑛subscript𝑡𝑛m_{c_{n}}^{t_{n}}italic_m start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT has the highest IoU with bounding box bnsubscript𝑏𝑛b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (corresponding to target pnsubscript𝑝𝑛p_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) among all targets detected in mcntnsuperscriptsubscript𝑚subscript𝑐𝑛subscript𝑡𝑛m_{c_{n}}^{t_{n}}italic_m start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and this IoU exceeds 0.6, we determine that target pnsubscript𝑝𝑛{\ p}_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT belongs to trajectory ΓksubscriptΓ𝑘\Gamma_{k}roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Otherwise, pnsubscript𝑝𝑛{\ p}_{n}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT has no association with ΓksubscriptΓ𝑘\Gamma_{k}roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

pn{Γkif argmaxiIoU(τkcntn,bi)=n and  IoU(τkcntn,bn)>0.6Γkotherwisesubscript𝑝𝑛casesabsentsubscriptΓ𝑘if subscript𝑖IoUsuperscriptsubscript𝜏𝑘subscript𝑐𝑛subscript𝑡𝑛subscript𝑏𝑖𝑛otherwise and  IoUsuperscriptsubscript𝜏𝑘subscript𝑐𝑛subscript𝑡𝑛subscript𝑏𝑛0.6absentsubscriptΓ𝑘otherwisep_{n}\begin{cases}\in\Gamma_{k}&\text{if }\arg\max_{i}\text{IoU}(\tau_{k}^{c_{% n}t_{n}},b_{i})=n\\ &\text{ and }\text{ IoU}(\tau_{k}^{c_{n}t_{n}},b_{n})>0.6\\ \not\in\Gamma_{k}&\text{otherwise}\end{cases}italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { start_ROW start_CELL ∈ roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL if roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT IoU ( italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_n end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL and IoU ( italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > 0.6 end_CELL end_ROW start_ROW start_CELL ∉ roman_Γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW (3)

During network training, the trajectory to which each target belongs is known. Therefore, the features of all targets are used as inputs for both the encoder and the decoder, i.e., Q=F𝑄𝐹Q=Fitalic_Q = italic_F and GN×N𝐺superscript𝑁𝑁G\in\mathbb{R}^{N\times N}italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, where the features of all targets function as queries to establish associations between any two targets. This configuration enables the model to effectively learn the relationships and associations among targets, both within the same trajectory and across different trajectories during the training stage.

We represent the case where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT belong to the same trajectory as pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}\leftrightarrow p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Gijsubscript𝐺𝑖𝑗G_{ij}italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the similarity score of pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}\leftrightarrow p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Considering that pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT may not correspond to any target, we define the similarity score of pisubscript𝑝𝑖p_{i}\leftrightarrow\emptysetitalic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ ∅ as gi0=0subscript𝑔𝑖00g_{i0}=0italic_g start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT = 0. We apply softmax to the similarity matrix G𝐺Gitalic_G separately for each frame to obtain the association matrix H𝐻Hitalic_H. For the frame mcjtjsuperscriptsubscript𝑚subscript𝑐𝑗subscript𝑡𝑗m_{c_{j}}^{t_{j}}italic_m start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where the target pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is detected, the element Hijsubscript𝐻𝑖𝑗H_{ij}italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the probability distribution of pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}\leftrightarrow p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. This is achieved by normalizing the similarity scores Gijsubscript𝐺𝑖𝑗G_{ij}italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT across all potential targets within the same frame mcjtjsuperscriptsubscript𝑚subscript𝑐𝑗subscript𝑡𝑗m_{c_{j}}^{t_{j}}italic_m start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, ensuring that the sum of probabilities for pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT associating with any target (including the possibility that pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT does not correspond to any target, i.e., pisubscript𝑝𝑖p_{i}\leftrightarrow\emptysetitalic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ ∅) in that frame equals 1.

Hij=eGijegi0+q=1N1(tq=tj and cq=cj)eGiqsubscript𝐻𝑖𝑗superscript𝑒subscript𝐺𝑖𝑗superscript𝑒subscript𝑔𝑖0superscriptsubscript𝑞1𝑁subscript1subscript𝑡𝑞subscript𝑡𝑗 and subscript𝑐𝑞subscript𝑐𝑗superscript𝑒subscript𝐺𝑖𝑞H_{ij}=\frac{e^{G_{ij}}}{e^{g_{i0}}+\sum_{q=1}^{N}1_{(t_{q}=t_{j}\text{ and }c% _{q}=c_{j})}e^{G_{iq}}}italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and italic_c start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (4)

where (tq=tj and cq=cj)subscript𝑡𝑞subscript𝑡𝑗 and subscript𝑐𝑞subscript𝑐𝑗(t_{q}=t_{j}\text{ and }c_{q}=c_{j})( italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and italic_c start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes target pqsubscript𝑝𝑞p_{q}italic_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is detected in frame mcjtjsuperscriptsubscript𝑚subscript𝑐𝑗subscript𝑡𝑗m_{c_{j}}^{t_{j}}italic_m start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 1(f)={1if f=true0if f=falsesubscript1𝑓cases1if 𝑓true0if 𝑓false1_{(f)}=\begin{cases}1&\text{if }f=\text{true}\\ 0&\text{if }f=\text{false}\end{cases}1 start_POSTSUBSCRIPT ( italic_f ) end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_f = true end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_f = false end_CELL end_ROW .

Similarly, on image mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, the probability distribution pisubscript𝑝𝑖p_{i}\leftrightarrow\emptysetitalic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ ∅ is

hi0ct=egi0egi0+q=1N1(tq=t and cq=c)eGiqsuperscriptsubscript𝑖0𝑐𝑡superscript𝑒subscript𝑔𝑖0superscript𝑒subscript𝑔𝑖0superscriptsubscript𝑞1𝑁subscript1subscript𝑡𝑞𝑡 and subscript𝑐𝑞𝑐superscript𝑒subscript𝐺𝑖𝑞h_{i0}^{ct}=\frac{e^{g_{i0}}}{e^{g_{i0}}+\sum_{q=1}^{N}1_{(t_{q}=t\text{ and }% c_{q}=c)}e^{G_{iq}}}italic_h start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 1 start_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_t and italic_c start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_c ) end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_i italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (5)

where (tq=t and cq=c)subscript𝑡𝑞𝑡 and subscript𝑐𝑞𝑐{(t_{q}=t\text{ and }c_{q}=c)}( italic_t start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_t and italic_c start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_c ) denotes target pqsubscript𝑝𝑞p_{q}italic_p start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is in frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

The ground truth of the association matrix H𝐻Hitalic_H is represented by X𝑋Xitalic_X. Xij=1subscript𝑋𝑖𝑗1X_{ij}=1italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 indicates pipjsubscript𝑝𝑖subscript𝑝𝑗p_{i}\leftrightarrow p_{j}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Xij={1if pipj0otherwisesubscript𝑋𝑖𝑗cases1if subscript𝑝𝑖subscript𝑝𝑗0otherwiseX_{ij}=\begin{cases}1&\text{if }p_{i}\leftrightarrow p_{j}\\ 0&\text{otherwise}\end{cases}italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (6)

Similarly, the ground truth for hi0ctsuperscriptsubscript𝑖0𝑐𝑡h_{i0}^{ct}italic_h start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT is denoted as xi0ctsuperscriptsubscript𝑥𝑖0𝑐𝑡x_{i0}^{ct}italic_x start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT. xi0ct=1superscriptsubscript𝑥𝑖0𝑐𝑡1x_{i0}^{ct}=1italic_x start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT = 1 indicates pisubscript𝑝𝑖p_{i}\leftrightarrow\emptysetitalic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ ∅ in frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

xi0ct={1if pi in mct0otherwisesuperscriptsubscript𝑥𝑖0𝑐𝑡cases1if subscript𝑝𝑖 in superscriptsubscript𝑚𝑐𝑡0otherwisex_{i0}^{ct}=\begin{cases}1&\text{if }p_{i}\leftrightarrow\emptyset\text{ in }m% _{c}^{t}\\ 0&\text{otherwise}\end{cases}italic_x start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ↔ ∅ in italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (7)

Given that each target has a unique corresponding target in every frame, we use the multi-object classification cross-entropy loss in each frame to calculate the association loss between targets. In frame mctsuperscriptsubscript𝑚𝑐𝑡m_{c}^{t}italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, the loss is calculated by:

assoct=1N(i=1N(j=1NXijlog(Hij)1(tj=t and cj=c)+xi0ctlog(hi0ct)))superscriptsubscript𝑎𝑠𝑠𝑜𝑐𝑡1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝑁subscript𝑋𝑖𝑗subscript𝐻𝑖𝑗subscript1subscript𝑡𝑗𝑡 and subscript𝑐𝑗𝑐superscriptsubscript𝑥𝑖0𝑐𝑡superscriptsubscript𝑖0𝑐𝑡\begin{split}\mathcal{L}_{asso}^{ct}=-\frac{1}{N}\Bigg{(}&\sum_{i=1}^{N}\Bigg{% (}\sum_{j=1}^{N}X_{ij}\log(H_{ij})\cdot 1_{(t_{j}=t\text{ and }c_{j}=c)}\\ &+x_{i0}^{ct}\log(h_{i0}^{ct})\Bigg{)}\Bigg{)}\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_a italic_s italic_s italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log ( italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ⋅ 1 start_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_t and italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_c ) end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_x start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT roman_log ( italic_h start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT ) ) ) end_CELL end_ROW (8)

We take the sum of cross-entropy losses on all frames as the total target association loss:

asso=c=1Ct=1Tassoctsubscript𝑎𝑠𝑠𝑜superscriptsubscript𝑐1𝐶superscriptsubscript𝑡1𝑇superscriptsubscript𝑎𝑠𝑠𝑜𝑐𝑡\mathcal{L}_{asso}=\sum_{c=1}^{C}\sum_{t=1}^{T}\mathcal{L}_{asso}^{ct}caligraphic_L start_POSTSUBSCRIPT italic_a italic_s italic_s italic_o end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_a italic_s italic_s italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_t end_POSTSUPERSCRIPT (9)

The complete network is optimized through the following loss function:

=det+assosubscript𝑑𝑒𝑡subscript𝑎𝑠𝑠𝑜\mathcal{L}=\mathcal{L}_{det}+\mathcal{L}_{asso}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_a italic_s italic_s italic_o end_POSTSUBSCRIPT (10)
Refer to caption
Figure 6: Comparison of tracking performance with and without the memory bank module. The target wearing white clothing experienced continuous occlusion from the left image until reappearing in the right image, spanning a duration of 152 frames, which exceeds the time window limit. The absence of such a module could lead to ID switching errors.
TABLE III: Comparison of MTMC tracking results on the EPFL, CAMPUS, MvMHAT, WILDTRACK, DIVOTrack and VisionTrack dataset. CA and C1 respectively stand for CVMA and CVIDF1. The best two results are shown in red and blue.
EPFL CAMPUS MvMHAT WILDTRACK DIVOTrack VisionTrack
Methods CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑
OSNet 73.0  40.3 58.8  47.8 92.6  87.7 10.8  18.2 33.0  44.9 63.8  64.4
Strong 75.6  45.2 63.4  55.0 49.0  55.1 28.6  41.6 39.1  44.7 16.9  43.0
AGW 73.9  43.2 60.8  52.8 92.5  86.6 15.6  23.8 54.3  55.3 70.1  68.4
MvMHAT 30.5  33.7 56.0  55.6 70.1  68.4 10.3  16.2 58.2  60.7 39.2  53.6
CrossMOT 74.4  47.3 65.6  61.2 92.3  87.4 42.3  56.7 68.8  69.1 64.5  64.4
GMT 76.1  71.2 71.0  77.6 96.5  97.0 61.0  60.0 75.3  78.6 76.0  81.4
TABLE IV: Comparison of MTMC tracking results on each scene of VisionTrack dataset. CA and C1 respectively stand for CVMA and CVIDF1. The best two results are shown in red and blue.
Scenes Garden Night Gate Square Path
Methods CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑
OSNet 68.1 68.7 69.2 63.2 86.990.8 85.6  80.6 57.4  44.9
Strong 42.5  56.6 3.3  34.2 6.1  42.4 9.1  47.5 13.9  49.7
AGW 76.573.8 66.2  61.0 86.4  90.4 87.1  87.8 70.2  78.0
MvMHAT 75.2  72.1 62.1  61.5 86.7  82.2 86.3  85.3 45.3  62.4
CrossMOT 62.1  67.9 74.8  52.6 80.4  78.2 83.2  91.0 62.2  64.7
GMT 82.3  81.5 48.6  73.4 85.9  91.2 87.290.4 77.6  86.4
Scenes Football Woods Park Road Bridge
Methods CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑
OSNet 78.8  74.5 75.9  53.5 43.8  59.1 -18.5  37.6 2.0  20.4
Strong 21.0  47.3 14.9  30.8 10.2  37.3 -37.3  25.4 -0.2  18.1
AGW 81.678.0 74.9  61.3 63.1  63.3 -4.8  42.7 2.4  19.4
MvMHAT 32.1  53.6 67.0  59.0 20.8  45.4 -28.5  29.5 -0.3  19.2
CrossMOT 75.5  71.7 36.7  50.9 49.6  58.5 -5.8  39.8 -0.9  19.1
GMT 83.5  86.1 79.0  83.5 74.5  83.3 11.2  52.8 20.7  45.1
Scenes Basketball Canteen Court1 Court2 Court3
Methods CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑ CA↑  C1↑
OSNet 76.5  57.3 28.7  54.1 95.2  75.8 86.0  89.9 77.9  84.8
Strong 7.0  33.2 29.3  46.8 2.5  38.7 -11.7  44.9 1.2  44.4
AGW 84.3  64.5 33.9  55.2 93.2  74.8 85.4  91.9 98.0  88.2
MvMHAT 48.3  48.3 28.0  50.8 93.1  76.5 84.6  88.8 96.2  87.8
CrossMOT 72.1  59.9 47.4  54.9 87.4  83.3 83.1  90.9 96.297.5
GMT 86.6  86.9 59.8  71.8 97.3  98.6 93.6  96.9 92.5  96.2
TABLE V: Comparison of single-camera tracking metrics on VisionTrack dataset. The best two results are shown in red and blue.
MOTA↑ MOTP↑ IDF1↑ MT↑ ML↓ HOTA↑ DetA↑ AssA↑ FP↓ FN↓ IDS↓
OSNet 77.4 80.3 68.4 79.8 5.1 57.8 62.8 53.7 174982 187397 1973
Strong 19.1 78.7 45.7 50.4 30.3 41.3 34.1 50.4 304574 317473 2698
AGW 77.0 80.3 77.0 79.5 5.4 59.5 62.9 57.0 160693 174219 2243
MvMHAT 76.9 80.4 66.0 79.5 5.1 56.3 62.5 51.3 189075 200258 1680
CrossMOT 75.6 78.5 65.8 76.0 5.7 54.5 60.8 49.3 189582 201867 1767
GMT 79.7 81.2 82.7 76.9 7.7 67.0 65.0 69.5 78481 115016 843

IV-C2 Online Interference

During online inference, we process the video stream using a sliding time window with a size of W𝑊Witalic_W and a step size of S=1𝑆1S=1italic_S = 1. At time t𝑡titalic_t, from the image set {mct}c=1Csuperscriptsubscriptsuperscriptsubscript𝑚𝑐𝑡𝑐1𝐶\{{m_{c}^{t}}\}_{c=1}^{C}{ italic_m start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT obtained from all camera views, we detect Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT targets, denoted as Pt={pnt}n=1Ntsuperscript𝑃𝑡superscriptsubscriptsuperscriptsubscript𝑝𝑛𝑡𝑛1subscript𝑁𝑡P^{t}=\{{p_{n}^{t}}\}_{n=1}^{N_{t}}italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with their corresponding fused features given by Ft={fnt}n=1Ntsuperscript𝐹𝑡superscriptsubscriptsuperscriptsubscript𝑓𝑛𝑡𝑛1subscript𝑁𝑡F^{t}=\{{f_{n}^{t}}\}_{n=1}^{N_{t}}italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Let T𝑇Titalic_T represent the current time point. The starting time of the window is then calculated by Ts=max(1,TW+1)subscript𝑇𝑠1𝑇𝑊1T_{s}=\max(1,T-W+1)italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_max ( 1 , italic_T - italic_W + 1 ). We maintain a historical cache for all N𝑁Nitalic_N targets within the time window, encompassing the fused features of all targets, where N=t=TsTNt𝑁superscriptsubscript𝑡subscript𝑇𝑠𝑇subscript𝑁𝑡N=\sum_{t=T_{s}}^{T}N_{t}italic_N = ∑ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. All targets in time window are denoted as P=t=TsTPt={pn}n=1N𝑃superscriptsubscript𝑡subscript𝑇𝑠𝑇superscript𝑃𝑡superscriptsubscriptsubscript𝑝𝑛𝑛1𝑁P=\bigcup_{t=T_{s}}^{T}P^{t}=\{{p_{n}}\}_{n=1}^{N}italic_P = ⋃ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_P start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The input F𝐹Fitalic_F for the encoder of global association module represents features of all targets in time window, where F=concat({Ft}t=TsT)𝐹concatsuperscriptsubscriptsuperscript𝐹𝑡𝑡subscript𝑇𝑠𝑇F=\text{concat}(\{F^{t}\}_{t=T_{s}}^{T})italic_F = concat ( { italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) and FN×D𝐹superscript𝑁𝐷F\in\mathbb{R}^{N\times D}italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT. The input Q𝑄Qitalic_Q for the decoder of global association module represents the features of the unmatched targets in the current frame, where Q=concat(FT)𝑄concatsuperscript𝐹𝑇Q=\text{concat}(F^{T})italic_Q = concat ( italic_F start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) and QNT×D𝑄superscriptsubscript𝑁𝑇𝐷Q\in\mathbb{R}^{N_{T}\times D}italic_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT. The global association transformer produces a similarity matrix GNT×N𝐺superscriptsubscript𝑁𝑇𝑁G\in\mathbb{R}^{N_{T}\times N}italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT, indicating the similarity scores between each unmatched target in the current frame and all targets in the time window. Assuming there are NRsubscript𝑁𝑅N_{R}italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT trajectories corresponding to the matched targets within the current time window, denoted as Γ={Γi}i=1NRΓsuperscriptsubscriptsubscriptΓ𝑖𝑖1subscript𝑁𝑅\Gamma=\{\Gamma_{i}\}_{i=1}^{N_{R}}roman_Γ = { roman_Γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. As the targets within the time window, except for those in the current frame, have already been assigned trajectories in previous inference steps, we transform the similarity matrix G𝐺Gitalic_G into Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, representing associations between unmatched targets and existing trajectories, where GNT×NRsuperscript𝐺superscriptsubscript𝑁𝑇subscript𝑁𝑅G^{\prime}\in\mathbb{R}^{N_{T}\times N_{R}}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

G=GMsuperscript𝐺𝐺𝑀{G^{\prime}}=GMitalic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_G italic_M

where the element Mijsubscript𝑀𝑖𝑗M_{ij}italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in the matrix M𝑀Mitalic_M represents the correspondence between target pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in P𝑃Pitalic_P and trajectory ΓjsubscriptΓ𝑗\Gamma_{j}roman_Γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in ΓΓ\Gammaroman_Γ,Mij={1if piΓj0if piΓjsubscript𝑀𝑖𝑗cases1if subscript𝑝𝑖subscriptΓ𝑗0if subscript𝑝𝑖subscriptΓ𝑗M_{ij}=\begin{cases}1&\text{if }p_{i}\in\Gamma_{j}\\ 0&\text{if }p_{i}\notin\Gamma_{j}\end{cases}italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ roman_Γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW and MN×NR𝑀superscript𝑁subscript𝑁𝑅M\in\mathbb{R}^{N\times N_{R}}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

We use the Hungarian algorithm to Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to match the current frame’s targets with all trajectories within the time window. We set the threshold for successfully associating unmatched targets with matched trajectories to θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. If the association score exceeds the threshold θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the target is associated with the corresponding trajectory, forming a global trajectory. Targets with association scores below the threshold θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are added to the buffer {fnu}n=1Nusuperscriptsubscriptsuperscriptsubscript𝑓𝑛𝑢𝑛1subscript𝑁𝑢\{f_{n}^{u}\}_{n=1}^{N_{u}}{ italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where fnusuperscriptsubscript𝑓𝑛𝑢f_{n}^{u}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT is the fused feature of the unmatched targets and Nusubscript𝑁𝑢N_{u}italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the number of unmatched targets.

Due to the limitation of the time window size, we cannot recover trajectories appearing outside the time window. We maintain a memory bank to store features of trajectories that appeared but have been absent for more than W𝑊Witalic_W frames. Assuming the memory bank retains R𝑅Ritalic_R trajectories, the memory bank is represented by {f~r}r=1Rsuperscriptsubscriptsuperscript~𝑓𝑟𝑟1𝑅\{\widetilde{f}^{r}\}_{r=1}^{R}{ over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT. For trajectory r𝑟ritalic_r, the average feature of its last Nmemsubscript𝑁𝑚𝑒𝑚N_{mem}italic_N start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT appearances represents the trajectory’s feature in the memory bank:f~r=1Nmemn=1Nmemfnrsuperscript~𝑓𝑟1subscript𝑁𝑚𝑒𝑚superscriptsubscript𝑛1subscript𝑁𝑚𝑒𝑚superscriptsubscript𝑓𝑛𝑟\widetilde{f}^{r}=\frac{1}{N_{mem}}\sum_{n=1}^{N_{mem}}f_{n}^{r}over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m italic_e italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, where fnrsuperscriptsubscript𝑓𝑛𝑟f_{n}^{r}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT is the fused feature of the target that belongs to trajectory r𝑟ritalic_r. We use global association transformer to calculate association matrix between unmatched targets and historical trajectories, where the features passed to the global association transformer are F=concat({f~r}r=1R)𝐹concatsuperscriptsubscriptsuperscript~𝑓𝑟𝑟1𝑅F=\text{concat}(\{\widetilde{f}^{r}\}_{r=1}^{R})italic_F = concat ( { over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) and Q=concat({fnu}n=1Nu)𝑄concatsuperscriptsubscriptsuperscriptsubscript𝑓𝑛𝑢𝑛1subscript𝑁𝑢Q=\text{concat}(\{f_{n}^{u}\}_{n=1}^{N_{u}})italic_Q = concat ( { italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). We set the threshold for successfully matching unmatched targets with trajectories in the memory bank to θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For targets with association scores above θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, they are associated with historical trajectories. Targets with scores below the threshold θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will initiate the generation of a new trajectory.

V Experiments

V-A Implementation details

V-A1 Training

We employ DLA-34 [54] and CenterNet [25] as the backbone and object detector respectively. For the VisionTrack dataset, original images are first resized proportionally to 1280×72012807201280\times 7201280 × 720. Each image side is then scaled by a random factor between 0.80.80.80.8 and 1.21.21.21.2 before being fed into the backbone. We utilized the Adam optimizer with an initial learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to pretrain the backbone and object detector on the CrowdHuman [55] dataset for 45,000 iterations, with a batch size of 32. Subsequently, we integrated the full network and co-trained global association transformer with the object detector on the VisionTrack datasets for 18,000 iterations using Adam, starting with a learning rate of 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and a batch size of 18. Training was conducted on two NVIDIA A6000 48G GPUs.

V-A2 Inference

For the VisionTrack, we set the object detection threshold at 0.52. In the feature extraction module, the dimensions of appearance features and spatio-temporal features are set as Droi=1024subscript𝐷roi1024D_{\text{roi}}=1024italic_D start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT = 1024 and Dst=128subscript𝐷st128D_{\text{st}}=128italic_D start_POSTSUBSCRIPT st end_POSTSUBSCRIPT = 128 respectively. In the global association module, thresholds for associating existing trajectories and reviving historical ones are set as θ1=0.1subscript𝜃10.1\theta_{\text{1}}=0.1italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.1 and θ2=0.2subscript𝜃20.2\theta_{\text{2}}=0.2italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.2 respectively. During testing, we set the time window size W=60𝑊60W=60italic_W = 60, and the memory bank stores the average features of the last Nmem=10subscript𝑁mem10N_{\text{mem}}=10italic_N start_POSTSUBSCRIPT mem end_POSTSUBSCRIPT = 10 appearances of the targets. For model output, trajectories shorter than 10 frames are discarded to reduce false positives from object detection. Inference was conducted on one NVIDIA TITAN RTX 24G GPU.

V-B Evaluation Metrics

Differing from single-camera tracking, multi-camera tracking metrics should be able to evaluate the Re-ID of the same target across different cameras. Hence, we adopt the cross-view IDF1 (CVIDF1) metric and the cross-view matching accuracy (CVMA) metric proposed in [56] as our evaluation metrics . CVIDF1, an ID-based metric, originates from the IDF1 metric in single-camera tracking and is defined as:

CVIDF1=2×CVIDP×CVIDRCVIDP+CVIDR𝐶𝑉𝐼𝐷𝐹12CVIDPCVIDRCVIDPCVIDRCVIDF1=\frac{2\times\text{CVIDP}\times\text{CVIDR}}{\text{CVIDP}+\text{CVIDR}}italic_C italic_V italic_I italic_D italic_F 1 = divide start_ARG 2 × CVIDP × CVIDR end_ARG start_ARG CVIDP + CVIDR end_ARG (11)

where CVIDP and CVIDR denote the inter-camera tracking precision and recall, respectively. CVMA, which assesses multi-target tracking accuracy, originates from the MOTA metric from single-camera tracking and is defined as:

CVMA=1(tmt+fpt+2mmettgt)𝐶𝑉𝑀𝐴1subscript𝑡subscript𝑚𝑡subscriptfp𝑡2subscriptmme𝑡subscript𝑡subscript𝑔𝑡CVMA=1-\left(\frac{\sum_{t}m_{t}+\text{fp}_{t}+2\text{mme}_{t}}{\sum_{t}g_{t}}\right)italic_C italic_V italic_M italic_A = 1 - ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + fp start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 2 mme start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) (12)

where mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, fptsubscriptfp𝑡\text{fp}_{t}fp start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and mmetsubscriptmme𝑡\text{mme}_{t}mme start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent missed detections, false positives, and inter-camera mismatches, respectively. gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the total number of detected targets across all cameras.

TABLE VI: Ablation studies on various parameters. The number in bold represents the best result.
Parameter Dstsubscript𝐷stD_{\text{st}}italic_D start_POSTSUBSCRIPT st end_POSTSUBSCRIPT CA↑ C1↑ Parameter W𝑊Witalic_W CA↑ C1↑
Feature Dimension 0 75.6 81.2 Window Size 30 75.7 80.3
64 75.7 81.5 45 75.7 81.4
128 76.0 81.4 60 76.0 81.4
256 75.9 81.0 75 75.4 81.8
512 74.5 80.5 90 75.0 81.5
Parameter H𝐻Hitalic_H CA↑ C1↑ Parameter Nenc:Ndec:subscript𝑁encsubscript𝑁decN_{\text{enc}}:N_{\text{dec}}italic_N start_POSTSUBSCRIPT enc end_POSTSUBSCRIPT : italic_N start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT CA↑ C1↑
Number of Heads 1 73.4 80.4 Attention Layers 1:1 76.0 81.4
4 74.1 80.3 2:1 74.8 80.3
8 76.0 81.4 1:2 75.5 80.9
16 74.2 81.3 2:2 75.8 81.1
32 75.2 81.4 3:3 75.6 80.7

V-C Comparison with Other SOTA Methods

To evaluate the performance of our model, we conducted experiments on six MTMC tracking datasets with overlap** fields of view: EPFL [48], CAMPUS [49], MvMHAT [43], WILDTRACK [50], DIVOTrack [45], and VisionTrack. We compared our model against existing methods: OSNet [57], Strong [58], AGW [59], MvMHAT [43], and CrossMOT [45]. Table III shows that our model substantially outperforms existing approaches across all datasets. On the VisionTrack dataset, our model achieved a performance of 76.076.076.076.0 CVMA (CA) and 81.481.481.481.4 CVIDF1 (C1), outperforming the second-best model by 5.95.95.95.9 in CA and 13.013.013.013.0 in C1. Tracking results on each scene of VisionTrack are shown in Table IV. We achieved the best and second-best results in the vast majority of scenes. We also compared our model with the other models on the single-camera tracking metrics, as shown in Table V. Our model achieves a 79.7 MOTA, 82.7 IDF1, and 67.0 HOTA. This is also better than the other models.

V-D Ablation studies

To validate each module within our model, we conducted ablation studies on the feature extraction module, global association module, and memory bank module using the VisionTrack dataset. To prevent any bias from the object detector’s performance influencing the tracking results, the same object detection results were used during testing.

V-D1 Feature extraction module

To further investigate the impact of integrating spatio-temporal features into the Re-ID features for inter-camera target association, we present our ablation study results in Table VI. Under the condition of maintaining a constant Re-ID feature dimension Droi=1024subscript𝐷roi1024D_{\text{roi}}=1024italic_D start_POSTSUBSCRIPT roi end_POSTSUBSCRIPT = 1024, we varied the dimension of the spatio-temporal feature Dstsubscript𝐷stD_{\text{st}}italic_D start_POSTSUBSCRIPT st end_POSTSUBSCRIPT. Optimal metrics for CA and C1 are observed when Dstsubscript𝐷stD_{\text{st}}italic_D start_POSTSUBSCRIPT st end_POSTSUBSCRIPT is set to 128 and 64, respectively. In comparison to the model without spatio-temporal feature, there is an enhancement of 0.4 and 0.3 in the CVMA and CVIDF1 metrics, respectively. On the DIVOTrack dataset, we observed enhancements of 1.4 in CVMA and 1.3 in CVIDF1 upon integrating spatio-temporal features into our model. The less gains on the VisionTrack dataset are largely attributable to the frequent camera movements during our dataset collection, which diminished the impact of spatio-temporal features. Based on our experimental results, we can state that integrating an appropriate proportion of spatio-temporal features into the Re-ID features enhances target associations. However, if the proportion of the spatio-temporal feature becomes overly dominant in the fused feature, the tracking performance degrades.

V-D2 Global association module

We demonstrate the impact of varying the number of heads in global association transformer on tracking performance in Table VI. We observe that the number of heads significantly affects the model’s performance, necessitating the determination of the optimal number of heads through prior experimentation. Table VI illustrates the impact of varying the number of attention layers on tracking results. Results indicate that increasing the number of attention layers does not enhance model performance. Consequently, we adopted a default structure for our model, which consists of an encoder layer and a decoder layer, each equipped with 8 heads.

V-D3 Memory bank module

As illustrated in Figure 6, the integration of the memory bank module enables the model to recover lost trajectories. Without the memory bank module, a pedestrian dressed in white who re-emerges after a prolonged occlusion (lasting 152 frames, which surpasses the size of the temporal window W𝑊Witalic_W) is mistakenly recognized as a new target, with the ID switching from ID3 to ID57. However, with the memory bank module, when the target reappears, it successfully matches with the historical trajectory with ID3 in the memory bank.

V-D4 Hyperparameters

To further investigate the impact of hyperparameter settings on model performance, we examine the effects of using various temporal window sizes during inference. The experimental results are presented in Table VI. We observe that modestly increasing the size of the temporal window aids in enhancing tracking performance. However, an excessively large temporal window inversely affects performance, potentially due to the increased complexity in associating targets with more distant temporal information. To balance inference speed and model performance, we select a temporal window size of W=60𝑊60W=60italic_W = 60 as our default hyperparameter.

TABLE VII: Comparison of MTMC tracking results on each scene of VisionTrack dataset. The number in bold represents the best result. CA and C1 respectively stand for CVMA and CVIDF1.
backbone CenterNet Re-ID Encoder ST Encoder Transformer
Flops 183.8G 43.3G 0.2G 0.8M 0.1G
Params 15.2M 0.7M 9.1M 0.03M 5M

V-D5 Model parameter size and real-time performance

Inference was conducted on one NVIDIA TITAN RTX 24G GPU. The GMT model achieves an average frame rate of 7.5 fps across various scenes. As shown in Table VII, the runtime of the backbone occupies a large proportion in the network. Future work could explore the use of more lightweight networks to further enhance real-time performance.

VI Conclusion

We propose a transformer-based global MTMC tracking model, GMT. The GMT model globally associates all targets appearing in different camera within a time window, generating global trajectories from detections at once. To enhance the feature representation of the targets, we integrate spatial-temporal features with the target Re-ID features, enabling the target features to simultaneously capture appearance and motion patterns. To recover lost trajectories, we incorporate a memory bank module to preserve features of historical trajectories. In addressing the issues of insufficient diversity of scenes, lack of low-light environments, and unsatisfactory annotation quality in existing datasets, we have developed a new large-scale MTMC tracking dataset, VisionTrack. GMT model exhibits superior performance across multiple datasets in diverse scenes and lighting conditions, demonstrating its robustness.

Our model employs a simple network to extract target Re-ID features. For some severely occluded targets, the model fails to accurately associate them. Future research could focus on designing a more powerful Re-ID encoder specifically tailored for MTMC tracking.

References

  • [1] P. Li, J. Zhang, Z. Zhu, Y. Li, L. Jiang, and G. Huang, “State-aware re-identification feature for multi-target multi-camera tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  • [2] F. Yang, H. Lu, and M.-H. Yang, “Robust superpixel tracking,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1639–1651, 2014.
  • [3] J. Fan, X. Shen, and Y. Wu, “What are we tracking: A unified approach of tracking and recognition,” IEEE transactions on image processing, vol. 22, no. 2, pp. 549–560, 2012.
  • [4] D. Li, X. Wei, X. Hong, and Y. Gong, “Infrared-visible cross-modal person re-identification with an x modality,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 4610–4617.
  • [5] S. Yogamani, C. Hughes, J. Horgan, G. Sistu, P. Varley, D. O’Dea, M. Uricár, S. Milz, M. Simon, K. Amende et al., “Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9308–9318.
  • [6] S. Panev and A. Manolova, “Improved multi-camera 3d eye tracking for human-computer interface,” in 2015 IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 1.   IEEE, 2015, pp. 276–281.
  • [7] C. d. Leo and B. S. Manjunath, “Multicamera video summarization and anomaly detection from activity motifs,” ACM Transactions on Sensor Networks (TOSN), vol. 10, no. 2, pp. 1–30, 2014.
  • [8] Z. Ma, X. Wei, X. Hong, and Y. Gong, “Bayesian loss for crowd count estimation with point supervision,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6142–6151.
  • [9] J. Ferryman and A. Shahrokni, “Pets2009: Dataset and challenge,” in 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance.   IEEE, 2009, pp. 1–6.
  • [10] Z. Tang, M. Naphade, M.-Y. Liu, X. Yang, S. Birchfield, S. Wang, R. Kumar, D. Anastasiu, and J.-N. Hwang, “Cityflow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8797–8806.
  • [11] Y. Xu, X. Liu, L. Qin, and S.-C. Zhu, “Cross-view people tracking by scene-centered spatio-temporal parsing,” in Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1, 2017.
  • [12] M. Hofmann, D. Wolf, and G. Rigoll, “Hypergraphs for joint multi-view reconstruction and multi-object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3650–3657.
  • [13] E. Ristani and C. Tomasi, “Features for multi-target multi-camera tracking and re-identification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6036–6046.
  • [14] O. Javed, K. Shafique, Z. Rasheed, and M. Shah, “Modeling inter-camera space–time and appearance relationships for tracking across non-overlap** views,” Computer Vision and Image Understanding, vol. 109, no. 2, pp. 146–162, 2008.
  • [15] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE international conference on image processing (ICIP).   IEEE, 2017, pp. 3645–3649.
  • [16] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129, pp. 3069–3087, 2021.
  • [17] N. Jiang, S. Bai, Y. Xu, C. Xing, Z. Zhou, and W. Wu, “Online inter-camera trajectory association exploiting person re-identification and camera topology,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1457–1465.
  • [18] T. I. Amosa, P. Sebastian, L. I. Izhar, O. Ibrahim, L. S. Ayinla, A. A. Bahashwan, A. Bala, and Y. A. Samaila, “Multi-camera multi-object tracking: A review of current trends and future advances,” Neurocomputing, vol. 552, p. 126558, 2023.
  • [19] W. Chen, L. Cao, X. Chen, and K. Huang, “An equalized global graph model-based approach for multicamera object tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 11, pp. 2367–2381, 2016.
  • [20] W. Liu, O. Camps, and M. Sznaier, “Multi-camera multi-object tracking,” arXiv preprint arXiv:1709.07065, 2017.
  • [21] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  • [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14.   Springer, 2016, pp. 21–37.
  • [23] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
  • [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [25] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
  • [26] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision.   Springer, 2020, pp. 213–229.
  • [27] C. Fu, K. Lu, G. Zheng, J. Ye, Z. Cao, B. Li, and G. Lu, “Siamese object tracking for unmanned aerial vehicle: A review and comprehensive analysis,” Artificial Intelligence Review, pp. 1–61, 2023.
  • [28] E. Ristani and C. Tomasi, “Tracking multiple people online and in real time,” in Computer Vision–ACCV 2014: 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part V 12.   Springer, 2015, pp. 444–459.
  • [29] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg, “Eco: Efficient convolution operators for tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6638–6646.
  • [30] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8971–8980.
  • [31] N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1571–1580.
  • [32] H.-M. Hsu, J. Cai, Y. Wang, J.-N. Hwang, and K.-J. Kim, “Multi-target multi-camera tracking of vehicles using metadata-aided re-id and trajectory-based camera link model,” IEEE Transactions on Image Processing, vol. 30, pp. 5198–5210, 2021.
  • [33] Y. Cai and G. Medioni, “Exploring context information for inter-camera multiple target tracking,” in IEEE Winter Conference on Applications of Computer Vision.   IEEE, 2014, pp. 761–768.
  • [34] H.-M. Hsu, T.-W. Huang, G. Wang, J. Cai, Z. Lei, and J.-N. Hwang, “Multi-camera tracking of vehicles based on deep features re-id and trajectory-based camera link models.” in CVPR workshops, 2019, pp. 416–424.
  • [35] Y.-G. Lee, Z. Tang, and J.-N. Hwang, “Online-learning-based human tracking across non-overlap** cameras,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2870–2883, 2017.
  • [36] C. Ma, F. Yang, Y. Li, H. Jia, X. Xie, and W. Gao, “Deep trajectory post-processing and position projection for single & multiple camera multiple object tracking,” International Journal of Computer Vision, vol. 129, pp. 3255–3278, 2021.
  • [37] Z. Tang, G. Wang, H. Xiao, A. Zheng, and J.-N. Hwang, “Single-camera and inter-camera vehicle tracking and 3d speed estimation based on fusion of visual and semantic features,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 108–115.
  • [38] Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah, “Multi-target tracking in multiple non-overlap** cameras using constrained dominant sets,” arXiv preprint arXiv:1706.06196, 2017.
  • [39] Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in European Conference on Computer Vision.   Springer, 2022, pp. 1–21.
  • [40] X. Zhou, T. Yin, V. Koltun, and P. Krähenbühl, “Global tracking transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8771–8780.
  • [41] K. G. Quach, P. Nguyen, H. Le, T.-D. Truong, C. N. Duong, M.-T. Tran, and K. Luu, “Dyglip: A dynamic graph model with link prediction for accurate multi-camera multiple object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 784–13 793.
  • [42] Y. He, X. Wei, X. Hong, W. Shi, and Y. Gong, “Multi-target multi-camera tracking by tracklet-to-target assignment,” IEEE Transactions on Image Processing, vol. 29, pp. 5191–5205, 2020.
  • [43] Y. Gan, R. Han, L. Yin, W. Feng, and S. Wang, “Self-supervised multi-view multi-human association and tracking,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 282–290.
  • [44] Z. Liu, Y. Shang, T. Li, G. Chen, Y. Wang, Q. Hu, and P. Zhu, “Robust multi-drone multi-target tracking to resolve target occlusion: A benchmark,” IEEE Transactions on Multimedia, 2023.
  • [45] S. Hao, P. Liu, Y. Zhan, K. **, Z. Liu, M. Song, J.-N. Hwang, and G. Wang, “Divotrack: A novel dataset and baseline method for cross-view multi-object tracking in diverse open scenes,” International Journal of Computer Vision, pp. 1–16, 2023.
  • [46] Y.-J. Li, X. Weng, Y. Xu, and K. M. Kitani, “Visio-temporal attention for multi-camera multi-target association,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9834–9844.
  • [47] C.-C. Cheng, M.-X. Qiu, C.-K. Chiang, and S.-H. Lai, “Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 051–10 060.
  • [48] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, “Multicamera people tracking with a probabilistic occupancy map,” IEEE transactions on pattern analysis and machine intelligence, vol. 30, no. 2, pp. 267–282, 2007.
  • [49] Y. Xu, X. Liu, Y. Liu, and S.-C. Zhu, “Multi-view people tracking via hierarchical trajectory composition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4256–4265.
  • [50] T. Chavdarova, P. Baqué, S. Bouquet, A. Maksai, C. Jose, T. Bagautdinov, L. Lettry, P. Fua, L. Van Gool, and F. Fleuret, “Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5030–5039.
  • [51] Y. Cui, C. Zeng, X. Zhao, Y. Yang, G. Wu, and L. Wang, “Sportsmot: A large multi-object tracking dataset in multiple sports scenes,” arXiv preprint arXiv:2304.05170, 2023.
  • [52] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 076–10 085.
  • [53] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [54] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2403–2412.
  • [55] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, “Crowdhuman: A benchmark for detecting human in a crowd,” arXiv preprint arXiv:1805.00123, 2018.
  • [56] R. Han, W. Feng, J. Zhao, Z. Niu, Y. Zhang, L. Wan, and S. Wang, “Complementary-view multiple human tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 917–10 924.
  • [57] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person re-identification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3702–3712.
  • [58] H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu, “A strong baseline and batch normalization neck for deep person re-identification,” IEEE Transactions on Multimedia, vol. 22, no. 10, pp. 2597–2609, 2019.
  • [59] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. Hoi, “Deep learning for person re-identification: A survey and outlook,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 6, pp. 2872–2893, 2021.
[Uncaptioned image] Huijie Fan received the B.S. degree in automation from the University of Science and Technology of Science and Technology, China, in 2007, and the Ph.D. degree in mode recognition and intelligent systems from the Chinese Academy of Sciences University, China, in 2014. She is currently a Research Scientist with the Institute of Shenyang Automation of the Chinese Academy of Sciences. Her research interests include deep learning on image processing and medical image processing and applications.
[Uncaptioned image] Tinghui Zhao received the B.S. degree in automation from the Zhejiang University, China, in 2022. Currently he is a graduate student in Shenyang Institute of Automation, Chinese Academy of Sciences.
[Uncaptioned image] Qiang Wang received the Ph.D. degree in pattern recognition and intelligent systems from the University of Chinese Academy of Sciences, China, in 2020. He is currently a Research Associate with the Key Laboratory of Manufacturing Industrial Integrated Automation, Shenyang University. His research focuses on deep learning, multitask learning, feature selection, and image restoration.
[Uncaptioned image] Baojie Fan received the Ph.D. degree in pattern recognition and intelligent system from the State Key Laboratory of Robotics, Shenyang Institute Automation, Chinese Academy of Sciences. He is currently a Professor with the Department of Automation, NJUPT. His major research interests include robot vision systems, object detection, and tracking.
[Uncaptioned image] Yandong Tang received the B.S. and M.S. degree in the calculation of mathematics from Shandong University, China, in 1984 and 1987. From 1987 to 1996, he worked at the Institute of Computing Technology, Shenyang, Chinese Academy of Sciences. From 1996 to 1998, he was engaged in research and development at Stuttgart University and Potsdam University in Germany. He received a Ph.D. degree in Engineering Mathematics from the Research Center (ZETEM) of Bremen University, Germany, in 2002. From 2002 to 2004, he worked at the Institute of Industrial Technology and Work Science (BIBA) at Bremen University of Germany. He is currently a Research Scientist with the Institute of Shenyang Automation of the Chinese Academy of Sciences. His research interests include image processing, mode recognition and robot vision.
[Uncaptioned image] LianQing Liu received his B.S. degree in Industry Automation from Zhengzhou University, Zhengzhou, China, in 2002, and his Ph.D. in Pattern Recognition and Intelligent Systems from Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China, in 2009. He is currently a professor at the Shenyang Institute of Automation, Chinese Academy of Sciences. His current research interests include nanorobotics, intelligent control, and biosensors.