MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark

Sanghyun Woo1∗  Kwanyong Park2∗  Inkyu Shin3∗  Myungchul Kim3∗  In So Kweon3

1New York University    2ETRI    3KAIST
Abstract

Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields, such as visual surveillance, crowd behavior analysis, and anomaly detection. However, due to the difficulty and cost of collecting and labeling data, existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting, which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue, we present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time, weather, and season conditions. This dataset provides a challenging test-bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras, which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets, benefiting independent fields such as person detection, re-identification, and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets, models, and test server will be made publicly available.

[Uncaptioned image]
Figure 1: The 3D layout overview. (a) campus and (b) factory. We installed 16 multi-modal cameras in both indoor and outdoor settings, across multiple floors, with overlap** coverage. The cameras were fixed in position and angle to densely cover the building, creating a realistic surveillance camera system.
* Equal contribution
{textblock*}

.8[.5,0](0.5-.72) https://sites.google.com/view/mtmmc

1 Introduction

Multiple object tracking (MOT) is an essential vision task that helps us understand visual content and predict the evolution of the surroundings over time. Recent advancements in MOT, thanks to benchmarks such as MOT17 [48], BDD100K [79], Waymo [61], and TAO [13] have led to the development of more effective and efficient trackers [21, 94, 47, 49, 72]. Despite these advancements, multiple-camera tracking has seen limited exploration, largely due to the lack of appropriate datasets. The high costs associated with the collection and annotation of such data are a major bottleneck.

The datasets currently available predominantly consist of either synthetically generated data from game simulators [35] or small-scale real-world data obtained from controlled camera networks [24, 14, 17, 23, 4, 8], which assume an idealized overlap between the camera views to simplify the annotation process. However, synthetic data often fail to translate effectively to real-world scenarios due to significant domain shifts, and datasets from controlled environments do not reflect the complexities of real-world multi-camera networks. Additionally, the withdrawal of the DukeMTMC [56], previously the most extensive real-world dataset, due to privacy issues has left a considerable void in this research area.

To tackle this, this paper presents a new benchmark called the Multi-Target Multi-Modal Camera (MTMMC) tracking dataset. The dataset was collected from two challenging environments—a campus and a factory—equipped with 16 multi-modal cameras, each placed at different angles (see Fig. 1). The dataset consists of 25 video recordings—13 from the campus and 12 from the factory—with each video containing five and a half minutes of HD video recording captured under various times, weather, and seasons, ensuring a rich diversity of backgrounds. To ensure compliance with data privacy standards, we collected informed consent from all participants, who explicitly agreed to the public release of the collected data for research purposes. The annotation of all trajectories was accomplished using a semi-automatic labeling system, carefully refined by crowdworkers over a year, making the dataset the most largest publicly accessible MTMC tracking benchmark to date.

Significantly, our dataset contains both RGB and thermal cameras, allowing the tracker to additionally utilize thermal information for more accurate multi-camera tracking. This is the first time a dataset has provided a valid test-bed for studying the impact of multi-modal learning for multi-camera tracking. Our experiments reveal that incorporating thermal data into standard RGB camera-based trackers results in more robust tracking, motivating future research in this new direction. The construction of the MTMC dataset also facilitates progress in related subtasks, such as person detection, re-identification, and MOT.

Dataset # Cameras # ID # Frames OV/NOV Camera Coverage Extra Modality FPS Resolution
PETS2009 [23] 8 30 1,200 OV outdoor 30 768×576768576768\times 576768 × 576
USC Campus [37] 3 146 135,000 NOV outdoor 30 852×480852480852\times 480852 × 480
Passageway [4] 4 4 120,000 OV outdoor 25 320×240320240320\times 240320 × 240
NLPR MCT [8] 5absent5\leq 5≤ 5 235absent235\leq 235≤ 235 355,500 NOV in & outdoor 20 320×240320240320\times 240320 × 240
CamNet [83] 8 50 360,000 NOV in & outdoor 25 640×480640480640\times 480640 × 480
WILDTRACK [9] 7 N/A 66,626 both outdoor 60 1920×1080192010801920\times 10801920 × 1080
DukeMTMC [56] 8 2,834 2,448,000 NOV outdoor 60 1920×1080192010801920\times 10801920 × 1080
MTA [35] 6 2,840 2,007,360 both simulated 41 1920×1080192010801920\times 10801920 × 1080
MMPTRACK [26] 6absent6\leq 6≤ 6 140absent140\leq 140≤ 140 2,979,900 OV indoor 15 640×320640320640\times 320640 × 320
MTMMC (Ours) 16 3,669 3,052,800 both in & outdoor  (Thermal) 23 1920 ×\times× 1080
Table 1: Overview of the publicly available MTMC datasets. For each dataset, we report the number of cameras, person identities, and frames. We also report the presence of overlap** (OV) / non-overlap** (NOV) camera views, camera coverage, availability of extra input modality, annotated frame rate (FPS), and frame resolution. Our new MTMMC dataset is unprecedented in its scale and diversity. It includes 16 cameras, 3,669 person IDs, and 3 million frames, making it a challenging and large-scale dataset. The dataset also provides high-resolution and multi-modal information.

2 Related Work

Benchmarks To construct a high-quality MTMC dataset, it is crucial to have temporally synchronized videos from multiple cameras. These videos must also maintain consistent person identities across all camera views. However, this requirement results in high annotation costs. As a result, existing MTMC benchmarks are either short in duration [24, 14, 17, 23], have low video resolution [4, 37, 83, 8], or provide inconsistent person IDs [9], making them unsuitable for training generic deep trackers for real-world use cases. The most popular dataset closest to our proposal is DukeMTMC [56], but it was withdrawn due to consent and privacy issues. Recently, two large-scale MTMC datasets, MTA [35] and MMPTRACK [26], have been introduced, but they have limitations such as being obtained through game simulations or collected in controlled setups where all cameras have overlap** fields of view. The new MTMC dataset aims to provide a larger basis for training and testing MTMC performance than any previous datasets, making it a valuable resource for researchers.

Multi-modal Learning Unlike existing tracking datasets, our dataset features an additional thermal input modality, which opens up new research directions for multi-modal learning in multi-camera tracking. Multi-modal learning is an interesting research problem that not only improves model robustness through modality fusion, which applies to various vision tasks such as detection [39, 81, 82, 82, 82], visual object tracking [76, 33, 34, 43, 36], and segmentation [63, 93], but also enables better representations of each modality by learning the intrinsic correlations between them [75, 46, 87, 1, 58]. In this paper, we present two new experimental setups with baselines.

Multiple Object Tracking The standard way to tackle the MTMC problem involves a two-step approach: 1) generating local tracklets for all the targets within each camera; 2) associating these local tracklets across cameras when they belong to the same target. The first step, known as multiple object tracking, has been extensively studied by the community. The tracking-by-detection paradigm has emerged as the dominant approach, owing to significant improvements in object detection techniques [54, 42, 53, 41]. Recent advances in this paradigm include develo** more discriminative association objectives [30, 38, 51, 66, 70, 49, 71], unifying detection and tracking [21, 62, 73, 94], or building an end-to-end framework [60, 47, 80].

Multi Camera Association Cross-camera association presents a more challenge due to pronounced changes in object appearance between cameras, variable background conditions, and an increased number of targets to be matched. To facilitate this process, various constraints have been employed, including time conflicts [86], linear motion patterns [55], camera network topology [59, 32], geometric cues [6, 78, 10], and spatial locality [29].

Cross-camera association can be conducted in both real-time [8, 52] and offline manners [12, 86, 55, 28, 29], with the latter often favored for its enhanced accuracy. Notably, several offline global association techniques have been developed, such as hierarchical clustering [86, 91], correlation clustering [55, 5], matrix factorization [28], and adaptive locality-based association [29].

Refer to caption
Figure 2: MTMC Dataset statistics comparison. We compare MTMMC dataset with the current largest simulated MTA [35] and real world MMPTrack [26] datasets. Each visual summarizes specific statistics of each dataset : (a) The instance tracks by the number of visited cameras, (b) The joint distribution of instance normalized width and height, (c) The instance track centers plotted over normalized image coordinates.

3 MTMMC

Our camera network is detailed in Fig. 1, comprising a school campus and a factory environment. This choice reflects common real-world surveillance scenarios. With 16 multi-modal cameras, our configuration spans both indoors and outdoors, extends over multiple levels of floors. Each camera provides RGB and thermal data that are spatially aligned and temporally synchronized. A detailed description of all cameras is provided in the appendix.

3.1 Data Characteristics

Table. 1 presents a comparative summary between our dataset and the existing datasets. The MTMMC dataset consists of 25 scenarios, each composed of 16 high-resolution RGB + Thermal videos captured at 23 fps, both at indoor and outdoor, resulting in a total of 3,052,800 frames. The dataset offers diverse real-world environmental conditions, ranging from day to evening (time), sunny to cloudy (weather), and summer to fall (season). This diversity makes our dataset unique and more representative.

In Fig. 2, we present a detailed comparison of our MTMMC dataset with two of the largest datasets in MTMC: MTA [35] and MMPTrack [26]. We focus on three critical aspects that influence the tracking performance:

  1. 1.

    Track Length. We analyze the variability in the number of cameras each individual is tracked through, providing insights into the robustness of tracking across cameras. MTMMC features a broader distribution of track lengths, including more extended tracking periods compared to MTA and MMPTrack, as shown in Figure 2(a), demonstrating our dataset’s capability to challenge and train models on maintaining identities over longer sequences.

  2. 2.

    Track Scale. We assess the range of target scales by normalizing bounding boxes against image dimensions. MTMMC includes a diverse range of scales, with Figure 2(b) highlighting instances of small-scale tracks that are not well-represented in other datasets, critical for training models to detect and track small or distant targets.

  3. 3.

    Track Path. Our analysis of track trajectories in normalized image coordinates reveals MTMMC’s coverage of diverse movement patterns. The dataset exhibits a more comprehensive range of trajectories compared to others, as depicted in Figure 2(c), which is pivotal for algorithms predicting and maintaining track continuity amidst complex environmental dynamics.

MTMMC advances in all these three aspects over the previous datasets, providing a challenging testbed that more precisely reflects real-world conditions.

We further provide a statistical analysis of key attributes of our dataset in the appendix. First, the number of objects per frame and the number of tracks per video metrics reveal the complexity of scene contexts and the robustness required for successful multi-object tracking. Additionally, the wide demographic range, represented by the age and gender distributions of the actors ensures the development of more inclusive and unbiased tracking algorithms. Lastly, for the first time, a thermal modality is included, enabling multi-modal learning — a feature unprecedented in previous multi-object tracking datasets.

3.2 Data Collection

The data collection was conducted over two days, capturing different seasons for each environment: summer for the factory and fall for the school campus. To ensure a high degree of accuracy in temporal alignment, we employed a precise global time-stam** method for space-time synchronization. For potential frame drops, we meticulously inspected the video sequences and made adjustments by aligning timestamps and interpolating missing data. To prevent any privacy issues, we recruited 623 actors of varying ages and genders and obtained data release agreements. We ensured that all participating actors were compensated for their time and efforts. Furthermore, we conducted de-identification for the 107 non-actors involved in the recordings.

Each video last in five-and-a-half-minutes per scenario, with 12 scenarios from the factory and 13 scenarios from the school campus. We allowed the actors to improvise their actions, provided they fit the given circumstances. For instance, actors could move luggage in the factory or play ball at school, resulting in a wide variety of behaviors being captured. Moreover, we instructed the actors to change their clothes for each scenario to ensure diverse appearances.

Notably, our new MTMMC dataset significantly improves upon the Duke-MTMC dataset [56], which was collected within a narrow 1.5-hour window on a single day on campus. By extending the breadth and diversity of our collection process, we aim to provide a more solid foundation for the development of robust tracking systems.

3.3 Data Annotation

We designed an annotation pipeline to separate the single-camera tracking and the multi-camera association tasks. The single-camera tracking involves generating bounding boxes of person tracks within a camera, while the multi-camera association involves assigning consistent person-IDs across multiple cameras. By dividing the tasks, we can assign inexperienced annotators to the former and skilled workers to the latter. The reviewers carefully check the quality of the completed labels from the annotators, and this process is repeated several times until no critical errors are visible. More details are in the appendix.

Single Camera Tracking. To annotate the set of 400 videos (16 cameras ×\times× 25 scenarios), we tasked annotators with drawing bounding boxes and assigning track-IDs to each person in the video. We collected the annotations in a semi-automatic manner, as described in previous works [68, 65]. First, the annotators tracked and labeled the person in the keyframes, which were selected every five frames in a video. We then used the deepSORT [70] algorithm to generate pseudo tracking boxes by interpolating the annotations between keyframes efficiently. The predicted tracking boxes were then carefully corrected by the annotators. Additionally, to protect the privacy of non-actors, we applied a de-identification process, which involved blurring their faces while remaining the ground truth annotations intact. This process ensured confidentiality of personal information, while simultaneously preserving the data integrity.

Multiple Camera Association. In the next step, we asked annotators to assign consistent track-IDs for the same person across the cameras for each scenario. We observed that the semi-automatic labeling approach was not sufficient to achieve satisfactory label quality for this task. Hence, we relied on careful manual labeling. After the initial labeling was completed by the annotators, the reviewers collected person-ID errors using two critical camera constraints. Firstly, one person cannot appear in multiple tracks of the same camera simultaneously. Secondly, one person cannot be visible in the view of two non-overlap** cameras simultaneously. The reviewers also checked for other remaining errors. All the collected errors were then passed to the annotators, who corrected them. The refining process was iterated twice to guarantee high-quality labels.

3.4 Data Splits

We split the MTMMC dataset into three subsets. The train set includes 14 scenarios (7 from the factory and 7 from the campus), the validation set includes 5 scenarios (3 from the factory and 2 from the campus), and the test set includes 6 scenarios (2 from the factory and 4 from the campus).

4 Experiments

We present various experimental setups and benchmark their performance using the new MTMMC dataset. For the evaluation, we use standard metrics such as mAP for detection, Rank1 and mAP for re-identification, and CLEAR MOT and IDF1 for tracking. The experiments are conducted on the train and validation sets of the dataset. Detailed setup specifications are in the appendix.

4.1 Sub Tasks: Detection, Re-ID, and MOT

Person Detection

We evaluate the efficacy of our dataset in training person detectors for tracking applications. We utilized two well-established detectors, Faster RCNN [54] and YOLOX [25], and investigate how well these models generalize and perform when trained on task-specific versus generic datasets. Specifically, we trained models MTMMC-Person and COCO-Person datasets and then tested their generalization performance using the MOT17 [48] dataset, which presents a variety of real-world tracking scenarios. The COCO-Person is a subset of the larger COCO [40] dataset and includes 65K natural images that depict humans. To compare fairly, we matched the size of our MTMMC-Person dataset, compiling 60K images sampled at a frame rate of 23 fps from the original video footage.

As shown in Table. 2, models trained on the MTMMC-Person dataset consistently outperformed those trained on COCO-Person during the MOT17 evaluations. This suggests that the specificity of the training data to the end-use scenario is crucial. By design, the MTMMC dataset is tailored to tracking, highlighting diverse human activities, frequent occlusions, varied interactions and non-central camera angles, which are typical in real-world tracking situations. These results validate the importance of contextual alignment between training data and its target application, emphasizing the value of our specialized dataset, MTMMC, for tracking and surveillance applications.

Method Train on Eval on mAP
Faster RCNN COCO-Person MOT17 29.8
MTMMC-Person MOT17 31.3
YOLOX COCO-Person MOT17 34.2
MTMMC-Person MOT17 38.3
Table 2: Detection Results.
Method Train on Eval on Rank 1 mAP
AGW Market-1501 Market-1501 95.3 88.2
MSMT17 MSMT17 78.3 55.6
MTMMC-reID MTMMC-reID 76.0 45.6
MSMT17 Market-1501 64.3 34.2
MTMMC-reID Market-1501 66.5 35.4
Table 3: Re-Identification Results.

Person Re-Identification

In line with the standard protocols for re-identification (Re-ID) data construction, as outlined in  [88, 90], we derived our MTMMC-reID dataset from the larger MTMMC dataset. For our experiments, we used the AGW model [77] as the benchmark.

Re-ID tasks require the identification of individuals across multiple camera views and at different times. Training data characteristics significantly influence the performance of Re-ID systems. The MTMMC-reID dataset, in particular, provides a challenging training environment, as evidenced by the lower Rank-1 accuracy and mAP scores—76.0 and 45.6, respectively—compared to other datasets (see Table. 3, rows 2-4). These figures highlight the demanding nature of the tracking scenarios within MTMMC-reID.

However, the dataset’s complexity is beneficial for model generalization. For instance, when a model trained on the MSMT17 [69] dataset is evaluated on Market-1501 [88], performance drops (to 64.3 Rank-1 and 34.2 mAP), indicating a loss of generalizability. Yet, if the same model is trained on MTMMC-reID and tested on Market-1501, it demonstrates better robustness with higher Rank-1 accuracy and mAP scores (66.5 and 35.4, respectively) compared to the MSMT17 training (refer to Table. 3, rows 5-6). These results imply that despite the intrinsic challenges of MTMMC-reID, models trained on it are better equipped to handle new, unseen environments, underscoring the value of rigorous training environments for improved real-world applicability.

Method Train on Eval on MTMMC Eval on MOT17
MTMMC MOT17 Misc IDF1 MOTA FP FN IDs IDF1 MOTA FP FN IDs
JDE 42.4 74.6 146678 859893 30767 48.0 40.9 2311 29084 329
cccpe 34.0 52.3 206112 1694301 27347 63.6 60.0 2927 18155 486
cccpe 43.7 72.6 125770 964863 25725 70.5 65.7 2232 15759 469
QDTrack 53.0 84.5 157529 475242 14542 55.3 43.6 10548 80197 449
34.3 52.3 286382 1643818 21470 66.8 65.3 9324 45441 1383
54.2 84.6 439646 439646 14106 70.0 68.6 6927 42903 1005
CenterTrack 50.8 78.6 504642 353525 16972 55.0 45.3 17718 69870 903
25.2 37.0 629624 1911628 40656 62.1 60.5 6678 55446 1710
CHpre 27.1 45.7 518692 1662554 40746 63.7 66.2 7128 45939 1611
CHpre 51.6 80.9 415132 351162 16938 65.7 66.7 6138 46338 1407
ByteTrack 64.8 89.7 112835 300354 7153 69.1 55.9 16896 54106 230
40.2 56.8 506286 1283368 13585 76.8 75.0 4539 8693 224
CH 56.9 77.7 267550 640084 7547 79.5 76.6 10128 27250 479
CH 64.6 89.1 147385 289854 7184 78.7 76.9 8504 28302 517
Table 4: Multi Object Tracking Results. Following the previous works, we use additional person detection data: CH denotes CrowdHuman [57], MIX indicates combined datasets of Caltech Pedestrian [16], Citypersons [84], CUHK-SYS [74], PRW [89] and ETH [19].
Method Train on Eval on MOT17
MTMMC MOTSynth IDF1 MOTA FP FN IDs
QDTrack 55.3 43.6 10548 80197 449
54.1 43.1 11178 80178 615
60.8 48.9 14724 67029 870
(a)
Method Train on Eval on MOT17
MTMMC MOTSynth IDF1 MOTA FP FN IDs
QDTrack 68.6 66.6 9963 43074 957
70.8 68.7 9813 39882 921
72.0 70.2 8367 39135 750
(b)
Table 5: Pre-Training on MTMMC and MOTSynth.

Multi Object Tracking

Multi-object tracking (MOT) is a task that requires the detection and tracking of multiple objects, often people, through a sequence of video frames. The challenge lies in kee** consistent object identities despite movement, occlusions, and environmental changes. In our experiment, we employed four state-of-the-art trackers: JDE [67], QDTrack [49], CenterTrack [94], and ByteTrack [85], and our analysis focuses on three main aspects:

  1. 1.

    Training and Evaluating on the Same Dataset: When models are both trained and evaluated on the same dataset, they exhibit lower performance on the MTMMC compared to the MOT17 dataset. For instance, JDE, achieved an IDF1 score of 42.4% on MTMMC, whereas the same model yielded an improved IDF1 of 63.6% on MOT17. This trend is consistent across all tested models, indicating that MTMMC presents a more challenging testbed.

  2. 2.

    Training and Evaluating on Different Datasets: When training and evaluation datasets differed, we observed a pattern where models trained on MTMMC generally outperformed those trained on MOT17 when evaluated on the alternate dataset. For example, ByteTrack, after being trained on MTMMC and tested on MOT17, reached an IDF1 score of 69.1% and MOTA of 55.9%, which is closer to the practical upper bounds observed when trained and tested on MOT17 (IDF1 of 76.8% and MOTA of 75.0%). In contrast, when ByteTrack was trained on MOT17 and evaluated on MTMMC, it achieved a much lower IDF1 of 40.2% and MOTA of 56.8%, versus its upper-bound performance on MTMMC (IDF1 of 64.8% and MOTA of 89.7%). This suggests that the complex and diverse tracking environments found in MTMMC contribute to the development of more robust and generalizable model features.

    Notably, the above two trends within multi-object tracking mirror the tendencies observed in our Re-ID experiments. This consistency reinforces the notion that training on more complex and diverse environments effectively enhances the models’ ability to generalize and maintain accuracy when introduced to new domains.

  3. 3.

    Training on Combined Datasets: The most compelling results were observed when models were trained on a mixture of both MTMMC and MOT17 datasets. This combined training approach produced the best results on both MTMMC and MOT17 evaluations. It implies that the MTMMC provides a complementary training signal to the MOT17. When combined, the diversity and complexity of MTMMC complement the MOT17, leading to a robust tracking model.

In conclusion, these experiments underline the importance of dataset diversity and complexity in training multi-object tracking models. The demanding context provided by MTMMC help to forge models that can handle real-world complexities effectively.

Refer to caption
Figure 3: Multi-modal Learning Setups and Baselines. (a) presents the concept of modality fusion with both input-level and feature-level fusion techniques integrating thermal data with RGB for enhanced object tracking. (b) outlines the modality drop scenario, where the model trained on combined RGB and thermal data is tested solely on RGB data, using methods like multi-modal reconstruction, knowledge distillation, and multi-modal contrastive learning.

4.2 Pre-Training: Real-world vs. Synthetic Data

In this study, we evaluate the efficacy of real-world data in improving MOT models by employing our MTMMC dataset as a foundational training set. We utilized the QDTrack [49] as our base tracker and conducted experiments to measure its performance on the MOT17 benchmark. These experiments involved pre-training the model on the MTMMC dataset and subsequently fine-tuning it on MOT17. Additionally, we drew comparisons with models pre-trained on the MOTSynth dataset [20], which is a large-scale synthetic dataset derived from extensive simulation within a gaming environment.

As detailed in Table 5, our findings illustrate that the MTMMC dataset, albeit comprising half the number of annotations compared to MOTSynth (0.5M vs. 1M), and without the aid of complex data simulation techniques, still substantially contributes to the tracking accuracy. Notably, models pre-trained on MTMMC yield a MOTA score of 55.3 without fine-tuning (54.1 when pre-trained on MOTSynth) and see an increase to 68.6 with fine-tuning (70.8 when pre-trained on MOTSynth). While MOTSynth commences at a higher baseline, our real-world data, when combined with MOTSynth, demonstrates a remarkable synergy, resulting in a superior IDF1 score of 72.0 post fine-tuning.

These observations underscore the continued relevance of real-world datasets. While the scalability and control offered by synthetic data are appealing, the inherent complexities and variability present in real-world data are crucial for models to learn effectively. The MTMMC dataset, therefore, remains an invaluable resource for achieving high-fidelity tracking performance, and its integration with synthetic data further enhances this advancement.

      Method       Fusion       IDF1       MOTA       mAP
      RGB              53.0       84.5       92.8
      T              44.5       79.2       89.9
      RGBT-I       Input       54.0       85.6       93.1
      RGBT-F       Feature       53.9       86.0       93.5
(a)
      Method       w/o fine-tune       w/ fine-tune
      IDF1       MOTA       IDF1       MOTA
      RGB-Unimodal (baseline)       55.3       43.6       68.6       66.6
      Knowledge Distill.       55.1       43.2       70.5       68.0
      Multi-modal Recon.       57.9       46.2       68.3       67.6
      Multi-modal Contrastive.       59.7       48.4       68.3       67.3
(b)
Table 6: Multi-modal learning results.
      Method       IDF1       MOTA       FP       FN       IDs
      TrackTA       32.8       76.9       10604       18715       13364
      H. Cluster       41.6       80.9       8012       14663       11072
(a)
Fusion IDF1 MOTA FP FN IDs
RGBT-I 42.2 81.1 7823 14264 10803
RGBT-F 43.5 81.7 7301 13592 9916
(b)
Table 7: Multi-Target Multi-Camera Tracking Results in MTMMC. For the efficient evaluation, we temporally sub-sampled the videos in 1FPS. H. Cluster denotes hierarchical clustering. The averaged results of all the testing scenarios are shown.

4.3 Multi-modal Learning: Setups and Baselines

Multi-modal learning aims to improve scene understanding by leveraging complementary information from different sensor modalities. In this context, we explore how thermal data, when paired with RGB data, can enhance object tracking. This question stems from existing literature that demonstrates the benefits of such combinations in other domains [63, 93, 3]. Our research extends these concepts into tracking scenarios using QDTrack [49] as the base tracker. We present two new learning setups, modality fusion and drop, illustrated in Fig. 3-(a) and (b), respectively. We provide more detailed setup specifications and additional analyses in the appendix. Here, we briefly introduce the high-level concepts of the setups and then discuss the key results.

Modality Fusion

We begin with modality fusion, focusing on the explicit integration of thermal data into RGB-based tracking models. This involves comparing both input and feature-level fusion methods against RGB and thermal-only baselines. We evaluate the benefits of thermal data incorporation, when it is directly available for both train and test.

Modality Drop

The modality drop setup presents a more challenging scenario. Here, the model is trained on both RGB and thermal data but is evaluated solely on RGB data. The rationale is that, during training, the model can learn generalized feature representations that are robust even when a modality is absent during testing. We introduce three strategies to harness RGB-T data effectively during training: knowledge distillation, multi-modal reconstruction, and multi-modal contrastive learning.

One practical application is using a multimodally trained tracking model in an unimodal tracking system. For instance, consider CCTV surveillance systems, which predominantly rely on RGB cameras often due to hardware or budget constraints. Our goal is to train the model using datasets like MTMMC, which contain both RGB and thermal data, and then test its effectiveness in environments that only provide RGB data. Essentially, we aim to determine if the model can learn generic features from the combined RGB and thermal data during training, and preserve its tracking capabilities in the absence of thermal data during testing.

Results

The results in Table. 6-(a) showcase the performance gains from modality fusion. The integration of thermal data at both the input (RGBT-I) and feature level (RGBT-F) with the base RGB data results in improved performance, compared to using RGB or thermal data in isolation. Notably, the RGBT-F approach, achieves the highest overall performance, with an IDF1 score of 53.9 and MOTA of 86.0. This suggests that thermal data, when fused at the feature level, provides a more discriminative tracking representation.

In Table. 6-(b), we summarize the performance in the modality drop setup. We simulate the modality drop scenario, by training the model using both RGB and thermal data in the MTMMC, and evaluate or optionally fine-tune the model using MOT17, which only provides RGB data. Here, the ‘without fine-tuning’ demonstrates how well the features learned from the combined multimodal data (RGB+T) transfer directly to the RGB domain. On the other hand, ‘with fine-tuning’ evaluates how effectively these learned features serve as initialization for further refinement. Without fine-tuning, Knowledge Distillation (KD) lags in performance (IDF1: 55.1, MOTA: 43.2), which is likely due to its strong dependence on thermal data imposed during distillation, resulting in a weaker generalization ability. In contrast, the Multi-modal Contrastive method shows a relatively high resilience (IDF1: 59.7, MOTA: 48.4), suggesting it learns modality-invariant features through contrastive learning, which confers strong generalization. With fine-tuning, KD exhibits a marked improvement (IDF1: 70.5, MOTA: 68.0), indicating its potential once adapted to the RGB domain. Conversely, the Multi-modal Contrastive method sees only a marginal increase after fine-tuning (IDF1: 68.3, MOTA: 67.3). It is important to note that generalizable features do not necessarily equate to an optimal initialization for RGB-specific fine-tuning. We recognize the further investigations are necessary to fully understand the underlying mechanisms, and we leave this for future studies.

4.4 Multi-modal MTMC

Multi-target multi-camera (MTMC) expands upon MOT by requiring the identification of multiple targets across various camera views. We build a strong baseline model to benchmark the MTMC scores on our new MTMMC dataset. Specifically, we integrate the multi-object tracker and person Re-ID networks, QDtrack [49] and BoT [45] to generate the tracklet-level feature representation. Upon this, we study the performance of two leading multi-camera association (MCA) methods, optimization-based [28] and clustering-based [35].

In Table. 7-(a), the results show that the hierarchical clustering-based MCA method [35] outperformed the optimization-based approach [28], which required heavy hyper-parameter tuning. Table 7-(b) presents the results following the integration of thermal information on the clustering-based method. The feature-fusion approach again resulted in more accurate multi-camera tracking. As a dataset and evaluation paper, we focus on establishing baseline models and benchmark scores to set a stage for followup researches. We hope to see numerous advanced multi-modal tracking models presented upon our results.

5 Conclusion

We have presented the MTMMC dataset—a large-scale, real-world, multi-modal tracking benchmark designed to advance MTMC tracking. Through our extensive experiments, we have demonstrated its efficacy in improving the performance of various sub-tasks and have highlighted its synergistic use with synthetic data for pre-training. Additionally, we introduced two new multi-modal learning setups—modality fusion and drop—and developed robust baseline models for multi-modal MTMC tracking. We hope that our contributions will reinvigorate research in MTMC and will spark new innovations in multi-modal tracking technologies, ushering in a new era of intelligent tracking systems.

Acknowledgement This work was partially supported by the NRF (NRF-2020M3H8A1115028, FY2021).

Appendix A Appendix

[Uncaptioned image]
Figure 5: Single-camera tracking annotation pipeline. We adopt the semi-automatic labeling approach. The workers first label the key frames and then the annotations for the other frames are interpolated based on the model predictions.

In this supplementary material, we present detailed information on the following aspects:

  1. A)

    Details of Annotation,

  2. B)

    Experimental Setup Specifications,

  3. C)

    Supplementary Experiments,

  4. D)

    Specifications of Camera Hardware,

  5. E)

    Overview of the MTMMC Dataset,

  6. F)

    Licenses of the Datasets Used,

  7. G)

    Video Demonstration, and

  8. H)

    Discussion of Ethical Considerations.

Appendix B Annotation Details

B.1 Annotation Pipeline

Our annotation pipeline is illustrated in Fig. 5. We separate the single-camera tracking task from the multi-camera association. The annotation tool is built upon the CVAT 111https://github.com/maheriya/cvat, an open-source vision annotation tool. We further enabled several functionalities such as uploading large-scale videos, efficient task management of crowd workers, and text description translations.

Refer to caption
Figure 6: Multi-camera association. The workers are instructed to assign consistent PIDs for the same person across the cameras for each scenario.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Figure 7: When to assign dummy person IDs. (a),(b) horizontal and vertical truncation (c),(d) severe lighting (e) small-scale.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 8: When to ignore the labeling. (a),(b) reflection on transparent objects (c) severe truncation (d) poor lighting.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 9: Amodal annotation example.

Single-camera tracking We employed 189 workers, 12 reviewers, and 2 project managers for our annotation task. The project managers provided annotation guidelines and managed the scheduling. As detailed in our main paper, we adopted a semi-automatic labeling approach. We manually annotated keyframes (with a default temporal stride of 5 frames) and used the deepSORT algorithm [70] for interpolating annotations on intermediate frames. Workers had the flexibility to adjust the temporal stride. To ensure high-quality annotations, we conducted rigorous peer reviews. In instances where bystanders were captured, we applied face blurring in post-processing to address privacy concerns. On average, each worker annotated 525 images per day. All annotations were saved in JSON format.

Multi-camera Association For this task, we selected a team of 8 highly skilled workers, supported by 8 reviewers and 2 project managers. Initially, we attempted model prediction-based labeling, but it failed to meet our internal quality standards. Consequently, as illustrated in Fig. 6, we shifted to manual labeling. Workers used the labels generated from the single-camera tracking phase as a base, checking for errors and matching labels across different cameras to assign final Person IDs (PIDs). In situations with ambiguous identifications, workers were permitted to use a placeholder label of ‘0’. To guarantee the accuracy of our data, we enforced two essential camera constraints and iteratively corrected any discrepancies until we observed no significant errors.

B.2 Annotation Instructions

To ensure high-quality annotations, we established specific guidelines for workers to handle exceptional cases. These guidelines were designed to maintain consistency and accuracy in challenging annotation scenarios.

  • Dummy Person ID In instances where individuals were challenging to identify due to significant changes in appearance, scale, or lighting, workers assigned a placeholder ID of ’0’. This practice was crucial for maintaining data integrity in cases where person identification was ambiguous or unreliable (refer to Fig. 7).

  • Ignore Area Our guidelines specified that reflections of persons on transparent surfaces, such as glass doors, should be excluded from annotations. Similarly, extreme cases of body part truncation or poor lighting conditions were treated as grey areas and omitted from the dataset to ensure annotation quality (refer to Fig. 8).

  • Amodal Annotation Following the MOT17 annotation standard [48], we instructed workers to label objects amodally. This approach involved marking occlusions distinctly from visible parts using dotted lines, enhancing the precision of our dataset (refer to Fig. 9).

Appendix C Experimental Details

C.1 Datasets

We provide the details of the additional data source used for the experiments in the main paper.

COCO2017 COCO [40] includes object detection and segmentation labels with 118k training images and 5k validation images. We collected only the person detection labels, COCO-Person, and used them for the person detection experiments.

Market-1501 Market-1501 [88] is a popular person re-ID dataset collected from six outdoor cameras. It contains 32,668 bounding boxes of 1,501 identities. To simulate the realistic scenarios, Deformable Part Model (DPM) [22] is employed to produce bounding boxes of pedestrians. We used it for a person re-identification experiments.

MSMT17 MSMT17 [69] is a challenging person re-identification dataset with 126,441 bounding boxes of 4,101 identities captured by 15 outdoor cameras and predicted using Faster RCNN [54]. The dataset includes complex scenes and significant variations in lighting and viewpoints to provide challenging cases. We utilized MSMT17 for our person re-identification experiments.

MOT17 MOT17 [48] is a widely used multi-object tracking benchmark that includes 7 training and 7 testing videos with challenging scenarios, such as frequent occlusions and crowd scenes. We split each training sequence into two halves and used the first half-frames for training and the second half for validation. We utilized MOT17 for single-camera tracking experiments.

MOTSynth MOTSynth [20] is a large-scale synthetic dataset created using the photorealistic video game Grand Theft Auto V. It is designed for person tracking and segmentation in urban scenarios and contains 764 full-HD videos, 1.3M frames, and 33M person instances. We utilized the official train and validation split and applied it to our transfer learning experiments.

C.2 Implementation Details

Training

Our framework is constructed using two well-established public codebases: mmtracking [11] and fast-reid [27]. We adhere closely to the default training recipes provided by these codebases, including data augmentation techniques and training schedules detailed in Table. 8. For implementation, we utilized Pytorch v1.10 and CUDA v11.3, executing our models on a robust hardware setup equipped with an AMD EPYC 7352 (2.3GHz) CPU and an NVIDIA RTX A6000 GPU, ensuring efficient processing and analysis.

Sub-task Model epoch fine-tune optimizer lr momentum weight_decay
Detection Faster R-CNN 24 4 SGD 0.02 0.9 0.0001
AGW 120 - adam 0.00035 - 0.0005
Re-ID BOT 120 - adam 0.00035 - 0.0005
JDE 30 30 SGD 0.01 0.9 0.0001
QDTrack 4 4 SGD 0.02 0.9 0.0001
CenterTrack 70 best adam 0.0125 - 0.0001
MOT ByteTrack 300 40 SGD 0.01 0.9 0.0005
Table 8: Subtasks Training Recipes
Method In. Attn. Asy. IDF1 MOTA mAP
RGB - - - 53.0 84.5 92.8
(1) Add. BAM 53.7 85.3 93.1
(2) Diff. Cha. 53.5 85.1 93.0
Diff. Spa. 53.7 85.4 93.3
(3) Diff. BAM 53.8 85.6 93.2
RGBT-F Diff. BAM 53.9 86.0 93.5
(a)
Modality Network # contra.loss pair IDF1 MOTA mAP
RGB baseline 1 55.7 43.8 55.6
RGB-T parallel 2 56.1 44.5 57.5
4 56.8 44.3 57.6
share 2 55.6 43.5 56.7
4 59.7 48.4 59.7
6 57.3 44.5 57.3
8 57.5 45.4 58.1
(b)
Table 9: Ablation studies of multi-modal learning models on MTMMC.

Multi-model Learning

In this section, we detail the design of two proposed multi-modal learning setups.

1) Modality Fusion

  • Input-level fusion. We employ channel-wise concatenation [18] to combine RGB and Thermal inputs into a 4-channel RGB-T input. This involves modifying the first convolutional layer in the backbone to accept 4 channels instead of 3, initializing the additional channel’s weight as the average of the RGB channel weights.

  • Feature-level fusion. Building upon prior research [81, 15, 92], we have developed a conditional attention module to utilize multi-modal data more effectively. This module uses a BAM attention block [50], processing the differential between RGB and thermal features to explicitly highlight the complementary nature of these modalities. The fusion process is executed as Ffuse=Frgb+BAM(FthermalFrgb)Fthermalsubscript𝐹𝑓𝑢𝑠𝑒subscript𝐹𝑟𝑔𝑏tensor-product𝐵𝐴𝑀subscript𝐹𝑡𝑒𝑟𝑚𝑎𝑙subscript𝐹𝑟𝑔𝑏subscript𝐹𝑡𝑒𝑟𝑚𝑎𝑙F_{fuse}=F_{rgb}+BAM(F_{thermal}-F_{rgb})\otimes F_{thermal}italic_F start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT + italic_B italic_A italic_M ( italic_F start_POSTSUBSCRIPT italic_t italic_h italic_e italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT ) ⊗ italic_F start_POSTSUBSCRIPT italic_t italic_h italic_e italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT, where Ffusesubscript𝐹𝑓𝑢𝑠𝑒F_{fuse}italic_F start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT represents the fused feature output and tensor-product\otimes is element-wise multiplication. We integrate this attention module at various levels within the Feature Pyramid Network (FPN [41]), training it end-to-end without additional adjustments.

To empirically validate our design choices, we conducted ablation studies as shown in Table 9-(a). We investigated three important design choices: (1) input for the attention module, (2) attention design, and (3) asymmetric encoders. Firstly, we found that forwarding the feature difference (Diff.) between the thermal and RGB performed better than their addition (Add.). Secondly, while using either channel (Cha.) or spatial (Spa.) attention resulted in performance improvements over the simple baseline model, jointly using them (BAM) yielded the best performance, as noted in the original paper [50]. Finally, the asymmetric encoder design, which is intended to properly cope with the input information density, resulted in better tracking performance.

Refer to caption
Figure 10: Overview of multi-target multi-camera tracker.

2) Modality Drop

  • Knowledge distillation. To transfer the knowledge learned from the RGBT-F model to the standard RGB model, we distill the final level of FPN features to the RGB model. We implement the distillation loss using the MSE loss function, defined as LDistill=LMSE(Frgb,Ffuse)subscript𝐿𝐷𝑖𝑠𝑡𝑖𝑙𝑙subscript𝐿𝑀𝑆𝐸subscript𝐹𝑟𝑔𝑏subscript𝐹𝑓𝑢𝑠𝑒L_{Distill}=L_{MSE}(F_{rgb},F_{fuse})italic_L start_POSTSUBSCRIPT italic_D italic_i italic_s italic_t italic_i italic_l italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_M italic_S italic_E end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT ). The final objective function is defined as L=LTrack+λLDistill𝐿subscript𝐿𝑇𝑟𝑎𝑐𝑘𝜆subscript𝐿𝐷𝑖𝑠𝑡𝑖𝑙𝑙L=L_{Track}+\lambda L_{Distill}italic_L = italic_L start_POSTSUBSCRIPT italic_T italic_r italic_a italic_c italic_k end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_D italic_i italic_s italic_t italic_i italic_l italic_l end_POSTSUBSCRIPT, where LTracksubscript𝐿𝑇𝑟𝑎𝑐𝑘L_{Track}italic_L start_POSTSUBSCRIPT italic_T italic_r italic_a italic_c italic_k end_POSTSUBSCRIPT represents the total loss to train the base tracker [49]. The λ𝜆\lambdaitalic_λ is set to 0.1.

  • Multi-modal reconstruction. Motivated by the MSDN framework [75], we incorporate the multi-modal reconstruction loss into the base tracker design. Specifically, we use two identical backbones, B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and B2subscript𝐵2B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and extract RoI features F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the RGB input. We then use F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to reconstruct the corresponding thermal information with a single deconvolution layer, which enforces F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to encode the RGB-to-Thermal correlations explicitly. The fused features of F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are then forwarded to the tracking head, implicitly encoding the thermal information without its presence at test time. The total loss is L=Ltrack+λLrecon𝐿subscript𝐿𝑡𝑟𝑎𝑐𝑘𝜆subscript𝐿𝑟𝑒𝑐𝑜𝑛L=L_{track}+\lambda L_{recon}italic_L = italic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT, where λ𝜆\lambdaitalic_λ is set to 1.0.

  • Multi-modal contrastive-learning. To enhance the multi-view contrastive learning, we made two adaptations to the multiple positive contrastive loss [49] that match the RGB instance features across the key and reference frames. Firstly, we allowed the instance feature matching within the frames, such as (key-key). Secondly, we included the instance thermal features, resulting in new feature combinations, such as RGB-T or T-T. The anchor features were selected from the key frame, and the positive and negative features were selected from the reference frame. The following combinations are investigated:

    • Lcontra2pair=RGBKeyRGBRef+TKeyTRefsubscriptsuperscript𝐿2𝑝𝑎𝑖𝑟𝑐𝑜𝑛𝑡𝑟𝑎𝑅𝐺subscript𝐵𝐾𝑒𝑦𝑅𝐺subscript𝐵𝑅𝑒𝑓subscript𝑇𝐾𝑒𝑦subscript𝑇𝑅𝑒𝑓L^{2-pair}_{contra}=RGB_{Key}-RGB_{Ref}+T_{Key}-T_{Ref}italic_L start_POSTSUPERSCRIPT 2 - italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = italic_R italic_G italic_B start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT - italic_R italic_G italic_B start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT

    • Lcontra4pair=Lcontra2pair+RGBKeyTRef+TKeyRGBRefsubscriptsuperscript𝐿4𝑝𝑎𝑖𝑟𝑐𝑜𝑛𝑡𝑟𝑎subscriptsuperscript𝐿2𝑝𝑎𝑖𝑟𝑐𝑜𝑛𝑡𝑟𝑎𝑅𝐺subscript𝐵𝐾𝑒𝑦subscript𝑇𝑅𝑒𝑓subscript𝑇𝐾𝑒𝑦𝑅𝐺subscript𝐵𝑅𝑒𝑓L^{4-pair}_{contra}=L^{2-pair}_{contra}+RGB_{Key}-T_{Ref}+T_{Key}-RGB_{Ref}italic_L start_POSTSUPERSCRIPT 4 - italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUPERSCRIPT 2 - italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT + italic_R italic_G italic_B start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT - italic_R italic_G italic_B start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT

    • Lcontra6pair=Lcontra4pair+RGBKeyTKey+TKeyRGBKeysubscriptsuperscript𝐿6𝑝𝑎𝑖𝑟𝑐𝑜𝑛𝑡𝑟𝑎subscriptsuperscript𝐿4𝑝𝑎𝑖𝑟𝑐𝑜𝑛𝑡𝑟𝑎𝑅𝐺subscript𝐵𝐾𝑒𝑦subscript𝑇𝐾𝑒𝑦subscript𝑇𝐾𝑒𝑦𝑅𝐺subscript𝐵𝐾𝑒𝑦L^{6-pair}_{contra}=L^{4-pair}_{contra}+RGB_{Key}-T_{Key}+T_{Key}-RGB_{Key}italic_L start_POSTSUPERSCRIPT 6 - italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUPERSCRIPT 4 - italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT + italic_R italic_G italic_B start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT - italic_R italic_G italic_B start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT

    • Lcontra8pair=Lcontra6pair+RGBRefTRef+TRefRGBRefsubscriptsuperscript𝐿8𝑝𝑎𝑖𝑟𝑐𝑜𝑛𝑡𝑟𝑎subscriptsuperscript𝐿6𝑝𝑎𝑖𝑟𝑐𝑜𝑛𝑡𝑟𝑎𝑅𝐺subscript𝐵𝑅𝑒𝑓subscript𝑇𝑅𝑒𝑓subscript𝑇𝑅𝑒𝑓𝑅𝐺subscript𝐵𝑅𝑒𝑓L^{8-pair}_{contra}=L^{6-pair}_{contra}+RGB_{Ref}-T_{Ref}+T_{Ref}-RGB_{Ref}italic_L start_POSTSUPERSCRIPT 8 - italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT = italic_L start_POSTSUPERSCRIPT 6 - italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT + italic_R italic_G italic_B start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT - italic_R italic_G italic_B start_POSTSUBSCRIPT italic_R italic_e italic_f end_POSTSUBSCRIPT

    In Table. 9-(b), we investigate the impact of multi-view contrastive learning along with the model design. Firstly, we find that multi-view matching generally outperforms the original RGB-based single-view matching. Secondly, sharing the backbone encoder for two different input sources leads to better results.

In our empirical evaluation, we demonstrated that our Modality Fusion method significantly enhances tracking performance. Additionally, the Modality Drop approach effectively encodes multi-modal correlations, improving tracker generalizability. Building on these findings, we designed a multi-modal Multi-Target Multi-Camera (MTMC) system, as detailed in the architectural overview in Fig. 10. Our system employs a two-stage approach. Initially, it generates tracklets for each video using a tracking-by-detection mechanism. These tracklets are then linked within each video through a Multi-Camera Association (MCA) process. In the main paper, we investigate the two primary MCA approaches: optimization-based [28] and clustering-based methods [35]. Moreover, we have validated the effectiveness of incorporating additional thermal information as a prior, which significantly enhances the tracker’s capabilities.

Method Train Rank 1 mAP
AGW Market-1501 98.0 65.2
MSMT17 94.0 69.9
MTMMC-reID 98.0 72.5
BOT Market-1501 96.0 64.4
MSMT17 94.0 68.6
MTMMC-reID 98.0 69.5
Table 10: Cross-domain re-identification results.
      Method       Modality       Rank 1       mAP
      AGW       RGB       78.8       45.7
      RGBT-I       79.2       47.3
      RGBT-F       80.7       48.4
      BOT       RGB       74.4       42.7
      RGBT-I       75.1       44.1
      RGBT-F       77.7       44.6
Table 11: Multi-modal re-identification results.
Method Train set Eval on MTMMC Eval on MOT17
MTMMC MOT17 IDF1 MOTA IDF1 MOTA
BoT-SORT 64.7 89.2 70.8 56.7
40.7 58.5 78.1 75.0
OC-SORT 63.4 88.5 68.8 55.3
39.2 52.8 74.5 73.5
Table 12: Multi Object Tracking Results using SOTA trackers.

Appendix D Additional Experiments

Cross-domain Person Re-Identification

To illustrate the broad applicability of our re-ID representation, we conducted cross-domain experiments in Table. 10 using using two Re-ID models, BOT [45] and AGW [77]. Trained on pedestrian datasets (Market-1501 [88], MSMT17 [69], and MTMMC-reID) and tested in sports scenes [64], the model using our dataset demonstrated superior performance, indicating that our dataset provides more generic representations.

Multi-modal Person Re-Identification

To further validate the effectiveness of incorporating thermal modality, we conducted multi-modal re-identification experiments. We implemented two fusion approaches: input-level fusion (RGBT-I) and feature-level fusion (RGBT-F). Our findings indicate performance enhancements in both the BOT [45] and AGW [77] models, thereby reaffirming the efficacy of integrating thermal modality in this context.

Multi Object Tracking w. SOTA trackers

In  Table. 12, we present the results of MOT experiments using two recent models [2, 7]. The results demonstrate that these algorithms exhibit lower performance when trained and tested on our MTMMC dataset compared to their performance on the MOT17 dataset. This indicates that our MTMMC dataset presents a richer array of complexities, emphasizing its significance in training and testing more advanced models.

Refer to caption
Figure 11: (a) Multi-modal camera with Coaxial Optical System. (b) RGB image. (c) Calibrating Thermal image with RGB image. (d) Overlay image.

Appendix E Camera Hardware Specifics

We detail our RGB-Thermal multi-modal camera system, adapted from Hwang et al. [31] for surveillance. It consists of an RGB camera (1920 × 1080 resolution) and a thermal camera (320 × 240 resolution), both with a 50 horizontal field of view and 60 fps frame rate. A hot mirror is included to filter extraneous radiation. For uniformity, thermal images are resized to 1920 × 1080, with border areas discarded during processing. See Fig. 11 (c) and (d) for further details.

Factory Campus Total
trian (#) 7 7 14
val (#) 3 2 5
test (#) 2 4 6
Total (#) 12 13 25
(a)
Sunny Cloudy Total
trian (#) 13 1 14
val (#) 3 2 5
test (#) 5 1 6
Total (#) 21 4 25
(b)
Summer Fall Total
trian (#) 7 7 14
val (#) 3 2 5
test (#) 2 4 6
Total (#) 12 13 25
(c)
Table 13: Details on the dataset split.

Appendix F MTMMC Dataset Details

F.1 Dataset Split

We divided the 25 MTMMC scenarios into three sets, consisting of 14, 5, and 6 scenarios for the training, validation, and testing sets, respectively. The division was based on the meta information to ensure equal and well-distributed representation, shown in Table 13.

F.2 Dataset Statistics

This section presents a detailed analysis of our dataset through three key metrics: (1) objects per frame, (2) tracks per video, and (3) age and gender distribution. These metrics highlight the dataset’s unique characteristics, enhancing its effectiveness for multiple object tracking tasks.

Number of Objects per Frame

The average number of objects per frame at each site is shown in Figure 13, with error bars indicating standard deviation. Our dataset showcases a significant variation in the number of objects per camera, mirroring real-world environments characterized by challenges like occlusion and diverse movement patterns. A noteworthy observation is the correlation between object density and the complexity of association tasks (Figures 15, 16), implying that higher object densities escalate the intricacies of tracking.

Number of Tracks per Video

Figure 14 illustrates the average number of tracks per video. This measure is vital for understanding the range of tracking difficulties, demonstrating that complexity is influenced not only by the quantity of objects but also by the nature of multiple, distinct tracks. High track counts typically indicate more frequent interactions and overlap** paths, posing additional challenges and necessitating sophisticated algorithms for effective differentiation.

The Age and Gender Distributions

Figure 12 shows the age and gender distribution of actors in our dataset. By encompassing a wide range of ages and genders, the dataset is relevant to diverse real-world settings. This variety does more than represent different demographics; it introduces added complexity to tracking tasks. Different movement and interaction patterns among various age groups and genders present additional challenges, particularly in dynamic or crowded settings.

Refer to caption
Figure 12: The age and gender distributions of actors.

F.3 Full Illustrations

We provide full illustrations of the cameras installed in the campus and factory environments in Fig. 17 and Fig. 18, respectively. These cameras were installed in various locations, such as indoors, outdoors, and across different floors, replicating a dense real-world surveillance camera system.

Appendix G Dataset License

The licenses of the datasets used in the experiments are denoted as follows:

Appendix H Video Demo

Our demo highlights the MTMMC dataset’s distinct characteristics, including multi-modal data, extended and varied tracks, and complex scenarios. It features diverse situations recorded by RGB-Thermal cameras in campus and factory settings, illustrating the dataset’s proficiency in tracking multiple targets across different camera perspectives, managing occlusions, and navigating challenging lighting conditions.

Appendix I Ethical Considerations

In creating the MTMMC dataset, our foremost commitment is to the protection of personal privacy and ethical integrity in data usage. We meticulously selected school campuses and factory locations where we had comprehensive authorization, ensuring total control over the data collection process. This approach was governed by a stringent and transparent protocol. Each participant provided informed consent through release agreements, and we rigorously de-identified all non-actor data to further safeguard privacy. This meticulous process received the endorsement of the National Information Society Agency. The Institutional Review Board (IRB) is presently conducting an in-depth review of the dataset, underscoring our dedication to ethical compliance. Furthermore, we are committed to adhering to all relevant privacy laws and regulations, ensuring the dataset aligns with the highest standards of data protection.

Our intention in releasing the MTMMC dataset is to foster advancements in multi-target multi-camera tracking research. We aim to facilitate the generation of open-source, transparent academic works, enhancing knowledge and understanding within the scholarly community. The dataset is strictly designated for non-commercial, public, and academic research, with specific use cases detailed in the accompanying agreement. We vigilantly prohibit any unauthorized use that diverges from these outlined purposes, particularly to avoid potential civil rights violations. In such instances, we will take decisive action in accordance with applicable laws.

To mitigate any inadvertent misuse by third-party groups, we have instituted robust measures:

  • Access to the MTMMC dataset is granted exclusively upon formal request. Prospective users must sign a comprehensive usage agreement, outlining their responsibilities and the ethical boundaries of data utilization.

  • We maintain a proactive stance in monitoring dataset usage. The author’s affiliated institution(s) reserve the right to report any suspicious activities or individuals to law enforcement officials or regulatory bodies, particularly in cases of legal or regulatory transgressions.

  • In collaboration with law enforcement agencies, we will actively participate in investigations and legal actions against any illicit activities involving the dataset.

  • Regular audits and reviews of dataset usage will be conducted to ensure continuous adherence to ethical standards and privacy regulations.

  • We have established training programs for all dataset users, emphasizing the importance of ethical data handling and awareness of privacy implications.

  • A transparent feedback mechanism is in place, allowing users and observers to voice ethical concerns or report misuse. This facilitates a responsive and accountable approach to data governance.

  • Our ethical practices are not static; they are subject to ongoing evaluation and refinement, reflecting the evolving landscape of data privacy and ethical norms.

  • We engage with external ethical boards and committees, seeking their guidance and oversight in maintaining the ethical integrity of our dataset.

Our comprehensive approach to ethical data management reflects our unwavering commitment to upholding the highest standards of privacy and integrity in academic research. Through these measures, we strive to ensure the MTMMC dataset serves as a valuable and responsible resource for the research community.

Refer to caption
Figure 13: Number of Objects per Frame. (a) campus and (b) factory.
Refer to caption
Figure 14: Number of Tracks per Video. (a) campus and (b) factory.
Refer to caption
Figure 15: Analysis on Association Performance. For the analysis, we adopt AssA [44] with the localisation threshold α=50𝛼50\alpha=50italic_α = 50.
Refer to caption
Figure 16: Analysis on Detection Performance. For the analysis, we adopt DetA [44] with the localisation threshold α=50𝛼50\alpha=50italic_α = 50.
Refer to caption
Figure 17: (top) Camera layout on campus environment. (bottom) Examples of multi-spectral images on campus.
Refer to caption
Figure 18: (top) Camera layout on factory environment. (bottom) Examples of multi-spectral images on factory.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

References

  • Abavisani et al. [2019] Mahdi Abavisani, Hamid Reza Vaezi Joze, and Vishal M Patel. Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1165–1174, 2019.
  • Aharon et al. [2022] Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, 2022.
  • Batchuluun et al. [2019] Ganbayar Batchuluun, Dat Tien Nguyen, Tuyen Danh Pham, Chanhum Park, and Kang Ryoung Park. Action recognition from thermal videos. IEEE Access, 7:103893–103917, 2019.
  • Berclaz et al. [2011] Jerome Berclaz, Francois Fleuret, Engin Turetken, and Pascal Fua. Multiple object tracking using k-shortest paths optimization. IEEE transactions on pattern analysis and machine intelligence, 33(9):1806–1819, 2011.
  • Bonchi et al. [2014] Francesco Bonchi, David Garcia-Soriano, and Edo Liberty. Correlation clustering: from theory to practice. In KDD, page 1972, 2014.
  • Bredereck et al. [2012] Michael Bredereck, Xiaoyan Jiang, Marco Körner, and Joachim Denzler. Data association for multi-object tracking-by-detection in multi-camera networks. In 2012 Sixth International Conference on Distributed Smart Cameras (ICDSC), pages 1–6. IEEE, 2012.
  • Cao et al. [2023] **kun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9686–9696, 2023.
  • Cao et al. [2015] Lijun Cao, Weihua Chen, Xiaotang Chen, Shuai Zheng, and Kaiqi Huang. An equalised global graphical model-based approach for multi-camera object tracking. arXiv preprint arXiv:1502.03532, 8, 2015.
  • Chavdarova et al. [2017] Tatjana Chavdarova, Pierre Baqué, Stéphane Bouquet, Andrii Maksai, Cijo Jose, Louis Lettry, Pascal Fua, Luc Van Gool, and François Fleuret. The wildtrack multi-camera person dataset. arXiv preprint arXiv:1707.09299, 2017.
  • Chen et al. [2020] Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, and Shuang Liu. Cross-view tracking for multi-human 3d pose estimation at over 100 fps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3279–3288, 2020.
  • Contributors [2020] MMTracking Contributors. MMTracking: OpenMMLab video perception toolbox and benchmark. https://github.com/open-mmlab/mmtracking, 2020.
  • Das et al. [2014] Abir Das, Anirban Chakraborty, and Amit K Roy-Chowdhury. Consistent re-identification in a camera network. In European conference on computer vision, pages 330–345. Springer, 2014.
  • Dave et al. [2020] Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, and Deva Ramanan. Tao: A large-scale benchmark for tracking any object. In European conference on computer vision, pages 436–454. Springer, 2020.
  • De Vleeschouwer et al. [2008] Christophe De Vleeschouwer, Fan Chen, Damien Delannay, Christophe Parisot, Christophe Chaudy, Eric Martrou, Andrea Cavallaro, et al. Distributed video acquisition and annotation for sport-event summarization. NEM summit, 8, 2008.
  • Deng et al. [2019] Liuyuan Deng, Ming Yang, Tianyi Li, Yuesheng He, and Chunxiang Wang. Rfbnet: deep multimodal networks with residual fusion blocks for rgb-d semantic segmentation. arXiv preprint arXiv:1907.00135, 2019.
  • Dollar et al. [2009] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: A benchmark. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 304–311, 2009.
  • D’Orazio et al. [2009] Tiziana D’Orazio, Marco Leo, Nicola Mosca, Paolo Spagnolo, and Pier Luigi Mazzeo. A semi-automatic system for ground truth generation of soccer video sequences. In 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pages 559–564. IEEE, 2009.
  • Eigen and Fergus [2015] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision, pages 2650–2658, 2015.
  • Ess et al. [2008] Andreas Ess, Bastian Leibe, Konrad Schindler, and Luc Van Gool. A mobile vision system for robust multi-person tracking. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008.
  • Fabbri et al. [2021] Matteo Fabbri, Guillem Brasó, Gianluca Maugeri, Orcun Cetintas, Riccardo Gasparini, Aljoša Ošep, Simone Calderara, Laura Leal-Taixé, and Rita Cucchiara. Motsynth: How can synthetic data help pedestrian detection and tracking? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10849–10859, 2021.
  • Feichtenhofer et al. [2017] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE international conference on computer vision, pages 3038–3046, 2017.
  • Felzenszwalb et al. [2010] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
  • Ferryman and Shahrokni [2009] James Ferryman and Ali Shahrokni. Pets2009: Dataset and challenge. In 2009 Twelfth IEEE international workshop on performance evaluation of tracking and surveillance, pages 1–6. IEEE, 2009.
  • Fleuret et al. [2007] Francois Fleuret, Jerome Berclaz, Richard Lengagne, and Pascal Fua. Multicamera people tracking with a probabilistic occupancy map. IEEE transactions on pattern analysis and machine intelligence, 30(2):267–282, 2007.
  • Ge et al. [2021] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
  • Han et al. [2021] Xiaotian Han, Quanzeng You, Chunyu Wang, Zhizheng Zhang, Peng Chu, Houdong Hu, Jiang Wang, and Zicheng Liu. Mmptrack: Large-scale densely annotated multi-camera multiple people tracking benchmark. arXiv preprint arXiv:2111.15157, 2021.
  • He et al. [2020a] Lingxiao He, Xingyu Liao, Wu Liu, Xinchen Liu, Peng Cheng, and Tao Mei. Fastreid: A pytorch toolbox for general instance re-identification. arXiv preprint arXiv:2006.02631, 2020a.
  • He et al. [2020b] Yuhang He, Xing Wei, Xiaopeng Hong, Weiwei Shi, and Yihong Gong. Multi-target multi-camera tracking by tracklet-to-target assignment. IEEE Transactions on Image Processing, 29:5191–5205, 2020b.
  • Hou et al. [2021] Yunzhong Hou, Zhongdao Wang, Sheng** Wang, and Liang Zheng. Adaptive affinity for associations in multi-target multi-camera tracking. IEEE Transactions on Image Processing, 31:612–622, 2021.
  • Hu et al. [2019] Hou-Ning Hu, Qi-Zhi Cai, Dequan Wang, Ji Lin, Min Sun, Philipp Krahenbuhl, Trevor Darrell, and Fisher Yu. Joint monocular 3d vehicle detection and tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5390–5399, 2019.
  • Hwang et al. [2015] Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1037–1045, 2015.
  • Jiang et al. [2018] Na Jiang, SiChen Bai, Yue Xu, Chang Xing, Zhong Zhou, and Wei Wu. Online inter-camera trajectory association exploiting person re-identification and camera topology. In Proceedings of the 26th ACM international conference on Multimedia, pages 1457–1465, 2018.
  • Kart et al. [2018] Ugur Kart, Joni-Kristian Kamarainen, and Jiri Matas. How to make an rgbd tracker? In proceedings of the european conference on computer vision (ECCV) Workshops, pages 0–0, 2018.
  • Kart et al. [2019] Ugur Kart, Alan Lukezic, Matej Kristan, Joni-Kristian Kamarainen, and Jiri Matas. Object tracking by reconstruction with view-specific discriminative correlation filters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1339–1348, 2019.
  • Kohl et al. [2020] Philipp Kohl, Andreas Specker, Arne Schumann, and Jurgen Beyerer. The mta dataset for multi-target multi-camera pedestrian tracking by weighted distance aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1042–1043, 2020.
  • Kristan et al. [2020] Matej Kristan, Aleš Leonardis, Jiří Matas, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kämäräinen, Martin Danelljan, Luka Čehovin Zajc, Alan Lukežič, Ondrej Drbohlav, et al. The eighth visual object tracking vot2020 challenge results. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 547–601. Springer, 2020.
  • Kuo et al. [2010] Cheng-Hao Kuo, Chang Huang, and Ram Nevatia. Inter-camera association of multi-target tracks by on-line learned appearance affinity models. In European conference on computer vision, pages 383–396. Springer, 2010.
  • Leal-Taixé et al. [2016] Laura Leal-Taixé, Cristian Canton-Ferrer, and Konrad Schindler. Learning by tracking: Siamese cnn for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 33–40, 2016.
  • Li et al. [2019] Chengyang Li, Dan Song, Ruofeng Tong, and Min Tang. Illumination-aware faster r-cnn for robust multispectral pedestrian detection. Pattern Recognition, 85:161–171, 2019.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • Lin et al. [2017a] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017a.
  • Lin et al. [2017b] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017b.
  • Liu et al. [2018] Ye Liu, Xiao-Yuan **g, Jianhui Nie, Hao Gao, Jun Liu, and Guo-** Jiang. Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in rgb-d videos. IEEE Transactions on Multimedia, 21(3):664–677, 2018.
  • Luiten et al. [2021] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixé, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129(2):548–578, 2021.
  • Luo et al. [2019] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019.
  • Luo et al. [2018] Zelun Luo, Jun-Ting Hsieh, Lu Jiang, Juan Carlos Niebles, and Li Fei-Fei. Graph distillation for action detection with privileged modalities. In Proceedings of the European Conference on Computer Vision (ECCV), pages 166–183, 2018.
  • Meinhardt et al. [2021] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021.
  • Milan et al. [2016] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  • Pang et al. [2021] Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li, Trevor Darrell, and Fisher Yu. Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 164–173, 2021.
  • Park et al. [2018] Jongchan Park, Sanghyun Woo, Joon-Young Lee, and In So Kweon. Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.
  • Park et al. [2022] Kwanyong Park, Sanghyun Woo, Seoung Wug Oh, In So Kweon, and Joon-Young Lee. Per-clip video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1352–1361, 2022.
  • Quach et al. [2021] Kha Gia Quach, Pha Nguyen, Huu Le, Thanh-Dat Truong, Chi Nhan Duong, Minh-Triet Tran, and Khoa Luu. Dyglip: A dynamic graph model with link prediction for accurate multi-camera multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13784–13793, 2021.
  • Redmon and Farhadi [2017] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  • Ristani and Tomasi [2018] Ergys Ristani and Carlo Tomasi. Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6036–6046, 2018.
  • Ristani et al. [2016] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pages 17–35. Springer, 2016.
  • Shao et al. [2018] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018.
  • Shin et al. [2023] Ukcheol Shin, Kwanyong Park, Byeong-Uk Lee, Kyunghyun Lee, and In So Kweon. Self-supervised monocular depth estimation from thermal images via adversarial multi-spectral adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5798–5807, 2023.
  • Shiva Kumar et al. [2017] KA Shiva Kumar, KR Ramakrishnan, and GN Rathna. Distributed person of interest tracking in camera networks. In Proceedings of the 11th International Conference on Distributed Smart Cameras, pages 131–137, 2017.
  • Sun et al. [2020a] Peize Sun, **kun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and ** Luo. Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020a.
  • Sun et al. [2020b] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020b.
  • Sun et al. [2019a] ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, and Mubarak Shah. Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence, 43(1):104–119, 2019a.
  • Sun et al. [2019b] Yuxiang Sun, Weixun Zuo, and Ming Liu. Rtfnet: Rgb-thermal fusion network for semantic segmentation of urban scenes. IEEE Robotics and Automation Letters, 4(3):2576–2583, 2019b.
  • Van Zandycke et al. [2022] Gabriel Van Zandycke, Vladimir Somers, Maxime Istasse, Carlo Del Don, and Davide Zambrano. Deepsportradar-v1: Computer vision dataset for sports understanding with high quality annotations. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, pages 1–8, 2022.
  • Wang et al. [2021] Weiyao Wang, Matt Feiszli, Heng Wang, and Du Tran. Unidentified video objects: A benchmark for dense, open-world segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10776–10785, 2021.
  • Wang et al. [2020a] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and Sheng** Wang. Towards real-time multi-object tracking. In European Conference on Computer Vision, pages 107–122. Springer, 2020a.
  • Wang et al. [2020b] Zhongdao Wang, Liang Zheng, Yixuan Liu, and Sheng** Wang. Towards real-time multi-object tracking. The European Conference on Computer Vision (ECCV), 2020b.
  • Weber et al. [2021] Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, et al. Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859, 2021.
  • Wei et al. [2018] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018.
  • Wojke et al. [2017] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  • Woo et al. [2022a] Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, and Joon-Young Lee. Bridging images and videos: A simple learning framework for large vocabulary video object detection. In European Conference on Computer Vision, pages 238–258. Springer, 2022a.
  • Woo et al. [2022b] Sanghyun Woo, Kwanyong Park, Seoung Wug Oh, In So Kweon, and Joon-Young Lee. Tracking by associating clips. In European Conference on Computer Vision, pages 129–145. Springer, 2022b.
  • Wu et al. [2021] Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming Yang, and Junsong Yuan. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12352–12361, 2021.
  • Xiao et al. [2017] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. Joint detection and identification feature learning for person search. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3415–3424, 2017.
  • Xu et al. [2017] Dan Xu, Wanli Ouyang, Elisa Ricci, Xiaogang Wang, and Nicu Sebe. Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5363–5371, 2017.
  • Yan et al. [2021] Song Yan, **yu Yang, Jani Käpylä, Feng Zheng, Aleš Leonardis, and Joni-Kristian Kämäräinen. Depthtrack: Unveiling the power of rgbd tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10725–10733, 2021.
  • Ye et al. [2021] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence, 44(6):2872–2893, 2021.
  • You and Jiang [2020] Quanzeng You and Hao Jiang. Real-time 3d deep multi-camera tracking. arXiv preprint arXiv:2003.11753, 2020.
  • Yu et al. [2020] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  • Zeng et al. [2021] Fangao Zeng, Bin Dong, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. arXiv preprint arXiv:2105.03247, 2021.
  • Zhang et al. [2019a] Lu Zhang, Zhiyong Liu, Shifeng Zhang, Xu Yang, Hong Qiao, Kaizhu Huang, and Amir Hussain. Cross-modality interactive attention network for multispectral pedestrian detection. Information Fusion, 50:20–29, 2019a.
  • Zhang et al. [2019b] Lu Zhang, Xiangyu Zhu, Xiangyu Chen, Xu Yang, Zhen Lei, and Zhiyong Liu. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5127–5137, 2019b.
  • Zhang et al. [2015] Shu Zhang, Elliot Staudt, Tim Faltemier, and Amit K Roy-Chowdhury. A camera network tracking (camnet) dataset and performance baseline. In 2015 IEEE Winter Conference on Applications of Computer Vision, pages 365–372. IEEE, 2015.
  • Zhang et al. [2017a] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3221, 2017a.
  • Zhang et al. [2021] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Zehuan Yuan, ** Luo, Wenyu Liu, and Xinggang Wang. Bytetrack: Multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864, 2021.
  • Zhang et al. [2017b] Zhimeng Zhang, Jianan Wu, Xuan Zhang, and Chi Zhang. Multi-target, multi-camera tracking by hierarchical clustering: Recent progress on dukemtmc project. arXiv preprint arXiv:1712.09531, 2017b.
  • Zhao et al. [2020] Long Zhao, Xi Peng, Yuxiao Chen, Mubbasir Kapadia, and Dimitris N Metaxas. Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6528–6537, 2020.
  • Zheng et al. [2015] Liang Zheng, Liyue Shen, Lu Tian, Sheng** Wang, **gdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
  • Zheng et al. [2017a] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. Person re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1367–1376, 2017a.
  • Zheng et al. [2017b] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE international conference on computer vision, pages 3754–3762, 2017b.
  • Zhong et al. [2017] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1318–1327, 2017.
  • Zhou et al. [2021a] Wujie Zhou, Xinyang Lin, **gsheng Lei, Lu Yu, and Jenq-Neng Hwang. Mffenet: Multiscale feature fusion and enhancement network for rgb–thermal urban road scene parsing. IEEE Transactions on Multimedia, 24:2526–2538, 2021a.
  • Zhou et al. [2021b] Wujie Zhou, **fu Liu, **gsheng Lei, Lu Yu, and Jenq-Neng Hwang. Gmnet: graded-feature multilabel-learning network for rgb-thermal urban scene semantic segmentation. IEEE Transactions on Image Processing, 30:7790–7802, 2021b.
  • Zhou et al. [2020] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Tracking objects as points. In European Conference on Computer Vision, pages 474–490. Springer, 2020.